I can teach you analytics! I have never seen a database do analytics faster than Yellowbrick. This week we are working on the Cumulative Sum, which is interchangeable with the word CSUM. You use a Cumulative Sum (CSUM) to get a running total with subtotals and grand totals.
In each example, you will see an ORDER BY statement, but it will not come at the end of the query. The ORDER BY keywords is always within the CSUM calculation. It is the ORDER BY statement that runs first, and once the system orders the data, the totals calculate. It is the initial ordering of the data set that gives these analytics the name “Ordered Analytics.”
Notice in the picture below that we are doing a SUM(Daily_Sales), but this is not an aggregate of Daily_Sales because of the OVER keyword. The OVER keyword made the SUM(Daily_Sales) an ordered analytic and not a mere SUM of Daily_Sales.
The most difficult part of the example below is the keywords “ROWS UNBOUNDED PRECEDING.” The ROWS UNBOUNDED PRECEDING is something we refer to as the moving window, or in other words, how many rows should we calculate? The ROWS UNBOUNDED PRECEDING means that after the system orders the data with the ORDER BY clause, calculate the running total from the first row to the last, always adding up the preceding totals with the current row’s Daily_Sales value. Check out the example below.
Let me summarize. Order the data first by the column Sale_Date and after the data sorts begin with row one and Sum the Daily_Sales. Then, continue to the next row and add to the sum of that row’s Daily_Sales to the running total. Continue to do this until you finish with the last row.
In the picture below, we are running the same calculation, but casting the column csumansi so it only delivers two decimal points. The cast command stands for (Convert and Store), which means it changes the data type temporarily for the life of the query.
In the picture below, we are running the same calculation, but ordering the data using two columns in the ORDER BY clause. The only reason the grand totals are different between the above and below picture is that we are not displaying all of the data.
Each CSUM example will have an ORDER BY statement, but sometimes you will also have a PARTITION statement. In the example below, you see the keywords PARTITION BY Product_ID, and that means the CSUM function calculates within each Product_ID only. Check out the next example below.
If you want to learn more about modern data analytics, sign up for the ANSWERS seminar that Yellowbrick is holding on September 29, 2020. Coffing Data is a sponsor.
I hope you enjoyed today’s Yellowbrick analytic lesson. See you next week.
Tom Coffing, better known as Tera-Tom, is the founder of Coffing Data Warehousing where he has been CEO for the past 25 years. Tom has written over 75 books on all aspects of Teradata, Netezza, Yellowbrick, Snowflake, Redshift, Aurora, Vertica, SQL Server, and Greenplum. Tom has taught over 1,000 classes worldwide, and he is the designer of the Nexus Product Line.