Azure Synapse Analytics – Moving Average
I can teach you analytics! Azure Synapse is built for analytics, and this blog will brilliantly show you the moving average. A moving average allows you to look for trends in your data. The most famous moving average is the Dow Jones Industrial 65-day moving average, but you can use a moving average to analyze trends in your data, and I will show you the SQL to make it happen.
A moving average is a statistical calculation used to smooth out fluctuations in data over time by creating a series of averages of different subsets of the full dataset. It helps reveal trends and patterns by reducing noise or random fluctuations, making it easier to identify the underlying direction of change in the data.
All of these examples have come from my books and training classes. Please do me a favor and tell your training coordinator that you know the best technical trainer in the World. Ask them to hire me to train at your company, either on-site or with a virtual class. They can see our classes, outlines, and a sample of my teaching at this link on our website. I hope to meet you and say thanks in my next class at your company:)
https://coffingdw.com/education/
The picture below shows a moving average. At first, it appears as a simple aggregate because we begin the moving average with AVG(daily_sales). However, that is followed by the keyword OVER, which implies an ordered analytic called a window function. It is called an ordered analytic because, after the word OVER, we see (in parenthesis) an ORDER BY statement.
Before the moving average is calculated, the data is sorted using the ORDER BY statement, which, in this case, is by product_id (major sort) and sale_date (minor sort). The moving sum is also called a window function because the calculation works on a window of rows signified by ROWS 2 PRECEDING. So, after ordering the data by product_id and sale_date, the calculation will provide the average daily_sales for the current row and the preceding two rows, which is a window of three rows.
The first row is an outlier and should not be analyzed for trending because it is a one-row average. The second row is also an outlier and should not be analyzed because it is a two-row average.
However, the third row is a three-row average because it calculates the average between the daily_sales of the current row (third row) and the first and second row. After the first two rows, each row is a three-row average (current row’s daily_sales and the previous two rows).
The example below shows the results in color. The moving average of the third row in pink is 46450.23, which is the average of 36000.07 (current row’s daily_sales) and 54500.22 and 48850.40 (previous two rows).
Each row’s moving average calculation (other than the first two rows) is a three-row calculation (the current row’s daily_sales) and the daily_sales of the previous two rows.
The big point is that the query has a window of three rows, as indicated by the keywords ROWS 2 PRECEDING.
To clarify the point, notice the last row’s moving average is 46579.63 in blue. That moving average was calculated by 49850.03, 48000.00, and 41888.88.
It was quite enlightening when I realized that a moving average was part of the ordered analytic family because it orders that data using an order by statement inside the analytic function. In the example below, we have a multi-column sort. Notice that the data is sorted by product_id because that is the major sort. Since there are many product_id 1000 rows, the ties are handled by the sale_date, which is the minor sort. Once the data is ordered, the moving average begins from the first row to the last.
Now that you know enough to be dangerous, examine the example below and describe how the moving average in the third row (46450.230000) was derived. I have colored it red for your convenience.
I have a hunch you nailed this quiz. The third row in red had a moving average of 46450.230000, derived by adding up the daily_sales of the current row with the daily_sales of the previous two rows and then dividing by three. In other words, 36000.07 + 54500.22 + 48850.40 divided by three.
Below is an example of a weekly moving average because of the keywords ROWS 6 PRECEDING. However, this example works because we are storing our data in a table that captures the daily_sales for each product_id once per day. Plus, I have to order the data by product_id and then sale_date. All of these factors lead to a seven-day moving average.
The first six days are outliers. After all, they cannot calculate the current row and six preceding rows because they don’t have six preceding them. However, each row has a seven-day moving average on the seventh day onward (blue arrow).
Our best seven-day moving average was January 11, 2024, and our worst was January 17. The goal of the moving average is to show positive and negative trends, and these two dates should be studied to find out why these results have occurred.
Notice that I put an extra ORDER BY statement at the end of the query. Azure Synapse doesn’t always return the data as you might anticipate, so I used the same ordering at the end that I did inside the moving average analytic.
One of the brilliant things about analytics is that you can have many of them within the same query. Below, we have two moving averages. The first uses ROWS 2 PRECEDING, and the second uses ROWS UNBOUNDED PRECEDING.
We have already learned that the ROWS 2 PRECEDING is a moving window of three rows (current row and previous two rows). The ROWS UNBOUNDED PRECEDING means beginning to average each row’s daily_sales amount with the daily_sales amount for all rows preceding. So, the first row is a one-row average, the second row is a two-row average, the third row is a three-row average, etc. The last row averages 46788.17 with the daily_sales of all previous rows.
The moving average aliased CONTINUOUS gives you the average over all preceding days up to the current row, showing how daily_sales behave over time without the constraint of a fixed window size.
In our next example below, we see the keywords PARTITION BY, and since we are partitioning by the product_id, the moving average will reset and start over with each product_id. In simpler words, the moving average will be calculated for product_id 1000 rows and then begin again for product_id 2000 rows.
This brings me to another important point: notice the fifth row of 36333.66, the worst three-day average for product_id 1000 rows. We now see the worst trend for our data for product_id 1000. Notice that the last row for product_id 1000 has a moving average of 50551.20, the best three-day average in our product_id 1000 three-day window.
Now that we can see the worst and best three-day trends, we can attempt to explore more to find out why, perhaps eliminate the downward trend and duplicate the environment that led to the best trend.
Although the second row had a moving average of 51675.31, it was not a three-row window but a two-row window, so I ignored it.
The example below is designed to blow your mind and teach you many advanced aspects of the moving average.
I am using the DATE_FORMAT function to return the day of the week for each sale_date, and I am aliasing it with the name DOW. I am using the dow in both queries below. The first query on the left is partitioned by product_id and DOW because I use the same DATE_FORMAT command in the partition statement, meaning we will reset the calculation for product_id 1000 when we change it to a new day of the week.
Notice that the example on the left has the data sorted by product_id, and all the Friday dates come first. Notice that each date within those Friday dates is in ascending order. The reason for that is when you use a partition by and then an order by the data is logically sorted by the partition and then the order by statement.
Notice how, on the left example, I use a ROWS 1 PRECEDING. We will average the current row (Friday) with the previous Friday’s daily_sales, which is a moving window of two. The only row, not the average of two rows, is the first row of each weekday for each product_id.
Please notice that I also put an order by statement at the end of the query on the left, which is exactly the ordering done by the moving average based on the combination of the partition by and order by statements within the moving average calculation.
Now, notice the second query on the right. It is an exact duplicate of the first query except for the final order by statement at the end.
The brilliance and learning lesson of the query on the right is that the calculation is done by the combination of the partition by and order by statement, so product_id 1000 Fridays compare with the previous Friday and so on, but when we order the data by product_id and then sale_date, the data comes out in logical order. However, we are still comparing product_id 1000 Friday rows with the previous Friday, and so on.
I put arrows in the answer set of the example to show you that the second Friday of product_id 1000 has a moving average of 73.00.
I have a table with five years of sales data for product_id 1000 and 2000. I am demonstrating SQL for a 65-day moving average. Notice the ROWS 64 PRECEDING in the moving average, a window of 65 rows.
The first 64 rows are not 65-day moving averages, so they are outliers, but from the 65th row on, we have a moving average that calculates the daily_sales value of the current row with the daily_sales value of the previous 64 rows.
The DOW 65-day Moving Average calculates the average closing price of the Dow Jones Industrial Average over 65 trading days.
By averaging the closing prices over a specific period, the moving average smooths out short-term fluctuations in the market, providing a clearer picture of the overall trend.
When the current price exceeds the 65-day moving average, it is bullish, but when it is below, it is bearish.
I hope you enjoyed this blog post. You now know why people use the moving average and how to do it with SQL.
Did you know that Coffing Data Warehousing was the first to create software that joins data across all systems? Download your free Nexus trial at www.CoffingDW.com.
Watch how Nexus performs federated queries to join an Azure Synapse table to tables from twenty systems in a single query. Watch the video.
To find out more about Nexus, you can watch our 60-second videos here: