I can teach you Yellowbrick analytics!  I have never seen a database perform analytics faster than Yellowbrick.  Yellowbrick can load billions of rows and still perform thousands of queries simultaneously. The analytic we will learn about today is CUME_DIST.

Sometimes, you want to make a report that contains the top or bottom percentage of values from a data set.  For example, you might wish to only see the top 20% of salespeople by net sales. One way to achieve this with Yellowbrick is to use the CUME_DIST() function.

The CUME_DIST() function calculates the cumulative distribution of a value within a group of values. Simply put, it calculates the relative position of a value in a group of values.

CUME_DIST finds the cumulative distribution of a value concerning other values within the same window partition. Each row’s value in the result set is greater than 0 and less than or equal to 1.  Values that tie get the same CUME_DIST value.

The complicated explanation is CUME_DIST represents the number of rows with values less than or equal to that row’s weight divided by the total number of rows. Check out the example below.

The example below is similar, but it uses a derived table.  Our goal is only to see only the best 50  percent of the sales.  When you place the same query in a derived table first, you can then use a WHERE clause because the data has been calculated and exists in the derived table.  The WHERE clause will ask for only the cdist values >= 0.5.

The example below shows you how Cume_Dist calculates when there are equal values among the rows.  We have updated our table, and now there are two rows with the same total_sales value.  After the sort by total_sales ascending rows, five and six have the same total_sales amount.  There are ten rows in the window, and since rows five and six ties with an amount of 5.00, the calculation becomes 6 / 10 = 0.6.

Next, we are going to take the next example even further by adding a PARTITION BY Clause.  PARTITION BY resets the Cume_Dist calculation with each region break. There are two regions, which are the North and South regions.  Our partition statement tells the system only to calculate the CUME_DIST within each region separately.

My introduction to Yellowbrick came from Ed Bernier.  I have known Ed for over 20 years.  We met while he was working at Netezza.  Ed knew that my Nexus product line could query every system, move data to and from most systems, and automate cross-system joins (federated queries).  Ed told me he was now working at Yellowbrick, and his enthusiasm was exhilarating.  Ed mentioned how fast the loading was to Yellowbrick and that the system could load billions of rows while simultaneously running thousands of queries.

I directed my programming team to implement ybload, ybunload, and ybrelay into our product line and get every system on the market moving to Yellowbrick in the Nexus Database Mover.  I then directed the team to make sure Yellowbrick tables could join with tables across all on-premises and cloud systems and automate the joins in the Nexus Super Join Builder. I then challenged my team to use ybrelay and get the Nexus to import flat files from Hadoop, Spark, and MapR with the click of a button.

I am proud to say that I am now a Yellowbrick partner.  Yellowbrick supplies the world’s only modern-day hybrid-cloud solution, and I provide the software that helps customers migrate to Yellowbrick.  Ed Bernier is one of the greatest technical geniuses I have seen in my career.

Ed Bernier, Yellowbrick

Watch the video below of Ed Bernier speaking at the Gartner conference about Yellowbrick. 

Are you looking for directions to the Hybrid-Cloud?  Follow the Yellowbrick road in this two-minute video and set your path to success.

I hope you enjoyed today’s Yellowbrick analytic lesson.  See you next week. 

Thank you,

Tera-Tom

Tom Coffing
CEO, Coffing Data Warehousing
Direct: 513 300-0341
Website: www.CoffingDW.com
YouTube channel: CoffingDW
Email: Tom.Coffing@CoffingDW.com