I have never seen analytic queries perform faster than on a Yellowbrick Data Warehouse. Today, I will begin a series on how to tune a Yellowbrick system. Today, I will start with one of the most critical fundamentals for tuning any parallel processing system, including Yellowbrick.
Yellowbrick does not require tuning, but you can quickly enhance the already blistering performance with a few tricks.
I recently did a webinar with Yellowbrick titled “Load and Go with Yellowbrick and Nexus.” Yellowbrick supplies the data warehouse, and my Nexus provides the ability to migrate data from any data source directly to Yellowbrick. Nexus also automates cross-system joins (federated queries) between Yellowbrick tables and any combination of tables from other database platforms.
Using Nexus, Lorenzo Danesi (Yellowbrick System Engineer) ran a query on Yellowbrick on a one billion row table in one second. He ran a federated query with tables from Yellowbrick and Teradata containing 10 million rows in two seconds. That kind of speed is mindboggling.
I will first cover the basics of table options and then show you the essential tuning technique.
First of all, why is Yellowbrick so fast? There are multiple reasons for the blistering speeds. Yellowbrick is a massively parallel database that uses modern hardware and SSD disk and memory techniques that directly move data from disk into cache memory, thus saving larger jobs for main memory. Yellowbrick loads data with their high-speed ybload utility using a row store, converted to columnar storage for queries. When you combine parallel processing, columnar storage, and high-speed data read from disk directly to the CPU, you get the best architecture for analytics.
The next fundamental you should understand is that you should first focus on joining tables whenever you want to tune a parallel processing system for maximum speed.
Why is focusing on joining tables so important when designing tables? A parallel processing system processes data in parallel. The best way to explain parallel processing is to paint a picture through a story. Here goes!
Two guys were out on a Saturday night having a great time dancing, singing, and celebrating. At midnight one of the fellows said, “I have to leave now to go do my laundry.” The other man was confused and said, “But we are having so much fun tonight. You can’t leave to do laundry. It is Saturday night!”
The friend explained that he uses a laundry mat to do his laundry and that if he waits until Sunday, there will be many people at the laundry mat. He will be lucky to get a washing machine, and he will be there all day. However, when he goes to the laundry mat on Saturday night, he will have ten washing machines, and he can get all of his wash done in an hour.
That is parallel processing!
Like a laundry mat, each washing machine is a separate entity that shares nothing. A parallel processing machine has different processes that do not share CPU, disk, or memory. A table’s data evenly distributes across the parallel processes (workers), and each processes the data they own simultaneously. The key is that each worker has a memory pool and CPU separate from all others.
See the picture below and notice that we have four workers, but we can scale up the number of workers by magnitudes of order. Also, notice that our Orders table distributes the table’s rows evenly across the workers, with each worker responsible for their portion of the work. Although the example is overly simplistic, just like washing machines, each worker is a separate entity. Still, each worker takes commands from the Manager Node to work together as a conductor guides an orchestra.
The next critical subject to understand when working with a Massively Parallel Processing (MPP) data warehouse concerns columnar storage. Here is what is essential to know. The rows of a table spread across the workers, but then the worker takes all of the rows they own and stores each column separately.
The example below depicts a single worker, which is what Yellowbrick calls a parallel processing unit. On Netezza, these were called SPUs (Snipper Processing Units), and on Teradata, these are called Access Module Processors (AMPs). The worker below owns four rows of a table, which I have color-coded for you; however, once the worker receives all of the rows they hold for a table, the worker stores each column in a separate data block. This technique is where they get the name columnar.
When joining tables, all parallel processing systems must have join rows from different tables on the same worker. If not, the system must redistribute or copy the data, so the joining rows are in the same memory.
Think of it as doing your laundry and putting clothes together that need to wash in hot, warm, or cold water together. You must get all of the clothes that need hot in one washing machine, and all clothes that need cold in another washing machine. Because each Yellowbrick worker processes data using its CPU and memory, the matching rows must be on that worker.
You can also think of this as you might have family living in different places across the city, but everyone gathers at one house when you have a birthday celebration. Rows from other tables might be on other workers, but they must all join on the same worker when you join them together.
There is three brilliant technique that Yellowbrick uses for creating a table that makes it so easy to tune for joins. The idea here is to go back to our birthday celebration. The concept is to take all of your relatives all over the city and make sure the entire family lives in the same apartment complex. Then, gathering for the birthday celebration in one apartment is quick and easy because everyone lives in the complex and doesn’t have to travel.
The first technique to create a table that uses a distribution key in the table create statement. See the picture below and notice the distribute on (column name) at the end.
When you have two huge tables that join together, the matching rows will be together on the same worker if you give them both the same distribution key as the column join condition. Take a look at the picture below. Notice we have two huge tables that will join thousands of times per day. If you make the distribution key the same on both tables, which happens to be the join condition, all matching rows will automatically be on the same worker.
Here is what two joining tables look like on Yellowbrick workers when the join column (PK/FK) is the respective distribution key of each table. Imagine we have to large tables with millions of rows joining together. When the distribution key is the same column on both tables, and you make the join column the distribution key of both tables, the matching rows are “Worker Local.” Worker Local means that each worker has the matching rows for Table A and Table B naturally on the same worker, which joins super fast.
If the matching color-coded rows were on different workers, the system would have to find a way to join the workers’ rows. The plan would transfer rows between workers, and this takes time and energy. Below, no movement over the network has to happen because no rows need to move. Joining rows “Worker Local” is the number one priority for tuning any parallel processing system.
The previous technique is for joining two huge tables, but what happens when you join a colossal fact table with many smaller dimension tables? Here is a method to create smaller dimension tables. Instead of using a distribution key in the table create statement, you can use a distribute replicate statement instead, which copies the entire table to every worker. When it joins the huge table, the matching rows are always available because the smaller table is present for all workers.
This concept is again like doing your laundry and needing to put a softener with each load. You put in a softener cup with each load of laundry, and it works equally for each washing machine.
It might appear like a waste to duplicate data on each worker, but joins work so much faster it makes a great deal of sense. Join movement is expensive; replication is not.
The next example below shows how a table that replicates on all workers joins so well. Table B has a distribute replicate, so the entire table copies to each worker. Although the pictorial shows Table A with only a few rows, imagine there are millions of rows. The concept is that the matching rows are “Worker Local” because the replicated table duplicates all rows on all workers. Although some might argue that we are duplicating data unnecessarily, this speeds up joins so fast that it is well worth it.
The final example is to create a table with a distribute random statement. In the picture below, the illustration shows a table with random distribution. A random distribution distributes the rows of the table across the workers evenly. This technique is when you want a table to be a staging table, or the table is enormous, and you need the parallel processes to all process the same amount of rows. The key here is to make sure when you join this table to smaller tables, the smaller tables replicate across all workers.
The illustration below shows Table A with random distribution, which distributes the data randomly but evenly. Table B replicates and copies in its entirety to all workers. Although the pictorial shows Table A with only a few rows, imagine there are millions of rows. The concept is that the matching rows are “Worker Local” because the replicated table duplicates all rows on all workers.
A DBA creates a table with distributing replicate for one reason: to ensure that the matching rows are present on the worker when it joins to a large table. When a smaller table replicates, it works perfectly when the larger table has either a distribution key or a random distribution. Either way, the joins are now “Worker Local.”
When the Nexus migrates data from another database to Yellowbrick, these techniques are automatic. If the migrating source tables come from Netezza, Teradata, Redshift, and Azure Synapse, the Nexus honors the source’s previous distribution style. However, you can also load through Nexus by size. The smaller tables get a distribution style of replication.
If you want to learn more about a Yellowbrick Data Warehouse or how Nexus migrates and joins data to help manage a multi-vendor hybrid cloud environment, check out the webinar recording below.
Webinar Recording of Load and Go with Yellowbrick Data Warehouse and Nexus
Here is the link to the webinar recording: https://www.yellowbrick.com/media/load-and-go-with-yellowbrick-and-nexus-webinar
Yellowbrick Data Warehouse is a modern, standards-based MPP analytic database that shatters ceilings on price/performance, achieving 100X performance versus legacy alternatives at a fraction of their cost. With its unified hybrid cloud architecture, applications can deploy anywhere with the same data and performance, de-risking cloud migrations, and unlocking multi-cloud innovation.
Coffing Data Warehousing’s Nexus Desktop and Nexus Server is the perfect combination for a remote workforce to manage a hybrid-cloud environment. From the user’s desktop, users can query every system, migrate to every system, and join data across every database. Users set up everything with the click of a button, but the Nexus Server’s job is to execute the work. The Nexus Desktop and Server ensure that all processing is done on a high-speed network so that users can work from any location. We are migrating millions of rows per second to Yellowbrick.
Why not try out a Proof of Concept (POC) of Yellowbrick and Nexus. The POC is free for 30 days. Please contact me for more information or check out http://www.Yellowbrick.com.
Tom Coffing, better known as Tera-Tom, is the founder of Coffing Data Warehousing where he has been CEO for the past 25 years. Tom has written over 75 books on all aspects of Teradata, Netezza, Yellowbrick, Snowflake, Redshift, Aurora, Vertica, SQL Server, and Greenplum. Tom has taught over 1,000 classes worldwide, and he is the designer of the Nexus Product Line.