What is Parallel Processing?
Two guys were having fun on a Saturday night when one said, “I’ve got to go and do my laundry.” The other said, “What!?” The first man explained that if he went to the laundry mat the next morning, he would be lucky to get one machine and be there all day. But if he went on Saturday night, he could get all the laundry machines. Then, he could do all his wash and dry in two hours. Now that’s parallel processing mixed in with a little dry humor!
Yellowbrick has Linear Scalability
Yellowbrick was born to be parallel. Parallel processing is by Worker. One worker per node or blade. A Yellowbrick system can grow from 8 nodes to up to 45 nodes. Three different density NVMe drives are supported. HD, VHD, and EHD. A Worker node has eight physical NVMe drives and can hold between 8TB and 32TB uncompressed. The range of data is because of the 4x compression, so they start at 24 TB per node on the low-end system and 128 TB per node at the high end. Each worker node has 36 cores. The above specs are why Yellowbrick is so incredibly fast.
Yellowbrick Table Distribution Style Options
Yellowbrick gives you three great choices to distribute your tables. If you have two tables that join a lot and they are about the same size, then you want to give them both the same distribution key as the join key. Identical distribution keys on the join co-locate the matching rows on the same worker. Two rows that join must be on the same worker (or Yellowbrick will move one or both rows temporarily to satisfy the joining requirement). If you join two tables a lot, but one table is humongous, and the other is small, then you want to have a little table distributed by REPLICATE. Use your distribution key to ensure joins happen faster, but also use it to spread the data as evenly among the workers as possible.
Distribution Key Where the Data is Unique
The entire row of a table is on a worker, but each column in the row is in a separate container (block). A Unique Distribution Key spreads the rows of a table evenly across the workers. A proper Distribution Key is the key to good distribution! Only one column can be a distribution key.
A Table with a Sort Key
We have chosen the Emp_No column as both the distribution key and the sort key. Yellowbrick can control both!
Distribution Key Where the Data is Non-Unique
The data did not spread evenly among the workers for this table. Do you know why? The Distribution Key is dept_no. All like values went to the same worker. This distribution isn’t perfect, but it is reasonable, so it is an acceptable practice. Be careful because this is how skew among workers can become a problem.
Distribution Key is Replicate
When replicate is selected as the distribution key, the entire table copies to each worker. This brilliant technique is used for smaller tables, such as a dimension table. Now, joins will be super fast.
Random Distribution Key
The data is spread evenly among the workers for this table. Do you know why? The Distribution Key is Random. The information divides randomly but equally. The purpose of random is to spread the data evenly. This technique ensures that your largest fact table will have perfect parallel processing. When you join a fact table that distributes random with multiple dimension tables that distribute replicate you are guaranteeing perfect parallelism and joins tuned for speed.
Fact and Dimension Table Distribution Key Design
The fact table (line_order_fact_table) is the largest, but the part_table is the largest dimension table. That is why you make Part_Key the distribution key for both tables. Now, when these two tables join, the matching Part_Key rows are already on the same worker. You can then distribute by Replicate on the other dimension tables. Each of these tables will have all their rows on each worker. Now, everything that joins to the fact table co-locates naturally!
Improving Performance by Defining a Sort Key
There are three fundamental reasons to use the sortkey keyword when creating a table. 1) If more recent data query most frequently, specify the timestamp or date column as the leading column for the sort key. 2) If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. 3) If you frequently join a (dimension) table, specify the join column as the sort key. Above, you can see we have made our sortkey the Order_Date column. Look at how the data sorts!
Range Partitioning Per Day
Range partitioning allows you to define partitions on a table based on ranges of values bounded by an interval. Optionally, you can create two separate partitions specifically for values that fall outside (above or below) the specified range. One of the partitions will contain values below the range, and one will contain values above the range.
Visual of Range Partitioning Per Day
Creating a Table with a Cluster Key
A clustered table maintains an internal map that organizes data as a set of points across the dimensions with cluster column definitions. This map optimizes query performance because the optimizer can skip the blocks of data that do not qualify for multiple restrictions. One pass over the internal map finds the intersection of rows that meet the WHERE clause criteria, so most data in the base table can be eliminated very early in the query plan.
Do you want to receive the entire Yellowbrick Architecture and SQL book for free? The book is over 600-pages and has every SQL command. Just email me at Tom.Coffing@CoffingDW.com. I will send the PDF to you within 24-hours!
Tom Coffing, better known as Tera-Tom, is the founder of Coffing Data Warehousing where he has been CEO for the past 25 years. Tom has written over 75 books on all aspects of Teradata, Netezza, Yellowbrick, Snowflake, Redshift, Aurora, Vertica, SQL Server, and Greenplum. Tom has taught over 1,000 classes worldwide, and he is the designer of the Nexus Product Line.