Amazon Redshift, a widely used cloud data warehouse, has evolved significantly to meet the performance requirements of the most demanding workloads. This post describes one such new feature: sort keys for multidimensional data layouts.
Amazon Redshift improves query performance by supporting sort keys for multidimensional data layouts. This is a new type of sort key that sorts data in a table by a filter predicate rather than by the table's physical columns. Sort keys in multidimensional data layouts can significantly improve table scan performance, especially when your query workload includes iterative scan filters.
Amazon Redshift already offers Automatic Table Optimization (ATO), which automatically optimizes table design by applying sort and distribution keys without requiring administrator intervention. This post introduces sort keys for multidimensional data layouts as an additional feature provided by ATO and powered by Amazon Redshift's Sort Key Advisor algorithm.
Sort keys for multidimensional data layouts
When you define a table with an AUTO sort key, Amazon Redshift ATO analyzes your query history and selects either a single column sort key or a multidimensional data layout sort key for your table, based on which option is more appropriate for your workload. Select automatically. When a multidimensional data layout is selected, Amazon Redshift builds a multidimensional sort function that co-locates rows that are typically accessed by the same query. The sort function is then used during query execution to skip data blocks and also skip scanning for individual predicates. Column.
Consider the following user query. This is the dominant query pattern in user workloads.
Amazon Redshift stores the data for each column in 1 MB disk blocks, and stores the minimum and maximum values in each block as part of the table's metadata. When you use range-restricted predicates in your queries, Amazon Redshift can use minimum and maximum values to quickly skip many blocks during table scans. However, this query's filter on subregion columns cannot be used to determine which blocks to skip based on minimum and maximum values. As a result, Amazon Redshift scans every row in the title table.
When a user's query is executed titles
Use a single column sort key subregion
the result of the above query would be:
This shows that the table scan read 2,164,081,640 rows.
To improve the scan of titles
For tables, Amazon Redshift may automatically determine the use of sort keys for multidimensional data layouts.all rows that satisfy lower(subregion) like '%united states%'
The predicate is co-located in a dedicated area of the table, so Amazon Redshift scans only the data blocks that satisfy the predicate.
When the user's query is executed, titles
Use multidimensional data layout sort keys, including: lower(subregion) like '%united states%'
As a predicate, the result of sys_query_detail
The query is:
This shows that the table scan read 152,324,046 rows, which is only 7% of the original data, and used the sort keys of the multidimensional data layout.
Although this example uses a single query to demonstrate multidimensional data layout capabilities, Amazon Redshift considers all queries running against the table and includes the most commonly run queries. Note that you can create multiple regions to satisfy the predicate.
Let's look at another example, this time with a more complex predicate and multiple queries.
imagine you have a table items (cost int, available int, demand int)
It consists of four lines, as shown in the following example.
#id | Fee | Available | request |
1 | Four | 3 | 3 |
2 | 2 | twenty three | 6 |
3 | Five | Four | Five |
Four | 1 | 1 | 2 |
The main workload consists of two queries.
- 70% query pattern:
- 20% query pattern:
Traditional sorting techniques allow you to sort a table based on the cost column. cost > 3
would benefit from sorting. So the item table after sorting using a single item is cost
The columns should look like this:
#id | Fee | Available | request |
Region #1, cost <= 3 | |||
Region #2, cost > 3 |
#id | Fee | Available | request |
Four | 1 | 1 | 2 |
2 | 2 | twenty three | 6 |
1 | Four | 3 | 3 |
3 | Five | Four | Five |
Using this traditional sort, you can immediately filter out the top two rows (blue) with ID 4 and ID 2. Because these lines do not meet the following conditions: cost > 3
.
On the other hand, sort keys in multidimensional data layouts sort tables based on a combination of two predicates that commonly occur in your workloads. cost > 3
and available < demand
. As a result, the rows of the table are sorted into four areas.
#id | Fee | Available | request |
Region #1, cost <= 3, availability < demand | |||
Region #2, cost <= 3、利用可能>= demand | |||
Region #3, Cost > 3, Available < Demand | |||
Region #4, Cost > 3, Availability >= Demand |
#id | Fee | Available | request |
Four | 1 | 1 | 2 |
2 | 2 | twenty three | 6 |
3 | Five | Four | Five |
1 | Four | 3 | 3 |
This concept can be applied to entire blocks rather than single rows, or can be applied to traditional sorting techniques ( like
), and when applied to three or more predicates.
system table
The following Amazon Redshift system tables indicate to users whether tables and queries use multidimensional data layouts.
- To determine whether a particular table uses a multidimensional data layout sort key, you can check the following:
sortkey1
The value of svv_table_info is equal toAUTO(SORTKEY(padb_internal_mddl_key_col))
. - To determine whether a particular query uses a multidimensional data layout to speed up table scans, you can check the following:
step_attribute
In the sys_query_detail view.value is equal tomulti-dimensional
Whether the table's multidimensional data layout sort key was used during the scan.
Performance benchmark
After running internal benchmark tests against multiple workloads using an iterative scan filter, we found that introducing sort keys for multidimensional data layouts yields the following results:
- Total execution time is reduced by 74% compared to no sort key.
- Total execution time is reduced by 40% compared to using an optimal single-column sort key for each table.
- The total number of rows read from the table is reduced by 80% compared to no sort key.
- The total number of rows read from the tables is reduced by 47% compared to using an optimal single-column sort key for each table.
Feature comparison
With the introduction of sort keys for multidimensional data layouts, you can now sort tables by expressions based on filter predicates that commonly occur within your workload. The following table compares the features of Amazon Redshift with his two competitors.
Features | amazon redshift | Competitor A | Competitor B |
Support for sorting by column | yes | yes | yes |
Support for sorting by expression | yes | yes | no |
Automatic column selection for sorting | yes | no | yes |
Automatic selection of expressions for sorting | yes | no | no |
Automatic selection between column sort or expression sort | yes | no | no |
Automatic use of expression sorting properties during scanning | yes | no | no |
considerations
When using multidimensional data layouts, keep the following in mind:
- Setting a table as SORTKEY AUTO enables multidimensional data layout.
- Amazon Redshift Advisor automatically chooses either a single column sort key or a multidimensional data layout for your table by analyzing past workloads.
- Amazon Redshift ATO adjusts the sorting results of multidimensional data layouts based on how ongoing queries interact with your workload.
- Amazon Redshift ATO maintains sort keys for multidimensional data layouts in the same way as existing sort keys today. For more information about ATO, see Using Automatic Table Optimization.
- Sort keys in multidimensional data layouts work in both provisioned clusters and serverless workgroups.
- Sort keys in multidimensional data layouts work with existing data as long as the table has AUTO SORTKEY enabled and a workload with a recurring scan filter is detected. The table is reorganized based on the results of the multidimensional sort function.
- Use alter table to override the sort keys for the multidimensional data layout of a table.
ALTER TABLE table_name ALTER SORTKEY NONE
. This disables the AUTO sort key functionality for the table. - Sort keys for multidimensional data layouts are preserved when you restore or migrate a provisioned cluster to a serverless cluster, and vice versa.
conclusion
In this post, we showed that sort keys in multidimensional data layouts can significantly improve query runtime performance for workloads where the dominant queries include iterative scan filters.
To create a preview cluster from the Amazon Redshift console, cluster Please select a page Creating a preview cluster. You can create clusters and test your workloads in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm) regions.
We welcome your feedback on this new feature. We also welcome your comments on this post.
About the author
Milind Oke is a data warehouse specialist solution architect based in New York. He has been building data warehouse solutions for over 15 years, specializing in Amazon Redshift.
Ding Jialin He is an applied scientist in the Learning Systems Group, specializing in applying machine learning and optimization techniques to improve the performance of data systems such as Amazon Redshift.
Yangju Ji I'm a product manager on the Amazon Redshift team. She has product vision and strategy experience in industry-leading data products and platforms. She has strong skills in building substantial software products using web development, systems design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.