Synapse Fundamentals for Tuning – Indexes Part1

If you have been around SQL Server for any length of time, you know by now that indexes are critical for performance. In Synapse Dedicated SQL Pools, indexes play a lesser role in query tuning because they do not impact the DSQL plans but rather impact the SQL plans on each of the distributions for each of the steps. When tuning a Synapse query, indexes are still important but are generally not the lowest hanging fruit when looking for a quick performance boost. Generally speaking, the number one thing you can do to boost performance quickly with very little effort, is simply to update statistics. If statistics are all accurate and you notice a particular step or multiple steps in the DSQL plan taking a long time, you should then look at indexes.

Defaults

When creating a table in Synapse Dedicated SQL Pools, the default index on that table is a clustered columnstore index (CCI) if not specified. However, CCI is not always the best option for tables but is a good starting place for large tables which should be a good portion of the tables in your data warehouse. Below I will summarize the index types and when they are and are not ideal.

Index Types

Clustered Columnstore indexes are great for large tables with lots of integers. That doesn’t mean that they aren’t good for character fields, but those fields just don’t compress as well as integers/numerics/bits/dates. (Note that LOB datatypes such as varchar(max) are not supported in CCI’s.) CCI’s store the data in a columnar format, grouped into chunks called rowgroups. Rowgroups are usually asynchronously compressed (via a background process called the tuple mover) to improve storage and performance many times over rowstore storage. For datasets smaller than 60 million records in Synapse, a clustered index may perform better due to the overhead of any uncompressed records which are stored in rowstore format (in what is called the delta store) until they are compressed (generally when the rowgroup reaches ~1 million records).

Clustered Indexes are exactly what they always have been in SQL Server – a rowstore b-tree index that stores data in sequential pages. Records are ordered for fast data retrieval. The only difference in Synapse is that the data is sharded across your distributions first based upon the distribution property of the table. In an ideal world, your clustered index will be 1/60^th of the total size of the table and therefore perform much faster than in regular SQL Server.

Heaps just like clustered indexes are simply data pages stored in rowstore format with no ordering. As data is deleted from a heap, you will need to rebuild it to help performance. This is no different than SQL Server just distributed across 60 distributions.

Nonclustered Indexes: Sometimes these are called secondary indexes. The data is not physically stored here, but this is a b-tree index that exists in addition to the physical storage of the table (CCI, clustered, or heap). Nonclustered indexes are created to improve performance of specific queries. Generally, composite indexes can be created for filter and join criteria to improve performance. In Synapse Dedicated SQL Pools, nonclustered indexes are generally not needed as frequently since a combination of CCI rowgroups and the distribution of data separates data into small enough pieces that a secondary index doesn’t produce significant gain in performance (in most but not all cases).

I will discuss index maintenance in a follow up post which will provide greater insight into CCI internals.