Skip to content

Daniel Crawford's Data Blog

Tips for performant database designs

Menu
  • Home
  • Blog
  • Resources
  • About
  • Contact
Menu

Synapse Fundamentals for Performance Tuning – Distribution

Posted on August 12, 2022September 28, 2022 by Daniel Crawford

If you are coming from a SQL Server world, what are the fundamental concepts in Synapse that are essential to building a performant Dedicated SQL Pool?  How does Synapse differ from regular SQL Server?  Let’s start with the most basic foundational principle of distribution. 

You might hear the term “MPP” thrown around when we talk about Synapse.  MPP stands for Massive Parallel Processing – which is just another fancy way of saying that there are multiple compute nodes processing data in parallel.  This contrasts with SQL Server which is traditionally an SMP or Symmetric Multi-Processing platform (a single compute node with multiple processors).  Synapse Dedicated SQL Pools consist of 60 distributions under the covers and therefore can have as many as 60 compute nodes each having multiple processors in themselves.  This allows for scalability at a massive level. 

So how does this translate into why my queries are slower than expected in my Dedicated SQL Pool?

Say for example, you have a query running in SQL Server – it queries 100,000 records and returns 1000 records back to the client in 10 seconds.  If the same dataset is moved to Synapse, you might expect it to run in a fraction of the time (sub second in fact) because it has so much more compute power behind it.  But you might be surprised when it doesn’t necessarily complete any faster but is potentially even slower for small datasets.  That is because data must be distributed effectively to utilize all the compute power available.

Because we have 60 distributions, our data is sharded across those 60 different SQL instances with a DW query engine coordinating the work across those instances.  If you have ever partitioned data in SQL Server and created a partition scheme that separates partitions onto separate disks, this is an almost identical concept.  The only difference is that instead of a partition column, we use a distribution column and neither the partition function or scheme need to be defined manually in Synapse.  A hash algorithm is used to hash the distribution field and distribute the data across the 60 distributions automatically.  This however, doesn’t necessarily mean it is distributed evenly and that is where we can run into problems.

We can run into performance problems when data is skewed or distributed unevenly putting large quantities of data on a few distributions vs evenly spreading workload across all distributions.  Skew can be introduced in the physical storage of the data as well as in the querying of data (think query filters or aggregations on heavily skewed fields).  A query can only run as fast as the slowest distribution, so we want to avoid skew not only on disk but also in our queries.  Additionally, and frequently, choosing the wrong distribution key causes data movement between distributions.  This data movement is typically the largest consumer of resources and contributes the most to overall clocktime of query completion.  Choosing the right distribution key is critical to good performance in Synapse.

How to choose the right distribution key

As stated above, there are two components to choosing a good distribution key. 

  • First, you should choose a field that has high cardinality.  Having 1000 or more unique values is a good place to start.  The more unique the field, the better distribution you will have.
  • Second, you should choose a field that is going to frequently be used to join to other tables and/or included in an aggregation (group by statement).  The goal is to move as little data as possible to complete queries.

Do surrogate keys make for a good hash distribution key?

In most scenarios, I recommend using natural keys instead of a surrogate key. Natural or business keys are more likely to be used for aggregations.  From an end user standpoint, it will make more sense to distribute on a natural key when using reporting tools such as PowerBI where data consumers may not be aware of the internals of Synapse or the metadata behind the tables. 

While surrogate keys have great cardinality due to uniqueness, they bring a list of other issues.  Probably the biggest challenge of surrogate keys is that they have to be created sequentially at the time of load and can be very costly as part of the load process.  Synapse doesn’t enforce primary key or unique constraints so you must manually ensure uniqueness in your load process and that will be very costly when not done in batch.  In short, I would recommend not using surrogate keys at all (unless absolutely necessary due to overlap of source system business keys or other reasons) but rather use the natural keys whenever possible.

If you have already loaded data into your Dedicated SQL Pool, you can easily check skew using the guidance here:  Distributed tables design guidance – Azure Synapse Analytics | Microsoft Docs

What are my other options?

Until now, I have been talking about hash distributions in Synapse, but there are two other types of distribution that we haven’t talked about.

Round Robin is a distributed table without a hash key and therefore the records are inserted in a round robin fashion as they come in.  This evenly distributes records across distributions most times and is great because you are not likely to have any skew.  (My teammate Mark Price-Maher, discusses an exception to this rule here.)  The problem with round robin distributions is that it always results in data movement unless you are querying the entire table, with no aggregations, and no joins to other tables. Round Robin tables typically should be used where fast writes are more important than subsequent reads or the entire table will be read (think of staging tables).  When creating your tables, if you don’t specify distribution type, your default will be round robin.  Be careful to not fall into the trap of allowing every table to be a round robin table. Performance across your dedicated SQL pool will be severely impacted by this.

Replicated tables are just what they say.  Data is replicated into the cache on each of the compute nodes so data becomes collocated for joins and therefore no additional data movement would need to occur to complete a query.  But there are a few gotchya’s that you want to watch out for: 

  1. Replicated tables are physically stored as round robin tables.
  2. The cache on each of the computes is invalidated every time you write to the table (insert, update, delete).
  3. The cache rebuilds on the first read of the data from disk and therefore the first read is slower.
  4. A maximum of 2 cache rebuild processes can run at a time and they do consume resources on your environment.

While replicated tables sound very appealing because it eliminates data movement, they can also hurt you if the table is too dynamic, too large (they should be less than 2GB as a rule of thumb), or you are joining too many together and defeating the purpose of an MPP platform that is made to distribute the workload across multiple compute nodes.

This is distribution in a very abbreviated overview but provides context necessary for discussing the next fundamental to Synapse query performance: Statistics.

Loading

Categories

  • Architecture Patterns
  • Fabric
  • Performance Tuning
  • Synapse
  • Top 10 Performance Considerations

Archives

  • July 2023
  • January 2023
  • December 2022
  • November 2022
  • September 2022
  • August 2022

Recent Synapse Videos

Microsoft Fabric is a SaaS platform that lets you easily connect, process, and visualize data, all within a single, secure environment. Instead of maintaining your own security infrastructure, Fabric takes care of it for you, leveraging Microsoft’s resources and expertise to keep your data protected, patched, monitored, and compliant. 

By moving your data to the cloud and working with analytics tools like Power BI, Data Factory, and the next-generation Synapse, you get built-in encryption for data at rest and in transit, along with disaster recovery options. This means you can confidently focus on building data solutions without worrying about the underlying security. 

 

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Product Group presents: Security in Fabric Data Engineering
Fabric Apache Spark Diagnostic Emitter for Logs and Metrics is now in public preview. This new feature allows Apache Spark users to collect Spark logs, job events, and metrics from their Spark applications and send them to various destinations, including Azure Event Hubs, Azure Storage, and Azure Log Analytics. It provides robust support for monitoring and troubleshooting Spark applications, enhancing your visibility into application performance. 

 What Does the Diagnostic Emitter Do? 

The Fabric Apache Spark Diagnostic Emitter enables Apache Spark applications to emit critical logs and metrics that can be used for real-time monitoring, analysis, and troubleshooting. Whether you’re sending logs to Azure Event Hubs, Azure Storage, or Azure Log Analytics, this emitter simplifies the process, allowing you to collect data seamlessly and store it in your preferred destinations.  

 

Blog post: https://blog.fabric.microsoft.com/en-US/blog/announcing-the-fabric-apache-spark-diagnostic-emitter-collect-logs-and-metrics/  

 

🎙 Meet the Speakers: 

👤 Guest Expert: Jenny Jiang, Principal Product Manager | Fabric Data Engineering 

LinkedIn: https://www.linkedin.com/in/jenny-jiang-8b57036/ 

👤 Host: Estera Kot, PhD, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 

Twitter: https://twitter.com/estera_kot 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Spark Diagnostic Emitter for Logs and Metrics
The T-SQL notebook feature in Microsoft Fabric lets you write and run T-SQL code within a notebook. You can use T-SQL notebooks to manage complex queries and write better markdown documentation. It also allows direct execution of T-SQL on connected warehouse or SQL analytics endpoint. By adding a Data Warehouse or SQL analytics endpoint to a notebook, T-SQL developers can run queries directly on the connected endpoint. BI analysts can also perform cross-database queries to gather insights from multiple warehouses and SQL analytics endpoints. 

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Qixiao Wang, Principal Program Manager in Microsoft | Fabric Spark 

Linkedin: https://www.linkedin.com/in/%E5%90%AF%E9%9C%84-%E7%8E%8B-24368233/  

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Like this video? Don't forget to hit the 'Like' button and share it with your network! 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Microsoft Fabric Product Group presents T-SQL support in Microsoft Fabric Notebooks
AutoML (Automated Machine Learning) is a collection of methods and tools that automate machine learning model training and optimization with little human involvement. The aim of AutoML is to simplify and speed up the process of choosing the best machine learning model and hyperparameters for a given dataset, which usually demands much skill and computing power. 

  

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Misha Desai, Senior Program Manager  

 

Misha is a Senior Product Manager based in Seattle, WA, specializing in model tracking, training, and governance within the Fabric Data Science team. 

Linkedin: www.linkedin.com/in/misha-desai-6034a362 

 

👤 Host: Estera Kot, Senior Product Manager at Microsoft and a member of the Fabric Product Group. She holds the role of Product Owner for Apache Spark-based runtimes in Microsoft Fabric and Synapse Analytics. Estera is a Data & AI Architect and is passionate about computer science. 
LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Low Code AutoML UI in Microsoft Fabric Data Science
With high concurrency mode, we’re bringing a game-changing way to orchestrate your data ingestion and transformation processes in enterprise workflows. Notebooks in pipelines now leverage shared, high-performance sessions, combining speed with cost-efficiency—all while maintaining top-tier security. 

 Imagine a pipeline with five notebooks, each running 5 minutes. Normally, the 3-minute Spark start time per step would push your pipeline to 40 minutes. With high concurrency, the total runtime drops to 28 minutes—a 30% speed improvement. 

Unlock faster workflows, lower costs, and a seamless data journey with high concurrency mode. Get ready to experience the next level of pipeline orchestration! 🎉 

https://blog.fabric.microsoft.com/en-US/blog/introducing-high-concurrency-mode-for-notebooks-in-pipelines-for-fabric-spark/  

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
High Concurrency Mode for Notebooks in Pipelines for Fabric Spark
Load More... Subscribe