Skip to content

Daniel Crawford's Data Blog

Tips for performant database designs

Menu
  • Home
  • Blog
  • Resources
  • About
  • Contact
Menu

Synapse Fundamentals for Tuning – Replicated Tables

Posted on January 20, 2023June 23, 2023 by Daniel Crawford

I covered some of the basics of table geometry in this post: Synapse Fundamentals for Performance Tuning – Distribution; however, it’s critical to dive deeper into the performance impact of having the correct table distribution defined.  Sometimes we think we have the correct distribution chosen for a table but are painfully mistaken as performance suffers.  Generally, that can be somewhat obvious when looking at execution plans and seeing shuffle or broadcast moves taking longer than expected. But What happens when you don’t see data movement and queries on an individual node are taking too long? Since replicated tables by their design eliminate data movement, performance impact is a little different than round robin or hash distributed tables. So here I want to drill into replicated tables and their pitfalls.  Replicated tables are many times mistakenly used in the name of limiting data movement, but they can have the opposite effect on performance.  Below I briefly list some causes of poor performance and the symptoms that you should watch out for when a replicated table is used incorrectly. (Reminder you might want to get some basic knowledge about distribution and how replicated tables work in the post mentioned above.)

  • Large replicated tables are a problem because they can take too long to cache which consumes resources for extended periods of time and can cause queuing. 
  • Replicated tables that are consistently joined to other replicated tables can be a problem due to limiting the compute power of the environment.  Replicated tables essential remove the MPP aspect of the environment by focusing all data processing to a single node when not joined to a round robin or hash distributed table.  This is problematic for large tables because large tables can benefit from being spread across multiple compute nodes and therefore could be more efficient with parallel processing rather than symmetric processing (which is the result of joining two or more replicated tables.)  Note that this is the same principle as having the wrong column chosen for a hash key on a distributed table. Don’t consolidate processing onto single processors instead of spreading the work evenly across multiple processors.
  • Dynamic replicated tables can be troublesome because the in-memory cache is invalidated upon every write action (insert, update, and delete).  This means that on every subsequent read there is the additional performance impact of recaching the entire table to every node. This is a very common issue and one that may go unnoticed in initial development and becomes a problem in production when the table is used differently than planned.
  • Frequently scaled Dedicated SQL Pools are a problem because the cached tables are invalidated when the environment is scaled.  Paused environments typically will retain the cache (unless maintenance is performed during pause/resume) but scaled environments add or remove compute nodes which results in the loss of the in-memory cache.  After the environment is scaled, the cache must be rebuilt upon first read of the table. 
  • Too many replicated tables can be a problem if they are all processed at the same time daily and are all being requested to re-cache their data at the same time upon first read of an updated replicated table.  Upon first read, a background system process runs the “BuildReplicatedTableCache” operation using a small resource class.  More importantly only two table will be cached in parallel so if there are multiple requests to recache tables, it will cause significant queuing. Note that you can see this process as an additional request in the DMV “sys.dm_pdw_exec_requests”.
SELECT * 
FROM sys.dm_pdw_exec_requests
WHERE command like '%BuildReplicatedTableCache%'

Replicated tables can be a boon to performance if they are implemented correctly and sparingly.  Too often users will see that making a table replicated will solve performance problems in one instance and then go overboard making many or most tables replicated which will come back to bite them.  Replicated tables should not be the default but the exception.

There is at least one anti-pattern that I do want to highlight here.  While academically speaking, we should only replicate tables where there is less than 2GB of data, it is not unusual for tables to be many times larger than just 2GB.  If you have a query pattern where statistics cannot be calculated for a table as part of a join (which results in unpredictable data movement possibly of the wrong dataset), you should consider replicating the smaller dimension table which may be larger than 2GB.  Note that this should be proportional to the size of the fact table.  For example, replication of a 5GB dimension table should be considered instead of the potential data movement of a 50TB fact table.

To dive deeper into replicated tables, be sure to check out this post: All you need to know about Replicated Tables in Synapse dedicated SQL pool – Microsoft Community Hub

Loading

Categories

  • Architecture Patterns
  • Fabric
  • Performance Tuning
  • Synapse
  • Top 10 Performance Considerations

Archives

  • July 2023
  • January 2023
  • December 2022
  • November 2022
  • September 2022
  • August 2022

Recent Synapse Videos

Microsoft Fabric is a SaaS platform that lets you easily connect, process, and visualize data, all within a single, secure environment. Instead of maintaining your own security infrastructure, Fabric takes care of it for you, leveraging Microsoft’s resources and expertise to keep your data protected, patched, monitored, and compliant. 

By moving your data to the cloud and working with analytics tools like Power BI, Data Factory, and the next-generation Synapse, you get built-in encryption for data at rest and in transit, along with disaster recovery options. This means you can confidently focus on building data solutions without worrying about the underlying security. 

 

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Product Group presents: Security in Fabric Data Engineering
Fabric Apache Spark Diagnostic Emitter for Logs and Metrics is now in public preview. This new feature allows Apache Spark users to collect Spark logs, job events, and metrics from their Spark applications and send them to various destinations, including Azure Event Hubs, Azure Storage, and Azure Log Analytics. It provides robust support for monitoring and troubleshooting Spark applications, enhancing your visibility into application performance. 

 What Does the Diagnostic Emitter Do? 

The Fabric Apache Spark Diagnostic Emitter enables Apache Spark applications to emit critical logs and metrics that can be used for real-time monitoring, analysis, and troubleshooting. Whether you’re sending logs to Azure Event Hubs, Azure Storage, or Azure Log Analytics, this emitter simplifies the process, allowing you to collect data seamlessly and store it in your preferred destinations.  

 

Blog post: https://blog.fabric.microsoft.com/en-US/blog/announcing-the-fabric-apache-spark-diagnostic-emitter-collect-logs-and-metrics/  

 

🎙 Meet the Speakers: 

👤 Guest Expert: Jenny Jiang, Principal Product Manager | Fabric Data Engineering 

LinkedIn: https://www.linkedin.com/in/jenny-jiang-8b57036/ 

👤 Host: Estera Kot, PhD, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 

Twitter: https://twitter.com/estera_kot 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
Microsoft Fabric Spark Diagnostic Emitter for Logs and Metrics
The T-SQL notebook feature in Microsoft Fabric lets you write and run T-SQL code within a notebook. You can use T-SQL notebooks to manage complex queries and write better markdown documentation. It also allows direct execution of T-SQL on connected warehouse or SQL analytics endpoint. By adding a Data Warehouse or SQL analytics endpoint to a notebook, T-SQL developers can run queries directly on the connected endpoint. BI analysts can also perform cross-database queries to gather insights from multiple warehouses and SQL analytics endpoints. 

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Qixiao Wang, Principal Program Manager in Microsoft | Fabric Spark 

Linkedin: https://www.linkedin.com/in/%E5%90%AF%E9%9C%84-%E7%8E%8B-24368233/  

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Like this video? Don't forget to hit the 'Like' button and share it with your network! 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Microsoft Fabric Product Group presents T-SQL support in Microsoft Fabric Notebooks
AutoML (Automated Machine Learning) is a collection of methods and tools that automate machine learning model training and optimization with little human involvement. The aim of AutoML is to simplify and speed up the process of choosing the best machine learning model and hyperparameters for a given dataset, which usually demands much skill and computing power. 

  

🎙 Meet the Speakers: 

👤 Guest from Microsoft Fabric Product Group: Misha Desai, Senior Program Manager  

 

Misha is a Senior Product Manager based in Seattle, WA, specializing in model tracking, training, and governance within the Fabric Data Science team. 

Linkedin: www.linkedin.com/in/misha-desai-6034a362 

 

👤 Host: Estera Kot, Senior Product Manager at Microsoft and a member of the Fabric Product Group. She holds the role of Product Owner for Apache Spark-based runtimes in Microsoft Fabric and Synapse Analytics. Estera is a Data & AI Architect and is passionate about computer science. 
LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

🔔 Stay Updated: For more insights into Microsoft Fabric Data Engineering and Data Science, and all things tech, make sure to subscribe to our channel and hit the notification bell so you never miss an episode!
Low Code AutoML UI in Microsoft Fabric Data Science
With high concurrency mode, we’re bringing a game-changing way to orchestrate your data ingestion and transformation processes in enterprise workflows. Notebooks in pipelines now leverage shared, high-performance sessions, combining speed with cost-efficiency—all while maintaining top-tier security. 

 Imagine a pipeline with five notebooks, each running 5 minutes. Normally, the 3-minute Spark start time per step would push your pipeline to 40 minutes. With high concurrency, the total runtime drops to 28 minutes—a 30% speed improvement. 

Unlock faster workflows, lower costs, and a seamless data journey with high concurrency mode. Get ready to experience the next level of pipeline orchestration! 🎉 

https://blog.fabric.microsoft.com/en-US/blog/introducing-high-concurrency-mode-for-notebooks-in-pipelines-for-fabric-spark/  

🎙 Meet the Speakers: 

👤 Guest from Product Group: Santhosh Kumar Ravindran, Senior Product Manager 

 

Santhosh Ravindran currently leads Spark Compute and Settings for Microsoft Fabric Spark. He focuses on building capabilities that meet the data engineering needs like Spark Pools, Queueing and Scheduling, and Job orchestration for big data workloads in large enterprises using Spark. Prior to this, Santhosh was a Product Manager and Engineering Lead building Metadata scanners, access policy orchestration, lineage and data catalog systems for enterprise data estates as part of the Microsoft Governance Platform (Microsoft Purview). 

 

LinkedIn:  https://www.linkedin.com/in/thisissanthoshkumar/   

Twitter:  https://www.twitter.com/iamsanthoshkr  

 

 

👤 Host: Estera Kot, Principal Product Manager at Microsoft. 

LinkedIn: https://www.linkedin.com/in/esterakot/ 
Twitter: https://twitter.com/estera_kot 

 

👍 Liked this video? Don't forget to hit the 'Like' button and share it with your peers!
High Concurrency Mode for Notebooks in Pipelines for Fabric Spark
Load More... Subscribe