This performance consideration is not unique to Synapse but can be exasperated in Synapse due to the quantity of data that is being worked with. The CPU intensive operations that I am referring to come in three different flavors.
- Data Type Conversions
- Runtime Calculations
- Scalar Functions
These topics all have one thing in common. They are operations that could have been pushed further upstream in the data warehouse preparation and handled in a batch process rather than at runtime.
Data Type Conversions that are done as part of report/ad hoc queries usually indicate that appropriate transformations and preparation was not done when ingesting data into your data warehouse. Many times, you will see this type of activity when transforming strings to dates for filter predicates. Or from strings to numeric fields for calculations. While this is a common query pattern, you should be mindful about doing this over and over again instead of doing it just once at the time of ingestion and preparation.
Runtime Calculations fall in line with similar thinking to data type conversions and point back to whether the data was prepared properly. If you are running the same calculation again and again, materialize the calculation to the fact table. If the calculation is based upon a slowly changing dimension, you will have to decide to incur the cost of calculation at runtime or the cost of transforming the data as part of ELT. Remember that in scenarios where you have a large ratio of change on fact tables, you can leverage CTAS and RENAME syntax to perform batch transformation and parallelize the update. For a reference of the different things you can do with CTAS check out this article I posted several years ago: https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/common-data-warehouse-functionality-redefined-by-ctas/ba-p/718187
Scalar Functions are functions that return a single value based upon an input. If they are used in context of a large dataset that is being returned, the scalar function will essentially be performing a record-by-record process that is extremely inefficient. This can further be compounded if you are applying complex if/then/else logic and/or doing additional lookups. Scalar functions have a place, but it should not be for evaluating data in large datasets that are repeatedly returned to the end user. We should love the idea of reusable code, but we should hate the idea of inefficient reusable code. Instead of scalar functions for result sets, think about preparing your data in batch. What is being done on a record-by-record basis can likely be done in a batch statement. If it is too complex and cannot be done in batch at time of ingestion, consider breaking apart the steps to get your final result set. Chances are likely that there are a series of transformations that need to be done in a particular order but can still be done in batch.
CPU intensive operations can be a problem where your entire environment is under CPU pressure. Workload management will throttle CPU when there is CPU pressure but if there is low overall CPU pressure, a single resource intensive query may get more CPU than the specified percentage in the WLM group configuration. If you generally don’t have CPU pressure and are simply running one or two heavy queries, this article may not apply to you on the surface, but you also need to be cognizant of the overall runtime of the query, if objects are locked for an extended period of time blocking other queries, or if it is additionally causing memory pressure. CPU intensive operations are many times a lower priority when tuning individual queries but become more obvious when tuning the entire workload on an environment.