SQL Server Performance: Troubleshooting Skewed Parallelism

The advent of multi-core processors has led many computing fields, including SQL Server, to adopt multi-threaded processing to improve performance and efficiency. Paul White wrote a great post introducing parallelism in SQL Server, which you should check out: Understanding and Using Parallelism in SQL Server - Simple Talk (red-gate.com)

SQL Server optimizers can choose to run operations (such as scanning a table) using multiple threads/workers, if it believes a parallel query plan can improve performance.That’s pretty neat, isn’t it? However, like anything, it can backfire under certain conditions.

In this blog, I will discuss an issue I personally encountered with parallelism, how I troubleshot it from identifying the problem to resolving it.

Troubleshooting using Wait Stats

As a DBA, users often provide us with vague description of issues, such as “Oh, we just noticed the server was pretty slow this morning from 5AM to 8AM.” It's our job to ask more questions and gather data from monitoring tools and system tables to pinpoint the problem and identify the queries that need attention.

I have found that SQL Diagnostic Manager is helpful for detecting outliers and performing in-depth performance analysis.

When users report SQL Server slowness, I typically start by examining the wait stats to identify which resources the queries are waiting on, as well as which queries are responsible for that.

Looking at the query wait statistics on SQL Diagnostic Manager, I identified a specific query that consistently experienced long waits between 5AM and 8AM.

Wait times are high on a particular query INSERT INTO # - Idera SQLdm, Queries, Query Waits panel

Examining the waits over time, the dominant ones were CXPACKET and CXCONSUMER. High values for these wait types may indicate inefficient parallelism.

The dominant wait types are CXPACKET and CXCONSUMER - Idera SQLdm, Queries, Query Waits panel

Next, I queried the Query Store to get the average wait time spent on CPU, Memory, Latch, and Parallelism for the query. I analyzed the data for the last 9 days to see if the slowness was a performance regression.


SELECT *
FROM (
	SELECT qsrsi.start_time AT TIME ZONE 'Central Standard Time' AS start_time
		, qsrsi.end_time AT TIME ZONE 'Central Standard Time' AS end_time
		, qsrs.wait_category_desc
		, avg_query_wait_time_ms
	FROM sys.query_store_runtime_stats_interval qsrsi
	INNER JOIN sys.query_store_wait_stats qsrs
	ON qsrsi.runtime_stats_interval_id = qsrs.runtime_stats_interval_id
	INNER JOIN sys.query_store_plan qsp
	ON qsrs.plan_id = qsp.plan_id
	INNER JOIN sys.query_store_query qsq
	ON qsq.query_id = qsp.query_id
	INNER JOIN sys.query_store_query_text qsqt
	ON qsq.query_text_id = qsqt.query_text_id
	WHERE qsrsi.start_time AT TIME ZONE 'Central Standard Time' >= '2024-10-01 12:00:00.0000000 -05:00'
		AND object_id = OBJECT_ID('Store_procedure_name') and query_sql_text like '%INSERT INTO #%'
) AS SourceTable
PIVOT (
	avg(avg_query_wait_time_ms)
	FOR wait_category_desc IN (CPU, Memory, Parallelism, Latch, [Network IO])
) AS PivotTable
ORDER BY start_time

After plotting the result in Excel, I noticed that the wait time for parallelism per execution doubled on the day the incident occured.

Parallelism

In this section, I will introduce some key concepts related to parallelism as well as pointing out the root cause, but you can also read about parallelism in more detail in the excellent post by Paul White mentioned earlier.

Below is a screenshot of a partial execution plan of the query in question. The operation in the top right is a serial one since it scans a very small, clustered index of 16 rows. The optimizer determined that parallelism was unnecessary for this operation.

Double-arrow yellow icons indicate parallel operators and those that are inside red boxes are exchange operators - Execution Plan

Any operation with the double-arrow yellow icon is a parallel operation. When we examine the properties of the Index Seek operator on the Table2, we can see the work is distributed across six threads. Why six threads? Because the server's MAXDOP (Maximum Degree of Parallelism) is set to 6. For more information, refer to this Microsoft documentation on configuring MAXDOP.

Also, we notice that the threads don't share work evenly, threads 1, 2, and 3 handle more work than others. We will return to the topic of skewed parallelism shortly, as that unravels the core of our problem.

A side note: Thread 0 is not a parallel thread, its function is to coordinate among the six parallel threads.

Uneven distribution of work across 6 thread workers - Execution Plan

The red-highlighted operations are exchange operators, which are responsible for moving rows between workers. In the example below, the exchange operator "Repartition Streams" redistributes rows returned from the Index Seek operation, and sending them to the consuming workers for the next step, the Hash Match operation. Usually, rows are redistributed based on a hash function on one or multiple columns in the row.

Exchange operator (Repartition Streams) in the middle of the Index Seek and Hash Match operators - Execution Plan

SQL Sentry Plan Explorer provides a summary of how rows are distributed across threads at each operation. When reading a plan tree, we start the innermost level of the tree and work our way outward, as that is the order in which the plan executes. And shown in the screenshot, many operations are parallel, meaning a significant amount of CPU is being dedicated to this query.

More importantly, the parallel threads processing the query seem to have an uneven distribution of work. Thread 2 and 6 are doing the bulk of the work, while the others handle significantly less. This skewness means that threads with lighter workloads finish earlier and must wait for slower threads to complete before moving on to the next operation. This is the reason the query experienced high parallelism wait types and ran inefficiently in terms of CPU.

The INSERT INTO# query's operators with skewed parallelism - SQL Sentry Plan Explorer

But then we should ask: what caused the imbalance? Remember how I mentioned earlier that the optimizer partitions the rows among workers?

It uses a hash algorithm on the value of one or more columns in a row to assign the row to a specific worker. If most of the hash values are 2, for example, then most of the rows will be routed to worker 2. In this case, all the repartitioning events use hashes of the Column_pli in the Table1.

The columns used for hashing and repartioning in the parallel plan - SQL Sentry Plan Explorer

Now, let's take a closer look at the number of qualified rows for each distinct value in Column_pli and how they correlate with the number of rows assigned to the parallel threads. Some noticeable observations are most rows belong to pli 2, which were processed by thread 6; 103 rows that belongs to pli 7 were processed by thread 3. To sum up, the uneven distribution of values in the Column_pli causes uneven distribution of work across the threads.

My solution is to separate the portion of the code that joins two tables using column_pli from the part of the query that requires heavy parallelism, specifically the section that accesses the XML data. By doing that, the optimizer may partition the work that accesses the XML data more evenly; the store procedure in overall may run more efficiently.

Separate the code that joins tables using column_pli and the code with heavy parallelism

However, the code that accesses and parses the XML also isn’t efficient either since it is parsing and filtering the xml records 10 times to access the 10 data fields. Instead, we can rewrite as follows so that the parsing only occurs three times.

The final query now contains 2 smaller queries. The top query creates the #temptable1 to hold intermediate results from parsing the xml records and the second query joins to the temp table obtained from the first query.

Performance Improvement:

The top query runs in parallel, and the threads share an even amount of work.

Query 1's with unskewed parallelism - SQL Sentry Plan Explorer

For the second query, the first few operators have skewed parallelism (the red area) due to the Table1 Column_pil used as partition columns. However, as soon as the query touches the #temptable1, it repartitions the stream using different sets of columns in the #temptable1. The later operators share even amount of work between threads (the green area).

Query 2's parallelism - SQL Sentry Plan Explorer

I then opened two different sessions on SSMS to compare the result:

1. In the window with session id 626, I ran the old query and it took 12 seconds.

2. In the window with session id 333, I ran the tuned query and it took 2 seconds.

I also verified that the parallelism waits were reduced by querying the dm_exec_session_wait_stats table. The CXPACKET and CXCONSUMER wait times were significantly reduced in the tuned query.

CXPACKET and CXCONSUME waits reduced - Result from sys.dm_exec_session_wait_stats

If you’re interested in learning more about CXPACKET wait type, check out Paul Randal’s post: Knee-Jerk Wait Statistics : CXPACKET - SQLPerformance.com. In his post, Paul explains that CXPACKET waits occur in two situations: first, when the controlling thread (thread 0) registers a wait because it is waiting for the worker threads to finish processing their portions of the parallel plan; and second, when a non-controlling thread registers a wait if it finishes before the other threads in the operator. Our case is the second. He also emphasizes that CXPACKET waits don’t necessarily indicate problematic parallelism, you need to investigate deeper (e.g. via execution plans) to determine that.

Key Takeaways

Parallelism plans can become inefficient when work is unevenly distributed across threads. Understanding the source of this skewness is crucial, as it will guide you in deciding how to eliminate or mitigate the extent of skewed parallelism.

Search This Blog

My DBA Adventure