Thursday, September 19, 2019

SQL Server Unequal NUMA CPU engagement

With physical servers using NUMA configuration - or vms using vNUMA - a number of factors can lead to uneven CPU utilization among the NUMA nodes.

Consider a 2x8 vm, with SQL Server maxdop set at 8. Assume a single maxdop 8 parallel query running on that system. The parallel workers will most likely be sitting on 8 user schedulers. If that parallel query has multiple zones, the parallel workers will still most likely be on 8 user schedulers, with as many parallel workers for the query stacked on each of those schedulers as there are zones in the plan.

All of the schedulers hosting parallel workers for that query may be co-resident with each other in vNUMA node 0. Or 1. Or they may be spread across them in some manner.

The "single query with maxdop less than scheduler count" scenario gives a good starting point for understanding how CPU engagement of the NUMA nodes may be different at any given time.

(Location of context ID 0 for the session adds another detail to this scenario. It may be co-resident on a scheduler with some number of parallel workers for the query, or it may not. For example, if all parallel workers are in vNUMA node 0 and context ID 0 is also in vNUMA node 0 in this example, context ID 0 is co-resident on a scheduler with at least one parallel worker for the query.)

Imbalances can even arise if all queries are maxdop 1. SQL Server distributed incoming connections to an end-point among all nodes available to that end-point with a round-robin scheme. But! Due to use of ADO.NET connection pooling and a pre-existing pool, two successive connections from a client - in the same pool - may very well end up on the same node. The fairly fast ramp up of pools to their maximum size (by default 100) and fairly slow ramp down (documentation says an idle pool connection will be disposed after 4 to 8 minutes of idle time, down to minimum pool size) can lead to some unexpected distributions of connections.

Now, its even possible that an equal number of active workers are on each scheduler of each node, but the vNUMA nodes *still* show unequal CPU engagement. I'll just mention 1 query-based possibility, and one system state-based possibility.

What if the queries on node 1 are optimized CCI-based queries, while through some stroke of luck open to the author of hypothetical situations the queries on node 0 are heavily read-dependent rowstore queries?

The distribution of pageio_latch waits would lean toward Node 0 due to the operation pattern there(especially if readahead for some reason or other isn't engaged or is still so slow it produces waits). And those waits could suppress CPU engagement on Node 0, while the same level of effect would not impede engagement on node 1.

Now let's talk how system state could result in uneven CPU engagement across the vNUMA nodes, even if work is evenly distributed.
Various memory states can result in that pattern.
What if another application is running on the same VM? If its memory primarily comes from NUMA node 0, its memory management can come into conflict with SQL Server, which will try to balance its memory utilization across the NUMA nodes.

Perfmon counters [\NUMA Node Memory(*)\Free & Zero Page List MBytes] can be good to peruse if this is suspected. If only one of 2 nodes is flirting with bottoming out Free and Zero Page List memory, it can disproportionately suffer from memory stalls and memory-related SQL Server waits, as well as potentially suffering from content send to and retrieve from pagefile.sys.

OK, enough speculation from me. :-)

Here's a stored procedure that, if run in a fairly short interval like every 10 seconds, can give insight into uneven CPU utilization on a NUMA server based on information from within SQL Server.


CREATE PROCEDURE dbo.SASQ_GRAB
AS
IF object_id ('dbo.SASQ_SCHEDULER_TASKS') IS NULL
BEGIN
    CREATE TABLE dbo.SASQ_SCHEDULER_TASKS (
     capture_time DATETIME NOT NULL,
     scheduler_id INT NOT NULL,
     session_id SMALLINT NULL,
     wait_type NVARCHAR(60) NULL,
     task_count SMALLINT NULL
    );

    CREATE CLUSTERED INDEX CI ON dbo.SASQ_SCHEDULER_TASKS (capture_time) 
    WITH (data_compression = row);
END

IF object_id ('dbo.SASQ_NODE_CONNS') IS NULL
BEGIN
    CREATE TABLE dbo.SASQ_NODE_CONNS (
     capture_time DATETIME NOT NULL,
     node_affinity TINYINT NOT NULL,
     client_net_address VARCHAR(50) NULL,
     conn_count SMALLINT NULL
    );

    CREATE CLUSTERED INDEX CI ON dbo.SASQ_NODE_CONNS (capture_time)
    WITH (data_compression = row);
END

IF object_id ('dbo.SASQ_WAIT_STATS') IS NULL
BEGIN
    CREATE TABLE dbo.SASQ_WAIT_STATS (
     capture_time        DATETIME NOT NULL,
     wait_type           NVARCHAR(60) NOT NULL,
     waiting_tasks_count BIGINT NOT NULL,
     wait_time_ms        BIGINT NOT NULL,
     max_wait_time_ms    BIGINT NOT NULL,
     signal_wait_time_ms BIGINT NOT NULL
    );

    CREATE CLUSTERED INDEX CI ON dbo.SASQ_WAIT_STATS (capture_time)
    WITH (data_compression=row);
END

DECLARE @ct DATETIME;
SET @ct = GETDATE();

INSERT INTO dbo.SASQ_SCHEDULER_TASKS WITH (TABLOCK)
SELECT @ct, ot.scheduler_id, ot.session_id, owt.wait_type, task_count = count(*)
FROM sys.dm_os_tasks ot with (nolock)
LEFT OUTER JOIN sys.dm_os_waiting_tasks owt with (nolock) ON ot.task_address = owt.waiting_task_address
WHERE scheduler_ID < 1048576
GROUP BY ot.scheduler_id, ot.session_id, owt.wait_type
ORDER BY ot.scheduler_id ASC
OPTION (MAXDOP 1);

INSERT INTO dbo.SASQ_NODE_CONNS with (tablock)
SELECT @ct, node_affinity, client_net_address, conn_count = count(*) 
FROM SYS.dm_exec_connections with (nolock)
GROUP BY node_affinity, client_net_address
OPTION (MAXDOP 1);


INSERT INTO dbo.SASQ_WAIT_STATS with (tablock)
SELECT capture_time = GETDATE(), wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms
FROM sys.dm_os_wait_stats ws WITH (NOLOCK)
WHERE waiting_tasks_count <> 0
OPTION (MAXDOP 1);

/* 20190919
   grab some info from DMOs in iterations of eg 10s to understand
   discrepancy of cpu engagement between NUMA nodes
*/

No comments:

Post a Comment