sql.sasquatch: September 2019

Monday, September 30, 2019

Observing the [DBCC CHECKDB] parallel object check

This is a query i use to keep tabs on all of the workers when i run [dbcc checkdb] at high DOP :-)

Note: this query won't catch the initial portions of checkdb that always run at DOP 1 - the checkalloc, checkcatalog, etc.

;WITH owt AS
(SELECT exec_context_id, wait_type, 
        MAX(wait_duration_ms) wait_duration_ms
 FROM sys.dm_os_waiting_tasks owt 
 GROUP BY exec_context_id, wait_type)
SELECT capture_tm = getdate(), owt.wait_type, count(*) waiters, 
       MIN(owt.wait_duration_ms) min_wait_ms, 
       MAX(owt.wait_duration_ms) max_wait_ms
FROM sys.dm_exec_requests req
JOIN sys.dm_os_tasks ot ON req.session_id = ot.session_id
LEFT OUTER JOIN owt
     ON ot.session_id = owt.session_id AND ot.exec_context_id = owt.exec_context_id
WHERE req.command IN ('dbcc table check') 
GROUP BY owt.wait_type
ORDER BY waiters DESC;

Thursday, September 19, 2019

SQL Server Unequal NUMA CPU engagement

With physical servers using NUMA configuration - or vms using vNUMA - a number of factors can lead to uneven CPU utilization among the NUMA nodes.

Consider a 2x8 vm, with SQL Server maxdop set at 8. Assume a single maxdop 8 parallel query running on that system. The parallel workers will most likely be sitting on 8 user schedulers. If that parallel query has multiple zones, the parallel workers will still most likely be on 8 user schedulers, with as many parallel workers for the query stacked on each of those schedulers as there are zones in the plan.

All of the schedulers hosting parallel workers for that query may be co-resident with each other in vNUMA node 0. Or 1. Or they may be spread across them in some manner.

The "single query with maxdop less than scheduler count" scenario gives a good starting point for understanding how CPU engagement of the NUMA nodes may be different at any given time.

(Location of context ID 0 for the session adds another detail to this scenario. It may be co-resident on a scheduler with some number of parallel workers for the query, or it may not. For example, if all parallel workers are in vNUMA node 0 and context ID 0 is also in vNUMA node 0 in this example, context ID 0 is co-resident on a scheduler with at least one parallel worker for the query.)

Imbalances can even arise if all queries are maxdop 1. SQL Server distributed incoming connections to an end-point among all nodes available to that end-point with a round-robin scheme. But! Due to use of ADO.NET connection pooling and a pre-existing pool, two successive connections from a client - in the same pool - may very well end up on the same node. The fairly fast ramp up of pools to their maximum size (by default 100) and fairly slow ramp down (documentation says an idle pool connection will be disposed after 4 to 8 minutes of idle time, down to minimum pool size) can lead to some unexpected distributions of connections.

Now, its even possible that an equal number of active workers are on each scheduler of each node, but the vNUMA nodes *still* show unequal CPU engagement. I'll just mention 1 query-based possibility, and one system state-based possibility.

What if the queries on node 1 are optimized CCI-based queries, while through some stroke of luck open to the author of hypothetical situations the queries on node 0 are heavily read-dependent rowstore queries?

The distribution of pageio_latch waits would lean toward Node 0 due to the operation pattern there(especially if readahead for some reason or other isn't engaged or is still so slow it produces waits). And those waits could suppress CPU engagement on Node 0, while the same level of effect would not impede engagement on node 1.

Now let's talk how system state could result in uneven CPU engagement across the vNUMA nodes, even if work is evenly distributed.
Various memory states can result in that pattern.
What if another application is running on the same VM? If its memory primarily comes from NUMA node 0, its memory management can come into conflict with SQL Server, which will try to balance its memory utilization across the NUMA nodes.

Perfmon counters [\NUMA Node Memory(*)\Free & Zero Page List MBytes] can be good to peruse if this is suspected. If only one of 2 nodes is flirting with bottoming out Free and Zero Page List memory, it can disproportionately suffer from memory stalls and memory-related SQL Server waits, as well as potentially suffering from content send to and retrieve from pagefile.sys.

OK, enough speculation from me. :-)

Here's a stored procedure that, if run in a fairly short interval like every 10 seconds, can give insight into uneven CPU utilization on a NUMA server based on information from within SQL Server.

CREATE PROCEDURE dbo.SASQ_GRAB
AS
IF object_id ('dbo.SASQ_SCHEDULER_TASKS') IS NULL
BEGIN
    CREATE TABLE dbo.SASQ_SCHEDULER_TASKS (
     capture_time DATETIME NOT NULL,
     scheduler_id INT NOT NULL,
     session_id SMALLINT NULL,
     wait_type NVARCHAR(60) NULL,
     task_count SMALLINT NULL
    );

    CREATE CLUSTERED INDEX CI ON dbo.SASQ_SCHEDULER_TASKS (capture_time) 
    WITH (data_compression = row);
END

IF object_id ('dbo.SASQ_NODE_CONNS') IS NULL
BEGIN
    CREATE TABLE dbo.SASQ_NODE_CONNS (
     capture_time DATETIME NOT NULL,
     node_affinity TINYINT NOT NULL,
     client_net_address VARCHAR(50) NULL,
     conn_count SMALLINT NULL
    );

    CREATE CLUSTERED INDEX CI ON dbo.SASQ_NODE_CONNS (capture_time)
    WITH (data_compression = row);
END

IF object_id ('dbo.SASQ_WAIT_STATS') IS NULL
BEGIN
    CREATE TABLE dbo.SASQ_WAIT_STATS (
     capture_time        DATETIME NOT NULL,
     wait_type           NVARCHAR(60) NOT NULL,
     waiting_tasks_count BIGINT NOT NULL,
     wait_time_ms        BIGINT NOT NULL,
     max_wait_time_ms    BIGINT NOT NULL,
     signal_wait_time_ms BIGINT NOT NULL
    );

    CREATE CLUSTERED INDEX CI ON dbo.SASQ_WAIT_STATS (capture_time)
    WITH (data_compression=row);
END

DECLARE @ct DATETIME;
SET @ct = GETDATE();

INSERT INTO dbo.SASQ_SCHEDULER_TASKS WITH (TABLOCK)
SELECT @ct, ot.scheduler_id, ot.session_id, owt.wait_type, task_count = count(*)
FROM sys.dm_os_tasks ot with (nolock)
LEFT OUTER JOIN sys.dm_os_waiting_tasks owt with (nolock) ON ot.task_address = owt.waiting_task_address
WHERE scheduler_ID < 1048576
GROUP BY ot.scheduler_id, ot.session_id, owt.wait_type
ORDER BY ot.scheduler_id ASC
OPTION (MAXDOP 1);

INSERT INTO dbo.SASQ_NODE_CONNS with (tablock)
SELECT @ct, node_affinity, client_net_address, conn_count = count(*) 
FROM SYS.dm_exec_connections with (nolock)
GROUP BY node_affinity, client_net_address
OPTION (MAXDOP 1);


INSERT INTO dbo.SASQ_WAIT_STATS with (tablock)
SELECT capture_time = GETDATE(), wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms
FROM sys.dm_os_wait_stats ws WITH (NOLOCK)
WHERE waiting_tasks_count <> 0
OPTION (MAXDOP 1);

/* 20190919
   grab some info from DMOs in iterations of eg 10s to understand
   discrepancy of cpu engagement between NUMA nodes
*/

Friday, September 13, 2019

SQL Server NUMA Memory

;WITH tgt AS (SELECT instance_name, cntr_value
              FROM SYS.DM_OS_PERFORMANCE_COUNTERS
              WHERE COUNTER_NAME = 'Target Node Memory (KB)'
     UNION ALL
     SELECT 'TOTAL', cntr_value
              FROM SYS.DM_OS_PERFORMANCE_COUNTERS
              WHERE COUNTER_NAME = 'Target Server Memory (KB)'),
tot AS (SELECT instance_name, cntr_value
              FROM SYS.DM_OS_PERFORMANCE_COUNTERS
              WHERE COUNTER_NAME = 'Total Node Memory (KB)'
     UNION ALL
     SELECT 'TOTAL', cntr_value
              FROM SYS.DM_OS_PERFORMANCE_COUNTERS
              WHERE COUNTER_NAME = 'Total Server Memory (KB)'),
dbc AS (SELECT instance_name, cntr_value
              FROM SYS.DM_OS_PERFORMANCE_COUNTERS
              WHERE COUNTER_NAME = 'Database Node Memory (KB)'
     UNION ALL
     SELECT 'TOTAL', cntr_value
              FROM SYS.DM_OS_PERFORMANCE_COUNTERS
              WHERE COUNTER_NAME = 'Database Cache Memory (KB)'),
stl AS (SELECT instance_name, cntr_value
              FROM SYS.DM_OS_PERFORMANCE_COUNTERS
              WHERE COUNTER_NAME = 'Stolen Node Memory (KB)'
     UNION ALL
     SELECT 'TOTAL', cntr_value
              FROM SYS.DM_OS_PERFORMANCE_COUNTERS
              WHERE COUNTER_NAME = 'Stolen Server Memory (KB)'),
fre AS (SELECT instance_name, cntr_value
              FROM SYS.DM_OS_PERFORMANCE_COUNTERS
              WHERE COUNTER_NAME = 'Free Node Memory (KB)'
     UNION ALL
     SELECT 'TOTAL', cntr_value
              FROM SYS.DM_OS_PERFORMANCE_COUNTERS
              WHERE COUNTER_NAME = 'Free Memory (KB)'),
frn AS (SELECT instance_name, cntr_value
              FROM SYS.DM_OS_PERFORMANCE_COUNTERS
              WHERE COUNTER_NAME = 'Foreign Node Memory (KB)'
     UNION ALL
     SELECT 'TOTAL', cntr_value = CONVERT(INT, NULL))
SELECT tgt.instance_name, target_kb =  tgt.cntr_value,
       total_kb = tot.cntr_value, dbCache_kb = dbc.cntr_value,
    stolen_kb = stl.cntr_value, free_kb = fre.cntr_value,
       foreign_kb = frn.cntr_value
FROM tgt 
JOIN tot ON tgt.instance_name = tot.instance_name
JOIN frn ON tgt.instance_name = frn.instance_name
JOIN dbc ON tgt.instance_name = dbc.instance_name
JOIN stl ON tgt.instance_name = stl.instance_name
JOIN fre ON tgt.instance_name = fre.instance_name;

Tuesday, September 10, 2019

T-SQL Tuesday 118; My Fantasy SQL Server Feature

This month the T-SQL Tuesday is hosted by Kevin Chant (b | t).
The topic: My Fantasy SQL Server Feature.

Here's my wish: meta-data only merges for populated CCI partitions when the partition scheme fully aligns.

Imagine a table - CCI only with no nonclustered indexes, no PK - into which millions of rows are inserted every day. The partition column is a datetime corresponding to the day the row was inserted.

It may make good sense to partition daily. But as the data ages, there's likely to be a time when that partition size may as well be monthly. And after 2 years, might as well have yearly partitions.

Rather than physically moving data into newly defined larger partitions, I'd rather a meta-data operation to merge the 28-31 daily partitions for an old month into a monthly partition. And a metadata only operation to merge 12 older monthly partitions into an yearly partition.

The older the data gets, the more time there's been to prune the rowgroups. A meta-data only operation to combine the daily rowgroup partitions into a monthly partition reduces the number of partitions *and* keeps partition elimination on the datetime partitioning column at an equivalent level to what partition elimination was before.

Whew. My shortest blog post in quite a while, I think. 😀

Monday, September 9, 2019

Decoding sys.dm_os_waiting_tasks.resource_description column info for lock waiters

I've been looking at some locking and blocking scenarios lately, and wanted to do some real-time investigation as well as some logging over time.

Have you seen Erland Sommarskog's beta_lockinfo? It's pretty nice. But it didn't quite fit my needs.
http://www.sommarskog.se/sqlutil/beta_lockinfo.html

I went with this instead. With each iteration, grab just the lock waiters from dm_os_waiting_tasks - and just those with waits longer than 1000 ms. Convert resource_description to XML, and decode the attributes so I can potentially manipulate it.

The REPLACE occurrences below are to correct for formatting issues in the resource_description text which keep it from converting to conforming XML. Hope I got them all.

;WITH lock_waiters AS
(SELECT waiting_task_address, session_id, exec_context_id, wait_duration_ms, wait_type, 
        resource_address, blocking_task_address, blocking_session_id, 
        blocking_exec_context_id, resource_description
 FROM sys.dm_os_waiting_tasks owt 
 WHERE owt.wait_type LIKE 'LCK%' AND wait_duration_ms > 1000)
SELECT track.session_id, track.blocking_session_id, track.wait_duration_ms,
       x1.[type], x1.mode, x1.associatedObjectId, x1.[dbid], x1.subresource, x1.lockPartition, 
       x1.fileid, x1.pageid, x1.hobtid, x1.[hash], x1.databasePrincipalId, x1.classid, x1.[objid], x1.id
FROM lock_waiters AS track
CROSS APPLY (SELECT info = CONVERT(XML, '<lockinfo type="' + REPLACE(REPLACE(REPLACE(REPLACE(resource_description,', ',','), ' = ','|'), ' ', '" '), '=', '="') + '"/>')) XML_table
CROSS APPLY
     (SELECT [type] = XML_table.info.value(N'(/lockinfo/@type)[1]', N'varchar(50)'),
             mode = XML_table.info.value(N'(/lockinfo/@mode)[1]', N'varchar(20)'),
             associatedObjectId = XML_table.info.value(N'(/lockinfo/@associatedObjectId)[1]', N'varchar(20)'),
             [dbid] = XML_table.info.value(N'(/lockinfo/@dbid)[1]', N'varchar(20)'),
             subresource = XML_table.info.value(N'(/lockinfo/@subresource)[1]', N'varchar(20)'),
             lockpartition = XML_table.info.value(N'(/lockinfo/@lockPartition)[1]', N'varchar(20)'),
             fileid = XML_table.info.value(N'(/lockinfo/@fileid)[1]', N'varchar(20)'),
             pageid = XML_table.info.value(N'(/lockinfo/@pageid)[1]', N'varchar(20)'),
             hobtid = XML_table.info.value(N'(/lockinfo/@hobtid)[1]', N'varchar(20)'),
             [hash] = XML_table.info.value(N'(/lockinfo/@hash)[1]', N'varchar(20)'),
             databasePrincipalId = XML_table.info.value(N'(/lockinfo/@databasePrincipalId)[1]', N'varchar(20)'),
             classid = XML_table.info.value(N'(/lockinfo/@classid)[1]', N'varchar(20)'),
             id = XML_table.info.value(N'(/lockinfo/@id)[1]', N'varchar(20)'),
             [objid] = XML_table.info.value(N'(/lockinfo/@objid)[1]', N'varchar(20)')) x1;

Here's the BOL link for SQL Server 2017 sys.dm_os_waiting_tasks.
https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-waiting-tasks-transact-sql?view=sql-server-2017