Monday, January 14, 2019

Evaluating SQL Server System Resource Governor Workload Group "Maximum Memory Grant %" for Adjustment

I'm looking at a SQL Server system with 4 NUMA nodes, 18 cores on each NUMA node.  Total RAM: 2 TB.


We can easily verify that the SQL Server default workload group is the main CPU consumer on the server.





I like to trend active requests and parallel threads against CPU utilization.  On the systems I watch, when they are healthy there should be a fairly strong correlation between CPU and requests, or CPU and parallel threads, or both.  Hm. During this time, threads have a stronger correlation than requests.  That's not concerning by itself.  But with this many requests and this many parallel threads against 72 logical processors/schedulers (HT is not enabled), I want higher CPU utilization and a higher rate of work.





Either the work is poorly distributed (a less likely condition) or there is a wait type (or types) that can be identified as suppressing CPU utilization.


Perfmon captures some wait types, and I strike gold by looking at resource_semaphore waits. 




I don't like switching between area and line graphs for resource utilization, but the graph below is easiest to read with CPU as a line graph over stacked area graphs of memory grants (pending and outstanding).  The point of this graph is to show if, at every point in time, the pending memory grants were converted to outstanding (so that the query was executing rather than waiting) there would be enough CPU to allow the queries to run.  When I intervene on a system for scalability or performance, I hope not just to replace one wait type with another 😎.  If, for example, CPU was near 100% utilized while there were pending memory grants, I'd expect simply to replace resource_semaphore waits with SOS_scheduler waits.  But, that's not the case here.



So I have in mind to lower the "Maximum Memory Grant %" for the relevant workload group.  On this system, the default workload group is still using 25%.

The graph below has reserved memory in dark blue layered in front of granted memory in lighter blue.  (There really is only one active user workload group on this system - default.  If there were additional workload groups and/or additional resource pools this type of analysis is still possible but information from DMVs would be necessary to augment perfmon.) So the exposed light blue on the top is the amount of used granted query memory.  Its just a fraction of the granted memory and sometimes a quite small fraction at that.  By reducing the maximum memory grant percentage, there will be a better fit between granted and used query memory (eg I expect reserved memory to be lower zs well as granted memory at the same query concurrency.) 


But there are lots of queries running on the system.  Some of them are probably already spilling to tempdb.  Those queries will spill more by lowering the max memory grant percent.  And some queries that weren't spilling before but were close will start spilling into tempdb.

Let's look at how much tempdb was already in use.  Perfmon has counters for the data file size in tempdb, and the amount of tempdb used.  In the graph below the used tempdb is in light blue layered in front of the dark blue size of tempdb files.  So the exposed dark blue at the top is the used tempdb.




Let's do the subtraction for used query memory, and used tempdb.  And stack them on top of each other.  The upper border is the upper limit for sort/hash activity on the system at any given time.  (Upper limit because "used tempdb" also includes temp tables, version store, etc).

Assuming that all of the queries from 10:00 am to 6:00 pm ran with the same plans but with maximum memory grant percent at 6% rather than 25%, none of their sort/hash needs would change.  Assuming the pace of every query remained the same, eliminating only resource_semaphore wait time (that's not realistic but work with me here 😇) the total area under the upper border would stay constant.  Like moving tetris blocks around.  The share of tempdb is likely to increase vs used query memory.  But it looks like bringing the maximum query memory grant percent down from 25% to 6% is going to be a win for this workload.  After the change, I recommend doing the same type of analysis to make sure the behavior is as expected.




Thursday, December 6, 2018

Fun with SQL Server Plan Cache, Trace Flag 8666, and Trace Flag 2388

Ok, last one for a while.  This time, pulling the stats relevant to a plan from the plan XML in the plan cache (thanks to trace flag 8666), then getting a little bit of stats update history via trace flag 2388.

I'm not particularly happy with performance; it took between 10 and 12 seconds to return a 12 row resultset 😥

But its good enough for the immediate troubleshooting needs.

Some sample output is below; note that the output is sorted by database, schema, table, and stats name.  But the updates of any one stat are NOT sorted based on the date in [Updated].  I don't have the brainpower right now to figure out how to convert the [Updated] values to DATETIME to get them to sort properly.  Maybe sometime soon 😏

CREATE OR ALTER PROCEDURE sasquatch__hist_one_plan_stats_T8666 @plan_handle VARBINARY(64)
AS
/* stored procedure based on work explained in the following blog post
https://sql-sasquatch.blogspot.com/2018/06/harvesting-sql-server-trace-flag-8666.html

supply a plan_handle and if trace flags 8666 was enabled at system level or session when plan was compiled
and trace flag 2388 is enabled at session level when stored procedure is executed, recent history of stats used to compile the plan will dsiplayed
*/

DECLARE @startdb NVARCHAR(256),
        @db      NVARCHAR(256),
        @schema  NVARCHAR(256),
        @table   NVARCHAR(256),
        @stats   NVARCHAR(256),
        @cmd     NVARCHAR(MAX);
        
SET @startdb = DB_NAME();
DROP TABLE IF EXISTS #plan;
CREATE TABLE #plan(planXML XML);

INSERT INTO #plan
SELECT CONVERT(XML, query_plan) planXML
FROM sys.dm_exec_query_plan(@plan_handle);

;WITH XMLNAMESPACES(default 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT DbName.DbName_Node.value('@FieldValue','NVarChar(128)')              dbName,
       SchemaName.SchemaName_Node.value('@FieldValue','NVarChar(128)')      schemaName,
       TableName.TableName_Node.value('@FieldValue','NVarChar(128)')        tableName,
       StatsName_Node.value('@FieldValue','NVarChar(128)')                  statName,
       CONVERT(INT, NULL) done
INTO #QO_stats
FROM #plan qsp
CROSS APPLY qsp.planXML.nodes(N'//InternalInfo/EnvColl/Recompile')       AS Recomp(Recomp_node)
CROSS APPLY Recomp.Recomp_Node.nodes(N'Field[@FieldName="wszDb"]')       AS DbName(DbName_Node)
CROSS APPLY Recomp.Recomp_Node.nodes(N'Field[@FieldName="wszSchema"]')   AS SchemaName(SchemaName_Node)
CROSS APPLY Recomp.Recomp_Node.nodes(N'Field[@FieldName="wszTable"]')    AS TableName(TableName_Node)
CROSS APPLY Recomp.Recomp_Node.nodes(N'ModTrackingInfo')                 AS [Table](Table_Node)
CROSS APPLY Table_Node.nodes(N'Field[@FieldName="wszStatName"]')         AS [Stats](StatsName_Node)
OPTION (MAXDOP 1);

DROP TABLE IF EXISTS #one_stat_hist;
CREATE TABLE #one_stat_hist
(Updated NVARCHAR(50), [Table Cardinality] BIGINT, [Snapshot Ctr] BIGINT, 
 Steps INT, Density FLOAT, [Rows Above] BIGINT, [Rows Below] BIGINT, 
 [Squared Variance Error] NUMERIC(16,16), [Inserts Since Last Update] BIGINT, 
 [Deletes Since Last Update] BIGINT, [Leading Column Type] NVARCHAR(50));

DROP TABLE IF EXISTS #stats_hist;

 CREATE TABLE #stats_hist
(dbName NVARCHAR(50), schemaName NVARCHAR(50), tableName NVARCHAR(50), statName NVARCHAR(50),
 Updated NVARCHAR(50), [Table Cardinality] BIGINT, [Snapshot Ctr] BIGINT, 
 Steps INT, Density FLOAT, [Rows Above] BIGINT, [Rows Below] BIGINT, 
 [Squared Variance Error] NUMERIC(16,16), [Inserts Since Last Update] BIGINT, 
 [Deletes Since Last Update] BIGINT, [Leading Column Type] NVARCHAR(50));

WHILE EXISTS (SELECT TOP 1 1 FROM #QO_stats WHERE done IS NULL)
BEGIN
     SELECT TOP 1 @db = dbName, @schema = schemaName, @table = tableName, @stats = statName
     FROM #QO_stats
     WHERE done IS NULL
     ORDER BY dbName, schemaName, tableName, statName;

     SET @cmd = N'USE ' + @db + N' DBCC SHOW_STATISTICS(''' + @schema + N'.' + @table + N''', ' + @stats 
              + N') WITH NO_INFOMSGS'

     INSERT INTO #one_stat_hist
     EXEC (@cmd);

  INSERT INTO #stats_hist
  SELECT @db, @schema, @table, @stats, *
  FROM #one_stat_hist;

     UPDATE #QO_stats
     SET done = 1 
     WHERE @db =   #QO_stats.dbName
     AND @schema = #QO_stats.schemaName
     AND @table =  #QO_stats.tableName 
     AND @stats =  #QO_stats.statName;

     TRUNCATE TABLE #one_stat_hist;

END

SELECT * FROM #stats_hist ORDER BY dbName, schemaName, tableName, statName

EXEC (N'USE ' + @startDB);

/* 20181206 */