Thursday, September 21, 2017

SQL Server - exploring parallel query hash row distribution & parallel page supplier Part I


This is a half-baked draft for two reasons.

First, I got distracted by an unexpected finding.  At lower number of rows such as 1023 (for values 1 to 11), the table I created below had 1023 data pages, or 2^10 - 1 (the expected number, 1 data page per row). But at 65535 total rows, the table seems to had 65527 data pages according to sys.dm_db_database_page_allocations. I'm not sure why 8 fewer data pages than expected were reported. I'll investigate in another blog post and link here.  Already found some interesting stuff while investigating and have a theory. 😁

Second... well, while I'm sharing the table I'll use for investigation... uhhhh... I'm still working on the queries.  When I've got test queries with the type of behavior I'm looking for I'll include them here, and when I've got interesting findings posted I'll change the status of this post to "fully baked."


*****

The SQL below will create a table with 2^(n-1) rows with num value n, for each integer n between 1 and 16.  The total number of rows in the table is 2^15 - 1 or 65535.

The code isn't elegant or necessarily very efficient.  It took 8 minutes and 11 seconds to complete on the test vm.  But a table with this structure should be very valuable in experiments to learn about how hash distribution of rows among parallel threads works.  Will also hopefully lead to leaning about how the parallel page supplier works. 

DROP TABLE IF EXISTS lnieders_bitmap;
CREATE TABLE lnieders_bitmap(num INT, fluff VARCHAR(4100));
CREATE CLUSTERED INDEX ci__lnieders_bitmap ON lnieders_bitmap(num);

DECLARE @p2 INT = 16, @fluff VARCHAR(4100) = REPLICATE('Z', 4100);
;WITH nums_1 AS (SELECT TOP (POWER(2, @p2)-1) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) rn
                 FROM master..spt_values t1 CROSS JOIN master..spt_values t2),
      ctrl_cte AS (SELECT TOP (@p2) rn, SUM(POWER(2, rn - 1)) OVER(ORDER BY rn ROWS UNBOUNDED PRECEDING) rt FROM nums_1),
      nums_2 AS (SELECT nums_1.rn, ctrl_cte.rn target_value FROM nums_1 LEFT OUTER JOIN ctrl_cte ON nums_1.rn = ctrl_cte.rt)
INSERT INTO lnieders_bitmap SELECT TOP(POWER(2, @p2)-1) cx.tv, @fluff FROM nums_2 t1
CROSS APPLY (SELECT MIN(target_value) tv FROM nums_2 t2 WHERE t2.rn >= t1.rn) cx ORDER BY t1.rn ASC;

SELECT COUNT(*) data_page__count 
FROM sys.dm_db_database_page_allocations(db_id(), object_id('lnieders_bitmap'), 1, NULL, 'DETAILED')
WHERE page_type_desc = 'DATA_PAGE';

Joe Obbish read my code above and had some comments:
  •  that code is so confusing to read
  • if change it like so it'll finish in 2 seconds
  • still not the way I'd probably do it if starting from scratch

Thanks, Joe!!  Sure enough, tried his code and it finished in 2 seconds.  So I increased the expected total number of rows from 65535 to 131071 which is where I wanted to be anyway.  That took four seconds.  To create 131071 rows my code would have taken... a *lot* longer.  😁

Here's what this looks like with Joe's help.


DROP TABLE IF EXISTS #t;
DECLARE @p2 INT = 17;
WITH nums_1 AS (
 SELECT TOP (POWER(2, @p2)-1) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) rn
 FROM master..spt_values t1
 CROSS JOIN master..spt_values t2
)
, ctrl_cte AS (
 SELECT rn
   , SUM(POWER(2, rn - 1)) OVER (ORDER BY rn ROWS UNBOUNDED PRECEDING) rt
 FROM nums_1
 WHERE rn <= @p2
)
, nums_2 AS (
 SELECT nums_1.rn, ctrl_cte.rn target_value
 FROM nums_1
 LEFT OUTER JOIN ctrl_cte ON nums_1.rn = ctrl_cte.rt
)
SELECT * INTO #t
FROM nums_2
OPTION (RECOMPILE);

DROP TABLE IF EXISTS lnieders_bitmap;
CREATE TABLE lnieders_bitmap(num INT, fluff VARCHAR(4101));
CREATE CLUSTERED INDEX ci__lnieders_bitmap ON lnieders_bitmap(num); 

DECLARE @fluff VARCHAR(4101) = REPLICATE('Z', 4101);
INSERT INTO lnieders_bitmap WITH (TABLOCK)
SELECT MIN(target_value) OVER (ORDER BY rn DESC ROWS UNBOUNDED PRECEDING), @fluff
FROM #t t1
OPTION (RECOMPILE);

Much faster, much nicer. So here's the idea: since each integer value n 1 to 17 is represented in the table with 2^n rows, when rows are distributed to parallel threads by hash the number of rows for the thread will tell me which values each thread got, too.

Now, this should also allow me to make observations about the parallel page supplier.  The threads of a parallel scan get their page sets from the parallel page supplier... but there's not much info beyond that publicly available.  A few reliable test cases (which I haven't invented yet) should help me to see whether the page sets for a clustered index scan are determined based on index key values.  Should also be able to suss out whether the page sets given to threads are affected by table stats or strictly by "whoever asks next gets the next set I'm ready to hand out."

Stay tuned.

*****

Later I thought "hey, I'll save myself some thought if I do this in base 10 rather than base 2."  And I wanted to see how Joe Obbish would have approached creating this type of table. I asked, and he taught me a new trick (which I'll likely have to keep coming back to here otherwise I won't remember how it worked).

This isn't exactly how Joe did it, but its as close as I get from memory.

This query show how it works.

;WITH base(base)  AS (SELECT 10),
pwr(pwr_plus1)    AS (SELECT 6),
nums_1(num_1)     AS (SELECT ROW_NUMBER() OVER(ORDER BY(SELECT NULL))
                      FROM master..spt_values t1 CROSS JOIN master..spt_values t2),
nums_2(num_2, tv) AS (SELECT num_1, POWER(base, num_1-1)
                      FROM nums_1 cross join pwr cross join base WHERE num_1 <= pwr.pwr_plus1),
nums_3(num_3)     AS (SELECT num_2 FROM nums_1 JOIN nums_2 ON nums_1.num_1 <= nums_2.tv)
SELECT num_3 val, COUNT(*) num_rows_at_val FROM nums_3 GROUP BY num_3 ORDER BY num_3;

So here's how it comes together to make my table for exploring.





Wednesday, September 20, 2017

SQL Server 2016 Execution Context ID: Trust but Verify II - Short & Sweet

In a previous post, I mentioned that sometimes two separate tasks/workers in a parallel query will identify themselves with the same execution context ID.  A join to sys.dm_os_threads allows identifying tasks with the kpid and clearing up any possible confusion.

Trust but Verify: SQL Server 2016 SP1 CU4 Execution Context ID
http://sql-sasquatch.blogspot.com/2017/09/trust-but-verify-sql-server-2016-sp1.html 

Today I'm just going to pretend that whole issue doesn't exist because I'm feeling lazy. 😊  Let me show another reason that - when it comes to execution context ID - I recommend "trust but verify."

The system I grabbed the graphic below from is, same as the earlier post, SQL Server SP1 CU4.
 
Got a parallel query running, and I want to grab some wait & blocking info for each of the parallel threads.  Makes sense to start in sys.dm_os_tasks and left outer join to sys.dm_os_waiting_tasks.  Because I might have some tasks working at time of capture by the query - they'd be unmatched in the view of waiting tasks.  I've underlined my left outer join back to sys.dm_os_tasks because it was originally unintuitive to do this.  If I want to grab the session ID and execution ID of the task that's blocking... isn't sys.dm_os_waiting_tasks good enough? Well... no. Execution context ID 0 can show up as NULL in the blocking_exec_context_id column, even though the blocking_task_address is populated.  Huh.

No problem, though.  Using that blocking_task_address for a join back to sys.dm_os_tasks allows me to see that execution context ID 2 below is blocked by execution context ID 0 within the same query.



Good.  That's enough for tonight.

Tuesday, September 19, 2017

#SQLServer 2016 SP1 CU4 vs CU2: Deep Sleepers and a Nervous Waker in the Deadzone

We got a deadzone in SQL Server 2016 SP1 CU4.  A small amount of testing on SQL Server 2016 SP1 CU3 indicates the deadzone behavior was introduced in CU4 and not earlier.  Rather than re-generate a bunch of stuff I have from CU2 in CU3, though, for the sake of me finally getting something more info about this scenario into the blog-o-sphere, I'll be comparing the behavior of SP1 CU4 to SP1 CU2 here.

The deadzone... for hours, four deep sleepers and one nervous waker for a MAXDOP 2 query try to get it out... usually they do.  But it can take hours... sometimes days.

Joe Obbish used a heading of "Putting the Dead in Deadlock" in his related blog post... I'm so impressed with that turn of phrase I have to try some of my own. :-) Joe's post is here...

Hash Partitioned Exchange Spills
https://orderbyselectnull.com/2017/09/15/hash-partitioned-exchange-spills/

And there is a Connect item...
Order Preserving Repartition Streams Can Cause Extreme Performance Degradations
https://connect.microsoft.com/SQLServer/feedback/details/3141119/order-preserving-repartition-streams-can-cause-extreme-performance-degradations

Let's start this adventure the way I started it: by looking at some frustrating behavior on SQL Server 2016 SP1 CU4. Below is the setup, followed by the troubled query.  The table created in the setup is as simple as it gets: single column BIGINT, in a clustered index.  13890 zeroes inserted, followed by 13890 ones.  Update stats on the clustered index before the first attempt at producing an egregious hang, so every execution is on a level playing field. Trace flag 8649 is used here so cost doesn't prevent a parallel plan.  Hint the merge join to get order-preserving exchanges.  The same problem occurs at higher DOP than 2... but when investigating a problem with parallel queries, I recommend working with the lowest DOP where the behavior can be reproduced for the sake of simplicity.

/* the setup: drop/create a single BIGINT column table w/CI
              populate the table with 13890 0s and 13890 1s */ 
DROP TABLE IF EXISTS DEADLOCK_6_L
CREATE TABLE DEADLOCK_6_L(ID BIGINT);
CREATE CLUSTERED INDEX CI__DEADLOCK_6_L ON DEADLOCK_6_L(ID);

INSERT INTO DEADLOCK_6_L SELECT 0 FROM
(SELECT TOP (13890) row_number() over(order by (select null)) RN
 FROM master..spt_values t1 CROSS JOIN master..spt_values t2) subQ;

INSERT INTO DEADLOCK_6_L SELECT 1 FROM
(SELECT TOP (13890) row_number() over(order by (select null)) RN
 FROM master..spt_values t1 CROSS JOIN master..spt_values t2) subQ;

UPDATE STATISTICS DEADLOCK_6_L(CI__DEADLOCK_6_L) WITH FULLSCAN;

/* might take up to 20 repetitions of the query below 
   to get exchange spill (CU2) or "the deadzone" (CU4) */ 

SELECT * FROM DEADLOCK_6_L t1 WHERE EXISTS
(SELECT 1 FROM DEADLOCK_6_L t2 WHERE t2.ID = t1.ID)
ORDER BY t1.ID OPTION (querytraceon 8649, merge join, maxdop 2);

I've run the setup and the trouble query in a user database with 8 equally-sized data files, and in a single-file master database, with the same results.  When the query is fast... its fast.  But when its slow, it can take hours to complete.  Sometimes days.

Here's live query stats from an example - in fact one that's running as I'm typing in this post.



As I was running these tests, I made sure that "live query stats" and "include actual plan" were toggled on for my session.  So sys.dm_exec_query_profiles info was available for the session.




Whew!!  That one finally completed. 21 hours, 16 minutes, 10 seconds as reported by SSMS.


Lets take a look at that post-execution plan for clues.  There's a warning on the Repartition Streams - what's it say?


OK, what about the waits recorded by the post-execution plan?  Huh. That's a total of only 176 ms of wait time. 4071 ms of CPU time.  And 76569190 ms of elapsed time (21h16m9s190ms).  That doesn't add up in any way, shape, or form. 


What does sys.dm_exec_session_wait_stats say?  *That* makes more sense.  306270624 ms of CXPACKET wait time.  85h6m30s624ms among the 5 threads that last long enough to be observed.












Thursday, September 14, 2017

Trust but Verify: SQL Server 2016 SP1 CU4 Execution Context ID

The @@version of the system I'm currently working on:

 

I spend a lot of time looking at sys.dm_os_waiting_tasks for parallel queries these days. For parallel queries, multiple sys.dm_os_waiting_tasks rows for a single execution context ID within a session is not unusual.  For example...




Got two rows for session_id 68 exec_context_id 1 above.  Matching the blocking_task_address in those rows to waiting_task_address... ok, exec_context_id 1 is slotted in CXPACKET wait for exec_context_id 3 and 4.  That's reasonable.  As DOP increases the number of exec_context_id values with multiple rows may also increase.  At really high DOP the number of CXPACKET rows can look downright scary :-)  But still legit once used to seeing it.

Here's the waiting tasks for another parallel query.



Like the previous results, there are two rows for exec_context_id 1.  Yeah, we've seen that before :-)

There are also 2 rows for exec_context_id 4.  Hmmm.  That seems plausible.  They show the same resource_address, show the same blocking_task_address.

However... don't be too quick to brush off multiple rows for a given exec_context_id as due to the parallel CXPACKET wait implementation we just discussed.  Or to the "not always transactionally consistent" nature of some of the DMVs.

Check it out below.   The two rows with exec_context_id value 4 have different values for waiting_task_address, and different wait_duration_ms values.  They are different tasks, on different workers.  They really shouldn't be reported as the same exec_context_id.




Does this matter?  Well - the presence of these duplicate values for exec_context_id doesn't seem to be an indication of anything nefarious under the hood.  But if you try to piece together wait information for parallel queries, things can get confusing if not ready to accommodate this possibility.  And monitoring/trending utilities that collect and correlate information from the DMVs can compound the confusion.

Trust but verify, y'all.

Ciao, thanks for reading!

And if you really have a hankerin' for more of the same...

SQL Server 2016 Execution Context ID: Trust but Verify II - Short & Sweet
http://sql-sasquatch.blogspot.com/2017/09/sql-server-2016-execution-context-id.html 

PowerShell connection to SQL Server: MARS enabled, pooling disabled


Sometimes I have to troubleshoot connection issues... or even query performance issues... for various applications.  It can be very valuable to mimic the connection settings including MARS and connection pooling.

Here's how to do that - code snippets at the end of the post.

In the capture below I construct my desired connection string - in this case with MARS enabled and connection pooling disabled.  I bring back one row/one column with a SQL statement just to confirm it worked :-)


Here I specifically enabled MARS because its something I'm looking into right now.

Can we see the MARS connection from the database server side?  Yep.  Code snippet for this also at the end of the blog post.


OK.  Let's be nice and call Dispose.  Oops - typo. Try again.


OK.  Since Connection Pooling was disabled for this connection, it should disappear right away.  And it is really gone. (It can take 4 to 8 minutes for an idle pool connection to disappear by default.)


The PowerShell code...


$scon = New-Object System.Data.SqlClient.SqlConnection
$scon.ConnectionString = "Data Source=localhost;Initial Catalog=MASTER;Connection Timeout=15;Integrated Security=true;MultipleActiveResultSets=True;Pooling=False"
$cmd = New-Object System.Data.SqlClient.SqlCommand
$cmd.Connection = $scon
$cmd.CommandText = "SELECT db_name();"
$cmd.CommandTimeout = 0
$scon.Open()
$adp = New-Object System.Data.SqlClient.SqlDataAdapter $cmd
$data = New-Object System.Data.DataSet
$adp.Fill($data) | Out-Null
$data.Tables
$scon.Dispose()

The SQL query to see the PowerShell MARS session...


select conn.net_transport, conn.connection_id, conn.parent_connection_id, conn.session_id, 
       conn.num_reads, conn.last_read, conn.num_writes, conn.last_write,
       sess.login_time, sess.client_interface_name from sys.dm_exec_connections conn
join sys.dm_exec_sessions sess on sess.session_id = conn.session_id
where exists 
(select 1 from sys.dm_exec_connections ic where ic.session_id = conn.session_id and ic.net_transport = 'Session')
AND sess.login_time > '2017-09-14 10:20:00'
order by conn.session_id;



Friday, August 25, 2017

#SQLServer 2016 CCI COUNT(*) Aggregate Pushdown Tipping Point - Part I

I've been working with SQL Server 2016 SP1 CCIs more lately... anything I figure out before Niko Neugebauer or Joe Obbish have given them full treatment in their blogs I'm gonna try to get in here right quick :-) Niko's and Joe's blogs are both on my 'must read list' by the way.
Niko's blog can be found here:
http://www.nikoport.com/

And Joe's blog can be found here:
https://orderbyselectnull.com/

Today I was tinkering with some CCIs... doing something that I often do: speculating about a system I haven't been able to access directly, trying to come up with theories to explain unexpected behavior.  Along the way I found something interesting I hadn't heard anyone mention: a tipping point for count(*) aggregate pushdown against a CCI.

*****Update 26 August 2017*****
Niko checked out my plans and tried to reproduce on both SQL Server 2017 and SQL Server 2016... no dice.  That reminds me - I should have included this for system context :-)




Without the same system context, results may differ.  I'll update the blog post again once I determine if trace flags, compatibility level, optimizer hot fixes, CE version are necessary components for this behavior. (Turns out the main consideration for tipping point of Bigint CCI column is MAXDOP.  CE plays a minor role - new CE gives integral row estimates for BIGINT columns.)

*****

Here's a quick repro if you want to follow along.


SET NOCOUNT ON;
GO
CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
             FROM L5)
  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum;
----------------------------------------------------------------------
-- © Itzik Ben-Gan
----------------------------------------------------------------------
/* http://tsql.solidq.com/SourceCodes/GetNums.txt */
GO

CREATE TABLE dbo.Tipper 
([Key] bigint NOT NULL,
 INDEX CCI_Tipper CLUSTERED COLUMNSTORE);

set statistics time on;
INSERT INTO dbo.Tipper
SELECT TOP (52428800) nums.n
FROM dbo.getNums(1, 52428800) nums;

create statistics Tipper_key on Tipper([key]) with fullscan;

/* predicate pushdown!! */
select count(*) from 
(SELECT [key] FROM Tipper where [key] > 0 and [key] < 27213.9) subQ

/* NO predicate pushdown!! */
select count(*) from 
(SELECT [key] FROM Tipper where [key] > 0 and [key] < 27213.8) subQ

Let me walk through it quickly.

I use Itzik Ben-Gan's getnums function a lot, so I'm more than happy to keep citing it :-)

So create the getnums function and the Tipper CCI in my database.


OK, lets use the getnums function to add 50 fully populated (1024 * 1024 rows) to the CCI, with incrementing values in the [key] column.


Now lets create fullscan stats on the [key] column.  Fullscan stats and values starting at 1, incrementing by 1 give a little more predictability to row estimates in the plans.


Below is what we want post-execution plans to look like when counting rows in a range - the thin arrow coming out of the Columnstore scan is a hint that predicate pushdown was successful.  I didn't specify MAXDOP in a query hint, and Resource Governor isn't supplying MAXDOP; MAXDOP 8 is coming from the Database Scoped Configuration.  The degree of parallelism turns out to be a significant factor in determining the tipping point.  The [key] column is a BigInt.  Maybe its surprising that I'm using 27213.9 as the upper bound.  But... check out the estimated number of rows :-) Again - this estimate is coming from the Legacy CE, specified in the database scoped configuration for my database.



Look what happens if the upper limit is 27213.8 at MAXDOP 8.  Row estimate of 27213.8 and... no more predicate pushdown!



Above I mentioned that DOP is an important part of determining the tipping point, with the cardinality estimator version playing a minor role.  Here are the tipping point values for legacy and new CE, at DOP 1 to DOP 8.


The tipping point values are row estimates, remember.  That's why the CE only plays a minor role here - since it's responsible for the estimates.  The estimates of the new CE are slightly different than the legacy CE - and for this case where the filter is against a BigInt column it looks like the new CE is giving integral row estimates, while the legacy CE estimates are not necessarily integral.

Anyway... a graph of the Legacy CE tipping point values against MAXDOP makes the nearly linear relationship clear.


Thursday, July 20, 2017

SQL Server Memory Accounting: Aligning Perfmon & DMVs

Short one for today, trying to get back into the habit of posting stuff here ;-)

I remember staring at all of the SQL Server memory-related counters available in perfmon, and wondering how I was going to make sense of all of them.

I remember the relief when I settled on "total = database + stolen + free", which aligned with the formula I usually used when evaluating UNIX server memory: total = file + computational + free.

Here's a link to a post from a few years ago about #SQLServer memory accounting...
Perfmon: SQL Server Database pages + Stolen pages + Free pages = Total pages
http://sql-sasquatch.blogspot.com/2013/09/perfmon-database-pages-stolen-pages.html

And here's a post going through a tricky detail when accounting for memory in AIX...
IBMPower AIX Memory Accounting: Fun with vmstat... Numperm Surprise!!
http://sql-sasquatch.blogspot.com/2014/10/ibmpower-aix-memory-accounting-fun-with.html

Its time to return to squaring the numbers between the DMVs and Perfmon.  I want to make sense of resource_semaphore_query_compile waits; I'll have to evaluate threshold values from sys.dm_exec_query_optimizer_memory_gateways in light of optimizer memory, granted memory, used granted memory, and total stolen memory.  The gateway threshold values aren't available in perfmon so I'd like to grab all of the numbers from the DMVs.  We'll see.

Here's a query I'm using to square up the basics between perfmon and the DMVs.

;WITH calc(counter_name, dmv_calc) AS
 (SELECT 'Target Server Memory (KB)' AS counter_name, 
         CONVERT(INT, sc.value_in_use) * 1024 AS dmv_calc
  FROM   sys.configurations sc
  WHERE [name] = 'max server memory (MB)'
  UNION ALL
  SELECT 'Database Cache Memory (KB)' AS counter_name,
         pages_kb AS dmv_calc
  FROM sys.dm_os_memory_clerks mc
  WHERE mc.[type] = 'MEMORYCLERK_SQLBUFFERPOOL'
  AND memory_node_id < 64
  UNION ALL 
  SELECT 'Total Server Memory (KB)' AS counter_name,
         SUM(virtual_address_space_committed_kb) 
         + SUM(locked_page_allocations_kb) AS dmv_calc
  FROM sys.dm_os_memory_nodes
  UNION ALL
  SELECT 'Stolen Server Memory (KB)' AS counter_name,
         SUM(allocations_kb) AS dmv_calc
  FROM sys.dm_os_memory_brokers
  UNION ALL
  SELECT 'Free Memory (KB)' AS counter_name,
         a - b - c AS dmv_calc 
  FROM
  (SELECT SUM(virtual_address_space_committed_kb) 
          + SUM(locked_page_allocations_kb) AS a
   FROM sys.dm_os_memory_nodes) a
  CROSS JOIN
  (SELECT SUM(allocations_kb) AS b
   FROM sys.dm_os_memory_brokers) b
  CROSS JOIN
  (SELECT pages_kb AS c
   FROM sys.dm_os_memory_clerks mc
   WHERE mc.[type] = 'MEMORYCLERK_SQLBUFFERPOOL'
   AND memory_node_id < 64) c)
 SELECT opc.[object_name], 
        opc.counter_name, 
        opc.cntr_value, 
        calc.dmv_calc,
  opc.cntr_value - calc.dmv_calc AS diff 
 FROM sys.dm_os_performance_counters opc
 JOIN calc ON calc.counter_name = opc.counter_name
 WHERE opc.counter_name IN
('Target Server Memory (KB)',
 'Total Server Memory (KB)',
 'Database Cache Memory (KB)',
 'Stolen Server Memory (KB)',
 'Free Memory (KB)')
 ORDER BY opc.cntr_value DESC;

The results are pretty close  If anyone knows where the variance in database cache/stolen memory comes from (or how to get rid of it), please let me know.  I'd really like to get it dead on, even though production system values will typically be changing over time even when measured in small increments.

Here's what I got on my test system as it was ramping up...