Wednesday, March 29, 2017

#sqlserver tempdb: "insert into... select" vs "select... into" vs index sort_in_tempdb write behavior

I believe pretty strongly that a collection of best practices should keep workload in mind.  Form fit to function, if you will.

So when it comes to tempdb best practices, I like to remind folks that not all use of tempdb looks the same from the system side.

On the systems I work with, sort/hash spill is the main component of tempdb use.  On many OLTP systems, rapid create/drop of temp tables is a much more prominent factor in tempdb considerations.  On some systems, row versioning or perhaps use of large local or global temp tables is a significant consideration.

Two concerns arise from the various uses of tempdb:
  • do they all stress allocation pages in the same way?  Tempdb allocation page contention and resulting pagelatch waits are the classic drivers of increasing the tempdb data file count for a given workload.
  • do they all utilize the storage subsystem in the same way? This is another, separate consideration in tempdb design which may influence decisions about which RAID level or disk group to put tempdb on, whether to use flash or HDD, how many Windows volumes to use, etc.
I'll have to return to that first consideration on another day :-)

Today I want to turn to the question of how some tempdb activity engages with storage.

Sort and hash spill is my primary concern, but today I'm going to weasel out by using CREATE INDEX with the sort_in_tempdb option as a proxy for sort spill.  At a later time (once my other tests behave a bit more) I'll come back to this and compare sort and hash spills to sort_in_tempdb activity.

The system I'm using is an 8 vcpu VM, set up with storage from an all flash array. SQL Server 2016 SP1.
Here's the setup for my test. 

/* test setup */
create table test_small (num1 BIGINT,
                         string VARCHAR(4000),
                         num2 BIGINT,
                         constraint pk_test_small PRIMARY KEY CLUSTERED (num1))
WITH (data_compression = page);

DECLARE @line VARCHAR(80) = 
DECLARE @string VARCHAR(4000) = 
   @line + @line + @line + @line + @line + @line + @line + @line + @line + @line
 + @line + @line + @line + @line + @line + @line + @line + @line + @line + @line
 + @line + @line + @line + @line + @line + @line + @line + @line + @line + @line
 + @line + @line + @line + @line + @line + @line + @line + @line + @line + @line
 + @line + @line + @line + @line + @line + @line + @line + @line + @line + @line;

;WITH L0 AS (SELECT 1 AS c UNION ALL SELECT 1),      --          2
      L1 AS (SELECT 1 AS c FROM L0 A CROSS JOIN L0), --          4
      L2 AS (SELECT 1 AS c FROM L1 A CROSS JOIN L1), --         16
      L3 AS (SELECT 1 AS c FROM L2 A CROSS JOIN L2), --        256
      L4 AS (SELECT 1 AS c FROM L3 A CROSS JOIN L3), --      65536
      L5 AS (SELECT 1 AS c FROM L4 A CROSS JOIN L4), -- 4294967296
INSERT INTO test_small
SELECT N, @string, N
WHERE N % 3 = 0

Crude, but effective.  You can probably tell that I was initially working with a table 3 times as large and scaled it back :-) The large string in each row forces 1 row per page, giving a fairly predictable table size when the table is not compressed.

For the first test scenario, lets create a temp table, and insert all rows into it.

/* scenario 1 */
create table #test_small_1 (num1 BIGINT,
                            string VARCHAR(4000),
                            num2 BIGINT);

INSERT INTO #test_small_1
FROM test_small;

OK, now lets use a "SELECT... INTO" query.

/* scenario 2 */
INTO #test_small_2
FROM test_small;

Finally, lets use "CREATE INDEX" with sort_in_tempdb.

/* scenario 3 */
CREATE INDEX nci_small_test ON test_small(Num2)
INCLUDE (String, Num1)

Now... lets see how that all turned out. 1 second perfmon captures below.

Interesting that the "select... into" took a little longer, isn't it :-)  But the two temp table operations were close to each other in write bytes/sec to tempdb.  Look how aggressive that "create index" was though!  Maxed out adapter bandwidth.

Let's look at writes/sec rather than write bytes/sec.
Whoa.  The "select... into" in the middle issued wwwaaaayyyy more writes, and way more writes/sec to tempdb than either of the other operations.  But... each operation wrote the same amount to tempdb.  So the "select... into" was using some pretty small writes in comparison.  

Based on the number of writes, the fact that the "select... into" showed higher disk queue length than the first table option isn't too surprising.  But... maybe its a little surprising that the "create index" has such high queue length, considering the much lower number of writes/second?

But if we look at sec/write... aha!   The table insert and "select... into" operations have write latency staying well below 64ms.  Its a little surprising to me that the "insert into" had higher peak write latency than the "select into..."  That may certainly be an oddity of that particular test run.  But I don't have a long enough attention span to repeat the test 10 or more times to even it out :-)  Nah, the real surpise is the write latency over 300 ms in the "create index".  On flash storage, too.

So... different uses of tempdb stress storage in different ways.  If I'm correct that "create index" is a good proxy for hash/sort spills, then tempdb spills require special consideration in how to destage aggressive write bytes/sec.

And... a theme I'll come back to in the future... relevant to the high write latency experienced by the "Create index"...
strange things happen to service time at saturation.


  1. Interesting, thanks! It makes sense that the latency is so high for the create index, precisely because of the fast flash storage -- it maxed out the adapter bandwidth, so the I/O queued up on the VM and/or the hypervisor.
    Could be especially dangerous if you're maxing out the bandwidth of all the blades on the chassis, if you're using a blade. May be a good use for resource governor to limit the IOPS.

    1. Exactly right about the saturated bandwidth resulting in very high latency.

      Interesting that you should bring up the resource governor :-)
      General IOPs limits to prevent write floods tend to work poorly. In this instance I have a database spread evenly over 8 files in a filegroup, each on own Windows volume and RDM.

      RG gives proportional governance to IOPs per volume if defined and if there is competition.

      In this case a single CREATE INDEX at maxdop 4 can flood with writes - if uncontested, governance won't kick in.

      VMware sioc is another approach (one I'm not fond of because consolidation isn't a goal of mine). If latency crosses the threshold in the datastore, IO against the datastore gets throttled. Also wouldn't work too well here.

      Many storage arrays have QoS settings to limit bytes/sec or IOPs for a host or LUN at the front end adapter. Host level wouldn't work well since I want to max out reads/writes to persistent database files, transaction logs. And reads from tempdb haven't been a problem (at least not yet).

      So where does that leave the idea of throttling tempdb spill writes?

      Maybe... just maybe... the -k startup option might work. Haven't tried it yet. We'll see. It would also potentially affect max write rate for checkpoint to persistent databases though, so thorough testing is warranted. :-)