Friday, November 22, 2013

Multiple tempdb data files for performance - What could go wrong?

Sure enough, there can be trouble with too many tempdb files on a single LUN.  I am seeing it more and more often these days.  Believe me, its none too easy to convince folks to change when, for the most part, the most respected names in SQL Server performance talk frequently about multiple tempdb data files, but almost never about using multiple LUNs to host those files.
Lets dive into that, at least for a quick lap.

OK.  Lets start with a reasonable measuring stick for whether or not there is a performance problem.  I'll use Paul Randal's recent categorization of disk latency (though for DSS/Data Warehouse systems I typically categorize IO performance in a different manner).

    Excellent: < 1ms
    Very good: < 5ms
    Good: 5 – 10ms
    Poor: 10 – 20ms
    Bad: 20 – 100ms
    Shockingly bad: 100 – 500ms
    WOW!: > 500ms

Ok... here's the system: HP DL380 with two 3.07GHz hex-core Xeon x5675 processors, just shy of 96GB RAM, 2 dual port Emulex HBA adapters.
The OS is WS2008R2 SP1, and the database version is Microsoft SQL Server 2008 R2 (SP2).

Details as they appear in a smash-n-grab...

OS Name:                   Microsoft Windows Server 2008 R2 Enterprise
OS Version:                6.1.7601 Service Pack 1 Build 7601

Total Physical Memory:     98,294 MB

System Manufacturer:       HP
System Model:              ProLiant DL380 G7
System Type:               x64-based PC
Processor(s):              2 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 44 Stepping 2 GenuineIntel ~3066 Mhz

Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz

Emulex AJ763A/AH403A, PCI Slot 2, Storport Miniport Driver 
Emulex AJ763A/AH403A, PCI Slot 2, Storport Miniport Driver                         
Emulex AJ763A/AH403A, PCI Slot 3, Storport Miniport Driver 
Emulex AJ763A/AH403A, PCI Slot 3, Storport Miniport Driver

This is not a monster system by any means.  The storage is pretty standard: virtual pools on an EMC VNX.

The file layout includes 24 tempdb files - one for each logical processor (hyperthreading is enabled).  There are 12 tempdb files on each of two LUNs.


Perfmon data was collected from the system with a 15 second interval.
Lets tackle an easy question first.  On this system is it accurate to say that 'most of tempdb is in the buffer pool'?  Lets use these perfmon counters to determine that.
\\SASQ1024\SQLServer:Buffer Manager\Database pages
\\SASQ1024\SQLServer:Transactions\Free Space in tempdb (KB)

During the observed window, tempdb ranges from 172 GB to 320 GB of free space.  At a minimum, the highwater mark of tempdb data footprint was 148 GB during this timeframe.
At the same time, the bpool database pages ranged from 38.3 GB to 58.8 GB.  Even if the bpool database pages contained only tempdb, it would contain less than half of the highwater mark of the tempdb data footprint.
So, on this system disk concerns for tempdb can be a significant performance factor.

How busy is this system?  Not so busy, according to these perfmon metrics.
\\SASQ1024\Processor(_Total)\% Processor Time   
\\SASQ1024\SQLServer:Buffer Manager\Page lookups/sec

 So, the server is not too busy.  How do read and write latency look?
\\SASQ1024\LogicalDisk(U:)\Avg. Disk sec/Read
\\SASQ1024\LogicalDisk(V:)\Avg. Disk sec/Read

Uh-oh.  The server is not very busy, but spending a lot of time above the 'poor' threshold, with lots of 'shockingly bad' values and some 'Wow!'.

Lets try to correlate the high read latency to other perfmon metrics, in order to find out how to help.

\\SASQ1024\LogicalDisk(U:)\Disk Reads/sec
\\SASQ1024\LogicalDisk(V:)\Disk Reads/sec
\\SASQ1024\LogicalDisk(U:)\Disk Read Bytes/sec   
\\SASQ1024\LogicalDisk(V:)\Disk Read Bytes/sec


Hmmm... so read throughput, whether operations or bytes, doesn't seem to correlate well to the spikes in latency.

What about write throughput?

\\SASQ1024\LogicalDisk(U:)\Disk Writes/sec
\\SASQ1024\LogicalDisk(V:)\Disk Writes/sec
\\SASQ1024\LogicalDisk(U:)\Disk Write Bytes/sec
\\SASQ1024\LogicalDisk(V:)\Disk Write Bytes/sec

 So, writes and write bytes to tempdb seem pretty well correlate to the spikes in read and write latency.  Interesting. Why would writes interfere with read latency in such a strongly correlated way?
The 'current disk queue length' give us a clue, although they don't explain the whole story.  And neither will I.  Not today.  I'm sleepy.
\\SASQ1024\LogicalDisk(U:)\Current Disk Queue Length
\\SASQ1024\LogicalDisk(V:)\Current Disk Queue Length

I'll end for now by repeating my recommendations for tempdb, which I recently gave out on the #sqlhelp hashtag. 

Go ahead and use multiple tempdb files (a reasonable number) - but one file per LUN.  This helps:
1. limit IO weaving
2. limit filesystem fragmentation when files expand
3. limit the contention for the disk queue when multiple files on a single LUN are hot

Consider using trace flags 1117 (trigger all database files in the same filegroup to grow together) and trace flag 1118 (no mixed extents).

Increase the maximum transfer size on the fibre channel HBA from the WIndows default of 512kb to 2mb.  This is good for all kindsa reasons - fewer round trips on the data network when SQL Server wants a big read, when larger reads are common there will be lower disk queue length for the same total amount of data, etc.

OK.. Be well, y'all.


  1. Very interesting. How was the issue resolved? Did the recommendations help?

  2. I'll post a followup in the future. Increasing the maximum transfer size had the intended effect of lowering the IOPs and current disk queue length across the board. But along the way we've realized there is a yet-undiagnosed issue limiting the overall throughput to the server, as well. Even though the full bandwidth at the host HBA and at the array front end adapter ports should allow for 1.6 gb/second, the server is never getting above 800 mb/second even though there is CPU to spare, long queue lengths (although shorter than before) and read/write latency still well below the targets.
    So we opted not to change the tempdb configuration yet - once the bandwidth issue is addressed we can re-baseline. Then comparing that baseline against new perfmon data after tempdb config has changed will show the benefit of changing tempdb configuration. If we changed tempdb config while bandwidth was constrained, part of the benefit would be obscured. Removing the bandwidth bottleneck later would result in a performance benefit from the combined effect of increased bandwidth and more favorable tempdb config, and it would be very hard to sort out the level of benefit to the individual changes.

  3. I am trying to understand high tempdb writes. If the SAN/VMWare metrics look healthy, does this mean that there is too many IO requests for Windows to handle? And we need to look for a spill to resolve? Thanks for your articles!

    1. What do you mean by "high"? Maybe high average sec/write?
      Next question would be - what do you mean by healthy VM & SAN metrics?
      IME most SANs provide metrics in a pretty wide observation window - maybe every 5 minutes. But in SQL Server, we typically pay attention to things in much smaller windows - I usually use a 30 second interval for perfmon, for example. Numbers that look bad in 30 second intervals can even out pretty well in 5 minute intervals.
      Also - writes might not look so bad when averaged in with reads - often VMware stats (as well as stats in most flavors of UNIX except AIX) will lump together reads and writes. Really bad write latency can be a number that's just "meh" for read latency - so again putting read and write numbers together can hide a problem.
      Tempdb disk IO patterns - especially spills - are very different than the IO patterns to persistent database files.
      For one thing, spills are nearly always 50% read/50% write. Many "write heavy" workloads still have a 75% read/25% write pattern to the persistent database files.
      Another differentiator is that spills to tempdb are much more aggressive - much higher queue lengths - than other writes.
      If there are multiple tempdb files on a single Windows volume which is a single Windows physical disk, during a spill its not too hard for the disk queue to be overwhelmed. For a pvscsi vHBA, the default queue depth for the attached "physicalDisks" is 64. If the "current disk queue length" in perfmon shows a number higher than 64 - that is an indication of use of the wait queue at that level. It is possible to increase the queue depth of the individual physicalDisks and the pvscsi vHBA adapter. But that will help the most if the LUN queue depth of the physical LUNs in the ESXi host can handle it. If using RDM disks (more rare all the time), its a 1 to 1 relationship for host LUNs to guest physicalDisks. If not, multiple guest physicalDisks could have VMDKs in a single VMFS filesystem in a single datastore. Maybe that datastore is providing VMDKs to multiple VMs. In cases like that, the LUN queue depth and adapter queue depth in the VMware host can be overwhelmed.
      If using VMDKs, it can also be important to check up on sioc (that kicks in to add additional throttling after average latency exceeds a threshold) and adaptive queue throttling (that kicks in when the array signals back that its queues are full).
      Finally, full or nearly full write cache in the array is still a common performance challenge for some database workloads, and tempdb spills stress write cache.

      Eliminating and taming spills can give storage a much needed break. But, I also encourage reviewing system configuration so that some amount of aggressive tempdb use can be absorbed.