Thursday, April 16, 2015

SQL Server PLE is a *very* tricky performance metric

PLE, or page life expectancy, is a metric that can be retrieved from SQL Server DMVs or from perfmon when monitoring a system.

It is a measure of the expected time a database page entering the buffer cache pool will remain before being evicted.  Pressure on the buffer cache pool causes pages from the pool to be evicted in order that the bpool page can be added to the free list - that pressure can come from database pages being read in, or from pages being allocated and written to the first time (first writes to a page don't require the page to be read), or from buffer pool pages being stolen by other memory consumers such as plan cache, query memory, etc.  Finally, the buffer pool may be pressured to evict contents so that the "total server memory" can shrink in response to low server memory.

In addition to the mix of factors which can change the buffer pool eviction rate, PLE is further complicated by the fact that on a NUMA server, the buffer pool is by default split into sections such that there are as many sections of the bpool as NUMA nodes - each section then actually has its own PLE.  The general PLE reported for the instance is an aggregation of the PLEs for each NUMA node.

But - it gets trickier :-)  Online there are still lots of references to a rule-of-thumb of 300 as a PLE performance threshold number.  My response to that particular recommendation, which is typically recognized as an outdated recommendation, is that the perfomance target for PLE is extremely workload and system dependent.  Without a baseline of PLE for a workload on a system, and without other supporting metrics for the same workload on the system, most decisions based on PLE will really boil down to guesses

I'll show you what I mean.  Here are two graphs, of the same system running the same ETL (although with daily variation in data) separated by a few weeks.

On the earlier ETL execution, from 2:10 to 3:00 there were up to 100 queries executing at a time, and PLE ranged from just over 100 to almost 400.  Knowing that the old rule-of-thumb is 300 for PLE, maybe I'd want to start investigating.  But at the same time - it didn't stay below 300 - so maybe its not a problem?





Fast forward by a few weeks.  Still less than 100 concurrent queries - but most of the time from 2:10-3:00 there are fewer queries running in this graph than there were before. But PLE never gets above 100!  Now there's for sure a problem!



Not so fast.  Lets throw in CPU utilization for SQL Server and compare.





Huh.  Quite a bit more CPU utilization in the second, green graph than in the first red one.  Since the workload is the same (same ETL jobs/packages running in same order although threading/dependencies can cause some timing differences), we can also look at request completion as a gauge for the pace of work.





Just a moment ago it looked like the second day's ETL was experiencing a performance problem.  Now it looks like maybe performance was better on the second day - it was able to accomplish more work per unit of time?

Because this particular system has a working set much larger than server RAM, bytes/sec processed can also be used as a measure of work.





Oh.  So, during the first timeperiod the system was experiencing major performance degradation due to sky-high read response times, which limited read bytes/sec.  During the second timeperiod latency was much, much lower and throughput was much higher.

No queries were tuned in the interim - I ain't necessarily the guy you want tuning queries :-)  But we took steps to make the storage accommodate the workload much better.  In turn, the pace of work increased - the pace of query completion increased.  Because data was returning from disk faster, disk IO wait decreased and CPU utilization increased.   Reads were making it into the buffer pool faster, forcing pages to need to be evicted sooner.  That lowers PLE.  Because it was an ETL, increased pace of reads allowed for an increased rate of logical writes(assuming transactions weren't impeded by high write log waits, or log buffer waits).  Every database page newly allocated in the buffer pool by writes also worked to lower PLE.

So in this case, lower PLE was certainly not a symptom of a performance problem.  Rather, it was a sign that efforts to improve system performance were successful. :-)

This can happen often.  Replace an HDD storage system with an all flash array(AFA) on a system that performs a workload like this?  Everything else being equal, I'd expect IO wait to drop, CPU utilization to increase, and PLE to drop.  That drop in PLE wouldn't be a sign of a performance problem - but rather a sign of the system performance improvement rendered by the AFA.  Similarly, if new, faster CPUs are applied to the workload with the same amount of RAM - to the extent that the workload finishes faster I'd expect the PLE to drop.

This is a very important type of performance analysis and evaluation for batch-focused workloads which sometimes gets lost in the (largely) OLTP-focused world.  PLE and similar system metrics have a whole different meaning for OLTP systems - especially for end user OLTPs.  That's because improving system performance for a given end user system with a fixed number of end users has a hard stopping point.  No amount of system tuning can get rid of the minimal required 'user think time' required to continue driving the workload :-)  Also, batch workloads can be greatly assisted by readahead.  The smaller queries typically executed in an OLTP setting reap minimal benefits from readahead, instead relying on data to already be in cache to avoid disk IO wait time.  So PLE becomes a much more important concept.

I don't want anyone to ignore the PLE - it can be a very important measure.  But it can also be very tricky - gotta have supporting evidence to be sure of what it tells you.

No comments:

Post a Comment