Sunday, February 17, 2013

Trace flag 8015 - single memory node and scheduler group

I applied trace flag 8015 and 8048 to SQL Server on the dl560 and restarted.

The trace flags are in place...
dbcc tracestatus
TraceFlag    Status    Global    Session
8015         1         1         0
8048         1         1         0

Trace flag 8015 has resulted in only one CPU group (parent node id for the schedulers)... 

select distinct parent_node_id from sys.dm_os_schedulers

Trace flag 8015 has also resulted in a single memory node for the server...

select distinct memory_node_id from sys.dm_os_memory_nodes

**** UPDATE by sql_sasquatch 20130426 ****
Seems like interest in trace flag 8015 is starting to bring folks here.  See a susequent blog post for a more comprehensive discussion of trace flag 8015 with lotsa links to the Microsoft docuemtnation about SQL Server NUMA support and the consequences of disabling it.
**** END UPDATE ****

**** UPDATE by sql_sasquatch 20130810 ****
Finally got around to adding some graphs to explain my interest in disabling NUMA support at the SQL Server layer.  Check it out.
**** END UPDATE ****

Thursday, February 14, 2013

FCoE gives me MSSQL headaches!

Converging networks is all the rage.  I personally am a fibre channel guy (maybe I just like the word 'lossless'), but we've done some testing on FCoE due to interest from others.  The testing has not gone well.

Under a moderate load changes from our ETL, IO related errors occur.  Eventually, pending writes reach the system limit and SQL Server crashes. Changing nothing other than the data network protocol allows the workflow to complete.  The errors are not reproducible with batch report workloads, only the ETL workload.

Although the errors listed below mention corruption as a possible cause, dbcc checkdb never indicated any type of corruption remaining after the system was recovered.

Here's the matrix of what we've tested.  Every FCoE test failed on physical server and succeeded on VMWare.  Fibre channel succeeded in every test.  The iSCSI protocol was not thoroughly tested, but some workflows were performed against it and did not present the error conditions.

(Physical Server, VMware ESXi Hypervisor)
(Emulex CNA, Cisco CNA)
(Windows Server 2008 R2 SP1,Windows Server 2012)
(SQL Server 2008 R2, SQL Server 2012)

My colleague describes one of the first bouts with this condition at the following link, including discussion of the test scenario and some of the hardware setups we've worked with.

The failed sessions start their anguish with a report of delayed IO return, even though perfmon shows good IO response time, and SQLio tests on the storage subsystem showed good response times above the level of peak activity during testing.

The order of events in the SQL Server error log, with excerpted examples below:
1. Delayed io message 833. (Although the message indicates the C drive, the database file is on a mounted volume at directory c:\db\mydb\8).
2.  Flushcache due to delayed checkpoint.
3.  Latch timeouts
4.  Failed checkpoint leads attempted rollback
5.  Rollback fails.

SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [c:\db\mydb\8\MyDB_08.ndf] in database [MyDB_24] (5).  The OS file handle is 0x0000000000000DD8.  The offset of the latest long I/O is: 0x000001a8ef0000
2012-10-09 16:04:44.15 spid16s     FlushCache: cleaned up 102633 bufs with 27634 writes in 346041 ms (avoided 488 new dirty bufs) for db 5:0
2012-10-09 16:04:44.15 spid16s                 average throughput:   2.32 MB/sec, I/O saturation: 18149, context switches 22248
2012-10-09 16:04:44.15 spid16s                 last target outstanding: 4800, avgWriteLatency 0

2012-10-09 16:11:48.00 spid116     A time-out occurred while waiting for buffer latch -- type 3, bp 000000190555A5C0, page 3:2094792, stat 0x1010f, database id: 5, allocation unit Id: 6488064, task 0x0000001596308928 : 0, waittime 300 seconds, flags 0x19, owning task 0x000000178E73C188. Not continuing to wait.

2012-10-09 16:11:50.04 spid72      A time-out occurred while waiting for buffer latch -- type 4, bp 0000001939A62C40, page 3:11267982, stat 0x1010f, database id: 5, allocation unit Id: 72057595887288320, task 0x0000001814E05498 : 0, waittime 300 seconds, flags 0x19, owning task 0x000000178E73C188. Not continuing to wait.

2012-10-09 16:12:56.66 spid90      A time-out occurred while waiting for buffer latch -- type 2, bp 000000192ACEA380, page 10:4012592, stat 0x40d, database id: 5, allocation unit Id: 536870912/281475513581568, task 0x000000146EB48188 : 12, waittime 300 seconds, flags 0x3a, owning task 0x000000146EB48188. Not continuing to wait.

2012-10-09 16:17:33.69 spid60      Timeout occurred while waiting for latch: class 'ACCESS_METHODS_DATASET_PARENT', id 00000017D7C426C0, type 4, Task 0x00000012EE1A8928 : 31, waittime 300 seconds, flags 0x1a, owning task 0x00000017D7F7E558. Continuing to wait.


Some background information on checkpoint, for my own future reference.

Poor storage performance could cause something like this - its the usual cause of the delayed IO messages.  But there is no evidence of a performance problem on the storage.  The guy that configured the storage knows what he's doing, from a general storage config standpoint and certainly from a SQL Server standpoint.  Anyone can make a mistake, but after weeks of wondering about this, if there was an easy-to-catch storage config error, he would have found it.
Database corruption could cause a problem like this.  But dbcc checkdb shows nothing.  And, finally - just changing to access the disks by fibre channel fixes the problem.
By cranking up the logging level from the Emulex driver, we were able to get an IO error out of the Windows log.  Before doing that, there was no indication of trouble in the error log.

I used this 97 page driver documentation pdf as the decoder ring.
On page 72 you'll find the beginning of:
Appendix A Error/Event Log Information, FC/FCoE Error/Event Logs

Here's what we got out of the system error log.
Log Name:      System  Source:        elxcna  Date:          10/17/2012 2:31:55 PM  Event ID:      11  Task Category: None  Level:   Error  Keywords:   Classic  User:   N/A  Computer:  Description: The driver detected a controller error on \Device\RaidPort1. Event Xml: <Event xmlns=""> <System> <Provider Name="elxcna" /> <EventID Qualifiers="49156">11</EventID> <Level>2</Level> <Task>0</Task> <Keywords>0x80000000000000</Keywords> <TimeCreated SystemTime="2012-10-17T19:31:55.129962700Z" /> <EventRecordID>6255</EventRecordID> <Channel>System</Channel> <Computer></Computer> <Security /> </System> <EventData> <Data>\Device\RaidPort1</Data> <Binary>0F00180001000000000000000B0004C0AD02000000000000000000000000000000000000000000000000000000000000000000000B0004C00000000000000000</Binary> </EventData> </Event> 
The Emulex codes had me thinking for a moment that there was a physical problem with the cabling. Until the same workload on the same server with SQL Server running on VMWare, accessing the storage through FCoE was successful.
We've pulled in Emulex and Microsoft.  No resolution yet.  Failures reproduced with Cisco CNAs as well.  So - what's up?  
Anyone else seen such issues with FCoE and SQL Server?

Wednesday, February 13, 2013

MSSQL bpool disfavoring

I expect trace flag 8015 + trace flag 8048 to be an all around win for the systems I work with, when on NUMA servers. These systems have two primary workloads: regularly scheduled batch ETL (thousands of tables, concurrency managed by an external agent) and regularly scheduled report batches (thousands of reports, concurrency managed from report app servers).

Some simulated workflow testing, and live system observation, can confirm or negate my expectations. But messing with memory management and task scheduling with a single trace flag is a big deal. The main workflows are not the only important considerations. To etl and batch report, I'll add as important considerations, at least: dbcc checkdb. Index rebuilds. Statistics updates. Backups. Recovery. Restore. Tx logging.

In order to understand more fully the implications of trace flag 8015 on a NUMA system, I want to learn more about the management of the bpool.

Trace flag 8015 is expected to help in three ways:
1. More balanced scheduling of tasks across all schedulers.
2. More consistently balanced use of bpool across the server, instead of very imbalanced PLE and buffer churn across memory nodes.
3. Elimination of potential query memory inflation due to away buffer lists, leading to more consistent query memory use for a given query for trending, investigation, and optimization purposes.

The following blog post includes some points about bpool management that I haven't seen addressed anywhere else. (There's also mention in Bob Dorr's SQL server io presentation that the touch count algorithm previously used has been abandoned in favor of time-of-last-access; must be time of penultimate access for LRU-K/2 algorithm? Oracle still uses touch count, I believe.)

No questions on the following two matters, yet :)
-"bstat" in DBCC PAGE output in page header has one bit, that says if the page is disfavoured or not, but these bits are not documented
-DBCC CHECKDB disfavours pages read in the buffer pool
-sys.dm_db_index_physical_stats also uses disfavouring

But the following two points raise lots of questions.
##UPDATE 2/13/2013## I don't feel too badly about having lots of questions about the disfavor of scans based on size relative to the bpool.  Paul Randal posted about the disfvoring done by dbcc checkdb today, and mentioned that he hasn't reproduced the scan resulting in disfavoring yet.

-Lazywriter uses the BUF structure to know the last 2 times, when a page was accessed
-When a scan reads data of more than 10% of the buffer pool, the pages are disfavoured immediately

The questions so far:
*Does BUF structure track physical access times(such as physical write and readahead), or logical only?

*In Oracle there is a stat "prefetched blocks aged out before use". Under bpool pressure can SQL Server evict readahead blocks before they are used?

* On a NUMA server, does disfavor of a scan begin at 10% of relevant NUMA node buffer pool?

* How does this work with bpool ramp up?

* is the decision to disfavor made before the scan starts?

* Do all pages of the scan get disfavored, or only those after the 10% threshold is exceeded?

* if the first scan pages to be disfavored are those after the 10% threshold, are the pages already in the bpool 'retro-disfavored'?

Sunday, February 10, 2013

MSSQL - Drive Letters vs Mounted Volumes?

Ok, time to gather info for updating best practices.

I had previously assumed that MSSQL behavior with LUNs provisioned as Windows drive letters and LUNs provisioned as mounted volumes under a drive letter was the same.

I'd heard of clustering challenges if the mounted volume was not within the directory tree of a drive letter.  And I know it can present some challenges for monitoring IO utilization and performance.  But its such a widespread practice and general performance testing has not yielded any specific differences.

However, dbcc checkdb was not included in our performance testing.  And, this kb article points out a potential difference in dbcc checkdb behavior with drive letters vs mounted partitions.

A single database with a single filegroup, and 8 database files with each file on its own drive letter.  By default, dbcc checkdb will build 8 lists of page reads.

Same situation with the 8 database files each on their own mounted volume, all within the directory tree of a single drive letter.  By default dbcc checkdb will build only one list for page reads, because there is only 1 drive letter.

Enter trace flag 2549.  With this trace flag, dbcc checkdb will assume each database file is on its own LUN, and build 8 page lists even though there's only one drive letter.

Be careful - if you've got several database files for a database on a single LUN (I consider that contrary to best practices in the first place), trace flag 2549 could lead to overtaxing IO resources.  But if you adhere to a one database file per one LUN model, trace flag 2549 can make dbcc checkdb behavior the same for drive letters or mounted volumes.

I wonder if there's anything else out there that mounted volume setups need to be aware of to have the same behavior as a similar system with drive letters?

SQL.Sasquatch, SQL_Sasquatch, SQL-Sasquatch

I don't like hyphens.  In handles, names, URLs I'd much rather have periods or underscores.  Alas, blogger will only let me have a hyphen in my blogspot URL.  So, it is.

I am square peg in a round hole, maybe a bit of a bull in a china shop. 

The SQL queries I write often give me away as a novice: I'm not from around here.

I built most of my expertise working with a non-relational database on UNIX platforms.  And I paid a lot of attention to memory management, processor scheduling, and storage IO.  I've seen long-forgotten or ignored memory management or processor scheduling settings be the key to unlocking the performance potential of systems.  I've seen crippling effects of storage bottlenecks caused by insufficient IO buffers or request slots, imbalanced storage utilization, writes-blocking-reads, copy-on-write overhead, filesystem fragmentation, file fragmentation, database object fragmentation, cache saturation, back end port/bus saturation, front end port saturation, RAID 5 random write overhead, replication overhead, and a few others that I will only remember (if I'm lucky) the next time I encounter them.

I'm glad when I hear people talk about capturing a baseline for future performance and resource utilization evaluation.  But few recognize the value in challenging the assumption that the system was optimally configured when the baseline was captured.  Some system limits only cause havoc after a certain resource utilization or concurrency threshold is crossed.  Identifying pre-existing latent performance or scalability limitations can be really hard.  Often the most obvious performance opportunity is purported to be the most significant performance opportunity.  Sometimes this leads to a never-ending engagement of code tuning in order to achieve healthy performance.  Sometimes it leads to provisioning additional system resources (throwing money at it).  In some cases, intense investigation and some system tuning could give enough back that code and system provisioning could be reviewed without the stress of an ongoing crisis to provide additional benefits in the future.

One of the reasons I'm valuable to have around is that I'll challenge the assumptions of optimal configuration at baseline, and that configuration or activity changes since baseline are the most significant factor in performance changes.  Not necessarily with the finesse you'd expect out of someone named sql_sorcier, for example.  Maybe only fittingly for someone named sql_sasquatch.

Friday, February 8, 2013

Big-Big table migration; table spool and uniquifier

Helping out today with a copy from one database to another of a table with over 9.5 billion rows in SQL Server 2008 R2.  Yikes.

SELECT text FROM sys.messages WHERE message_id = 666 and language_id=1033

The maximum system-generated unique value for a duplicate group was exceeded for index with partition ID %I64d. Dropping and re-creating the index may resolve this; otherwise, use another clustering key.
 I think SQL 2005 kb 937533 is relevant.  Table spool may only be able to handle 2,147,483,648 input rows before exhausting the available values for uniquifier.

Also discussed a little here.

Here's a recurrence in SQL Server 2008.

I'd like to add an indexed "done" column to the source table.  Select the TOP million rows where done is NULL, insert to the target, set done = 1 on the TOP million rows where done is NULL, commit and repeat while select from source where done is NULL exists.

But first the route of common table expression and NTILE will be tried.

Trouble is... maybe both methods might still be vulnerable to exhausting the uniquifier.  Kevin Boles thinks that even bcp in with batch parameter might hit the same limit.

I'm not sure how and when the table spool will start over with the uniquifier.  Anyone know?