Under a moderate load changes from our ETL, IO related errors occur. Eventually, pending writes reach the system limit and SQL Server crashes. Changing nothing other than the data network protocol allows the workflow to complete. The errors are not reproducible with batch report workloads, only the ETL workload.
Although the errors listed below mention corruption as a possible cause, dbcc checkdb never indicated any type of corruption remaining after the system was recovered.
Here's the matrix of what we've tested. Every FCoE test failed on physical server and succeeded on VMWare. Fibre channel succeeded in every test. The iSCSI protocol was not thoroughly tested, but some workflows were performed against it and did not present the error conditions.
(Physical Server, VMware ESXi Hypervisor)
(Emulex CNA, Cisco CNA)
(Windows Server 2008 R2 SP1,Windows Server 2012)
(SQL Server 2008 R2, SQL Server 2012)
My colleague describes one of the first bouts with this condition at the following link, including discussion of the test scenario and some of the hardware setups we've worked with.
The failed sessions start their anguish with a report of delayed IO return, even though perfmon shows good IO response time, and SQLio tests on the storage subsystem showed good response times above the level of peak activity during testing.
The order of events in the SQL Server error log, with excerpted examples below:
1. Delayed io message 833. (Although the message indicates the C drive, the database file is on a mounted volume at directory c:\db\mydb\8).
2. Flushcache due to delayed checkpoint.
3. Latch timeouts
4. Failed checkpoint leads attempted rollback
5. Rollback fails.
Poor storage performance could cause something like this - its the usual cause of the delayed IO messages. But there is no evidence of a performance problem on the storage. The guy that configured the storage knows what he's doing, from a general storage config standpoint and certainly from a SQL Server standpoint. Anyone can make a mistake, but after weeks of wondering about this, if there was an easy-to-catch storage config error, he would have found it.
Database corruption could cause a problem like this. But dbcc checkdb shows nothing. And, finally - just changing to access the disks by fibre channel fixes the problem.
By cranking up the logging level from the Emulex driver, we were able to get an IO error out of the Windows log. Before doing that, there was no indication of trouble in the error log.
I used this 97 page driver documentation pdf as the decoder ring.
Here's what we got out of the system error log.
The Emulex codes had me thinking for a moment that there was a physical problem with the cabling. Until the same workload on the same server with SQL Server running on VMWare, accessing the storage through FCoE was successful.Log Name: System Source: elxcna Date: 10/17/2012 2:31:55 PM Event ID: 11 Task Category: None Level: Error Keywords: Classic User: N/A Computer: R710-07.domain.com Description: The driver detected a controller error on \Device\RaidPort1. Event Xml: <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="elxcna" /> <EventID Qualifiers="49156">11</EventID> <Level>2</Level> <Task>0</Task> <Keywords>0x80000000000000</Keywords> <TimeCreated SystemTime="2012-10-17T19:31:55.129962700Z" /> <EventRecordID>6255</EventRecordID> <Channel>System</Channel> <Computer>R710-07.domain.com</Computer> <Security /> </System> <EventData> <Data>\Device\RaidPort1</Data> <Binary>0F00180001000000000000000B0004C0AD02000000000000000000000000000000000000000000000000000000000000000000000B0004C00000000000000000</Binary> </EventData> </Event>
We've pulled in Emulex and Microsoft. No resolution yet. Failures reproduced with Cisco CNAs as well. So - what's up?
Anyone else seen such issues with FCoE and SQL Server?