Wednesday, April 3, 2013

Windows Port Exhaustion and Connection Failures

Port exhaustion is far from unique to Windows.  I first became familiar with port exhaustion and idle connection behavior by recommending intervention and eventually writing up some best practices for a tiered database architecture on AIX and HP-UX systems.  But, here I'll just describe the Windows details.
  
Although I've seen sources discuss Windows idle connection termination behavior, and other sources discuss TIME_WAIT delay status, and still other sources discuss port exhaustion, I can't think of a single place that I saw all of those concepts tied together in a neat little bundle.  That might still be true once I publish this blog post :)  At least it'll be here for me to refer to in the future :)

If you want to evaluate a particular delayed/rejected connection situation for port exhaustion, the Powershell script at the following location is a great place to start:
http://blogs.msdn.com/b/debuggingtoolbox/archive/2010/10/11/powershell-script-troubleshooting-for-port-exhaustion-using-netstat.aspx

I recommend evaluating the idle connection and TIME_WAIT behavior regardless, from a best practice standpoint.

Client/server communication typically relies on dynamic ports.  Dynamic ports are also known as ephemeral or anonymous ports.
Client communication will often use an available dynamic port on the client system to begin communication with a specified port on the server.  The maximum number of connections from that client is then limited to the number of dynamic ports.
Some server activities use dynamic ports as well: a communication request initially arrives on a specific server port listener, and the client connection is handed off to a dynamic server port.  FTP is an example of server activity which uses dynamic ports.
Port exhaustion occurs when there are no available dynamic ports for new connections.  If dynamic ports are used on the client and server side of the connections, port exhaustion could occur on the client or server system.

What can lead to port exhaustion?  I'll only give a few examples here.  But the main concepts involved are the limited number of dynamic ports on the client and/or server, the amount of time to terminate idle connections, and the amount of time before a "closed" dynamic port can be reused.
If a web service goes crazy and starts gobbling up dynamic ports, all available dynamic ports can get chewed up real fast.  If it keeps those connections open, as long as they are open no more connections for that protocol will succeed.
If it closes the connections, there could still be a problem.  That's because of the amount of time that a port will wait in the TIME_WAIT state after closed, before it can be reused.  By default this is 4 minutes.  A crazy web service could keep the connection inflow rate equal to the port available rate for a long time, leading to lots of connection failures.
Here's another thing to add to the mix: if a connection isn't properly closed, whether due to the connection being severed between client code and server code (power outage, switch crash or even intrusion prevention), the connection may need to time out as an idle connection before its terminated.  By default, that will take more than 2 hours.

So, here's where I've typically seen this come into play in an OLTP setting:
A power outage drops large number of Citrix or web servers from a client server farm.  All of the connections on the 'server' server are now idle.  They aren't terminated immediately, they need to go through TCP timeout protocol.  Assume the number of connections at the time of power loss was 80% of the available dynamic ports.  Assume that many connections will be attempted again when those client servers come back on line 20 minutes after the power loss.

That's 160% of the available dynamic ports. Of the new connections, only 1/4 will succeed until hitting the dynamic port limit.  Since power was restored 20 minutes after power was lost, that means that 3 out of 4 new connections will be waiting - either for one of the other new connections to terminate and the port becomes available, or for the remaining default 1 hour and 40 minutes until idle connections from the power loss are terminated.
 
So, what can be done about this?

*Step 1
Verify dynamic port range is sufficient.

For Windows version up to and including Windows Server 2003, the default dynamic port range was 1025 to 5000.  In those versions, HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\MaxUserPort registry key could set a nondefault max dynamic port number.  Windows Server 2003 security bulletin MS08-037 hotfix introduced a new default dynamic port range from 49152 to 65535.  That default range is still current through Windows Server 2012 6.2.9200.
http://support.microsoft.com/kb/953230
http://support.microsoft.com/kb/956188

Starting at least with Windows Server 2008 (and maybe with Windows Server 2003 after the security bulletin, I'm not sure), the dynamic port range is specified per protocol.
Check it for each protocol with these commands:
netsh int ipv4 show dynamicport tcp
netsh int ipv4 show dynamicport udp
netsh int ipv6 show dynamicport tcp
netsh int ipv6 show dynamicport udp

A typical recommendation for Microsoft Exchange 2007 systems was a dynamic port range of 1025 to 65535.  Seems reasonable for most Windows server systems to me, unless there is a particular reason against it (such as the possibility of consuming too much connection-related memory).

*Step 2
Verify idle connection behavior is appropriate.
For an idle TCP connection, the default Windows behavior is to wait for 2 hours before sending a TCP keepalive packet.
The registry value KeepAliveTime (milliseconds) can specify an alternative elapsed time.
If the keepalive packet is not acknowledged within 1 second by default, another keepalive packet will be sent.  Registry entry KeepAliveInterval (milliseconds) specifies this interval.
The number of keepalive packet retries is determined on some Windows versions by registry vlaue MaxDataRetries.  Various documents indicate this is hardcoded at 10 in Windows Server 2008 R2.

I recommend setting KeepAliveTime to 300000 milliseconds (5 minutes), dropping expected idle time to 5 minutes and 10 seconds.

I do not recommend changing KeepAliveInterval from the 1 second default.  There isn't much to gain there, and I haven't seen recommendations to change this value in any of the sources I've consulted.

While "Additional Registry Entries" below recommends changing MaxDataRetries from default 5 to 3 retries, there just isn't much to gain in terms of idle time expected before termination.

TCP/IP Registry Values for Microsoft Windows Vista and Windows Server 2008
http://www.microsoft.com/en-us/download/details.aspx?id=9152

Windows Server 2008 R2 and Windows Server 2008
>Secure Windows Server
>>Threats and Vulnerabilities Mitigation
>>>Threats and Countermeasures Guide: Security Settings in Windows Server 2008 and Windows Vista
>>>>Additional Registry Entries
http://technet.microsoft.com/en-us/library/dd349797%28v=ws.10%29.aspx

*Step 3
Verify appropriate TIME_WAIT behavior.
The registry value TCPTimedWaitDelay in HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters determines how long a port will remain in TIME_WAIT status after the port has been closed.  By default, the value of 240 seconds or 4 minutes is used.  So, a dynamic port typically cannot be reused for 4 minutes after communication is complete.  The intent for this delay was to allow for efficient handling of a new communication request from the same client.  However, the combination of a small dynamic port range and a long wait time for reuse can easily lead to port exhaustion.
This kb article describes how the size of the dynamic port range and the time period specified for TCPTimedWaitDelay can cause port exhaustion for SQL Server client connections.
http://support.microsoft.com/kb/328476
I recommend a TCPTimedWaitDelay of 30 seconds.  This aligns with the typical Exchange 2007 recommendation.

*UPDATE 4/29/2013 sql.sasquatch
First time I've noticed this. An 'application layer' TCP keepalive was implemented in SQL Server 2005, with a default timeout of 30 seconds and default interval of 1 second. This is separate from the 'looks alive' and 'isalive' checks used with clustering.
END UPDATE*
http://msdn.microsoft.com/en-us/library/ms190771(v=sql.105).aspx
http://blogs.msdn.com/b/sql_protocols/archive/2006/03/09/546852.aspx

5 comments:

  1. All right... no-one else will comment, so I will... like I usually do :)

    I'm not an Azure guy myself, but lots of other folks are heading into this new territory. In addition to making your servers more tolerant of high port rate use as above - and often before doing so - its a good idea to find the cause if the port occupation level is unexpected. Here's a recently corrected port leak that can lead to exhaustion.

    ".NET Clients encountering Port Exhaustion after installing KB2750149"
    http://blogs.msdn.com/b/windowsazurestorage/archive/2013/05/25/net-clients-encountering-port-exhaustion-after-installing-kb2750149.aspx

    ReplyDelete
  2. Looks like my link in the previous comment is now orphaned.
    Go here instead.
    ".NET Clients encountering Port Exhaustion after installing KB2750149 or KB2805227"
    http://blogs.msdn.com/b/windowsazurestorage/archive/2013/08/08/net-clients-encountering-port-exhaustion-after-installing-kb2750149-or-kb2805227.aspx

    ReplyDelete
  3. Here's a more recent Microsoft blog post with an updated Powershell script for monitoring port exhaustion.
    http://blogs.technet.com/b/clinth/archive/2013/08/09/detecting-ephemeral-port-exhaustion.aspx

    ReplyDelete