Saturday, February 20, 2016

"Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached"

An email inbox full of this error message is not the best way to start a morning. Our service had been running in production smoothly for months, but software engineering is a cruel profession and bad things are bound to happen.

Luckily only one of our three production servers was down so we quickly pulled the problem child out of the load balancer rotation and began to investigate. We used perfmon to monitor the open connections on the server and were easily able to fill up the connection pool with only a few hundred concurrent requests from our tests scripts. After waiting a few minutes the connections were reclaimed and the service would become responsive again.

At this point it was rather obvious we were leaking connections, but how? This error message is an indication that connections were being taken from the pool but not released again. So the first obvious thing to do was check to make sure all our SQL clients were being properly disposed. This would be an easy fix if somehow we missed putting our SQL clients in using blocks, but after looking through the code all of our clients were in using blocks and should be disposed properly.

Connection pooling
SQL connections are expensive to create. It consists of creating a socket, doing an initial handshake, authenticating the credentials, doing transaction checks, etc. Connection pooling is used as an optimization technique used to minimize the number of times a connection must be established. So, when a SQL client wants to connect to the database a connection will be created or one taken from the existing pool of connections. Once the SQL client is finished with the connection it will close the connection when Dispose() is called and the connection will go back to the connection pool. Each connection pool is associated with a distinct connection string. If the maximum number of connections in the pool has been reached and there are no more available connections in the pool then a timeout will occur when a connection is requested.
Duplicating a production problem locally is the holy grail of fixing production bugs. Once local duplication has been reached fixing the bug is usually straight-forward. In this case I was not able to reproduce this problem locally even when sending 10,000 concurrent requests at my local service. Am I hitting the same endpoint? Yes. Similar CPU/memory specs as production? Yes. After running through many of these questions we tried setting <gcServer enabled="true"/> on my local dev machine and YES!! Duplication. This server setting is used in production and was required to duplicate this problem.

Garbage collection
The gcServer element specifies whether or not to use server or workstation garbage collection. Workstation garbage collection collects on the same user thread that triggered the garbage collection, so it must compete with other threads when garbage collecting. Server garbage collection dedicates threads specifically to garbage collection at high priority, so this setting will perform better on multi-core CPU machines.
After being able to reproduce the problem locally it was easy to narrow down the offending code (F#).

use cmd = new GetRecords(connectionString)
let! dbResult = cmd.AsyncExecute(contactID)
let records =
    dbResult
    |> Seq.map Convert.convertRecords
That doesn't look so bad right? The SQL command is in a using and Dispose() should be called on it when it goes out of scope. Well, we had our suspicions about the fact that the database call was retuning a sequence and we were curious how this code would perform:

use cmd = new GetRecords(connectionString)
let! dbResult = cmd.AsyncExecute(contactID)
let records =
    dbResult
    |> Seq.toArray
    |> Array.toSeq
    |> Seq.map Convert.convertRecords
Voila! Materializing the sequence to an array before returning actually fixed the problem. But why? Sequences are not computed until it is traversed, so returning a sequence of objects retrieved from a database will keep open the database connection until the sequence is fully traversed. This is bad because we want that database connection returned back tot he pool as quick as possible so it can be used by other threads.

Summary
Be weary of sequences of items returned from a database that are lazily evaluated. It is best practice to materialize the sequence to an array before returning it out of the function. Also, make sure your load tests are run before EVERY release in a QA environment that exactly mirrors production. Hopefully that will save you headaches in the future.

No comments:

Post a Comment