I don't know how Slick manages its thread pool, but someone needs to make sure that the pool is sized properly for both the number of threads and the number of Connections that may be obtained by a single thread.
val tasks: IndexedSeq[Future[Int]] = 1 to 20 map { i =>
val action = { table += i }.map { identity }
database.run(action.transactionally)
}
This can lead to a deadlock between the thread pool and the
connection pool. The "map" call causes the action to be suspended
(because the mapping function needs to be executed on a different
thread) while holding a connection (due to "transactionally"). The
continuation is scheduled with high priority to ensure that it can
always be inserted into the task queue. If you start enough of these
actions concurrently you will end up with a situation where the
connection pool is exhausted, thus getConnection blocks until an
existing connection is freed up. However, this blocking happens on
the database thread pool, thus preventing the continuations, which
would eventually commit the transaction and return the connection to
the pool, from running, thus causing a deadlock.If HikariCP is not timing out each call to getConnection() after the configured connectionTimeout it would come as a surprise and would certainly be a bug. A simple reproduction scenario (in Scala, Java, or whatever) would be greatly appreciated.
val tasks = 1 to 5 map { i => val a = sql"select 1".as[Int].head.map(identity) dc.db.run(a.transactionally) } Await.result(Future.sequence(tasks), Duration(30, TimeUnit.SECONDS))The expectation is that this will fail after a bit over 1 second but in fact it takes over 12 seconds for the SQLTimeoutException to arrive.
def transactionally: DBIOAction[R, S, E with Effect.Transactional] = SynchronousDatabaseAction.fuseUnsafe( StartTransaction.andThen(a).cleanUp(eo => if(eo.isEmpty) Commit else Rollback)(DBIO.sameThreadExecutionContext) .asInstanceOf[DBIOAction[R, S, E with Effect.Transactional]] )This schedules a Rollback even when StartTransaction fails. Fixing this needs further investigation. I don't think it would be as simple as moving the "StartTransaction andThen ..." out of the "cleanUp" because StartTransaction has already pinned the session if it fails in the actual "setAutoCommit(false)" call, so we'll have to distinguish between failures during connection acquisition and later ones. The same problem exists in "withPinnedSession" but it will require a different kind of fix because DBIO.Pin doesn't acquire a connection, so the actual failure comes later. Maybe it *should* force the connection, then we could immediately unpin the session if this fails.
Regarding the connectionTimeout in general, while you think a connectionTimeout of 0 sounds attractive, I think in practice it certainly is not. Consider a pool of 10 connections. At some point a Connection reaches its maxLifetime and is retired from the pool. A thread is scheduled to replace the Connection in the pool. While this may not take long, Connection setup still requires a non-zero amount of time -- possibly a few milliseconds, possibly triple-digit milliseconds in the case of a busy server or SSL setup for an encrypted connection. Either way, for some period of time (very likely less than 1000ms), the pool has 9 connections, not 10. If nine of the ten connections are in use and we allowed a connectionTimeout of 0ms, a caller to getConnection() would immediately fail, even though the pool is below capacity. This is certainly not what most users would expect or want (including Slick users, I suspect).
In a traditional, blocking application design, you spawn lots of threads and they get stuck in getConnection calls concurrently if you overload the pool. In this case the connectionTimeout serves as a timeout to fail individual calls. In Slick's semi-asynchronous design it has a very different effect. The only way to overload the pool is by keeping transactions open while waiting for asynchronous calls to return. This can lead to the observed deadlocks. This simplest way to avoid those would be to fail immediately when there are no free connections in the pool. You could also treat the case where you are waiting to replace an expired connection differently and only fail immediately when the pool is at its maximum capacity. OTOH, Slick could implement this feature directly and make it work with any pool implementation. The overhead would be one atomic int operation on each connection acquisition and release.
Another option for Slick is different prioritization of continuations. It's more work because we have to implement a custom queue (well, even more custom than the one we already have). Currently we use low-priority tasks for new actions and high-priority tasks for continuations. The difference is that high-priority tasks can use the entire task queue size (which is actually double of what you configure) whereas low-priority tasks can only use the configured size. This ensures that continuations can always be enqueued but it doesn't affect scheduling, which is always FIFO. We could use an even higher priority for continuations which hold a connection, and enqueue those in front of all other tasks. This should prevent the deadlocks in the first place.
Op woensdag 25 november 2015 15:39:26 UTC+1 schreef Stefan Zeiger:In a traditional, blocking application design, you spawn lots of threads and they get stuck in getConnection calls concurrently if you overload the pool. In this case the connectionTimeout serves as a timeout to fail individual calls. In Slick's semi-asynchronous design it has a very different effect. The only way to overload the pool is by keeping transactions open while waiting for asynchronous calls to return. This can lead to the observed deadlocks. This simplest way to avoid those would be to fail immediately when there are no free connections in the pool. You could also treat the case where you are waiting to replace an expired connection differently and only fail immediately when the pool is at its maximum capacity. OTOH, Slick could implement this feature directly and make it work with any pool implementation. The overhead would be one atomic int operation on each connection acquisition and release.
I'm not sure I'm in favour of unrelated DBIO actions suddenly starting to fail, because of other transactional operations waiting for their callbacks to finish. In my opinion once a DBIO has been scheduled on the Slick queue it should get at least a chance to execute. This would be a violation of the principle of least astonishment.
An alternative design might be to use a second threadpool, which only executes the continuations, which are holding connections.Since there can't be more continuations as the number of threads in the main threadpool, you only a need a small queue. And because there is already connection available, the continuation will always be guaranteed to proceed.
val tasks = 1 to 50 map { i => val a = sql"select 1".as[Int].head.map { i => Thread.sleep(100ms); i } dc.db.run(a.transactionally) } Await.result(Future.sequence(tasks), Duration(30, TimeUnit.SECONDS))
After thinking about this some more, the high-priority scheduling alone can almost avoid the deadlocks but not quite. Let's take the following example with a thread pool size of 1 and a connection pool size of 1:val tasks = 1 to 50 map { i => val a = sql"select 1".as[Int].head.map { i => Thread.sleep(100ms); i } dc.db.run(a.transactionally) } Await.result(Future.sequence(tasks), Duration(30, TimeUnit.SECONDS))
1) All 5 actions are started and put into the task queue (because they begin with a SynchronousDatabaseAction which has been fused from StartTransaction plus the query).
2) The first action runs, acquiring the only Connection from the pool and hanging on to it (because of the transaction).
3) Without the "map" call, the next action would be another SynchronousDatabaseAction, so we'd put it at the front of the queue, so the rest until the final Commit can run and release the connection. (Note that this doesn't occur in practice and the example already works fine without "map" because all the actions get fused.) But we do have a "map", so it gets passed to the global ExecutionContext for asynchronous execution.
4) While the "map" runs asynchronously, the database thread pool starts processing the next job. The asynchronous call finishes 100ms later and enqueues the continuation (which releases the connection) at the front, but it's too late. The next action is has already been started and will block on getConnection, so we have a deadlock.
Failing without blocking would resolve the deadlock without delay but it would drain the entire task queue and make all tasks except for the first one fail. That's not really what we want.
On the other hand, waiting for one second is not desirable, either, because after task #2 has failed and task #1 has finally completed, we just go back to step 1 and run into the same deadlock again with tasks #3 and #4. This cuts the wait time in half (25 seconds minimum, 75 with the current implementation) and still makes half of the tasks fail.
How about we count the open connections and suspend the database thread pool if no new connections are available and the next task in the queue is not of the highest priority (with an existing connection)? I think this would solve the problem completely. Or am I overlooking yet another corner case?
I don't know if there is a similar concept in Slick, or one that could be introduced, but it might be one solution. For example, if all tasks queued by the same map iteration, running "transactionally" were made to acquire the same Connection -- through some sort of ThreadLocal or session association. This might require starting the transaction somehow prior to the map iteration, such that each task would implicitly be pulled into the transaction and share a connection.