Hi,
I'm using Ecto and Phoenix and seeing some strange stale connection issues.
The app runs periodic jobs that query the DB (using `quantum` and `exq` - but I don't believe they are causing the issue) as well a few REST endpoints.
It actually talks to Amazon Redshift rather than Postgres, but again afaik that is not the primary issue.
When the app is freshly deployed, it works fine, the first few occurrences of the job run fine.
After a time, when the job runs again, I get a
05:56:40.235 [error] GenServer #PID<0.582.0> terminating
** (exit) an exception was raised:
** (ErlangError) erlang error: {:timeout, {:gen_server, :call, [#PID<0.373.0>, {:query, "SELECT ...", [...]}, 1000000]}}
(stdlib) gen_server.erl:212: :gen_server.call/3
(postgrex) lib/postgrex/connection.ex:86: Postgrex.Connection.query/4
(ecto) lib/ecto/adapters/postgres/connection.ex:31: Ecto.Adapters.Postgres.Connection.query/4
(ecto) lib/ecto/adapters/sql.ex:246: Ecto.Adapters.SQL.query/7
(ecto) lib/ecto/pool.ex:159: Ecto.Pool.do_run/4
(ecto) lib/ecto/adapters/sql.ex:231: Ecto.Adapters.SQL.query/6
(ecto) lib/ecto/adapters/sql.ex:209: Ecto.Adapters.SQL.query/5
As you can see, I have set a ridiculously high timeout just to ensure the DB is not legitimately taking that long. Normally these queries run in a few seconds tops. I see a similar thing from the REST endpoints - although a little more inconsistently.
Mostly these return very quickly:
01:33:53.445 request_id=ag26aqr5ohnjn60m0p30 [info] Sent 200 in 10ms
And then randomly times out:
01:42:06.602 request_id=ag266v3foboiaes24e80 [info] Sent 500 in 1000001ms
The app is running on EC2 inside docker - and my suspicion is that the TCP connection for a connection held by poolboy times out, and poolboy continues to try use a stale connection.
In order to debug this further, I have connected an iex session to the running app and run a query on that node.
If I manually checkout a connection from the pool and try run a trivial query, it times out. On the other hand, if I create a new connection, it works fine.
Use a connection from the pool:
query = "SELECT count(*) from production.table_name;"
run_from_pool = fn ->
IO.puts "starting query from pool"
{pool_mod, pool, timeout} = Panoramix.Repo.__pool__
{:ok, worker, {connection_mod, pid_from_pool}, time} = pool_mod.checkout(pool, timeout)
result = Postgrex.Connection.query!(pid_from_pool, query, [])
pool_mod.checkin(pool, worker, timeout)
IO.puts "finished query from pool"
end
Node.spawn app, run_from_pool
>> starting query from pool
>> #PID<16517.2024.0>
And nothing more is printed.
If I run the same query with a new connection, it works fine:
run_new_connection = fn ->
IO.puts "starting query with new connection"
{:ok, pid} = Postgrex.Connection.start_link(hostname: "***", username: "***", password: "***", database: "***", port: 5439)
Postgrex.Connection.query!(pid, query, [])
IO.puts "finished query with new connection"
end
Node.spawn app, run_new_connection
>> #PID<16517.2026.0>
>> starting query with new connection
>> finished query with new connection
I am using the latest phoenix (1.0.2), ecto (1.0.3) and postgrex (0.9.1). We have a number of Rails apps that run in the same network, also in docker, without issue.
Any advice on how to resolve or work around this issue greatly appreciate :)
Regards,
Colin.