Stale Connection issues with Ecto

523 views
Skip to first unread message

Colin Bankier

unread,
Sep 28, 2015, 2:22:02 AM9/28/15
to elixir-lang-talk
Hi,
I'm using Ecto and Phoenix and seeing some strange stale connection issues.
The app runs periodic jobs that query the DB (using `quantum` and `exq` - but I don't believe they are causing the issue) as well a few REST endpoints.
It actually talks to Amazon Redshift rather than Postgres, but again afaik that is not the primary issue.

When the app is freshly deployed, it works fine, the first few occurrences of the job run fine.
After a time, when the job runs again, I get a

05:56:40.235 [error] GenServer #PID<0.582.0> terminating
** (exit) an exception was raised:
    ** (ErlangError) erlang error: {:timeout, {:gen_server, :call, [#PID<0.373.0>, {:query, "SELECT ...", [...]}, 1000000]}}
        (stdlib) gen_server.erl:212: :gen_server.call/3
        (postgrex) lib/postgrex/connection.ex:86: Postgrex.Connection.query/4
        (ecto) lib/ecto/adapters/postgres/connection.ex:31: Ecto.Adapters.Postgres.Connection.query/4
        (stdlib) timer.erl:197: :timer.tc/3
        (ecto) lib/ecto/adapters/sql.ex:246: Ecto.Adapters.SQL.query/7
        (ecto) lib/ecto/pool.ex:159: Ecto.Pool.do_run/4
        (ecto) lib/ecto/adapters/sql.ex:231: Ecto.Adapters.SQL.query/6
        (ecto) lib/ecto/adapters/sql.ex:209: Ecto.Adapters.SQL.query/5

As you can see, I have set a ridiculously high timeout just to ensure the DB is not legitimately taking that long. Normally these queries run in a few seconds tops. I see a similar thing from the REST endpoints - although a little more inconsistently.

Mostly these return very quickly:

01:33:53.445 request_id=ag26aqr5ohnjn60m0p30 [info] Sent 200 in 10ms

And then randomly times out:

01:42:06.602 request_id=ag266v3foboiaes24e80 [info] Sent 500 in 1000001ms

The app is running on EC2 inside docker - and my suspicion is that the TCP connection for a connection held by poolboy times out, and poolboy continues to try use a stale connection.

In order to debug this further, I have connected an iex session to the running app and run a query on that node.

If I manually checkout a connection from the pool and try run a trivial query, it times out. On the other hand, if I create a new connection, it works fine.

Use a connection from  the pool:

query = "SELECT count(*) from production.table_name;"

run_from_pool = fn ->
  IO.puts "starting query from pool"
  {pool_mod, pool, timeout} = Panoramix.Repo.__pool__
  {:ok, worker, {connection_mod, pid_from_pool}, time} = pool_mod.checkout(pool, timeout)
  result = Postgrex.Connection.query!(pid_from_pool, query, [])
  pool_mod.checkin(pool, worker, timeout)
  IO.puts "finished query from pool"
end
Node.spawn app, run_from_pool
>> starting query from pool
>> #PID<16517.2024.0>

And nothing more is printed.

If I run the same query with a new connection, it works fine:

run_new_connection = fn ->
  IO.puts "starting query with new connection"
  {:ok, pid} = Postgrex.Connection.start_link(hostname: "***", username: "***", password: "***", database: "***", port: 5439)
  Postgrex.Connection.query!(pid, query, [])
  IO.puts "finished query with new connection"
end
Node.spawn app, run_new_connection
>> #PID<16517.2026.0>
>> starting query with new connection
>> finished query with new connection

I am using the latest phoenix (1.0.2), ecto (1.0.3) and postgrex (0.9.1). We have a number of Rails apps that run in the same network, also in docker, without issue.

Any advice on how to resolve or work around this issue greatly appreciate :)

Regards,
Colin.

José Valim

unread,
Sep 28, 2015, 3:46:26 AM9/28/15
to elixir-l...@googlegroups.com
I will talk to Eric. Maybe the connections are getting stale and we need to make sure we are checking those cases.



José Valim
Skype: jv.ptec
Founder and Director of R&D

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/03e3243d-b743-4be0-8ccc-0221171a1572%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Colin Bankier

unread,
Sep 28, 2015, 5:24:09 PM9/28/15
to elixir-l...@googlegroups.com
Great, thanks Jose!

--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/CPehTFwxetc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/CAGnRm4%2Bi%2BNO4cqeBjUzyCOCoh%2Bk89Zmou%2B79T6Cr_GsRQUSJgQ%40mail.gmail.com.

Eric Meadows-Jönsson

unread,
Sep 29, 2015, 9:12:46 AM9/29/15
to elixir-l...@googlegroups.com
I don't know what's going on but will try to investigate. You can try adding `keep_alive: true` to Postgrex's `:socket_options` config.

Unrelated to Postgrex but it's also a good time to revisit your configured timeouts, 1000 seconds (17 minutes) is a too long timeout for Postgrex.


For more options, visit https://groups.google.com/d/optout.



--
Eric Meadows-Jönsson

Eric Meadows-Jönsson

unread,
Sep 29, 2015, 9:14:26 AM9/29/15
to elixir-l...@googlegroups.com
It is worrying that Postgrex is seemingly not able to detect dead connections. I will try to reproduce this scenario.
--
Eric Meadows-Jönsson

Hadi Saloko

unread,
Sep 29, 2015, 9:03:14 PM9/29/15
to elixir-lang-talk
it happen to me too.. same stack trace

last week, I set timeout configuration to a very large integer (like Colin did). there are few errors logged but my server could hang in anytime
this week, I set to 1000 - 3000. I got so many errors logged and it took 30 minutes to few hours for my server to hang :(

Colin Bankier

unread,
Sep 30, 2015, 8:26:37 AM9/30/15
to elixir-l...@googlegroups.com
Thanks Eric, I tried adding

           socket_options: [keepalive: true]

to the config for my repo - as far as I could tell this is passed to postgrex and gen_tcp correctly. It didn't help however.
I also decreased the timeout to ensure that wasn't making things worse, it was really only that big to prove to myself that it was a connection issue and not a database issue.

I'll look into something temporary I can do to ensure a fresh connection is used everytime (any tips appreciated :) ) - and will be keen to hear if you find anything your end. I'm happy to hear from Hadi that I'm not the only one seeing this sort of thing.

Reply all
Reply to author
Forward
0 new messages