Jira (PDB-4760) Broadcast commands transiently fail with "Connection is closed"

0 views
Skip to first unread message

Austin Blatt (Jira)

unread,
May 29, 2020, 6:40:04 PM5/29/20
to puppe...@googlegroups.com
Austin Blatt created an issue
 
PuppetDB / Bug PDB-4760
Broadcast commands transiently fail with "Connection is closed"
Issue Type: Bug Bug
Affects Versions: PDB 6.11.0
Assignee: Unassigned
Created: 2020/05/29 3:39 PM
Fix Versions: PDB 6.11.1, PDB 6.12.0
Priority: Normal Normal
Reporter: Austin Blatt

When running PuppetDB with two write databases configured, and running benchmark to simlate a large number of submissions, there will be occassional errors with "replace catalogs" and "replace facts" commands that fail and log errors like the following.

Retrying after attempt 0 for host-2, due to: clojure.lang.ExceptionInfo: [3] [replace catalog] Unable to broadcast command for host-2 {:kind :puppetlabs.puppetdb.command/retry} [Ljava.lang.Throwable;@1186448b
clojure.lang.ExceptionInfo: [3] [replace catalog] Unable to broadcast command for host-2
        at puppetlabs.puppetdb.command$broadcast_cmd.invokeStatic(command.clj:606)
        at puppetlabs.puppetdb.command$broadcast_cmd.invoke(command.clj:555)
        at puppetlabs.puppetdb.command$process_cmd$fn__39637$fn__39638.invoke(command.clj:631)
        at puppetlabs.puppetdb.command$process_cmd$fn__39637.invoke(command.clj:629)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760$fn__37761.invoke(metrics.clj:14)
        at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
        at com.codahale.metrics.Timer.time(Timer.java:101)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760.invoke(metrics.clj:14)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760$fn__37761.invoke(metrics.clj:14)
        at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
        at com.codahale.metrics.Timer.time(Timer.java:101)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760.invoke(metrics.clj:14)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invokeStatic(metrics.clj:17)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:6)
        at puppetlabs.puppetdb.command$process_cmd.invokeStatic(command.clj:626)
        at puppetlabs.puppetdb.command$process_cmd.invoke(command.clj:612)
        at puppetlabs.puppetdb.command$process_message.invokeStatic(command.clj:720)
        at puppetlabs.puppetdb.command$process_message.invoke(command.clj:677)
        at puppetlabs.puppetdb.command$message_handler$fn__39667.invoke(command.clj:752)
        at puppetlabs.puppetdb.threadpool$dochan$fn__39260$fn__39261.invoke(threadpool.clj:110)
        at puppetlabs.puppetdb.threadpool$gated_execute$fn__39224.invoke(threadpool.clj:64)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
        Suppressed: java.sql.SQLException: Connection is closed
                at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:490)
                at com.sun.proxy.$Proxy11.getAutoCommit(Unknown Source)
                at com.zaxxer.hikari.pool.HikariProxyConnection.getAutoCommit(HikariProxyConnection.java)
                at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:761)
                at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:741)
                at puppetlabs.puppetdb.jdbc$with_transacted_connection_fn$fn__28084.invoke(jdbc.clj:528)
                at puppetlabs.puppetdb.jdbc$eval28058$retry_sql_STAR___28063$fn__28064$fn__28065.invoke(jdbc.clj:500)
                at puppetlabs.puppetdb.jdbc$eval28058$retry_sql_STAR___28063$fn__28064.invoke(jdbc.clj:499)
                at puppetlabs.puppetdb.jdbc$eval28058$retry_sql_STAR___28063.invoke(jdbc.clj:490)
                at puppetlabs.puppetdb.jdbc$with_transacted_connection_fn.invokeStatic(jdbc.clj:526)
                at puppetlabs.puppetdb.jdbc$with_transacted_connection_fn.invoke(jdbc.clj:515)
                at puppetlabs.puppetdb.command$exec_replace_catalog.invokeStatic(command.clj:295)
                at puppetlabs.puppetdb.command$exec_replace_catalog.invoke(command.clj:292)
                at puppetlabs.puppetdb.command$exec_command.invokeStatic(command.clj:444)
                at puppetlabs.puppetdb.command$exec_command.invoke(command.clj:438)
                at puppetlabs.puppetdb.command$attempt_exec_command$fn__39565.invoke(command.clj:513)
                at puppetlabs.puppetdb.command$call_with_quick_retry$fn__39558.invoke(command.clj:495)
                at puppetlabs.puppetdb.command$call_with_quick_retry.invokeStatic(command.clj:494)
                at puppetlabs.puppetdb.command$call_with_quick_retry.invoke(command.clj:492)
                at puppetlabs.puppetdb.command$attempt_exec_command.invokeStatic(command.clj:512)
                at puppetlabs.puppetdb.command$attempt_exec_command.invoke(command.clj:509)
                at puppetlabs.puppetdb.command$broadcast_cmd$attempt_exec__39600$fn__39603.invoke(command.clj:577)
                at puppetlabs.puppetdb.command$broadcast_cmd$attempt_exec__39600.invoke(command.clj:576)
                at puppetlabs.puppetdb.command$broadcast_cmd$fn__39623$fn__39624$fn__39625.invoke(command.clj:590)
                ... 4 common frames omitted

When turning on debug logging, I could see that the first attempt of each command would fail with a message like

2020-05-29 14:12:31,273 DEBUG [cmd-broadcast-thread-13] [p.p.command] Exception throw in L1 retry attempt 1
org.postgresql.util.PSQLException: ERROR: could not serialize access due to concurrent update
  Where: SQL statement "SELECT 1 FROM ONLY "public"."certnames" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"

and the subsequent retries would fail with

2020-05-29 14:12:31,290 DEBUG [cmd-broadcast-thread-13] [p.p.command] Exception throw in L1 retry attempt 2
java.sql.SQLException: Connection is closed
        at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:515)

until it finally threw the error in the first code block.

This is because exec-replace-catalogs and exec-replace-facts end up nesting with-connection macros

(jdbc/with-monitored-db-connection db status
  <catch and attempt retry>
  (jdbc/with-transacted-connection
    <attempt storage>))

When the storage attempt fails, with-transacted-connection marks the connection as closed, but it is still bound by the wrapping with-monitored-db-connection, so it attempts to use a connection that is marked as closed for its retries.

There is a similar issue in the macro retry-with-monitored-connection

(retry-sql 5
           (with-monitored-db-connection db-spec status
             (with-db-transaction [:isolation isolation]
               (f))))

will get a new connection for each retry. This is used by "store report", which is why we don't see the transient ConnectionClosed errors with that command.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)
Atlassian logo

Austin Blatt (Jira)

unread,
May 29, 2020, 6:42:03 PM5/29/20
to puppe...@googlegroups.com
Austin Blatt updated an issue
Change By: Austin Blatt
When running PuppetDB with two write databases configured, and running benchmark to simlate a large number of submissions, there will be occassional errors with "replace catalogs" and "replace facts" commands that fail and log errors like the following.
{code}
{code}


When turning on debug logging, I could see that the first attempt of each command would fail with a message like
{code}

2020-05-29 14:12:31,273 DEBUG [cmd-broadcast-thread-13] [p.p.command] Exception throw in L1 retry attempt 1
org.postgresql.util.PSQLException: ERROR: could not serialize access due to concurrent update
  Where: SQL statement "SELECT 1 FROM ONLY "public"."certnames" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
{code}

and the subsequent retries would fail with
{code}

2020-05-29 14:12:31,290 DEBUG [cmd-broadcast-thread-13] [p.p.command] Exception throw in L1 retry attempt 2
java.sql.SQLException: Connection is closed
        at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:515)
{code}

until it finally threw the error in the first code block.

This is because exec-replace-catalogs and exec-replace-facts end up nesting {{with-connection}} macros
{code}

(jdbc/with-monitored-db-connection db status
  <catch and attempt retry>
  (jdbc/with-transacted-connection
    <attempt storage>))
{code}
When the
initial storage attempt fails, {{with-transacted-connection}} marks the connection as closed, but it is still bound by the wrapping {{with-monitored-db-connection}}, so it attempts to use a the same connection , that is now marked as closed , for its retries.


There is a similar issue in the macro {{retry-with-monitored-connection}}
{code}

(retry-sql 5
           (with-monitored-db-connection db-spec status
             (with-db-transaction [:isolation isolation]
               (f))))
{code}

will get a new connection for each retry. This is used by "store report", which is why we don't see the transient ConnectionClosed errors with that command.

Austin Blatt (Jira)

unread,
May 29, 2020, 6:57:03 PM5/29/20
to puppe...@googlegroups.com
Austin Blatt commented on Bug PDB-4760
 
Re: Broadcast commands transiently fail with "Connection is closed"

It looks like we also have the improper connection handling in the retry code of with-transacted-connection-fn

Rob Browning (Jira)

unread,
Jun 2, 2020, 12:58:03 PM6/2/20
to puppe...@googlegroups.com

Rob Browning (Jira)

unread,
Jun 12, 2020, 12:27:03 PM6/12/20
to puppe...@googlegroups.com
Rob Browning commented on Bug PDB-4760
 
Re: Broadcast commands transiently fail with "Connection is closed"

After some investigation, it looks like the underlying cause was the fact that in the initial, somewhat hurried version of the broadcast code, I added a db connection "with" wrapper around the call-with-quick-retry loop, and that loop suppresses exceptions, so that wen there was (for example) a serialization failure, instead of getting a new connection for the next attempt, the (new) top-level wrapper would cause us to continue to use the existing connection, the one we were using when we hit the error. That's because the with connection forms nest, and only the topmost "with connection" wrapper will actually close a connection.

We suspected that continuing to use the connection after the failure might be what was causing the closed connection exceptions. To test that, I added some code to exec-replace-facts to force all calls to block just before the storage work, and just after, along with some flushed console output to document the ordering, all within the exec-replace-facts transaction. Then I annotated call-with-quick-retry to refresh the connection's information via (.isValid connection 0) and print (.isClosed connection) whenever an exception is thrown by an attempt (.isClosed may not be accurate without the .isValid refresh). Finally, I added two identical facts to the queue (to ensure competing deliveries) and started PuppetDB with broadcast enabled.

Upon startup the output indicated that PuppetDB attempted concurrent delivery of the duplicate facts, provoking a "could not serialize access due to concurrent update" exception, followed by diagnostic output indicating that the connection was closed. After that the "closed connection" exception was thrown, supporting the initial hypothesis.

Austin Blatt (Jira)

unread,
Jun 15, 2020, 2:57:02 PM6/15/20
to puppe...@googlegroups.com

Rob Browning (Jira)

unread,
Jul 10, 2020, 1:56:04 PM7/10/20
to puppe...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages