Jira (PDB-4504) add-resource-events-pk migration does not use batched inserts

1 view
Skip to first unread message

Charlie Sharpsteen (JIRA)

unread,
Sep 16, 2019, 2:47:03 PM9/16/19
to puppe...@googlegroups.com
Charlie Sharpsteen created an issue
 
PuppetDB / Bug PDB-4504
add-resource-events-pk migration does not use batched inserts
Issue Type: Bug Bug
Affects Versions: PDB 6.5.0, PDB 6.3.4
Assignee: Unassigned
Created: 2019/09/16 11:46 AM
Priority: Normal Normal
Reporter: Charlie Sharpsteen

PuppetDB 6.3 added a primary key to the `resource_events` table in order to
facilitate functionality such as repacking and partitioning. In order to create
a unique key, the hashing algorithm for resource events was changed. This
resulted in a migration that re-writes the entire table in order to update the
hash and remove duplicates. The re-write portion does not use batched inserts
when sending transformed data back to the database, which results in a seperate
`INSERT` statement being executed for each row in the `resource_events` table.

Reproduction Case

  • Install PuppetDB 5.2
  • Upgrade to PuppetDB 6.3 and wait for migration 69 to finish
  • Compare the number of calls to INSERT INTO resource_evetns_transform to
    the numer of rows modified by that query and the number of rows in the
    resource_events table:

SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';
SELECT count(*) FROM resource_events;

Outcome

pg_stat_statements shows that the INSERT is executed once per row in
the resource_events table:

pe-puppetdb=# SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';
-[ RECORD 1 ]--
calls | 5698404
rows  | 5698404
 
pe-puppetdb=# SELECT count(*) FROM resource_events;
-[ RECORD 1 ]--
count | 5779611

Expected Outcome

The insertion of transformed rows is batched such that the number of calls
to INSERT is a few orders of magnitude smaller than the number of rows
in the resource_events table.

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.7.1#77002-sha1:e75ca93)
Atlassian logo

Charlie Sharpsteen (JIRA)

unread,
Sep 16, 2019, 2:49:04 PM9/16/19
to puppe...@googlegroups.com
Charlie Sharpsteen updated an issue
Change By: Charlie Sharpsteen
PuppetDB 6.3 added a primary key to the ` {{ resource_events ` }}
table in order to

facilitate functionality such as repacking and partitioning. In order to create
a unique key, the hashing algorithm for resource events was changed. This
resulted in a migration that re-writes the entire table in order to update the
hash and remove duplicates. The re-write portion does not use batched inserts
when sending transformed data back to the database, which results in a seperate separate
` {{ INSERT ` }} statement being executed for each row in the ` {{ resource_events ` }} table.


h2. Reproduction Case

  - Install PuppetDB 5.2

  - Populate Postgres with some test data and [enable the pg_stat_statements

    extension|https://www.postgresql.org/docs/9.6/pgstatstatements.html] on the
    PuppetDB database

  - Upgrade to PuppetDB 6.3 and wait for migration 69 to finish

  - Compare the number of calls to {{INSERT INTO
resource_evetns_transform resource_events_transform }} to
    the
numer number of rows modified by that query and the number of rows in the
    {{resource_events}} table:

{code:sql}

SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';
SELECT count(*) FROM resource_events;
{code}

h3. Outcome


{{pg_stat_statements}} shows that the {{INSERT}} is executed once per row in
the {{resource_events}} table:

{noformat}

pe-puppetdb=# SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';
-[ RECORD 1 ]--
calls | 5698404
rows  | 5698404

pe-puppetdb=# SELECT count(*) FROM resource_events;
-[ RECORD 1 ]--
count | 5779611
{noformat}

h3. Expected Outcome


The insertion of transformed rows is batched such that the number of calls
to {{INSERT}} is a few orders of magnitude smaller than the number of rows
in the {{resource_events}} table.

Charlie Sharpsteen (JIRA)

unread,
Sep 16, 2019, 3:00:03 PM9/16/19
to puppe...@googlegroups.com
Charlie Sharpsteen commented on Bug PDB-4504
 
Re: add-resource-events-pk migration does not use batched inserts

The loop resulting in this is here:

https://github.com/puppetlabs/puppetdb/blob/6.3.4/src/puppetlabs/puppetdb/scf/migrate.clj#L1543-L1558

There are a couple items that lead to the INSERT being executed once per row:

  • Reduce will iterate over each row in isolation, so there is only one row passed to jdbc/insert-multi!, instead of a batch of many rows.
  • We’re passing a seq of hashes to jdbc/insert-multi! and the docs for that function indicate that will cause a separate database operation per hash of data.

Item 1 might be addressed by using Clojure's partition function to chunk the list of rows into batches:

http://clojuredocs.org/clojure.core/partition

For item 2, the java.jdbc docs indicate that we should pas a seqence of lists of column values to jdbc/insert-multi!:

https://github.com/clojure/java.jdbc/blob/java.jdbc-0.7.7/src/main/clojure/clojure/java/jdbc.clj#L1574-L1577

Also, it sounds like getting JDBC to do batched inserts with Postgres requires the :reWriteBatchedInserts option to be set on the database connection:

https://github.com/clojure/java.jdbc/blob/java.jdbc-0.7.10/src/main/clojure/clojure/java/jdbc.clj#L1626-L1631

Charlie Sharpsteen (JIRA)

unread,
Sep 16, 2019, 3:18:03 PM9/16/19
to puppe...@googlegroups.com
Charlie Sharpsteen updated an issue
Change By: Charlie Sharpsteen
PuppetDB 6.3 added a primary key to the {{resource_events}} table in order to
facilitate functionality such as repacking and partitioning. In order to create
a unique key, the hashing algorithm for resource events was changed. This
resulted in a migration that re-writes the entire table in order to update the
hash and remove duplicates. The re-write portion does not use batched inserts
when sending transformed data back to the database, which results in a separate

{{INSERT}} statement being executed for each row in the {{resource_events}} table.


h2. Reproduction Case

  - Install PuppetDB 5.2

  - Populate Postgres with some test data and
[ enable the pg_stat_statements
    extension
| on the
    PuppetDB database:

  - Upgrade to PuppetDB 6.3 and wait for migration 69 to finish

  - Compare the number of calls to {{INSERT INTO resource_events_transform}} to
    the number of rows modified by that query and the number of rows in the

    {{resource_events}} table:

{code:sql}
SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';
SELECT count(*) FROM resource_events;
{code}

h3. Outcome

{{pg_stat_statements}} shows that the {{INSERT}} is executed once per row in
the {{resource_events}} table:

{noformat}
pe-puppetdb=# SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';
-[ RECORD 1 ]--
calls | 5698404
rows  | 5698404

pe-puppetdb=# SELECT count(*) FROM resource_events;
-[ RECORD 1 ]--
count | 5779611
{noformat}

h3. Expected Outcome

The insertion of transformed rows is batched such that the number of calls
to {{INSERT}} is a few orders of magnitude smaller than the number of rows
in the {{resource_events}} table.

Charlie Sharpsteen (JIRA)

unread,
Sep 16, 2019, 3:19:03 PM9/16/19
to puppe...@googlegroups.com
Charlie Sharpsteen updated an issue
PuppetDB 6.3 added a primary key to the {{resource_events}} table in order to
facilitate functionality such as repacking and partitioning. In order to create
a unique key, the hashing algorithm for resource events was changed. This
resulted in a migration that re-writes the entire table in order to update the
hash and remove duplicates. The re-write portion does not use batched inserts
when sending transformed data back to the database, which results in a separate
{{INSERT}} statement being executed for each row in the {{resource_events}} table.


h2. Reproduction Case

  - Install PuppetDB 5.2

  - Populate Postgres with some test data and enable the pg_stat_statements
    extension on the

    PuppetDB database:

Charlie Sharpsteen (JIRA)

unread,
Sep 16, 2019, 3:19:04 PM9/16/19
to puppe...@googlegroups.com

Nick Walker (JIRA)

unread,
Sep 16, 2019, 4:40:04 PM9/16/19
to puppe...@googlegroups.com

Nick Walker (JIRA)

unread,
Sep 16, 2019, 4:40:05 PM9/16/19
to puppe...@googlegroups.com

Nick Walker (JIRA)

unread,
Sep 16, 2019, 4:40:05 PM9/16/19
to puppe...@googlegroups.com
Nick Walker commented on Bug PDB-4504
 
Re: add-resource-events-pk migration does not use batched inserts

Rob Browning any possibility of getting this fixed before the next PE release?

Charlie Sharpsteen (JIRA)

unread,
Sep 17, 2019, 5:29:04 PM9/17/19
to puppe...@googlegroups.com

Austin Blatt (JIRA)

unread,
Sep 27, 2019, 4:46:04 PM9/27/19
to puppe...@googlegroups.com

Austin Blatt (JIRA)

unread,
Sep 27, 2019, 5:01:03 PM9/27/19
to puppe...@googlegroups.com

Charlie Sharpsteen (JIRA)

unread,
Sep 27, 2019, 5:02:04 PM9/27/19
to puppe...@googlegroups.com

Austin Blatt (JIRA)

unread,
Sep 27, 2019, 5:02:04 PM9/27/19
to puppe...@googlegroups.com
Austin Blatt updated an issue
Change By: Austin Blatt
CS Priority: Reviewed Needs Priority
Release Notes: Not Needed

Austin Blatt (JIRA)

unread,
Oct 4, 2019, 2:02:03 PM10/4/19
to puppe...@googlegroups.com

Austin Blatt (JIRA)

unread,
Jan 13, 2020, 4:32:04 PM1/13/20
to puppe...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages