Jira (PDB-4504) add-resource-events-pk migration does not use batched inserts

1 view

Skip to first unread message

Charlie Sharpsteen (JIRA)

unread,

Sep 16, 2019, 2:47:03 PM9/16/19

to puppe...@googlegroups.com

Charlie Sharpsteen created an issue

add-resource-events-pk migration does not use batched inserts

Issue Type:	Bug
Affects Versions:	PDB 6.5.0, PDB 6.3.4
Assignee:	Unassigned
Created:	2019/09/16 11:46 AM
Priority:	Normal
Reporter:	Charlie Sharpsteen

PuppetDB 6.3 added a primary key to the `resource_events` table in order to
facilitate functionality such as repacking and partitioning. In order to create
a unique key, the hashing algorithm for resource events was changed. This
resulted in a migration that re-writes the entire table in order to update the
hash and remove duplicates. The re-write portion does not use batched inserts
when sending transformed data back to the database, which results in a seperate
`INSERT` statement being executed for each row in the `resource_events` table.

Reproduction Case

Install PuppetDB 5.2

Populate Postgres with some test data and [enable the pg_stat_statements
extension|https://www.postgresql.org/docs/9.6/pgstatstatements.html] on the
PuppetDB database

Upgrade to PuppetDB 6.3 and wait for migration 69 to finish

Compare the number of calls to INSERT INTO resource_evetns_transform to
the numer of rows modified by that query and the number of rows in the
resource_events table:

SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';

SELECT count(*) FROM resource_events;

Outcome

pg_stat_statements shows that the INSERT is executed once per row in
the resource_events table:

pe-puppetdb=# SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';

-[ RECORD 1 ]--

calls | 5698404

rows  | 5698404

pe-puppetdb=# SELECT count(*) FROM resource_events;

-[ RECORD 1 ]--

count | 5779611

Expected Outcome

The insertion of transformed rows is batched such that the number of calls
to INSERT is a few orders of magnitude smaller than the number of rows
in the resource_events table.

This message was sent by Atlassian JIRA (v7.7.1#77002-sha1:e75ca93)

Charlie Sharpsteen (JIRA)

unread,

Sep 16, 2019, 2:49:04 PM9/16/19

to puppe...@googlegroups.com

Charlie Sharpsteen updated an issue

add-resource-events-pk migration does not use batched inserts

Change By:	Charlie Sharpsteen

PuppetDB 6.3 added a primary key to the ` {{ resource_events ` }}

table in order to

facilitate functionality such as repacking and partitioning. In order to create
a unique key, the hashing algorithm for resource events was changed. This
resulted in a migration that re-writes the entire table in order to update the
hash and remove duplicates. The re-write portion does not use batched inserts

when sending transformed data back to the database, which results in a seperate separate
` {{ INSERT ` }} statement being executed for each row in the ` {{ resource_events ` }} table.

h2. Reproduction Case

- Install PuppetDB 5.2

- Populate Postgres with some test data and [enable the pg_stat_statements

extension|https://www.postgresql.org/docs/9.6/pgstatstatements.html] on the
PuppetDB database

  - Upgrade to PuppetDB 6.3 and wait for migration 69 to finish

  - Compare the number of calls to {{INSERT INTO resource_evetns_transform resource_events_transform }} to
    the numer number of rows modified by that query and the number of rows in the
    {{resource_events}} table:

{code:sql}

SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';
SELECT count(*) FROM resource_events;

{code}

h3. Outcome

{{pg_stat_statements}} shows that the {{INSERT}} is executed once per row in
the {{resource_events}} table:

{noformat}

pe-puppetdb=# SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';
-[ RECORD 1 ]--
calls | 5698404
rows | 5698404

pe-puppetdb=# SELECT count(*) FROM resource_events;
-[ RECORD 1 ]--
count | 5779611

{noformat}

h3. Expected Outcome

The insertion of transformed rows is batched such that the number of calls
to {{INSERT}} is a few orders of magnitude smaller than the number of rows
in the {{resource_events}} table.

Charlie Sharpsteen (JIRA)

unread,

Sep 16, 2019, 3:00:03 PM9/16/19

to puppe...@googlegroups.com

Charlie Sharpsteen commented on

Re: add-resource-events-pk migration does not use batched inserts

The loop resulting in this is here:

https://github.com/puppetlabs/puppetdb/blob/6.3.4/src/puppetlabs/puppetdb/scf/migrate.clj#L1543-L1558

There are a couple items that lead to the INSERT being executed once per row:

Reduce will iterate over each row in isolation, so there is only one row passed to jdbc/insert-multi!, instead of a batch of many rows.

We’re passing a seq of hashes to jdbc/insert-multi! and the docs for that function indicate that will cause a separate database operation per hash of data.

Item 1 might be addressed by using Clojure's partition function to chunk the list of rows into batches:

http://clojuredocs.org/clojure.core/partition

For item 2, the java.jdbc docs indicate that we should pas a seqence of lists of column values to jdbc/insert-multi!:

https://github.com/clojure/java.jdbc/blob/java.jdbc-0.7.7/src/main/clojure/clojure/java/jdbc.clj#L1574-L1577

Also, it sounds like getting JDBC to do batched inserts with Postgres requires the :reWriteBatchedInserts option to be set on the database connection:

https://github.com/clojure/java.jdbc/blob/java.jdbc-0.7.10/src/main/clojure/clojure/java/jdbc.clj#L1626-L1631

Charlie Sharpsteen (JIRA)

unread,

Sep 16, 2019, 3:18:03 PM9/16/19

to puppe...@googlegroups.com

Charlie Sharpsteen updated an issue

add-resource-events-pk migration does not use batched inserts

Change By:	Charlie Sharpsteen

PuppetDB 6.3 added a primary key to the {{resource_events}} table in order to
facilitate functionality such as repacking and partitioning. In order to create
a unique key, the hashing algorithm for resource events was changed. This
resulted in a migration that re-writes the entire table in order to update the
hash and remove duplicates. The re-write portion does not use batched inserts

when sending transformed data back to the database, which results in a separate

{{INSERT}} statement being executed for each row in the {{resource_events}} table.

h2. Reproduction Case

- Install PuppetDB 5.2

- Populate Postgres with some test data and

[ enable the pg_stat_statements
extension | on the
PuppetDB database:

https://www.postgresql.org/docs/9.6/pgstatstatements.html ] on the
PuppetDB database

- Upgrade to PuppetDB 6.3 and wait for migration 69 to finish

- Compare the number of calls to {{INSERT INTO resource_events_transform}} to
the number of rows modified by that query and the number of rows in the

{{resource_events}} table:

{code:sql}
SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';
SELECT count(*) FROM resource_events;
{code}

h3. Outcome

{{pg_stat_statements}} shows that the {{INSERT}} is executed once per row in
the {{resource_events}} table:

{noformat}
pe-puppetdb=# SELECT sum(calls) AS calls, sum(rows) AS rows FROM pg_stat_statements WHERE query ILIKE 'insert into resource_events_transform%';
-[ RECORD 1 ]--
calls | 5698404
rows | 5698404

pe-puppetdb=# SELECT count(*) FROM resource_events;
-[ RECORD 1 ]--
count | 5779611
{noformat}

h3. Expected Outcome

The insertion of transformed rows is batched such that the number of calls
to {{INSERT}} is a few orders of magnitude smaller than the number of rows
in the {{resource_events}} table.

Charlie Sharpsteen (JIRA)

unread,

Sep 16, 2019, 3:19:03 PM9/16/19

to puppe...@googlegroups.com

Charlie Sharpsteen updated an issue

add-resource-events-pk migration does not use batched inserts

Change By:	Charlie Sharpsteen

PuppetDB 6.3 added a primary key to the {{resource_events}} table in order to
facilitate functionality such as repacking and partitioning. In order to create
a unique key, the hashing algorithm for resource events was changed. This
resulted in a migration that re-writes the entire table in order to update the
hash and remove duplicates. The re-write portion does not use batched inserts
when sending transformed data back to the database, which results in a separate
{{INSERT}} statement being executed for each row in the {{resource_events}} table.

h2. Reproduction Case

  - Install PuppetDB 5.2

  - Populate Postgres with some test data and enable the pg_stat_statements
    extension on the
    PuppetDB database:

https://www.postgresql.org/docs/9.6/pgstatstatements.html

Charlie Sharpsteen (JIRA)

unread,

Sep 16, 2019, 3:19:04 PM9/16/19

to puppe...@googlegroups.com

Charlie Sharpsteen updated an issue

add-resource-events-pk migration does not use batched inserts

Change By:	Charlie Sharpsteen
CS Priority:	Needs Priority

Nick Walker (JIRA)

unread,

Sep 16, 2019, 4:40:04 PM9/16/19

to puppe...@googlegroups.com

Nick Walker updated an issue

add-resource-events-pk migration does not use batched inserts

Change By:	Nick Walker
Priority:	Normal Major

Nick Walker (JIRA)

unread,

Sep 16, 2019, 4:40:05 PM9/16/19

to puppe...@googlegroups.com

Nick Walker assigned an issue to Rob Browning

add-resource-events-pk migration does not use batched inserts

Change By:	Nick Walker
Assignee:	Rob Browning

Nick Walker (JIRA)

unread,

Sep 16, 2019, 4:40:05 PM9/16/19

to puppe...@googlegroups.com

Nick Walker commented on

Re: add-resource-events-pk migration does not use batched inserts

Rob Browning any possibility of getting this fixed before the next PE release?

Charlie Sharpsteen (JIRA)

unread,

Sep 17, 2019, 5:29:04 PM9/17/19

to puppe...@googlegroups.com

Charlie Sharpsteen assigned an issue to Charlie Sharpsteen

add-resource-events-pk migration does not use batched inserts

Change By:	Charlie Sharpsteen
Assignee:	Rob Browning Charlie Sharpsteen

Austin Blatt (JIRA)

unread,

Sep 27, 2019, 4:46:04 PM9/27/19

to puppe...@googlegroups.com

Austin Blatt updated an issue

add-resource-events-pk migration does not use batched inserts

Change By:	Austin Blatt
Fix Version/s:	PDB 6.7.0

Austin Blatt (JIRA)

unread,

Sep 27, 2019, 5:01:03 PM9/27/19

to puppe...@googlegroups.com

Austin Blatt updated an issue

add-resource-events-pk migration does not use batched inserts

Change By:	Austin Blatt
Issue Type:	Bug Improvement

Charlie Sharpsteen (JIRA)

unread,

Sep 27, 2019, 5:02:04 PM9/27/19

to puppe...@googlegroups.com

Charlie Sharpsteen updated an issue

add-resource-events-pk migration does not use batched inserts

Change By:	Charlie Sharpsteen
CS Priority:	Needs Priority Reviewed

Austin Blatt (JIRA)

unread,

Sep 27, 2019, 5:02:04 PM9/27/19

to puppe...@googlegroups.com

Austin Blatt updated an issue

add-resource-events-pk migration does not use batched inserts

Change By:	Austin Blatt
CS Priority:	Reviewed Needs Priority
Release Notes:	Not Needed

Austin Blatt (JIRA)

unread,

Oct 4, 2019, 2:02:03 PM10/4/19

to puppe...@googlegroups.com

Austin Blatt updated an issue

add-resource-events-pk migration does not use batched inserts

Change By:	Austin Blatt
Fix Version/s:	PDB 6.3.5

Austin Blatt (JIRA)

unread,

Jan 13, 2020, 4:32:04 PM1/13/20

to puppe...@googlegroups.com

Austin Blatt updated an issue

add-resource-events-pk migration does not use batched inserts

Change By:	Austin Blatt
Fix Version/s:	PDB 6.8.0

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu