Jira (PDB-3715) PuppetDB doesn't complete shutdown after migration fails

44 weergaven
Naar het eerste ongelezen bericht

Charlie Sharpsteen (JIRA)

ongelezen,
16 okt 2017, 16:43:0316-10-2017
aan puppe...@googlegroups.com
Charlie Sharpsteen created an issue
 
PuppetDB / Bug PDB-3715
PuppetDB doesn't complete shutdown after migration fails
Issue Type: Bug Bug
Affects Versions: PDB 5.1.1
Assignee: Unassigned
Created: 2017/10/16 1:42 PM
Priority: Major Major
Reporter: Charlie Sharpsteen

The PuppetDB service doesn't shut down after a failed migration, despite log messages indicating that the shutdown cycle has been started. This causes the init service to wait for the full START_TIMEOUT before declaring startup as failed and killing the process. Given we've bumped START_TIMEOUT to 4 hours to allow for migration of large databases, this significantly delays detection of failed upgrades due to migration issues.

    1. Reproduction Case

Install PuppetDB 5.0.1 on a CentOS 7 VM using the puppetlabs/puppetdb module:

  • Add package repos for version 5 of the Puppet Platform and install the puppet agent:

rpm -Uvh http://yum.puppetlabs.com/puppet5/puppet-release-el-7.noarch.rpm
yum install -y puppet agent

  • Instal the PuppetDB module

/opt/puppetlabs/bin/puppet module install puppetlabs-puppetdb --version 6.0.1

  • Apply the following manifest to install Postgres and PuppetDB. Two applications may be necessary due to a race condition where Puppet resources attempt to enforce state on the database while Postgres is still starting up.

/opt/puppetlabs/puppet/bin/puppet apply << EOF
class {'puppetdb::globals': version => '5.0.1'}
class {'puppetdb': disable_ssl => true}
EOF

Add some test data to PuppetDB that will trigger the migration failure described in PDB-3692:

/opt/puppetlabs/puppet/bin/curl -X POST \
  -H 'Content-Type:application/json' \
  -H 'Accept:application/json' \
  -d '{"certname":"test1","environment":"DEV","values":{"myfact": null},"producer_timestamp":"2015-01-01", "producer":"master1"}' \
  "http://localhost:8080/pdb/cmd/v1?command=replace_facts&version=5&certname=test1"
/opt/puppetlabs/puppet/bin/curl -X POST \
  -H 'Content-Type:application/json' \
  -H 'Accept:application/json' \
  -d '{"certname":"test2","environment":"DEV","values":{"myfact": null},"producer_timestamp":"2015-01-01", "producer":"master1"}' \
  "http://localhost:8080/pdb/cmd/v1?command=replace_facts&version=5&certname=test2"
 
# Alter the facts table to trigger PDB-3692 on upgrade
sudo -u postgres /usr/bin/psql -d puppetdb -c "UPDATE facts SET value='null' WHERE factset_id=2 AND value_type_id=4"

Decrease the timeout for PuppetDB startup and then upgrade PuppetDB to 5.1.1:

echo 'START_TIMEOUT=60' >> /etc/sysconfig/puppetdb
 
/opt/puppetlabs/bin/puppet resource --evaltrace package puppetdb ensure=5.1.1

      1. Outcome

/var/log/puppetlabs/puppetdb/puppetdb.log shows the migration fails and the service enters shutdown within seconds after starting:

2017-10-16 20:03:52,350 INFO  [p.p.c.services] PuppetDB version 5.1.1
2017-10-16 20:03:52,351 INFO  [c.z.h.HikariDataSource] PDBMigrationsPool - Starting...
2017-10-16 20:03:52,353 INFO  [c.z.h.HikariDataSource] PDBMigrationsPool - Start completed.
2017-10-16 20:03:52,517 INFO  [p.p.s.migrate] Applying database migration version 64
2017-10-16 20:03:52,518 INFO  [p.p.s.migrate] [1/8] Cleaning up unreferenced facts...
2017-10-16 20:03:52,527 INFO  [p.p.s.migrate] [2/8] Creating new fact storage tables...
2017-10-16 20:03:52,543 INFO  [p.p.s.migrate] [3/8] Copying unique fact values into fact_values
2017-10-16 20:03:52,555 INFO  [p.p.s.migrate] [4/8] Reconstructing facts to refer to fact_values...
2017-10-16 20:03:52,557 INFO  [p.p.s.migrate] [5/8] Cleaning up duplicate null values...
2017-10-16 20:03:52,558 INFO  [p.p.s.migrate]   ... none found
2017-10-16 20:03:52,559 INFO  [p.p.s.migrate] [6/8] Computing fact value hashes...
2017-10-16 20:03:52,611 INFO  [p.p.s.migrate] [7/8] Indexing fact_values table...
2017-10-16 20:03:52,640 ERROR [p.p.s.migrate] Caught SQLException during migration
java.sql.BatchUpdateException: Batch entry 3 ALTER TABLE fact_values ADD CONSTRAINT fact_values_value_hash_key UNIQUE (value_hash) was aborted.  Call getNextException to see the cause.
  ...
2017-10-16 20:03:52,641 ERROR [p.p.s.migrate] Unravelled exception
org.postgresql.util.PSQLException: ERROR: could not create unique index "fact_values_value_hash_key"
  Detail: Key (value_hash)=(\x2be88ca4242c76e8253ac62474851065032d6833) is duplicated.
  ...
2017-10-16 20:03:52,668 INFO  [p.t.internal] Shutting down due to JVM shutdown hook.
2017-10-16 20:03:52,669 INFO  [p.t.internal] Beginning shutdown sequence

However, the upgrade takes over a full minute to complete, the init script hangs until START_TIMEOUT kicks in:

# /opt/puppetlabs/bin/puppet resource --evaltrace --verbose package puppetdb ensure=5.1.1
Info: /Package[puppetdb]: Starting to evaluate the resource
Notice: /Package[puppetdb]/ensure: ensure changed '5.0.1-1.el7' to '5.1.1'
Info: /Package[puppetdb]: Evaluated in 132.98 seconds
package { 'puppetdb':
  ensure => '5.1.1-1.el7',
}
 
# journalctl -u puppetdb
...
Oct 16 19:51:10 r1lt6c9aq1o5t4r systemd[1]: Starting puppetdb Service...
Oct 16 19:51:29 r1lt6c9aq1o5t4r puppetdb[13361]: The [database] classname setting has been retired and will be ignored.
Oct 16 19:51:29 r1lt6c9aq1o5t4r puppetdb[13361]: The [database] subprotocol setting has been retired and will be ignored.
Oct 16 19:51:33 r1lt6c9aq1o5t4r puppetdb[13361]: The [database] classname setting has been retired and will be ignored.
Oct 16 19:51:33 r1lt6c9aq1o5t4r puppetdb[13361]: The [database] subprotocol setting has been retired and will be ignored.
Oct 16 19:52:10 r1lt6c9aq1o5t4r puppetdb[13361]: Startup timed out after 60 seconds
Oct 16 19:52:10 r1lt6c9aq1o5t4r puppetdb[13361]: Startup script was terminated before completion
Oct 16 19:53:11 r1lt6c9aq1o5t4r puppetdb[13361]: Process 13368 not terminated gracefully after 60 seconds
Oct 16 19:53:12 r1lt6c9aq1o5t4r puppetdb[13361]: /opt/puppetlabs/server/apps/puppetdb/ezbake-functions.sh: line 79: 13368 Killed                  ${JAVA_BIN} ${JAVA_ARGS} -Djava.sec
Oct 16 19:53:13 r1lt6c9aq1o5t4r puppetdb[13361]: Process 13368 killed after SIGKILL
Oct 16 19:53:13 r1lt6c9aq1o5t4r systemd[1]: puppetdb.service: control process exited, code=exited status=1
Oct 16 19:53:13 r1lt6c9aq1o5t4r systemd[1]: Failed to start puppetdb Service.

      1. Expected Outcome

PuppetDB startup should fail-fast after migration fails and not have to be killed by the init script due START_TIMEOUT.

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v6.4.14#64029-sha1:ae256fe)
Atlassian logo

Charlie Sharpsteen (JIRA)

ongelezen,
16 okt 2017, 16:48:0416-10-2017
aan puppe...@googlegroups.com
Charlie Sharpsteen updated an issue
Change By: Charlie Sharpsteen
The PuppetDB service doesn't shut down after a failed migration, despite log messages indicating that the shutdown cycle has been started. This causes the init service to wait for the full START_TIMEOUT before declaring startup as failed and killing the process. Given we've bumped START_TIMEOUT to 4 hours to allow for migration of large databases, this significantly delays detection of failed upgrades due to migration issues.

## h2.  Reproduction Case


Install PuppetDB 5.0.1 on a CentOS 7 VM using the puppetlabs/puppetdb module:

  - Add package repos for version 5 of the Puppet Platform and install the puppet agent:

{code:bash}
{code}

  - Instal the PuppetDB module

{code:bash}

/opt/puppetlabs/bin/puppet module install puppetlabs-puppetdb --version 6.0.1
{code}

  - Apply the following manifest to install Postgres and PuppetDB. Two applications may be necessary due to a race condition where Puppet resources attempt to enforce state on the database while Postgres is still starting up.

{code:bash}

/opt/puppetlabs/puppet/bin/puppet apply << EOF
class {'puppetdb::globals': version => '5.0.1'}
class {'puppetdb': disable_ssl => true}
EOF
{code}


Add some test data to PuppetDB that will trigger the migration failure described in PDB-3692:

{code:bash}

/opt/puppetlabs/puppet/bin/curl -X POST \
  -H 'Content-Type:application/json' \
  -H 'Accept:application/json' \
  -d '{"certname":"test1","environment":"DEV","values":{"myfact": null},"producer_timestamp":"2015-01-01", "producer":"master1"}' \
  "http://localhost:8080/pdb/cmd/v1?command=replace_facts&version=5&certname=test1"
/opt/puppetlabs/puppet/bin/curl -X POST \
  -H 'Content-Type:application/json' \
  -H 'Accept:application/json' \
  -d '{"certname":"test2","environment":"DEV","values":{"myfact": null},"producer_timestamp":"2015-01-01", "producer":"master1"}' \
  "http://localhost:8080/pdb/cmd/v1?command=replace_facts&version=5&certname=test2"

# Alter the facts table to trigger PDB-3692 on upgrade
sudo -u postgres /usr/bin/psql -d puppetdb -c "UPDATE facts SET value='null' WHERE factset_id=2 AND value_type_id=4"
{code}


Decrease the timeout for PuppetDB startup and then upgrade PuppetDB to 5.1.1:

{code:bash}

echo 'START_TIMEOUT=60' >> /etc/sysconfig/puppetdb

/opt/puppetlabs/bin/puppet resource --evaltrace package puppetdb ensure=5.1.1
{code}

### h3.  Outcome


/var/log/puppetlabs/puppetdb/puppetdb.log shows the migration fails and the service enters shutdown within seconds after starting:

{noformat}

2017-10-16 20:03:52,350 INFO  [p.p.c.services] PuppetDB version 5.1.1
2017-10-16 20:03:52,351 INFO  [c.z.h.HikariDataSource] PDBMigrationsPool - Starting...
2017-10-16 20:03:52,353 INFO  [c.z.h.HikariDataSource] PDBMigrationsPool - Start completed.
2017-10-16 20:03:52,517 INFO  [p.p.s.migrate] Applying database migration version 64
2017-10-16 20:03:52,518 INFO  [p.p.s.migrate] [1/8] Cleaning up unreferenced facts...
2017-10-16 20:03:52,527 INFO  [p.p.s.migrate] [2/8] Creating new fact storage tables...
2017-10-16 20:03:52,543 INFO  [p.p.s.migrate] [3/8] Copying unique fact values into fact_values
2017-10-16 20:03:52,555 INFO  [p.p.s.migrate] [4/8] Reconstructing facts to refer to fact_values...
2017-10-16 20:03:52,557 INFO  [p.p.s.migrate] [5/8] Cleaning up duplicate null values...
2017-10-16 20:03:52,558 INFO  [p.p.s.migrate]   ... none found
2017-10-16 20:03:52,559 INFO  [p.p.s.migrate] [6/8] Computing fact value hashes...
2017-10-16 20:03:52,611 INFO  [p.p.s.migrate] [7/8] Indexing fact_values table...
2017-10-16 20:03:52,640 ERROR [p.p.s.migrate] Caught SQLException during migration
java.sql.BatchUpdateException: Batch entry 3 ALTER TABLE fact_values ADD CONSTRAINT fact_values_value_hash_key UNIQUE (value_hash) was aborted.  Call getNextException to see the cause.
  ...
2017-10-16 20:03:52,641 ERROR [p.p.s.migrate] Unravelled exception
org.postgresql.util.PSQLException: ERROR: could not create unique index "fact_values_value_hash_key"
  Detail: Key (value_hash)=(\x2be88ca4242c76e8253ac62474851065032d6833) is duplicated.
  ...
2017-10-16 20:03:52,668 INFO  [p.t.internal] Shutting down due to JVM shutdown hook.
2017-10-16 20:03:52,669 INFO  [p.t.internal] Beginning shutdown sequence
{noformat}


However, the upgrade takes over a full minute to complete, the init script hangs until START_TIMEOUT kicks in:

{noformat}

# /opt/puppetlabs/bin/puppet resource --evaltrace --verbose package puppetdb ensure=5.1.1
Info: /Package[puppetdb]: Starting to evaluate the resource
Notice: /Package[puppetdb]/ensure: ensure changed '5.0.1-1.el7' to '5.1.1'
Info: /Package[puppetdb]: Evaluated in 132.98 seconds
package { 'puppetdb':
  ensure => '5.1.1-1.el7',
}

# journalctl -u puppetdb
...
Oct 16 19:51:10 r1lt6c9aq1o5t4r systemd[1]: Starting puppetdb Service...
Oct 16 19:51:29 r1lt6c9aq1o5t4r puppetdb[13361]: The [database] classname setting has been retired and will be ignored.
Oct 16 19:51:29 r1lt6c9aq1o5t4r puppetdb[13361]: The [database] subprotocol setting has been retired and will be ignored.
Oct 16 19:51:33 r1lt6c9aq1o5t4r puppetdb[13361]: The [database] classname setting has been retired and will be ignored.
Oct 16 19:51:33 r1lt6c9aq1o5t4r puppetdb[13361]: The [database] subprotocol setting has been retired and will be ignored.
Oct 16 19:52:10 r1lt6c9aq1o5t4r puppetdb[13361]: Startup timed out after 60 seconds
Oct 16 19:52:10 r1lt6c9aq1o5t4r puppetdb[13361]: Startup script was terminated before completion
Oct 16 19:53:11 r1lt6c9aq1o5t4r puppetdb[13361]: Process 13368 not terminated gracefully after 60 seconds
Oct 16 19:53:12 r1lt6c9aq1o5t4r puppetdb[13361]: /opt/puppetlabs/server/apps/puppetdb/ezbake-functions.sh: line 79: 13368 Killed                  ${JAVA_BIN} ${JAVA_ARGS} -Djava.sec
Oct 16 19:53:13 r1lt6c9aq1o5t4r puppetdb[13361]: Process 13368 killed after SIGKILL
Oct 16 19:53:13 r1lt6c9aq1o5t4r systemd[1]: puppetdb.service: control process exited, code=exited status=1
Oct 16 19:53:13 r1lt6c9aq1o5t4r systemd[1]: Failed to start puppetdb Service.
{noformat}


### h3.  Expected Outcome


PuppetDB startup should fail-fast after migration fails and not have to be killed by the init script due START_TIMEOUT.

Charlie Sharpsteen (JIRA)

ongelezen,
19 okt 2017, 15:42:0219-10-2017
aan puppe...@googlegroups.com

Owen Rodabaugh (JIRA)

ongelezen,
19 okt 2017, 18:53:0319-10-2017
aan puppe...@googlegroups.com
Owen Rodabaugh updated an issue
Change By: Owen Rodabaugh
CS Priority: Needs Priority Major
CS Impact: This issue can cause troubleshooting a failed upgrade to take most of a day. This is a poor experience for our customers especially those who consider PE a critical service. It also leads to a very poor support experience when customers have to wait 4 hours to get final results that the upgrade failed.
CS Severity: 4 - Major
CS Business Value: 5 - $$$$$$
CS Frequency: 3 - 25-50% of Customers

Josh Cooper (JIRA)

ongelezen,
29 mrt 2019, 01:06:0229-03-2019
aan puppe...@googlegroups.com
Josh Cooper updated an issue
Change By: Josh Cooper
Team: Coremunity PuppetDB
This message was sent by Atlassian JIRA (v7.7.1#77002-sha1:e75ca93)
Atlassian logo

Rob Browning (Jira)

ongelezen,
15 jan 2021, 12:00:0415-01-2021
aan puppe...@googlegroups.com
Rob Browning commented on Bug PDB-3715
 
Re: PuppetDB doesn't complete shutdown after migration fails

I suspect/hope we may have fixed this – just need to re-test when someone has a bit of time.

This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)
Atlassian logo

Zachary Kent (Jira)

ongelezen,
17 feb 2021, 15:05:0317-02-2021
aan puppe...@googlegroups.com
Zachary Kent updated an issue
 
Change By: Zachary Kent
Labels: needs-investigation

Sebastian Miclea (Jira)

ongelezen,
13 aug 2021, 04:05:0213-08-2021
aan puppe...@googlegroups.com
Sebastian Miclea assigned an issue to Sebastian Miclea
Change By: Sebastian Miclea
Assignee: Sebastian Miclea
This message was sent by Atlassian Jira (v8.13.2#813002-sha1:c495a97)
Atlassian logo

Sebastian Miclea (Jira)

ongelezen,
13 aug 2021, 04:06:0213-08-2021
aan puppe...@googlegroups.com
Sebastian Miclea assigned an issue to Unassigned

Andrei Filipovici (Jira)

ongelezen,
26 aug 2021, 07:17:0226-08-2021
aan puppe...@googlegroups.com
Andrei Filipovici commented on Bug PDB-3715
 
Re: PuppetDB doesn't complete shutdown after migration fails

This bug is no longer reproducible. Tested it with PuppetDB version 7.9.0

Andrei Filipovici (Jira)

ongelezen,
26 aug 2021, 07:19:0326-08-2021
aan puppe...@googlegroups.com

Andrei Filipovici (Jira)

ongelezen,
26 aug 2021, 07:20:0326-08-2021
aan puppe...@googlegroups.com
Allen beantwoorden
Auteur beantwoorden
Doorsturen
0 nieuwe berichten