Dtx recovery background worker and backends

Ashwin Agrawal

unread,

May 26, 2020, 9:35:06 PM5/26/20

to Greenplum Developers

DTX recovery background worker's task is to complete distributed recovery, means finish (commit / abort) pending prepared transactions. Only after dtx recovery completes, backends can proceed. Hence, backend needs to wait till dtx recovery completes successfully.

Currently, we have two infinite waits:

DTX recovery process infinitely keeps trying to complete the distributed recovery. If any error happens during distributed recovery (like double fault) or such, the dtx recovery process errors out, terminates and gets swapped again, fails and spawns again infinitely
backends have an infinite wait for the dtx recovery process to complete

It's understood this situation should be rare in reality. In production this might seem okay but certainly in CI we need someway to avoid such infinite waiting. Instead we need to time-out someway.

DTX recovery process infinitely trying to recover seems fine as what else can we do. But feel we should have some sort of timeout for backend to bail out if recovery is not completing and ERROR out. I was thinking of using statement_timeout for this but that's not enabled by default, so will not serve us for CI and such cases. I was thinking of adding 10mins timeout but that also felt like some arbitrary number, as depending on cluster size and such maybe in real the distributed recovery could take longer and we would unnecessarily fail the connection.

Does someone have reasonable value or action to suggest?

Hao Wu

unread,

May 26, 2020, 9:45:27 PM5/26/20

to Ashwin Agrawal, Greenplum Developers

What should the segment/master do after they hit the timeout?

Paul Guo

unread,

May 26, 2020, 10:27:06 PM5/26/20

to Ashwin Agrawal, Greenplum Developers

DTX recovery process infinitely trying to recover seems fine as what else can we do. But feel we should have some sort of timeout for backend to bail out if recovery is not completing and ERROR out. I was thinking of using statement_timeout for this but that's not enabled by default, so will not serve us for CI and such cases. I was thinking of adding 10mins timeout but that also felt like some arbitrary number, as depending on cluster size and such maybe in real the distributed recovery could take longer and we would unnecessarily fail the connection.

Does someone have reasonable value or action to suggest?

Did we see such issue (dtx recovery hang) on CI recently?

How about enabling error out (shutdown master?) after timeout just when cassert is enabled.

Currently CI mainly tests with cassert enabled for each commit. And since it is used

with cassert enabled, we could hard code a rough value - not need of guc.

Ashwin Agrawal

unread,

May 27, 2020, 12:22:02 AM5/27/20

to Greenplum Developers

DTX recovery background worker's task is to complete distributed recovery, means finish (commit / abort) pending prepared transactions. Only after dtx recovery completes, backends can proceed. Hence, backend needs to wait till dtx recovery completes successfully.

Currently, we have two infinite waits:

DTX recovery process infinitely keeps trying to complete the distributed recovery. If any error happens during distributed recovery (like double fault) or such, the dtx recovery process errors out, terminates and gets swapped again, fails and spawns again infinitely
backends have an infinite wait for the dtx recovery process to complete

It's understood this situation should be rare in reality. In production this might seem okay but certainly in CI we need someway to avoid such infinite waiting. Instead we need to time-out someway.

Asim Praveen

unread,

May 27, 2020, 2:54:56 AM5/27/20

to gpdb...@greenplum.org

On 27/05/20 7:15 am, Hao Wu wrote:
>
> What should the segment/master do after they hit the timeout?
>

The backend that was waiting for DTX recovery on QD couldn't have
dispatched anything to segments. Upon timeout, as per Ashwin's
suggestion, the backend would report ERROR, which will be seen by the
user. The user can retry.

DTX recovery failure should be treated similar to crash recovery
failure. The system remains unavailable until the failure is resolved.
+1 to let DTX recovery process be spawned and respawned as many times as
necessary until the DTX recovery succeeds.

Regarding backend timeout, does it really need to be configurable? If
the timeout is too short, users can always retry. How about a
hard-coded limit of 30 seconds?

Asim

Paul Guo

unread,

May 27, 2020, 3:48:34 AM5/27/20

to Ashwin Agrawal, Greenplum Developers

oops, ignore my previous words. I misunderstood the questions.

Since when doing dtx recovery, there is no parallel query so the normal dtx recovery time might

not vary much, so maybe hardcode a value for client. If dtx recovery really needs

more time I think it is fine to let client error out even the recovery is ongoing. Even I think the timeout should not

be large. Think about the single-node crash recovery case, the client connection fails even immediately.

Heikki Linnakangas

unread,

May 27, 2020, 3:59:34 PM5/27/20

to pvtl-cont-paulguo, Ashwin Agrawal, Greenplum Developers

On 27/05/2020 10:48, Paul Guo wrote:
> Since when doing dtx recovery, there is no parallel query so the
> normal dtx recovery time might not vary much, so maybe hardcode a
> value for client. If dtx recovery really needs more time I think it
> is fine to let client error out even the recovery is ongoing. Even I
> think the timeout should not be large. Think about the single-node
> crash recovery case, the client connection fails even immediately.

I agree, a short timeout is fine. Maybe even no timeout at all, let the
client connection error out immediately if DTX recovery is still in
progress.

DTX recovery only happens at QD server startup, right? Like Paul said,
this is comparable to single-node crash recovery. During crash recovery,
you get:

FATAL: the database system is starting up
DETAIL: last replayed record at 0/0

We could have the same during DTX recovery. Something like:

FATAL: the database system is starting up
DETAIL: waiting for distributed transactions to complete

- Heikki

Ashwin Agrawal

unread,

May 27, 2020, 5:56:56 PM5/27/20

to Heikki Linnakangas, pvtl-cont-paulguo, Greenplum Developers

Thank You for all the inputs. I have created PR [1] for the same. Only doubt I have regarding emitting immediate error is our utilities. I think mostly first few connections are made by our utility and I don't know if they have retry logic coded or not. At least running isolation2 test no problem was revealed. Hence, opened PR with no retries. Also, while looking into this realized, cdb_setup() (within it mainly InitMotionLayerIPC()) is currently performed at start of connection for all non-utility connections. Wondering if there is way to delay that initialization to some later point, only if connection executes some distributed query. Like query: psql -c "select * from gp_segment_configuration;" does not need to InitMotionLayerIPC(). Currently such connection would also fail if dtx recovery is in-progress which really is not necessary and seems can be avoided. Thoughts? [1] https://github.com/greenplum-db/gpdb/pull/10195

Ashwin Agrawal

unread,

May 28, 2020, 11:13:05 AM5/28/20

to Pengzhou Tang, Heikki Linnakangas, pvtl-cont-paulguo, Greenplum Developers

On 5/27/20 6:23 PM, Pengzhou Tang wrote:

Hi Ashwin, I was trying to set a retry timeout to 4 seconds in cdb_setup() before, but it
turns out will make the pipeline flaky, so I revert that fix and keep the behavior that always waiting. (referring to a54a557c212f). I am not sure whether your PR will still make flakiness.

Things have passed in PR pipeline and local run. So, I am not sure if we would face the flakiness or not similar to last time.
Ideally the change is much more aggressive this time compared to last time. So, I would expect if flakiness exist to show up much
more and sooner compared to last time. I don't have better way to shake out the flakiness other than commit the PR and see
what happens.

Also, while looking into this realized, cdb_setup() (within it mainly
InitMotionLayerIPC()) is currently performed at start of connection
for all non-utility connections. Wondering if there is way to delay
that initialization to some later point, only if connection executes
some distributed query. Like query:

psql -c "select * from gp_segment_configuration;"

does not need to InitMotionLayerIPC(). Currently such connection would
also fail if dtx recovery is in-progress which really is not necessary
and seems can be avoided. Thoughts?

Hmm, InitMotionLayerIPC() is used to fork the ic thread and setup ic contexts, we need the listener port information when a QE is initialized.

I agree we can move the logical of waiting the DTX recovery out of cdb_setup() and delay it until we perform some distributed queries.

I am not planning to work on that piece of work. So, will leave it for someone else to tackle.

Pengzhou Tang

unread,

May 29, 2020, 9:23:55 AM5/29/20

to Ashwin Agrawal, Heikki Linnakangas, pvtl-cont-paulguo, Greenplum Developers

Hi Ashwin, I was trying to set a retry timeout to 4 seconds in cdb_setup() before, but it

turns out will make the pipeline flaky, so I revert that fix and keep the behavior that always waiting. (referring to a54a557c212f). I am not sure whether your PR will still make flakiness.

Also, while looking into this realized, cdb_setup() (within it mainly
InitMotionLayerIPC()) is currently performed at start of connection
for all non-utility connections. Wondering if there is way to delay
that initialization to some later point, only if connection executes
some distributed query. Like query:

psql -c "select * from gp_segment_configuration;"

does not need to InitMotionLayerIPC(). Currently such connection would
also fail if dtx recovery is in-progress which really is not necessary
and seems can be avoided. Thoughts?

Hmm, InitMotionLayerIPC() is used to fork the ic thread and setup ic contexts, we need the listener port information when a QE is initialized.

I agree we can move the logical of waiting the DTX recovery out of cdb_setup() and delay it until we perform some distributed queries.

[1] https://github.com/greenplum-db/gpdb/pull/10195

Ashwin Agrawal

unread,

May 29, 2020, 5:15:46 PM5/29/20

to Pengzhou Tang, Heikki Linnakangas, pvtl-cont-paulguo, Greenplum Developers

On 5/28/20 8:13 AM, Ashwin Agrawal wrote:

On 5/27/20 6:23 PM, Pengzhou Tang wrote:

Hi Ashwin, I was trying to set a retry timeout to 4 seconds in cdb_setup() before, but it
turns out will make the pipeline flaky, so I revert that fix and keep the behavior that always waiting. (referring to a54a557c212f). I am not sure whether your PR will still make flakiness.
Things have passed in PR pipeline and local run. So, I am not sure if we would face the flakiness or not similar to last time.
Ideally the change is much more aggressive this time compared to last time. So, I would expect if flakiness exist to show up much
more and sooner compared to last time. I don't have better way to shake out the flakiness other than commit the PR and see
what happens.


Okay, I can see where the flakiness in tests comes from. It's mainly
the change in expectation in terms of gpstart.

Crash recovery rejects the connections immediately but not pg_ctl
start.  Default behavior (or with -w) for pg_ctl start is to wait
until can successfully connect to database.

In greenplum, gpstart doesn't provide user the option to specify -w or
any timeout. It always behaves as -w was specified.  So, at successful
exit of gpstart, connections should succeed and no retries are
required. Hence, tests that perform restart (via gpstop -air) expect
would be able to connect as soon as command is done.

With old logic, the connection internally waited if distributed
transaction recovery was still running. But with immediate failure as
proposed, connections might fail. So, if making this change, need to
add logic to gpstart to also wait for distributed transaction recovery
to complete before declaring start done.

- Ashwin

Asim Praveen

unread,

May 31, 2020, 12:58:17 AM5/31/20

to Ashwin Agrawal, Pengzhou Tang, Heikki Linnakangas, pvtl-cont-paulguo, Greenplum Developers

On 30-May-2020, at 2:46 AM, Ashwin Agrawal <aas...@vmware.com> wrote:

In greenplum, gpstart doesn't provide user the option to specify -w or

any timeout. It always behaves as -w was specified.  So, at successful
exit of gpstart, connections should succeed and no retries are
required. Hence, tests that perform restart (via gpstop -air) expect
would be able to connect as soon as command is done.

With old logic, the connection internally waited if distributed
transaction recovery was still running. But with immediate failure as
proposed, connections might fail. So, if making this change, need to
add logic to gpstart to also wait for distributed transaction recovery
to complete before declaring start done.

If the cdb_setup call can be deferred until when it’s known that the command needs to be dispatched, the problem can be solved without changing gpstart, correct? The queries invoked by gpstart should not need dispatch.

Asim

Ashwin Agrawal

unread,

Jun 2, 2020, 12:33:16 AM6/2/20

to Asim Praveen, Pengzhou Tang, Heikki Linnakangas, pvtl-cont-paulguo, Greenplum Developers

The flakiness is less about queries from gpstart. But expectation in tests that after successful completion of gpstart, further queries/test will run fine. With the change, that's breakage in expectation. At completion of gpstart, we should be ready to accept distributed transactions.

I have created this PR https://github.com/greenplum-db/gpdb-postgres-merge/pull/52 against PostgreSQL 12 merge branch, as depends on PostgreSQL 10 change. Please let me know your thoughts.

Asim Praveen

unread,

Jun 2, 2020, 2:57:42 AM6/2/20

to Ashwin Agrawal, Pengzhou Tang, Heikki Linnakangas, pvtl-cont-paulguo, Greenplum Developers

> On 02-Jun-2020, at 10:03 AM, Ashwin Agrawal <aas...@vmware.com> wrote:
>
> On 5/30/20 9:58 PM, Asim Praveen wrote:
>>
>>> On 30-May-2020, at 2:46 AM, Ashwin Agrawal <aas...@vmware.com> wrote:
>>>
>>> In greenplum, gpstart doesn't provide user the option to specify -w or
>>> any timeout. It always behaves as -w was specified. So, at successful
>>> exit of gpstart, connections should succeed and no retries are
>>> required. Hence, tests that perform restart (via gpstop -air) expect
>>> would be able to connect as soon as command is done.
>>>
>>> With old logic, the connection internally waited if distributed
>>> transaction recovery was still running. But with immediate failure as
>>> proposed, connections might fail. So, if making this change, need to
>>> add logic to gpstart to also wait for distributed transaction recovery
>>> to complete before declaring start done.
>>>
>>>
>>
>> If the cdb_setup call can be deferred until when it’s known that the command needs to be dispatched, the problem can be solved without changing gpstart, correct? The queries invoked by gpstart should not need dispatch.
> The flakiness is less about queries from gpstart. But expectation in tests that after successful completion of gpstart, further queries/test will run fine. With the change, that's breakage in expectation. At completion of gpstart, we should be ready to accept distributed transactions.

Indeed, gpstart must render the cluster with DTM started. I misunderstood the problem earlier.

> I have created this PR https://github.com/greenplum-db/gpdb-postgres-merge/pull/52 against PostgreSQL 12 merge branch, as depends on PostgreSQL 10 change. Please let me know your thoughts.

LGTM!

Asim

David Krieger

unread,

Jun 2, 2020, 12:27:06 PM6/2/20

to Greenplum Developers, pa...@vmware.com, peng...@vmware.com, linnak...@vmware.com, pau...@gmail.com, aas...@vmware.com

On Monday, June 1, 2020 at 9:33:16 PM UTC-7, Ashwin wrote:

The flakiness is less about queries from gpstart. But expectation in tests that after successful completion of gpstart, further queries/test will run fine. With the change, that's breakage in expectation. At completion of gpstart, we should be ready to accept distributed transactions.

What is the method for gpstart to determine if DTM is started?

Our usual technique for determining if the cluster can handle distributed transactions is to do:

BEGIN;

CREATE TEMP TABLE canDoDistributedXact(a int);

COMMIT;

in a loop until it succeeds, but perhaps there is a better interface in master?

David

Ashwin Agrawal

unread,

Jun 2, 2020, 11:42:06 PM6/2/20

to David Krieger (Pivotal), Greenplum Developers, Asim Praveen, Pengzhou Tang, Heikki Linnakangas, pvtl-cont-paulguo

On 6/2/20 9:27 AM, David Krieger wrote:
> What is the method for gpstart to determine if DTM is started?
>
> Our usual technique for determining if the cluster can handle
> distributed transactions is to do:
>
> BEGIN;
> CREATE TEMP TABLE canDoDistributedXact(a int);
> COMMIT;
>
> in a loop until it succeeds, but perhaps there is a better interface
> in master?

Current master branch doesn't have any better mechanism, so the
brute-force way you provided above is only option. But things change
starting PostgreSQL 10. Upstream changed to record postmaster status
to PID file in data directory. It updates this file when specific
state in reached by server, which are like "ready", "starting",
"stopping", etc... So, leveraging that change I added state "dtmready"
to notify pg_ctl (and effectively gpstart) to wait till distributed
transactions recovery is complete. This helps to avoid making
connections to database for pg_ctl -w and helps to avoid flooding
of logs with FATAL messages "cannot connect". This change will come to
master via PostgreSQL 12 merge work lands.

Reply all

Reply to author

Forward