CI failures due to isolation2 framework problem

20 views
Skip to first unread message

Asim

unread,
Jun 23, 2021, 8:00:22 AM6/23/21
to Greenplum Developers
Hello

Some inconsistent failures in CI are happening due to the way "q" syntax
to quit a session is implemented in isolation2 framework. An example is
this resgroup failure:

https://prod.ci.gpdb.pivotal.io/teams/main/pipelines/6X_STABLE_without_asserts/jobs/resource_group_centos6/builds/1046

The relevant diff from that link is as follows:

     ---
\/home\/gpadmin\/gpdb_src\/src\/test\/isolation2\/expected\/resgroup\/resgroup_cancel_terminate_concurrency\.out

    +++
\/home\/gpadmin\/gpdb_src\/src\/test\/isolation2\/results\/resgroup\/resgroup_cancel_terminate_concurrency\.out

@@ -293,9 +293,10 @@

1q: ... <quitting>

2q: ... <quitting>

DROP ROLE role_concurrency_test;

-DROP

+ERROR: role "role_concurrency_test" cannot be dropped because some
objects depend on it

+DETAIL: owner of table pg_temp_28.tmp


The isolation2 framework's test runner does not wait until the backend
process exits when processing the "1q:" and "2q:" lines from the spec. 
If the test runner moves on to process the next line in the spec while
the backend processes for sessions "1" and "2" have not exited yet, the
DROP is bound to fail. This problem is easy to reproduce on Greenplum 7
as well.

Solutions?  One option is to make isolation2 framework capable of
waiting until the backend process exits when processing "*q:" lines.  A
new connection can be used to check existence of the desired backend pid
by querying "pg_stat_activity" view.  The isolation2 framework supports
utility mode, mirror and standby connections, in addition to regular
connections to Greenplum coordinator.  Simply querying
"pg_stat_activity" may not be enough, we may need to obtain active PIDs
from segments.

Another alternative is to burden tests with the onus of ensuring that
the desired backend process has exited.  In the above failed test, a new
fault can be defined in RemoveTempRelations and be waited for before the
"DROP ROLE" statement.

Any other solutions or comments on how to address this issue?

Asim


Ashwin Agrawal

unread,
Jun 23, 2021, 6:52:07 PM6/23/21
to Asim, Greenplum Developers
Great find. Thanks for sharing.

- Do we know all the reasons tests use :q:"?
- How many tests among them really have dependency on when session exit is completed for follow-up steps? I am guessing very few.

Inputs to these might help us decide if really adding wait for actual session termination to complete for ":q" is worth it or will unnecessarily slow down tests.

If only a few tests really care about the real termination, then maybe those tests can have retry logic to wait dependency to be cleared. I would prefer it can be done without introducing a fault injector for this. Like based on retry by checking pg_depend table and such for the DROP ROLE case.

-- 
Ashwin Agrawal (VMware)

Asim

unread,
Jun 24, 2021, 2:59:14 AM6/24/21
to Ashwin Agrawal, Greenplum Developers
On 24/06/21 4:21 am, Ashwin Agrawal wrote:
>
> - Do we know all the reasons tests use :q:"?
> - How many tests among them really have dependency on when session
> exit is completed for follow-up steps? I am guessing very few.


The only reason I can imagine is to test on-proc-exit actions such as
cleaning up session state.  That includes temp tables, workfiles and
such.  Therefore, the legitimate use of ":q" should be quite uncommon,
in line with what you guessed.  Besides the resource group test, I came
across wokrfile_mgr_test in isolation2 that suffers from flaky behaviour
due to ":q".


> Inputs to these might help us decide if really adding wait for actual
> session termination to complete for ":q" is worth it or will
> unnecessarily slow down tests.
>
> If only a few tests really care about the real termination, then maybe
> those tests can have retry logic to wait dependency to be cleared. I
> would prefer it can be done without introducing a fault injector for
> this. Like based on retry by checking pg_depend table and such for the
> DROP ROLE case.


Adding retry logic in specific tests is a better alternative. Thank you!

Asim


Reply all
Reply to author
Forward
0 new messages