Hello
Some inconsistent failures in CI are happening due to the way "q" syntax
to quit a session is implemented in isolation2 framework. An example is
this resgroup failure:
https://prod.ci.gpdb.pivotal.io/teams/main/pipelines/6X_STABLE_without_asserts/jobs/resource_group_centos6/builds/1046
The relevant diff from that link is as follows:
---
\/home\/gpadmin\/gpdb_src\/src\/test\/isolation2\/expected\/resgroup\/resgroup_cancel_terminate_concurrency\.out
+++
\/home\/gpadmin\/gpdb_src\/src\/test\/isolation2\/results\/resgroup\/resgroup_cancel_terminate_concurrency\.out
@@ -293,9 +293,10 @@
1q: ... <quitting>
2q: ... <quitting>
DROP ROLE role_concurrency_test;
-DROP
+ERROR: role "role_concurrency_test" cannot be dropped because some
objects depend on it
+DETAIL: owner of table pg_temp_28.tmp
The isolation2 framework's test runner does not wait until the backend
process exits when processing the "1q:" and "2q:" lines from the spec.
If the test runner moves on to process the next line in the spec while
the backend processes for sessions "1" and "2" have not exited yet, the
DROP is bound to fail. This problem is easy to reproduce on Greenplum 7
as well.
Solutions? One option is to make isolation2 framework capable of
waiting until the backend process exits when processing "*q:" lines. A
new connection can be used to check existence of the desired backend pid
by querying "pg_stat_activity" view. The isolation2 framework supports
utility mode, mirror and standby connections, in addition to regular
connections to Greenplum coordinator. Simply querying
"pg_stat_activity" may not be enough, we may need to obtain active PIDs
from segments.
Another alternative is to burden tests with the onus of ensuring that
the desired backend process has exited. In the above failed test, a new
fault can be defined in RemoveTempRelations and be waited for before the
"DROP ROLE" statement.
Any other solutions or comments on how to address this issue?
Asim