I spent some more time on PR 790, nowadays known as PR 9258, to refactor
AO table segment management. I read through the old discussions and the
problems that came up the last time this was looked at.
There are currently three issues that I'm aware of. The first two were
discussed in an earlier thread on gpdb-dev. The third one I just ran
into myself, while playing with the PR and poking holes into the first
two issues.
I believe I have a path forward. See proposed Solutions below for each
of the three issues.
Problem 1: VACUUM may not be able to remove old segments quickly
----------------------------------------------------------------
This is the problem that Asim explain in the start of this thread:
https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/49MW3tBoH8s/FzL3ascmAgAJ
Essentially, with the proposed PR, a VACUUM on an appendonly table is
not be able to remove old AO segments after scanning the table, as long
as there are any transactions still running in the system that had
started before the VACUUM.
Solution:
I think we can live with that. Sure, dropping the old segments sooner
would be nice, but I don't see any safe way to achieve that. Currently,
we have a system in place that tracks serializable transactions, to
allow us to be more aggressive, but that is not quite waterproof either;
see problem 2.
So I propose that we accept that. If you run VACUUM on an AO table, and
there are any older transactions still running when it finishes, the
VACUUM will leave behind the old AO segment. The next VACUUM will drop
it, if the transaction has finished by then. (autovacuum would be nice
here...)
Problem 2: Existing anomaly with UDFs or with a lucky race
----------------------------------------------------------
This is the problem I mentioned in reply to Asim's initial post:
https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/49MW3tBoH8s/W_bdyxZMAgAJ
Session 1: Start transaction, acquire snapshot
Session 2: VACUUM. Removes old tuple versions, and moves everything
that's left to a new segment. It then drops the old segment.
Session 1: Run a query against the table, using the snapshot that was
acquired in first step.
This is easy to demonstrate with a user-defined function with a sleep in
it, but in theory you could run into it with any transaction, as there's
always a small window between acquiring a snapshot and acquiring locks
on tables.
This can be demonstrated on 6X_STABLE (and master) with this
isolation2test script:
--------------------
CREATE TABLE ao (a INT, b INT) WITH (appendonly=true);
INSERT INTO ao SELECT i as a, i as b FROM generate_series(1, 100) AS i;
DELETE FROM ao WHERE a <= 30;
create or replace function myfunc() returns bigint as $$
begin /* inside a function */
perform pg_sleep(10); /* inside a function */
return (select count(*) from ao); /* inside a function */
end; /* inside a function */
$$ stable language plpgsql;
-- Start a query. This will acquire the snapshot quickly, but then pause in
-- the pg_sleep inside the function.
1&: select myfunc();
-- Vacuum the table.
2: VACUUM ao;
-- Wait for the query to finish.
1<:
--------------------
This fails with an error:
ERROR: read beyond eof in table "ao" file "base/162197/92796.1", read
position 0 (small offset 0), actual read length 0 (large read length
720) (cdbbufferedread.c:211)
CONTEXT: SQL statement "SELECT (select count(*) from ao)"
PL/pgSQL function myfunc() line 4 at RETURN
Solution:
This will be fixed by the PR we're discussing. The delayed vacuum, which
is listed here as Problem 1, fixes this.
3. Confused visibility with mix of distributed vs local transactions.
---------------------------------------------------------------------
To determine the visibility of AO segments, we rely on the auxiliary
pg_aoseg_* heap table. The visiblity of the rows in the pg_aoseg table
determines which segments are visible to a snapshot.
That works great currently, but if we switch to updating pg_aoseg in a
local transaction as this PR does, some anomalies become possible. A
distributed transaction might see an inconsistent view of the segments,
because one row version in pg_aoseg is visible according to the
distributed snapshot, while another version of the same row is visible
to its local snapshot.
In fact, you can observe this anomaly even on master without appendonly
tables, if you modify a heap table in utility mode. Here's an example
(as an isolationtest2 test schedule):
--------------------
DROP TABLE IF EXISTS utiltest;
CREATE TABLE utiltest (a int, t text);
INSERT INTO utiltest SELECT i as a, 'initial' FROM generate_series(1,
10) AS i;
create or replace function myfunc() returns setof utiltest as $$
begin /* inside a function */
perform pg_sleep(10); /* inside a function */
return query select * from utiltest; /* inside a function */
end; /* inside a function */
$$ stable language plpgsql;
-- Launch one backend to query the table with a delay. It will acquire
snapshot
-- first, but scan the table only later. (A serializable snapshot would
achieve
-- the same thing.)
1&: select * from myfunc();
-- Update normally via QD. The delayed query does not see this update, as
-- it acquire the snapshot before this.
2: UPDATE utiltest SET t = 'updated' WHERE a <= 3;
-- Update again in utility mode. Does the delayed query see this or not?
0U: UPDATE utiltest SET t = 'updated in utility mode' WHERE a <= 5;
-- Get the delayed query's results. It returns 12 rows. It sees the
-- initial version of each row, but it *also* sees the new tuple
-- versions of the utility-mode update. The utility-updated rows have
-- an xmin that's visible to the local snapshot that the delayed query.
1<:
--------------------
In summary, this test case creates a table with 10 rows, and updates
some rows twice. One of the updates is made in utility mode. No rows are
deleted or inserted after the initial load, so any query on the table
should see 10 rows. But the query that's performed in the function sees
12 rows. It sees two versions of the rows that were updated twice.
This is clearly not ideal, but we can perhaps shrug it off as "don't do
that". If you update rows in utility mode, weird things can happen. But
it poses a problem for the idea of using local transactions for AO
vacuum. We can't let that anomaly to happen as a result of a normal VACUUM.
Solution:
Change the PR so that still uses distributed transactions for the
updates on pg_aoseg during AO vacuum. That's annoying because it brings
back some of the complexity in the vacuum code that I wanted to
eliminate, but I think it's doable.
- Heikki