Appendonly vacuum and visibility issues

Heikki Linnakangas

unread,

Feb 12, 2020, 10:15:31 AM2/12/20

to Greenplum Developers

I spent some more time on PR 790, nowadays known as PR 9258, to refactor
AO table segment management. I read through the old discussions and the
problems that came up the last time this was looked at.

There are currently three issues that I'm aware of. The first two were
discussed in an earlier thread on gpdb-dev. The third one I just ran
into myself, while playing with the PR and poking holes into the first
two issues.

I believe I have a path forward. See proposed Solutions below for each
of the three issues.

Problem 1: VACUUM may not be able to remove old segments quickly
----------------------------------------------------------------

This is the problem that Asim explain in the start of this thread:

https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/49MW3tBoH8s/FzL3ascmAgAJ

Essentially, with the proposed PR, a VACUUM on an appendonly table is
not be able to remove old AO segments after scanning the table, as long
as there are any transactions still running in the system that had
started before the VACUUM.

Solution:

I think we can live with that. Sure, dropping the old segments sooner
would be nice, but I don't see any safe way to achieve that. Currently,
we have a system in place that tracks serializable transactions, to
allow us to be more aggressive, but that is not quite waterproof either;
see problem 2.

So I propose that we accept that. If you run VACUUM on an AO table, and
there are any older transactions still running when it finishes, the
VACUUM will leave behind the old AO segment. The next VACUUM will drop
it, if the transaction has finished by then. (autovacuum would be nice
here...)

Problem 2: Existing anomaly with UDFs or with a lucky race
----------------------------------------------------------

This is the problem I mentioned in reply to Asim's initial post:

https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/49MW3tBoH8s/W_bdyxZMAgAJ

Session 1: Start transaction, acquire snapshot
Session 2: VACUUM. Removes old tuple versions, and moves everything
that's left to a new segment. It then drops the old segment.
Session 1: Run a query against the table, using the snapshot that was
acquired in first step.

This is easy to demonstrate with a user-defined function with a sleep in
it, but in theory you could run into it with any transaction, as there's
always a small window between acquiring a snapshot and acquiring locks
on tables.

This can be demonstrated on 6X_STABLE (and master) with this
isolation2test script:

--------------------
CREATE TABLE ao (a INT, b INT) WITH (appendonly=true);
INSERT INTO ao SELECT i as a, i as b FROM generate_series(1, 100) AS i;

DELETE FROM ao WHERE a <= 30;

create or replace function myfunc() returns bigint as $$
begin /* inside a function */
perform pg_sleep(10); /* inside a function */
return (select count(*) from ao); /* inside a function */
end; /* inside a function */
$$ stable language plpgsql;

-- Start a query. This will acquire the snapshot quickly, but then pause in
-- the pg_sleep inside the function.
1&: select myfunc();

-- Vacuum the table.
2: VACUUM ao;

-- Wait for the query to finish.
1<:
--------------------

This fails with an error:
ERROR: read beyond eof in table "ao" file "base/162197/92796.1", read
position 0 (small offset 0), actual read length 0 (large read length
720) (cdbbufferedread.c:211)
CONTEXT: SQL statement "SELECT (select count(*) from ao)"
PL/pgSQL function myfunc() line 4 at RETURN

Solution:

This will be fixed by the PR we're discussing. The delayed vacuum, which
is listed here as Problem 1, fixes this.

3. Confused visibility with mix of distributed vs local transactions.
---------------------------------------------------------------------

To determine the visibility of AO segments, we rely on the auxiliary
pg_aoseg_* heap table. The visiblity of the rows in the pg_aoseg table
determines which segments are visible to a snapshot.

That works great currently, but if we switch to updating pg_aoseg in a
local transaction as this PR does, some anomalies become possible. A
distributed transaction might see an inconsistent view of the segments,
because one row version in pg_aoseg is visible according to the
distributed snapshot, while another version of the same row is visible
to its local snapshot.

In fact, you can observe this anomaly even on master without appendonly
tables, if you modify a heap table in utility mode. Here's an example
(as an isolationtest2 test schedule):

--------------------
DROP TABLE IF EXISTS utiltest;
CREATE TABLE utiltest (a int, t text);
INSERT INTO utiltest SELECT i as a, 'initial' FROM generate_series(1,
10) AS i;

create or replace function myfunc() returns setof utiltest as $$
begin /* inside a function */
perform pg_sleep(10); /* inside a function */
return query select * from utiltest; /* inside a function */
end; /* inside a function */
$$ stable language plpgsql;

-- Launch one backend to query the table with a delay. It will acquire
snapshot
-- first, but scan the table only later. (A serializable snapshot would
achieve
-- the same thing.)
1&: select * from myfunc();

-- Update normally via QD. The delayed query does not see this update, as
-- it acquire the snapshot before this.
2: UPDATE utiltest SET t = 'updated' WHERE a <= 3;

-- Update again in utility mode. Does the delayed query see this or not?
0U: UPDATE utiltest SET t = 'updated in utility mode' WHERE a <= 5;

-- Get the delayed query's results. It returns 12 rows. It sees the
-- initial version of each row, but it *also* sees the new tuple
-- versions of the utility-mode update. The utility-updated rows have
-- an xmin that's visible to the local snapshot that the delayed query.
1<:
--------------------

In summary, this test case creates a table with 10 rows, and updates
some rows twice. One of the updates is made in utility mode. No rows are
deleted or inserted after the initial load, so any query on the table
should see 10 rows. But the query that's performed in the function sees
12 rows. It sees two versions of the rows that were updated twice.

This is clearly not ideal, but we can perhaps shrug it off as "don't do
that". If you update rows in utility mode, weird things can happen. But
it poses a problem for the idea of using local transactions for AO
vacuum. We can't let that anomaly to happen as a result of a normal VACUUM.

Solution:

Change the PR so that still uses distributed transactions for the
updates on pg_aoseg during AO vacuum. That's annoying because it brings
back some of the complexity in the vacuum code that I wanted to
eliminate, but I think it's doable.

- Heikki

Ivan Novick

unread,

Feb 12, 2020, 11:22:26 AM2/12/20

to Heikki Linnakangas, Greenplum Developers

+1 to live with problem1 as described by Heikki below. Just having it clean in next vacuum sounds totally fine

On Wed, Feb 12, 2020 at 7:15 AM Heikki Linnakangas <hlinna...@pivotal.io> wrote:

IProblem 1: VACUUM may not be able to remove old segments quickly
----------------------------------------------------------------

Heikki Linnakangas

unread,

Feb 12, 2020, 3:23:28 PM2/12/20

to Ashwin Agrawal, Greenplum Developers

On 12/02/2020 21:24, Ashwin Agrawal wrote:
> Let's make this little more explicit on what the behavior (change) would
> be for clarity. I feel it should be fine.
>
> Currently, only lazy vacuum is not able to drop compacted segment file
> if there is read-only transaction running for this table, it puts the
> file in awaiting_drop_phase and next vacuum cleans it up. Full vacuum
> always compacted the file means always reclaimed the dead space.
>
> With the proposed, lazy and full vacuum both may not be able to reclaim
> the space. Important to note AO vacuum kind of doubles the on-disk space
> unlike heap (which never grows the on-disk space on vacuums), as it
> writes visible tuples to new files. So, not able to reclaim space for
> old segment files kind of is different compared to heap, where it will
> increase the disk footprint than reducing the same. Which could be
> anti-goal for running vacuum on AO table.

Yep.

> Should we convert AO full vacuum to do what heap does, complete rewrite
> to fresh temp table and throw away the old table? With that at
> least full vacuum will have the same expectations, will always reclaim
> the space.

Hmm. That would be complicated. It's not enough to rewrite all the data
- that's what AO compaction already does for one segment - but you also
need to reconstruct the visibility of all the tuples, so that after the
rewrite, all snapshots see the same set of tuples as before. I guess you
could do that: scan the AO segment, leaving out all dead row versions,
and for all others, reconstruct the pg_aoseg rows with EOFs and entries
in the visibility map, to match the visibility of the old segment.
That's quite a lot of work, we don't have any code to do that today, but
it seems doable.

I'd suggest that we keep that idea in mind as a backup plan, and
implement it later, if the doubled disk usage becomes too troublesome.
It seems natural to do that a a separate change, after this PR has landed.

> Change the PR so that still uses distributed transactions for the
> updates on pg_aoseg during AO vacuum. That's annoying because it brings
> back some of the complexity in the vacuum code that I wanted to
> eliminate, but I think it's doable.
>

> This means we will not be able to autovacuum AO tables in future since
> auto-vacuum works locally and not distributed way, correct?

Yeah, I guess so.

Note that just dropping old AWAITING_DROP segments that are no longer
needed doesn't require a distributed transactions. So we could enable
autovacuum to help with Problem 1, even if we can't use it for rewriting
the segments.

- Heikki

Ashwin Agrawal

unread,

Feb 12, 2020, 3:47:03 PM2/12/20

to Heikki Linnakangas, Greenplum Developers

On Wed, Feb 12, 2020 at 7:15 AM Heikki Linnakangas <hlinna...@pivotal.io> wrote:

Problem 1: VACUUM may not be able to remove old segments quickly
----------------------------------------------------------------

This is the problem that Asim explain in the start of this thread:

https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_a_greenplum.org_d_msg_gpdb-2Ddev_49MW3tBoH8s_FzL3ascmAgAJ&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=mYaxzKC7oaeKdjGD1jDpwNFmoSFDewfKEoQUUK_J8Ow&s=94Tc3LExAGnJ6b997Ldk8JSSB8D9GnICSp4P9LJsb4Q&e=

Essentially, with the proposed PR, a VACUUM on an appendonly table is
not be able to remove old AO segments after scanning the table, as long
as there are any transactions still running in the system that had
started before the VACUUM.

Solution:

I think we can live with that. Sure, dropping the old segments sooner
would be nice, but I don't see any safe way to achieve that. Currently,
we have a system in place that tracks serializable transactions, to
allow us to be more aggressive, but that is not quite waterproof either;
see problem 2.

So I propose that we accept that. If you run VACUUM on an AO table, and
there are any older transactions still running when it finishes, the
VACUUM will leave behind the old AO segment. The next VACUUM will drop
it, if the transaction has finished by then. (autovacuum would be nice
here...)

Let's make this little more explicit on what the behavior (change) would be for clarity. I feel it should be fine.

Currently, only lazy vacuum is not able to drop compacted segment file if there is read-only transaction running for this table, it puts the file in awaiting_drop_phase and next vacuum cleans it up. Full vacuum always compacted the file means always reclaimed the dead space.

With the proposed, lazy and full vacuum both may not be able to reclaim the space. Important to note AO vacuum kind of doubles the on-disk space unlike heap (which never grows the on-disk space on vacuums), as it writes visible tuples to new files. So, not able to reclaim space for old segment files kind of is different compared to heap, where it will increase the disk footprint than reducing the same. Which could be anti-goal for running vacuum on AO table.

Should we convert AO full vacuum to do what heap does, complete rewrite to fresh temp table and throw away the old table? With that at least full vacuum will have the same expectations, will always reclaim the space.

3. Confused visibility with mix of distributed vs local transactions.

This means we will not be able to autovacuum AO tables in future since auto-vacuum works locally and not distributed way, correct?

I have not given thought to this but just pointing out visimap updates might need to follow the same pattern as pg_aoseg during vacuum.

Ashwin Agrawal

unread,

Feb 12, 2020, 4:48:16 PM2/12/20

to Heikki Linnakangas, Greenplum Developers

On Wed, Feb 12, 2020 at 12:23 PM Heikki Linnakangas <hlinna...@pivotal.io> wrote:

On 12/02/2020 21:24, Ashwin Agrawal wrote:
> Let's make this little more explicit on what the behavior (change) would
> be for clarity. I feel it should be fine.
>
> Currently, only lazy vacuum is not able to drop compacted segment file
> if there is read-only transaction running for this table, it puts the
> file in awaiting_drop_phase and next vacuum cleans it up. Full vacuum
> always compacted the file means always reclaimed the dead space.
>
> With the proposed, lazy and full vacuum both may not be able to reclaim
> the space. Important to note AO vacuum kind of doubles the on-disk space
> unlike heap (which never grows the on-disk space on vacuums), as it
> writes visible tuples to new files. So, not able to reclaim space for
> old segment files kind of is different compared to heap, where it will
> increase the disk footprint than reducing the same. Which could be
> anti-goal for running vacuum on AO table.

Yep.

Also, something we need some plan for resolution (rare but possible), this could lead table to have no segfiles available for insertion. If customers insert with high concurrency and then continuously vacuum AO tables, it doesn't have the opportunity to clean-up old files. Couple of such cases were reported with older implementation. Newer implementation makes it much more likely. We just need to test out and know what the resolution would look like. (Mostly testing vacuum on AO with all segfiles except one in awaiting drop state works fine.)

One of the explicit customer advantages I wish to point out once we get this work done is no more work-load is constrained by `max_appendonly_tables` GUC. Would be able to I/U/D on as many AO tables as wish concurrently. With partitions it was easy to hit this limit. Yes, customers bump this value high but then it unnecessarily consumes shared memory on QD.

> Change the PR so that still uses distributed transactions for the
> updates on pg_aoseg during AO vacuum. That's annoying because it brings
> back some of the complexity in the vacuum code that I wanted to
> eliminate, but I think it's doable.
>
> This means we will not be able to autovacuum AO tables in future since
> auto-vacuum works locally and not distributed way, correct?

Yeah, I guess so.

Note that just dropping old AWAITING_DROP segments that are no longer
needed doesn't require a distributed transactions. So we could enable
autovacuum to help with Problem 1, even if we can't use it for rewriting
the segments.

That's nice!

Heikki Linnakangas

unread,

Feb 13, 2020, 3:19:08 AM2/13/20

to Ashwin Agrawal, Greenplum Developers

Yep.

After sleeping on it, here's another add-on idea to alleviate this:

I think we could *detect* the case that a transaction tries to access a
segment with an old snapshot. We could throw an error in that case. So
we could keep the old aggressive compaction strategy, but if a
transaction then tries to read a segment that's already been compacted
away, you would get a "snapshot too old" error.

We could even have a vacuum option or GUC to choose between the two
behaviours. Or actually, there are three possible strategies:

1. Aggressively drop old segments. If an old query tries access a table
after the vacuum, the query aborts with an error.

2. Keep old segmnents around until no more *serializable* transactions
need it. If a non-serializable transaction with an old snapshot tries to
access the table, it aborts with an error. This is the current behaviour
in released versions, except that the error isn't a graceful one.

3. Keep old segments until no one needs them. This is the behaviour that
I originally suggested.

(I would prefer to leave out strategy 2, though, because then we need to
keep track of which transactions are serializable. It's not hard, but
it's another diff vs. upstream that we'd need to carry.)

- Heikki

Asim R P

unread,

Feb 13, 2020, 4:23:46 AM2/13/20

to Heikki Linnakangas, Greenplum Developers

On Wed, Feb 12, 2020 at 8:45 PM Heikki Linnakangas <hlinna...@pivotal.io> wrote:
>
> 3. Confused visibility with mix of distributed vs local transactions.
> ---------------------------------------------------------------------
>
> To determine the visibility of AO segments, we rely on the auxiliary
> pg_aoseg_* heap table. The visiblity of the rows in the pg_aoseg table
> determines which segments are visible to a snapshot.
>
> That works great currently, but if we switch to updating pg_aoseg in a
> local transaction as this PR does, some anomalies become possible. A
> distributed transaction might see an inconsistent view of the segments,
> because one row version in pg_aoseg is visible according to the
> distributed snapshot, while another version of the same row is visible
> to its local snapshot.
>

> Solution:

>
> Change the PR so that still uses distributed transactions for the
> updates on pg_aoseg during AO vacuum. That's annoying because it brings
> back some of the complexity in the vacuum code that I wanted to
> eliminate, but I think it's doable.
>

I'm still finding it difficult to wrap my head around pg_aovisimap updates. Let's imagine the following scenario, involving two delete transactions and one vacuum on the same append-optimized table. Let's also assume that one visimap tuple represents visibility information for 1000 rows in the append-optimized table.

t1: delete rows 1 to 10;

--> creates new pg_aovisimap tuple for rows 1 to 1000 (visible rows: 11-1000)

t2: delete rows 20 to 100;

--> updates the pg_aovisimap tuple (visible rows: 11-19, 101-1000)

t3: vacuum;

If the interleaving is such that t3 sees t1 and t2 as committed but t2 is newer than the oldest xmin of t3 (vacuum), the vacuum transaction can remove rows deleted by t1 but it must keep the rows deleted by t2 in the compacted segment file. How can this happen if vacuum sees pg_aovisimap table as updated by t2?

Asim

Heikki Linnakangas

unread,

Feb 13, 2020, 4:54:48 AM2/13/20

to Asim R P, Greenplum Developers

On 13/02/2020 11:23, Asim R P wrote:
> I'm still finding it difficult to wrap my head around pg_aovisimap
> updates. Let's imagine the following scenario, involving two delete
> transactions and one vacuum on the same append-optimized table. Let's
> also assume that one visimap tuple represents visibility information for
> 1000 rows in the append-optimized table.
>
> t1: delete rows 1 to 10;
> --> creates new pg_aovisimap tuple for rows 1 to 1000 (visible rows:
> 11-1000)
> t2: delete rows 20 to 100;
> --> updates the pg_aovisimap tuple (visible rows: 11-19, 101-1000)
> t3: vacuum;
>
> If the interleaving is such that t3 sees t1 and t2 as committed but t2
> is newer than the oldest xmin of t3 (vacuum), the vacuum transaction can
> remove rows deleted by t1 but it must keep the rows deleted by t2 in the
> compacted segment file. How can this happen if vacuum sees pg_aovisimap
> table as updated by t2?

It does *not* copy the rows deleted by t2 to the compacted segment file.
That's where the AWAITING_DROP status comes into play. If an old
transaction comes along that still considers t2 as in progress, it will
look at the old AWAITING_DROP segment, not the new compacted one.

When vacuum updates the entries pg_aoseg, to mark the old segment as
AWAITING_DROP and the new compacted segment as live (DEFAULT), it's done
as normal transactional update. So an old snapshot will still consider
the old segment as live.

- Heikki

Ashwin Agrawal

unread,

Feb 13, 2020, 6:02:19 PM2/13/20

to Heikki Linnakangas, Greenplum Developers

I agree to leave out option 2. So, instead of new option can we make VACUUM FULL follow option 1 and VACUUM follow option 3.

Based on the feedback I think we can add GUC later if required to pick option 1 on lazy vacuum as well.

Ashwin Agrawal

unread,

Feb 13, 2020, 6:07:40 PM2/13/20

to Heikki Linnakangas, Asim R P, Greenplum Developers

And this strategy works because deletes can't be performed for AO tables inside

serializable transactions, correct? Otherwise, delete would have operated on old segment

files visimap entries.

Heikki Linnakangas

unread,

Feb 14, 2020, 3:10:47 AM2/14/20

to Ashwin Agrawal, Asim R P, Greenplum Developers

Ah yes, that's correct. I had forgotten about that limitation.

- Heikki

Reply all

Reply to author

Forward