Appendonly vacuum in PR 790

163 views
Skip to first unread message

Asim Praveen

unread,
Apr 7, 2017, 7:41:35 PM4/7/17
to Greenplum Developers
PR 790 proposes to change vacuum behavior for appendonly tables so that it can be performed in one step on QEs without QD coordinating individual vacuum phases of prepare, compact and drop.  While this is a highly desirable simplification, it comes with a problem.  It creates an exception to the norm that locks are acquired on QD before a statement is dispatched to QE.  Let's first summarize the proposed change for lazy vacuum:  

1. QD acquires AccessShareLock on the AO table and dispatches the lazy vacuum to QEs.
2. QE runs the prepare and compaction phases as local transactions.
3. Upon compaction, state of the appendonly segment file in pg_aoseg is updated to AWAITING_DROP.
4. QE then begins drop phase as a new local transaction and tries to upgrade the lock to AccessExclusiveLock.  Note that this lock upgrade happens only on QE while QD is still holding AccessShareLock.
5. A compacted appendonly segment file is dropped only if (a) lock on the appendonly table can be upgraded to AccessExclusive and (b) the segment file's tuple in pg_aoseg that indicates the state was AWAITING_DROP can be frozen with heap_freeze_tuple().  If so, it is visible to all running transactions.

The problem is step 4 where QE upgrades the lock independent of QD.  This is dangerous as vacuum may drop an appendonly segment file that is still needed by an old snapshot.  The check 5(b) guards against this case.  However, it's too strict a protection.  In case there is at least one transaction older than the compaction transaction, the check 5(b) fails and the segment file is left in AWAITING_DROP state without being dropped.  So 5(b) effectively translates to - if vacuum is not the oldest running transaction in the system, give up the drop phase.  In all likelihood, there will be a concurrent transaction older than vacuum.  We found uao_catalog_tables ICW fail in the parallel schedule because of this.

Vacuum full also faces the same problem of giving up in drop phase due to a concurrent older transaction because it uses the same check in 5(b).

To fix this, the check 5(b) must be weakened/removed and locks must be acquired on QD before dispatch.  We propose a solution as follows:

1. Break vacuum into two steps and let QD dispatch the two steps.
2. Step1 runs prepare and compact phases.  Step2 runs drop phase.
3. QD acquires AccessShareLock before dispatching step1.
4. QD upgrades the AccessShareLock to AccessExclusiveLock and then dispatches step2.
5. QEs do not use heap_freeze_tuple() check in drop phase.  The fact that QD was able to acquire AccessExclusiveLock on the AO table means that no older snapshot can ever need the appendonly file about to be dropped.  Remember the rule that locks are acquired on QD and then the statement is dispatched to QE.
6. The question is then how do we guard against serializable transactions that started before vacuum, or cursors?  This is already taken care of today by walking ProcArray and giving up if PGPROC->serialiazbleIsoLevel is found to be set for at least one backend.

This solution will penalize utility mode reads on the same AO table which is being vacuumed, by introducing a window where vacuum may remove an appendonly segment file that is still needed by an old snapshot.

Comments, suggestions?

Abhijit and Asim

Heikki Linnakangas

unread,
Apr 8, 2017, 7:05:18 AM4/8/17
to Asim Praveen, Greenplum Developers
On 04/08/2017 02:41 AM, Asim Praveen wrote:
> 5. QEs do not use heap_freeze_tuple() check in drop phase. The fact that
> QD was able to acquire AccessExclusiveLock on the AO table means that no
> older snapshot can ever need the appendonly file about to be dropped.
> Remember the rule that locks are acquired on QD and then the statement is
> dispatched to QE.
> 6. The question is then how do we guard against serializable transactions
> that started before vacuum, or cursors? This is already taken care of
> today by walking ProcArray and giving up if PGPROC->serialiazbleIsoLevel is
> found to be set for at least one backend.

This doesn't sound right. A serializable transaction just makes it
easier to reproduce the problem, but the exact same problems exist with
read committed transactions. You mentioned cursors yourself, and
serializableIsoLevel will help with them. Even without cursors, you
could have a query like:

SELECT pg_sleep(10) FROM tableA
UNION ALL
SELECT myfunction();

where myfunction() runs a query:

SELECT * FROM <table-being-vacuumed>


If vacuum runs after acquiring the snapshot, but before myfunction()
runs, you have a problem.

- Heikki

Robert Eckhardt

unread,
Apr 8, 2017, 4:11:35 PM4/8/17
to Heikki Linnakangas, Asim Praveen, Greenplum Developers
Heikki, 

Let me take a bit of a step back. PR 790 was cool in my opinion because it did some work towards offloading some of the control from the master to the segments.  Furthermore it simplified the code. 

It seems like we ran into some, seemingly, insurmountable hurdles with that particular implementation. 

The question I have is, does the approach Asim and Abhijit put forth create a clearer path for moving towards the above goals?

-- Rob

Asim Praveen

unread,
Apr 8, 2017, 4:13:42 PM4/8/17
to Heikki Linnakangas, Greenplum Developers

On Sat, Apr 8, 2017 at 4:05 AM, Heikki Linnakangas <hlinna...@pivotal.io> wrote:
>
>
> SELECT pg_sleep(10) FROM tableA
> UNION ALL
> SELECT myfunction();
>
> where myfunction() runs a query:
>
> SELECT * FROM <table-being-vacuumed>
>
>
> If vacuum runs after acquiring the snapshot, but before myfunction() runs, you have a problem.

I must admit that we didn't think about the UDF case.  But the stale snapshot does not seem to be a problem.  Note that PR 790 changes vacuum behavior to use local transactions on QEs.  Any tuples in pg_aoseg touched by vacuum will be evaluated against QE local snapshot.  QE acquires a new local snapshot upon receiving the dispatched statement.  If vacuum committed before a snapshot is acquired by QE, changes made by vacuum will be visible through this snapshot.

The problem is if QE acquires a local snapshot and it vacuum runs on the QE afterwards.  As long as locks are acquired on QD before acquiring them on QE, we should be covered.  In case of the above query, QD acquires a lock on any tables accessed by myfunction() before dispatch.  Stack trace on QD showing when a relation is locked through SPI:

  * #0: LockAcquire + 30 at /Users/apraveen/gpdb/src/backend/storage/lmgr/lock.c:484
    #1: LockRelationOid + 57 at /Users/apraveen/gpdb/src/backend/storage/lmgr/lmgr.c:84
    #2: AcquireExecutorLocks + 346 at /Users/apraveen/gpdb/src/backend/utils/cache/plancache.c:763
    #3: RevalidateCachedPlanWithParams + 341 at /Users/apraveen/gpdb/src/backend/utils/cache/plancache.c:443
    #4: RevalidateCachedPlan + 44 at /Users/apraveen/gpdb/src/backend/utils/cache/plancache.c:656
    #5: _SPI_execute_plan + 358 at /Users/apraveen/gpdb/src/backend/executor/spi.c:1791
    #6: SPI_execute_plan + 267 at /Users/apraveen/gpdb/src/backend/executor/spi.c:415

Asim

Heikki Linnakangas

unread,
Apr 10, 2017, 2:43:45 PM4/10/17
to Asim Praveen, Greenplum Developers
On 04/08/2017 11:13 PM, Asim Praveen wrote:
> On Sat, Apr 8, 2017 at 4:05 AM, Heikki Linnakangas <hlinna...@pivotal.io>
> wrote:
>>
>> SELECT pg_sleep(10) FROM tableA
>> UNION ALL
>> SELECT myfunction();
>>
>> where myfunction() runs a query:
>>
>> SELECT * FROM <table-being-vacuumed>
>>
>> If vacuum runs after acquiring the snapshot, but before myfunction()
> runs, you have a problem.
>
> I must admit that we didn't think about the UDF case. But the stale
> snapshot does not seem to be a problem. Note that PR 790 changes vacuum
> behavior to use local transactions on QEs. Any tuples in pg_aoseg touched
> by vacuum will be evaluated against QE local snapshot. QE acquires a new
> local snapshot upon receiving the dispatched statement. If vacuum
> committed before a snapshot is acquired by QE, changes made by vacuum will
> be visible through this snapshot.

Hmm. Do we have a different problem, then? If the QE backend reads
pg_aoseg using a local snapshot, and the local snapshot says that the
vacuum transaction committed, it will read the new segment. But the new
segment might not contain all the rows that are still visible to the
backend's *distributed* snapshot.

- Heikki

Asim Praveen

unread,
Apr 11, 2017, 1:39:44 AM4/11/17
to Heikki Linnakangas, Greenplum Developers
Yes, that is a different problem than being addressed by PR790.  And it arises from the fact that visimap table is scanned by vacuum using SnapshotNow.  One visimap tuple records visibility status for multiple rows in a single tuple.  According to SnapshotNow, vacuum uses the latest visimap entry to run compaction.  But the latest visimap entry may not be visible to an old snapshot.  To fix this, compaction should reconcile / merge visimaps from all tuples in pg_aovisimap table declared "live" by HeapTupleSatisfiesVacuum().  We are discussing how to do this, will post a solution soon.

Note that the problem exists only in cases where locks are acquired *after* snapshot is taken on QD.  In addition to serializable transactions, one such case happens to be a query involving a UDF, where locks may be acquired at the time of executing statements within the UDF when a snapshot is already taken.

Heikki Linnakangas

unread,
Apr 11, 2017, 3:49:37 AM4/11/17
to Asim Praveen, Greenplum Developers
On 04/11/2017 08:39 AM, Asim Praveen wrote:
> Note that the problem exists only in cases where locks are acquired *after*
> snapshot is taken on QD. In addition to serializable transactions, one
> such case happens to be a query involving a UDF, where locks may be
> acquired at the time of executing statements within the UDF when a snapshot
> is already taken.

Even without UDFs, there's always a small window between acquiring the
snapshot, and acquiring locks.

- Heikki

Reply all
Reply to author
Forward
0 new messages