Re: [BUGS] BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)

Robert Haas

unread,

Apr 27, 2015, 10:46:55 AM4/27/15

to

On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgr...@ymail.com> wrote:
> I think I see why I was seeing this and nobody else was

Thomas said he reproduced it. No?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Kevin Grittner

unread,

Apr 27, 2015, 10:57:20 AM4/27/15

to

Robert Haas <rober...@gmail.com> wrote:
> On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgr...@ymail.com> wrote:
>> I think I see why I was seeing this and nobody else was
>
> Thomas said he reproduced it. No?

I should have been more clear about what I meant by "this". Thomas
said he reproduced the immediate errors with Álvaro's patch, but if
he said anything about files in the members subdirectory not going
away with VACUUM followed by CHECKPOINT, regardless of
configuration, I missed it. It turns out that these steps only
"prime the pump" for the files to be deleted on subsequent access
to the members SLRU. That doesn't contribute to database
corruption, but it sure can be confusing for someone trying to
clean things up.

--
Kevin Grittner
EDB: http://www.enterprisedb.com

Alvaro Herrera

unread,

Apr 27, 2015, 10:58:56 AM4/27/15

to

Robert Haas wrote:
> On Thu, Apr 23, 2015 at 9:59 PM, Alvaro Herrera
> <alvh...@2ndquadrant.com> wrote:
> > Thomas Munro wrote:
> >> That's why I proposed not using xid-like logic, and instead using a
> >> type of three-way comparison that allows you to see when nextOffset
> >> would 'cross' oldestOffsetStopLimit, instead of the two-way comparison
> >> that considers half the number-space to be in the past and half in the
> >> future, in my earlier message.
> >
> > Yeah, that bit made sense to me.
>
> In addition to preventing the corruption, I think we also need a
> back-patchable fix for AV to try to keep this situation from happening
> in the first place.

Let me push a patch to fix the corruption, and then we can think of ways
to teach autovacuum about the problem. I'm not optimistic about that,
honestly: as all GUC settings, these are individual for each process,
and there's no way for one process to affect the values that are seen by
other processes (autovac workers). The only idea that comes to mind is
to publish values in shared memory, and autovac workers would read them
from there instead of using normal GUC values.

> What I think we should do is notice when members utilization exceeds
> offset utilization and progressively ramp back the effective value of
> autovacuum_multixact_freeze_max_age (and maybe also
> vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age)
> so that autovacuum (and maybe also manual vacuums) get progressively
> more aggressive about trying to advance relminmxid. Suppose we decide
> that when the "members" space is 75% used, we've got a big problem and
> want to treat autovacuum_multixact_freeze_max_age to effectively be
> zero.

I think we can easily determine the rate of multixact member space
consumption and compare to the rate of multixact ID consumption;
considering the historical multixact size (number of members per
multixact) it would be possible to change the freeze ages by the same
fraction, so that autovac effectively behaves as if the members
consumption rate is what is driving the freezing instead of ID
consumption rate. That way, we don't have to jump suddenly from
"normal" to "emergency" behavior as some fixed threshold.

> This may not be the right proposal in detail, but I think we should do
> something.

No disagreement on that.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alvaro Herrera

unread,

Apr 27, 2015, 11:24:20 AM4/27/15

to

Kevin Grittner wrote:
> Robert Haas <rober...@gmail.com> wrote:
> > On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgr...@ymail.com> wrote:
> >> I think I see why I was seeing this and nobody else was
> >
> > Thomas said he reproduced it. No?
>
> I should have been more clear about what I meant by "this". Thomas
> said he reproduced the immediate errors with Álvaro's patch, but if
> he said anything about files in the members subdirectory not going
> away with VACUUM followed by CHECKPOINT, regardless of
> configuration, I missed it. It turns out that these steps only
> "prime the pump" for the files to be deleted on subsequent access
> to the members SLRU. That doesn't contribute to database
> corruption, but it sure can be confusing for someone trying to
> clean things up.

The whole matter of truncating multixact is a longish trip. It starts
when autovacuum completes a round or VACUUM finishes processing a table;
these things call vac_update_datfrozenxid. That routine scans pg_class
and sees if datfrozenxid or datminmxid can be advanced from their
current points; only if any of these can, vac_truncate_clog is called.
That routine calls SetMultiXactIdLimit(), which determines a new
MultiXactState->oldestMultiXactId (saved in shared memory). The
involvement of vacuum stops here; following steps happen at checkpoint.

At checkpoint, oldestMultiXactId is saved to pg_control as part of a
checkpoint (MultiXactGetCheckptMulti); the checkpointed value is passed
back to multixact by MultiXactSetSafeTruncate, which saves it in shmem
as lastCheckpointedOldest. The same checkpoint later calls
TruncateMultiXact which can remove files.

Note that if vac_update_datfrozenxid finds that the pg_database values
cannot be changed (during the vacuum phase), the multixact truncation
point is not advanced and checkpoint has nothing to do. But note that
the clog counter advancing enough will also trigger multixact
truncation.

Robert Haas

unread,

Apr 27, 2015, 12:41:01 PM4/27/15

to

On Mon, Apr 27, 2015 at 10:59 AM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
> Let me push a patch to fix the corruption, and then we can think of ways
> to teach autovacuum about the problem.

Sounds good to me. Are you going to do that today?

> I'm not optimistic about that,
> honestly: as all GUC settings, these are individual for each process,
> and there's no way for one process to affect the values that are seen by
> other processes (autovac workers). The only idea that comes to mind is
> to publish values in shared memory, and autovac workers would read them
> from there instead of using normal GUC values.

I don't think we could store values for the parameters directly in
shared memory, because I think that at least some of those GUCs are
per-session changeable. But we might be able to store weighting
factors in shared memory that get applied to whatever the values in
the current session are. Or else maybe each backend can just
recompute the information for itself when it needs it.

>> What I think we should do is notice when members utilization exceeds
>> offset utilization and progressively ramp back the effective value of
>> autovacuum_multixact_freeze_max_age (and maybe also
>> vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age)
>> so that autovacuum (and maybe also manual vacuums) get progressively
>> more aggressive about trying to advance relminmxid. Suppose we decide
>> that when the "members" space is 75% used, we've got a big problem and
>> want to treat autovacuum_multixact_freeze_max_age to effectively be
>> zero.
>
> I think we can easily determine the rate of multixact member space
> consumption and compare to the rate of multixact ID consumption;
> considering the historical multixact size (number of members per
> multixact) it would be possible to change the freeze ages by the same
> fraction, so that autovac effectively behaves as if the members
> consumption rate is what is driving the freezing instead of ID
> consumption rate. That way, we don't have to jump suddenly from
> "normal" to "emergency" behavior as some fixed threshold.

Right. I think that not jumping from normal mode to emergency mode is
quite important, and was trying to describe a system that would
gradually ramp up the pressure rather than a system that would do
nothing for a while and then suddenly go ballistic.

With regard to what you've outlined here, we need to make sure that if
the multixact rate varies widely, we still clean up before we hit
autovac wraparond. That's why I think it should be driven off of the
fraction of the available address space which is currently consumed,
not some kind of short term measure of mxact size or generation rate.
I'm not sure exactly what you have in mind here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

David Gould

unread,

Apr 27, 2015, 4:12:10 PM4/27/15

to

On Mon, 27 Apr 2015 11:59:10 -0300
Alvaro Herrera <alvh...@2ndquadrant.com> wrote:

> I think we can easily determine the rate of multixact member space
> consumption and compare to the rate of multixact ID consumption;
> considering the historical multixact size (number of members per
> multixact) it would be possible to change the freeze ages by the same
> fraction, so that autovac effectively behaves as if the members
> consumption rate is what is driving the freezing instead of ID
> consumption rate. That way, we don't have to jump suddenly from
> "normal" to "emergency" behavior as some fixed threshold.

I would like to add a data point: one of my clients has a plpgsql function
that manages to use ten to 30 thousand multixact ids per invocation. It
interacts with a remote resource and sets an exception handler on a per
item basis to catch errors on the remote call.

-dg

--
David Gould 510 282 0869 da...@sonic.net
If simplicity worked, the world would be overrun with insects.

Alvaro Herrera

unread,

Apr 28, 2015, 2:23:34 AM4/28/15

to

Thomas Munro wrote:

> One thing I noticed about your patch is that it effectively halves the
> amount of multixact members you can have on disk. Sure, I'd rather
> hit an error at 2^31 members than a corrupt database at 2^32 members,
> but I wondered if we should try to allow the full range to be used.

Ah, yeah, we do want the full range; that's already built in the code
elsewhere.

In this version, I used your WouldWrap function, but there was a bug in
your formulation of the call site: after the WARNING has been issued
once, it is never issued again for that wraparound cycle, because the
second time around the nextOffset has already crossed the boundary and
your routine returns false. IMO this is wrong and the warning should be
issued every time. To fix that problem I removed the offsetWarnLimit
altogether, and instead do WouldWrap() of the value against
offsetStopLimit minus the 20 segments. That way, the warning is issued
continuously until the offsetStopLimit is reached (once there,
obviously, only the error is thrown, not the warning, which is correct.)

I also added a call to DetermineSafeOldestOffset() in TrimMultiXact:
as far as I can tell, this is necessary for the time when a standby
exits recovery, because when InRecovery we return early from
DetermineSafeOldestOffset() so the safe point would never get set.

memberswrap-2.patch

Thomas Munro

unread,

Apr 28, 2015, 2:30:43 AM4/28/15

to

On Tue, Apr 28, 2015 at 6:23 PM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
> Thomas Munro wrote:
>
>> One thing I noticed about your patch is that it effectively halves the
>> amount of multixact members you can have on disk. Sure, I'd rather
>> hit an error at 2^31 members than a corrupt database at 2^32 members,
>> but I wondered if we should try to allow the full range to be used.
>
> Ah, yeah, we do want the full range; that's already built in the code
> elsewhere.
>
> In this version, I used your WouldWrap function, but there was a bug in
> your formulation of the call site: after the WARNING has been issued
> once, it is never issued again for that wraparound cycle, because the
> second time around the nextOffset has already crossed the boundary and
> your routine returns false. IMO this is wrong and the warning should be
> issued every time. To fix that problem I removed the offsetWarnLimit
> altogether, and instead do WouldWrap() of the value against
> offsetStopLimit minus the 20 segments. That way, the warning is issued
> continuously until the offsetStopLimit is reached (once there,
> obviously, only the error is thrown, not the warning, which is correct.)

+1

Tomorrow I will send a separate patch for the autovacuum changes that
I sent earlier. Let's discuss and hopefully eventually commit that
separately.

--
Thomas Munro
http://www.enterprisedb.com

Robert Haas

unread,

Apr 28, 2015, 10:33:24 AM4/28/15

to

On Tue, Apr 28, 2015 at 2:23 AM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
> Ah, yeah, we do want the full range; that's already built in the code
> elsewhere.
>
> In this version, I used your WouldWrap function, but there was a bug in
> your formulation of the call site: after the WARNING has been issued
> once, it is never issued again for that wraparound cycle, because the
> second time around the nextOffset has already crossed the boundary and
> your routine returns false. IMO this is wrong and the warning should be
> issued every time. To fix that problem I removed the offsetWarnLimit
> altogether, and instead do WouldWrap() of the value against
> offsetStopLimit minus the 20 segments. That way, the warning is issued
> continuously until the offsetStopLimit is reached (once there,
> obviously, only the error is thrown, not the warning, which is correct.)
>

> I also added a call to DetermineSafeOldestOffset() in TrimMultiXact:
> as far as I can tell, this is necessary for the time when a standby
> exits recovery, because when InRecovery we return early from
> DetermineSafeOldestOffset() so the safe point would never get set.

Putting the period inside the parentheses here looks weird?

+ "This command would create a
multixact with %u members, which exceeds remaining space (%u
members.)",

Maybe rephrase as: "This command would create a multixact with %u
members, but the remaining space is only enough for %u members."

I don't think this should have a comma:

+ errhint("Execute a database-wide VACUUM in that
database, with reduced vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.")));

This looks like excess brace-ifiaction:

+ if (start < boundary)
+ {
+ return finish >= boundary || finish < start;
+ }
+ else
+ {
+ return finish >= boundary && finish < start;
+ }

I think this is confusing:

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Robert Haas

unread,

Apr 28, 2015, 10:34:51 AM4/28/15

to

On Tue, Apr 28, 2015 at 10:33 AM, Robert Haas <rober...@gmail.com> wrote:
> I think this is confusing:

Oops, hit send too soon.

+/*
+ * Read the offset of the first member of the given multixact.
+ */

This is confusing to me because the two subdirectories of pg_multixact
are called "members" and "offsets". Here you are talking about the
offset of the first member. Maybe I'm just slow, but that seems like
conflating terminology. You end up with a function called
read_offset_for_multi() that is actually looking up information about
members. Ick.

Alvaro Herrera

unread,

Apr 28, 2015, 10:34:59 AM4/28/15

to

Thomas Munro wrote:

> > In this version, I used your WouldWrap function, [...]
>
> +1

Pushed.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alvaro Herrera

unread,

Apr 28, 2015, 10:55:58 AM4/28/15

to

I sure wish this had arrived two minutes earlier ...

Robert Haas wrote:

> Putting the period inside the parentheses here looks weird?
>
> + "This command would create a
> multixact with %u members, which exceeds remaining space (%u
> members.)",
>
> Maybe rephrase as: "This command would create a multixact with %u
> members, but the remaining space is only enough for %u members."

WFM, will change.

> I don't think this should have a comma:
>
> + errhint("Execute a database-wide VACUUM in that
> database, with reduced vacuum_multixact_freeze_min_age and
> vacuum_multixact_freeze_table_age settings.")));

Ditto.

> This looks like excess brace-ifiaction:
>
> + if (start < boundary)
> + {
> + return finish >= boundary || finish < start;
> + }
> + else
> + {
> + return finish >= boundary && finish < start;
> + }

Yeah, agreed. Will undo that change. (I disliked the comment above the
indented single-statement, so added braces, but then moved the comment.
I should have removed the braces at that point.)

> I think this is confusing:
>

> +/*
> + * Read the offset of the first member of the given multixact.
> + */
>
> This is confusing to me because the two subdirectories of pg_multixact
> are called "members" and "offsets". Here you are talking about the
> offset of the first member. Maybe I'm just slow, but that seems like
> conflating terminology. You end up with a function called
> read_offset_for_multi() that is actually looking up information about
> members. Ick.

Yeah, I introduced the confusing terminology while inventing multixacts
initially and have regretted it many times. I will think about a better
name for this. (Meanwhile, on IM Robert suggested
find_start_of_first_multi_member)

Robert Haas

unread,

Apr 28, 2015, 12:05:10 PM4/28/15

to

On Tue, Apr 28, 2015 at 10:56 AM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
> I sure wish this had arrived two minutes earlier ...

Sorry about that. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Alvaro Herrera

unread,

Apr 28, 2015, 1:53:53 PM4/28/15

to

Alvaro Herrera wrote:

> > I think this is confusing:
> >
> > +/*
> > + * Read the offset of the first member of the given multixact.
> > + */
> >
> > This is confusing to me because the two subdirectories of pg_multixact
> > are called "members" and "offsets". Here you are talking about the
> > offset of the first member. Maybe I'm just slow, but that seems like
> > conflating terminology. You end up with a function called
> > read_offset_for_multi() that is actually looking up information about
> > members. Ick.
>
> Yeah, I introduced the confusing terminology while inventing multixacts
> initially and have regretted it many times. I will think about a better
> name for this. (Meanwhile, on IM Robert suggested
> find_start_of_first_multi_member)

Pushed. I chose find_multixact_start() as a name for this function.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Jeff Janes

unread,

Apr 28, 2015, 2:14:02 PM4/28/15

to

On Tue, Apr 28, 2015 at 10:54 AM, Alvaro Herrera <alvh...@2ndquadrant.com> wrote:

Alvaro Herrera wrote:

> > I think this is confusing:
> >
> > +/*
> > + * Read the offset of the first member of the given multixact.
> > + */
> >
> > This is confusing to me because the two subdirectories of pg_multixact
> > are called "members" and "offsets". Here you are talking about the
> > offset of the first member. Maybe I'm just slow, but that seems like
> > conflating terminology. You end up with a function called
> > read_offset_for_multi() that is actually looking up information about
> > members. Ick.
>
> Yeah, I introduced the confusing terminology while inventing multixacts
> initially and have regretted it many times. I will think about a better
> name for this. (Meanwhile, on IM Robert suggested
> find_start_of_first_multi_member)

Pushed. I chose find_multixact_start() as a name for this function.

Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this just-described commit, I can no longer upgrade from a 9.2.10 database using pg_upgrade.

I can reproduce it from a clean 9.2 install which has never even been started up.

Deleting files from new pg_multixact/offsets ok

Setting oldest multixact ID on new cluster ok

Resetting WAL archives ok

*failure*

Consult the last few lines of "pg_upgrade_server.log" for

the probable cause of the failure.

The last few lines are:

command: "../bisect/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "../data2/" -o "-p 50432 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/home/jjanes/pgsql/git'" start >> "pg_upgrade_server.log" 2>&1

waiting for server to start....LOG: database system was shut down at 2015-04-28 11:08:18 PDT

FATAL: could not access status of transaction 1

DETAIL: Could not open file "pg_multixact/offsets/0000": No such file or directory.

LOG: startup process (PID 3977) exited with exit code 1

LOG: aborting startup due to startup process failure

Cheers,

Jeff

Alvaro Herrera

unread,

Apr 28, 2015, 2:52:32 PM4/28/15

to

Jeff Janes wrote:

> Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this
> just-described commit, I can no longer upgrade from a 9.2.10 database using
> pg_upgrade.

How annoying, thanks for the report. I reproduced it here. The problem
is that the upgrade process removes the files from pg_multixact/offsets,
which is what we now want to read on startup. Not yet sure how to fix
it.

Alvaro Herrera

unread,

Apr 28, 2015, 7:12:41 PM4/28/15

to

Jeff Janes wrote:
> Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this
> just-described commit, I can no longer upgrade from a 9.2.10 database using
> pg_upgrade.

Here's a patch, but I don't like it too much. Will think more about it,
probably going to push something tomorrow.

memberswrap-3.patch

Thomas Munro

unread,

Apr 29, 2015, 2:10:39 AM4/29/15

to

On Tue, Apr 21, 2015 at 5:12 PM, Amit Kapila <amit.k...@gmail.com> wrote:
> On Tue, Apr 21, 2015 at 12:34 AM, Alvaro Herrera <alvh...@2ndquadrant.com>
> wrote:
>>
>> Alvaro Herrera wrote:
>>
>> > The fix is to raise an ERROR when generating a new multixact, if we
>> > detect that doing so would get close to the oldest multixact that the
>> > system knows about. If that happens, the solution is to vacuum so that
>> > the "oldest" point is advanced a bit more and you have room to generate
>> > more multixacts. In production, you would typically adjust the
>> > multixact freeze parameters so that "oldest multixact" is advanced more
>> > aggressively and you don't hit the ERROR.
>>
>> Here's a patch. I have tested locally and it closes the issue for me.
>> If those affected can confirm that it stops the file removal from
>> happening, I'd appreciate it.
>>
>
> 1. Do you think it makes sense to give warning in SetMultiXactIdLimit()
> if we have already reached offsetWarnLimit as we give for multiWarnLimit?

Amit and I discussed this offline. Yes, we could include a warning
message here, for consistency with the warnings you get about xid
wraparound. Concretely I think it means that you would also get
warnings about being being near the member space limit from vacuums,
rather than just from attempts to allocate new multixact IDs. The
test to detect an impending members-would-wrap ERROR would be similar
to what we do in GetNewMultiXactId, so something like:

MultiXactOffsetWouldWrap(offsetStopLimit,
nextOffset,
MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT *
OFFSET_WARN_SEGMENTS)

I'm not sure whether it's worth writing an extra patch for this
though, because if you're in this situation, your logs are already
overflowing with warnings from the regular backends that are
generating multixacts. Thoughts anyone?

--
Thomas Munro
http://www.enterprisedb.com

Amit Kapila

unread,

Apr 29, 2015, 7:42:04 AM4/29/15

to

On Wed, Apr 29, 2015 at 5:44 AM, Thomas Munro <thomas...@enterprisedb.com> wrote:

>
> On Tue, Apr 28, 2015 at 6:30 PM, Thomas Munro
> <thomas...@enterprisedb.com> wrote:
> > Tomorrow I will send a separate patch for the autovacuum changes that
> > I sent earlier. Let's discuss and hopefully eventually commit that
> > separately.
>

> Here is a separate patch which makes autovacuum start a wrap-around
> vacuum sooner if the member space is running out, by adjusting
> autovacuum_multixact_freeze_max_age using a progressive scaling
> factor. This version includes a clearer implementation of
> autovacuum_multixact_freeze_max_age_adjusted provided by Kevin
> Grittner off-list.
>

Some comments:

1. It seems that you are using autovacuum_multixact_freeze_max_age_adjusted()

only at couple of places, like it is not used in below calc:

vacuum_set_xid_limits()

{

..

mxid_freezemin = Min(mxid_freezemin,

autovacuum_multixact_freeze_max_age / 2);

..

}

What is the reason of using this calculation at some places and

not at other places?

2.

@@ -2684,8 +2719,8 @@ relation_needs_vacanalyze(Oid relid,

: autovacuum_freeze_max_age;

multixact_freeze_max_age = (relopts && relopts->multixact_freeze_max_age >= 0)

- ? Min(relopts->multixact_freeze_max_age, autovacuum_multixact_freeze_max_age)

- : autovacuum_multixact_freeze_max_age;

+ ? Min(relopts->multixact_freeze_max_age, autovacuum_multixact_freeze_max_age_adjusted())

+ : autovacuum_multixact_freeze_max_age_adjusted();

It seems that it will try to read from offset file for each

relation which might or might not be good, shall we try to

cache the oldestMultiXactMemberOffset?

3. currently there is some minimum limit of autovacuum_multixact_freeze_age (10000000)

which might not be honored by this calculation, so not sure if that can impact the

system performance in some cases where it is currently working sane.

4. Can you please share results that can show improvement

with current patch versus un-patched master?

5.

+ /*

+ * TODO: In future, could oldestMultiXactMemberOffset be stored in shmem,

+ *

pg_controdata, alongside oldestMultiXactId?

+ */

You might want to write the comment as:

XXX: We can store oldestMultiXactMemberOffset in shmem, pg_controldata

alongside oldestMultiXactId?

6.

+ * Returns vacuum_multixact_freeze_max_age, adjusted down to prevent excessive use

+ * of addressable

multixact member space if required.

I think here you mean autovacuum_multixact_freeze_max_age?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Jeff Janes

unread,

Apr 29, 2015, 1:07:34 PM4/29/15

to

On Tue, Apr 28, 2015 at 4:13 PM, Alvaro Herrera <alvh...@2ndquadrant.com> wrote:

Jeff Janes wrote:
> Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this
> just-described commit, I can no longer upgrade from a 9.2.10 database using
> pg_upgrade.

Here's a patch, but I don't like it too much. Will think more about it,
probably going to push something tomorrow.

It looks like that patch is targeted to 9.4 branch. I couldn't readily get it to apply on HEAD. I tested it on 9.4, and it solved the problem there.

Thanks,

Jeff

Amit Kapila

unread,

Apr 29, 2015, 11:53:16 PM4/29/15

to

On Tue, Apr 28, 2015 at 11:24 PM, Alvaro Herrera <alvh...@2ndquadrant.com> wrote:

>
> Alvaro Herrera wrote:
>
>
> Pushed. I chose find_multixact_start() as a name for this function.
>

I have done test to ensure that the latest change has fixed the

reported problem and below are the results, to me it looks the

reported problem is fixed.

I have used test (explode_mxact_members) developed by Thomas

to reproduce the problem. Start one transaction in a session.

After running the test for 3~4 hours with parameters as

explode_mxact_members 500 35000, I could see the warning messages

like below (before the fix there were no such messages and test is

completed but it has corrupted the database):

WARNING: database with OID 1 must be vacuumed before 358 more multixact members are used

HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and

vacuum_multixact_freeze_table_age settings.

WARNING: database with OID 1 must be vacuumed before 310 more multixact members are used

HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and

vacuum_multixact_freeze_table_age settings.

WARNING: database with OID 1 must be vacuumed before 261 more multixact members are used

HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and

vacuum_multixact_freeze_table_age settings.

WARNING: database with OID 1 must be vacuumed before 211 more multixact members are used

HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and

vacuum_multixact_freeze_table_age settings.

WARNING: database with OID 1 must be vacuumed before 160 more multixact members are used

HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and

vacuum_multixact_freeze_table_age settings.

explode_mxact_members: explode_mxact_members.c:38: main: Assertion `PQresultStatus(res) == PGRES_TUPLES_OK'

failed.

After this I set the vacuum_multixact_freeze_min_age and

vacuum_multixact_freeze_table_age as zero and then performed

Vacuum freeze for template1 and postgres followed by

manual CHECKPOINT. I could see below values in pg_database.

postgres=# select oid,datname,datminmxid from pg_database;

oid | datname | datminmxid

-------+-----------+------------

1 | template1 | 17111262

13369 | template0 | 17111262

13374 | postgres | 17111262

(3 rows)

Again I start the test as ./explode_mxact_members 500 35000, but it

immediately failed as

500 sessions connected...

Loop 0...

WARNING: database with OID 13369 must be vacuumed before 12 more multixact members are used

HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and

vacuum_multixact_freeze_table_age settings.

WARNING: database with OID 13369 must be vacuumed before 11 more multixact members are used

HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and

vacuum_multixact_freeze_table_age settings.

WARNING: database with OID 13369 must be vacuumed before 9 more multixact members are used

HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and

vacuum_multixact_freeze_table_age settings.

explode_mxact_members: explode_mxact_members.c:38: main: Assertion `PQresultStatus(res) == PGRES_TUPLES_OK'

failed.

Now it was confusing for me why it has failed for next time even

though I had Vacuum Freeze and CHECKPOINT, but then I waited

for a minute or two and ran Vacuum Freeze by below command:

./vacuumdb -a -F

vacuumdb: vacuuming database "postgres"

vacuumdb: vacuuming database "template1"

Here I have verified that all files except one were deleted.

After that when I restarted the test, it went perfectly fine and it never

lead to any warning messages, probable because the values for

vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age

were zero.

I am still not sure why it took some time to clean the members directory

and resume the test after running Vacuum Freeze and Checkpoint.

Robert Haas

unread,

Apr 30, 2015, 9:08:44 AM4/30/15

to

On Tue, Apr 28, 2015 at 7:13 PM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
> Jeff Janes wrote:
>> Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this
>> just-described commit, I can no longer upgrade from a 9.2.10 database using
>> pg_upgrade.
>
> Here's a patch, but I don't like it too much. Will think more about it,
> probably going to push something tomorrow.

What don't you like about it? We should get something committed here;
it's not good for the back-branches to be in a state where pg_upgrade
will break.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Alvaro Herrera

unread,

Apr 30, 2015, 12:50:46 PM4/30/15

to

Robert Haas wrote:
> On Tue, Apr 28, 2015 at 7:13 PM, Alvaro Herrera
> <alvh...@2ndquadrant.com> wrote:
> > Jeff Janes wrote:
> >> Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this
> >> just-described commit, I can no longer upgrade from a 9.2.10 database using
> >> pg_upgrade.
> >
> > Here's a patch, but I don't like it too much. Will think more about it,
> > probably going to push something tomorrow.
>
> What don't you like about it? We should get something committed here;
> it's not good for the back-branches to be in a state where pg_upgrade
> will break.

Yeah, I managed to find a real fix which I will push shortly.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alvaro Herrera

unread,

Apr 30, 2015, 1:03:12 PM4/30/15

to

Jeff Janes wrote:
> On Tue, Apr 28, 2015 at 4:13 PM, Alvaro Herrera <alvh...@2ndquadrant.com>

> wrote:
>
> > Jeff Janes wrote:
> > > Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to
> > > this just-described commit, I can no longer upgrade from a 9.2.10
> > > database using pg_upgrade.
> >
> > Here's a patch, but I don't like it too much. Will think more about it,
> > probably going to push something tomorrow.
>

> It looks like that patch is targeted to 9.4 branch. I couldn't readily get
> it to apply on HEAD. I tested it on 9.4, and it solved the problem there.

Yeah, I wrote it in 9.3. However, it was wrong; or at least there's a
better way to formulate it, and the new formulation applies without
conflict from 9.3 to master. So I pushed that instead.

Thanks!

Robert Haas

unread,

May 1, 2015, 3:08:53 PM5/1/15

to

On Fri, May 1, 2015 at 6:51 AM, Thomas Munro
<thomas...@enterprisedb.com> wrote:
> Those other places are for capping the effective table and tuple
> multixact freeze ages for manual vacuums, so that manual vacuums (say
> in nightly cronjobs) get a chance to run a wraparound scans before
> autovacuum kicks in at a less convenient time. So, yeah, I think we
> want to incorporate member wraparound prevention into that logic, and
> I will add that in the next version of the patch.

+1. On a quick read-through of the patch, the biggest thing that
jumped out at me was that it only touches the autovacuum logic.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Amit Kapila

unread,

May 2, 2015, 7:22:42 AM5/2/15

to

On Thu, Apr 30, 2015 at 10:47 AM, Thomas Munro <thomas...@enterprisedb.com> wrote:

>
> On Wed, Apr 29, 2015 at 11:41 PM, Amit Kapila <amit.k...@gmail.com> wrote:
>
> > 3. currently there is some minimum limit of autovacuum_multixact_freeze_age
> > (10000000)
> > which might not be honored by this calculation, so not sure if that can
> > impact the
> > system performance in some cases where it is currently working sane.
>

> The reason why we need to be able to set the effective freeze age
> below that minimum in cases of high member data consumption rates is
> that you could hit the new member space wraparound prevention error
> before you consume anywhere near that many multixact IDs. That
> minimum may well be entirely reasonable if the only thing you're
> worried about is multixact ID wraparound prevention.
>
> For example, my test program eats an average of 250 members per
> multixact ID when run with 500 sessions (each loop creates 500
> multixact IDs having 1, 2, 3, ..., 500 members). At that rate, you'll
> run out of addressable member space after 2^32 / 250 = 17,179,869
> multixact IDs. To prevent an error condition using only the existing
> multixact ID wraparound prevention machinery, we need to have an
> effective max table age (so that autovacuum wakes up and scans all
> tables) and min freeze age (so that it actually freezes the tuples)
> below that number. So we have to ignore the GUC minimum in this
> situation.
>

I understand that point, but I mentioned so that if there is some specific

reason for keeping the current minimum value, then we should evaluate

that we have not broken the same by not honouring the minimum value of

GUC. As far as I can see from code, there seems to be one place

(refer below code) where that value is used to calculate Warning limit for

multixacts and the current patch doesn't seem to have any impact on the

same.

SetMultiXactIdLimit()
{
..
multiWarnLimit = multiStopLimit - 10000000;
}

> ...
>
> Observations:
>
> 1. Sometimes the values don't change from minute to minute,
> presumably because there hasn't been a checkpoint to update
> pg_controldata on disk, but hopefully we can still see what's going on
> here despite the slight lag in the data.
>

Yeah and I think this means that there will no advancement for oldest

multixactid and deletion of files if the checkpoints are configured for

a timeout value. I think there is no harm in specifying this in document

if it is currently not specified.

> 2. We get to somewhere in the 73-75% SLRU used range before
> wraparound vacuums are triggered. We probably need to spread things
> out more that that.
>
> 3. When the autovacuum runs, it advances oldest_mxid by different
> amounts each time; that's because I'm using the adjusted freeze max
> age (the max age of a table before it gets a wraparound vacuum) as our
> freeze min age (the max age for individual tuples before they're
> frozen) here:
>
> @@ -1931,7 +1964,9 @@ do_autovacuum(void)
> {
> default_freeze_min_age = vacuum_freeze_min_age;
> default_freeze_table_age = vacuum_freeze_table_age;
> - default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
> + default_multixact_freeze_min_age =
> + Min(vacuum_multixact_freeze_min_age,
> + autovacuum_multixact_freeze_max_age_adjusted());
> default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
> }
>
> Without that change, autovacuum would trigger repeatedly as we got
> near 75% SLRU usage but not freeze anything, because
> default_multixact_freeze_min_age was higher than the age of any tuples
> (which had only made it to an age of around ~12 million; actually it's
> not exactly the tuple age per se... I don't fully understand the
> treatment of locker and updater multixact IDs in the vacuum code,
> HeapTupleSatisfiesVacuum and heap_freeze_tuple etc yet so I'm not sure
> exactly how that value translates into vacuum work, but I can see
> experimentally that a low multixact freeze min age is needed to get
> relminxmid moved forward).
>
> It's good that freeze table age ramps down so that the autovacuum
> launcher trigger point jumps around a bit and we spread the autovacuum
> launches over time, but it's not great that we finish up truncating
> different amounts of multixacts and associated SLRU each time. We
> could instead use a freeze min age of 0 to force freezing of *all*
> tuples if this is a member-space-wraparound-prevention vacuum (that
> is, if autovacuum_multixact_freeze_max_age !=
> autovacuum_multixact_freeze_max_age_adjusted()).

We already set vacuum_multixact_freeze_min_age to half of

autovacuum_multixact_freeze_max_age so that autovacuums to

prevent MultiXact wraparound won't occur too frequently as per below

code:

vacuum_set_xid_limits()

{

..

mxid_freezemin = Min(mxid_freezemin,

autovacuum_multixact_freeze_max_age / 2);

Assert(mxid_freezemin >= 0);

..

}

Now if we set it to zero, then I think it might lead to excessive

freezing and inturn more I/O without the actual need (more space

for multixact members)

>
> There is less to say about the results with an unpatched server: it
> drives in a straight line for a while, and then crashes into a wall
> (ie the new error preventing member wraparound), which I see you have
> also reproduced. It's used up all of the circular member space, but
> only has around 17 million multixacts so autovacuum can't help you
> (it's not even possible to set autovacuum_multixact_freeze_max_age
> below 100 million), so to get things moving again you need to manually
> VACUUM FREEZE all databases including template databases.
>

In my tests on setting vacuum multixact parameter

(vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age)

values to zero, it has successfuly finished the tests (no warning and I could

see truncation of files in members directory) , so I think one might argue

that in many cases one could get the available space for members by

just setting appropriate values for vacuum_multixact_* params, but I feel

it is better to have some auto adjustment algorithm like this patch is

trying to do so that even if those values are not set appropriately, it can

avoid the wraparound error. I think the only thing we might need to be

cautious about is that new calculation should not make it worse (less

aggresive) in case of lower values for vacuum_multixact_* parameters.

Amit Kapila

unread,

May 3, 2015, 12:40:49 AM5/3/15

to

On Sat, May 2, 2015 at 11:46 AM, Thomas Munro <thomas...@enterprisedb.com> wrote:
>
> On Sat, May 2, 2015 at 7:08 AM, Robert Haas <rober...@gmail.com> wrote:

> > On Fri, May 1, 2015 at 6:51 AM, Thomas Munro
> > <thomas...@enterprisedb.com> wrote:
> >> Those other places are for capping the effective table and tuple
> >> multixact freeze ages for manual vacuums, so that manual vacuums (say
> >> in nightly cronjobs) get a chance to run a wraparound scans before
> >> autovacuum kicks in at a less convenient time. So, yeah, I think we
> >> want to incorporate member wraparound prevention into that logic, and
> >> I will add that in the next version of the patch.
> >
> > +1. On a quick read-through of the patch, the biggest thing that
> > jumped out at me was that it only touches the autovacuum logic.
>
>

> Also attached is the output of the monitor.sh script posted upthread,
> while running explode_mxact_members.c. It looks better than the last
> results to me: whenever usage reaches 50%, autovacuum advances things
> such that usage drops right back to 0% (because it now uses
> multixact_freeze_min_age = 0) , and the system will happily chug on
> forever. What this test doesn't really show adequately is that if you
> had a lot of different tables and databases with different relminmxid
> values, they'd be vacuumed at different times. I should probably come
> up with a way to demonstrate that...
>

About data, I have extracted parts where there is a change in

oldest_mxid and segments

time segments usage_fraction usage_kb oldest_mxid next_mxid next_offset

13:48:36 1 0 16 1 1 0

13:49:36 369 .0044 94752 1 1 0

..

14:44:04 41703 .5083 10713400 1 8528909 2140755909

14:45:05 1374 .0167 352960 8573819 8722521 2189352521

..

15:37:16 41001 .4997 10529528 8573819 17060811 4282263311

..

15:38:16 709 .0086 182056 17132168 17254423 35892627

..

16:57:15 41440 .5051 10644712 17132168 25592713 2128803417

..

16:58:16 1120 .0136 287416 25695507 25786824 2177525278

Based on this data, it seems that truncation of member space

as well as advancement of oldest multixact id happens once

it reaches 50% usage and at that time segments drops down to almost

zero. This happens repeatedly after 1 hour and in-between there

is no progress which indicates that all the work happens at

one go rather than in spreaded way. Won't this choke the system

when it happens due to I/O, isn't it better if we design it in a way such

that it is spreaded over period of time rather than doing everything at

one go?

--

+int

+compute_max_multixact_age_to_avoid_member_wrap(bool manual)

{

..

+ if (members <= safe_member_count)

+ {

+ /*

+ * There is no danger of

member wrap, so return a number that is not

+ * lower than autovacuum_multixact_freeze_max_age.

+

*/

+ return -1;

+ }

..

The above code doesn't seem to match its comments.

Comment says "..not lower than autovacuum_multixact_freeze_max_age",

but then return -1. It seems to me here we should return unchanged

autovacuum_multixact_freeze_max_age as it was coded in the initial

version of patch. Do you have any specific reason to change it?

Amit Kapila

unread,

May 4, 2015, 2:26:11 AM5/4/15

to

On Mon, May 4, 2015 at 5:19 AM, Thomas Munro <thomas...@enterprisedb.com> wrote:

>
> On Sun, May 3, 2015 at 4:40 PM, Amit Kapila <amit.k...@gmail.com> wrote:
> > --
> > +int
> > +compute_max_multixact_age_to_avoid_member_wrap(bool manual)
> > {
> > ..
> > + if (members <= safe_member_count)
> > + {
> > + /*
> > + * There is no danger of
> > member wrap, so return a number that is not
> > + * lower than autovacuum_multixact_freeze_max_age.
> > +
> > */
> > + return -1;
> > + }
> > ..
> >
> > The above code doesn't seem to match its comments.
> > Comment says "..not lower than autovacuum_multixact_freeze_max_age",
> > but then return -1. It seems to me here we should return unchanged
> > autovacuum_multixact_freeze_max_age as it was coded in the initial
> > version of patch. Do you have any specific reason to change it?
>

> Oops, the comment is fixed in the attached patch.
>
> In an earlier version, I was only dealing with the autovacuum case.
> Now that the VACUUM command also calls it, I didn't want this
> compute_max_multixact_age_to_avoid_member_wrap function to assume that
> it was being called by autovacuum code and return the
> autovacuum-specific GUC in the case that no special action is needed.
> Also, the function no longer computes a value by scaling
> autovacuum_multixact_freeze_max_age, it now scales the current number
> of active multixacts, so that we can begin selecting a small non-zero
> number of tables to vacuum as soon as we exceed safe_member_count as
> described above

I am slightly worried that if for scaling we don't consider the value for

multixact_*_age as configured by user, Vacuum/Autovacuum might

behave totally different from what user is expecting. Basically

it will be dominated based on member space usage and will ignore

the values set by user for multixact_*_age parameters. One way

could be to use minimum of the value calculated based on member

space and the value specified by user for multixact related parameters

as suggested in points 1 and 2 (below in mail).

One more thing, I think the current calculation considers members

usage, shouldn't we try to consider offset usage as well?

> (whereas when we used a scaled down
> autovaccum_multixact_freeze_max_age, we usually didn't select any
> tables at all until we scaled it down a lot, ie until we got close to
> dangerous_member_count). Finally, I wanted a special value like -1
> for 'none' so that table_recheck_autovac and ExecVacuum could use a
> simple test >= 0 to know that they also need to set
> multixact_freeze_min_age to zero in the case of a
> member-space-triggered vacuum, so that we get maximum benefit from our
> table scans by freezing all relevant tuples, not just some older ones

>

I think setting multixact_freeze_min_age to zero could be too aggresive

for I/O. Yes with this you can get maximum benefit, but at cost of

increased I/O. How would you justify setting it to zero as appropriate

w.r.t increased I/O?

Few more observations:

1.

@@ -2687,6 +2796,10 @@ relation_needs_vacanalyze(Oid relid,

? Min(relopts-

>multixact_freeze_max_age, autovacuum_multixact_freeze_max_age)

:

autovacuum_multixact_freeze_max_age;

+ /* Special settings if we are running out of member address space.

*/

+ if (max_multixact_age_to_avoid_member_wrap >= 0)

+ multixact_freeze_max_age =

max_multixact_age_to_avoid_member_wrap;

+

Isn't it better to use minimum to already computed value of

multixact_freeze_max_age and max_multixact_age_to_avoid_member_wrap?

multixact_freeze_max_age = Min(multixact_freeze_max_age, max_multixact_age_to_avoid_member_wrap);

Similar change needs to be done in table_recheck_autovac()

2.

@@ -1118,7 +1197,12 @@ do_start_worker(void)

/* Also determine the oldest datminmxid we will consider. */

recentMulti = ReadNextMultiXactId();

- multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;

+ max_multixact_age_to_avoid_member_wrap =

+ compute_max_multixact_age_to_avoid_member_wrap(false);

+ if (max_multixact_age_to_avoid_member_wrap >= 0)

+ multiForceLimit = recentMulti - max_multixact_age_to_avoid_member_wrap;

+ else

+ multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;

Here also, isn't it better to use minimum of autovacuum_multixact_freeze_max_age

and max_multixact_age_to_avoid_member_wrap.

3.

+int

+compute_max_multixact_age_to_avoid_member_wrap(bool manual)

+{

+ MultiXactOffset members;

+ uint32 multixacts;

+ double fraction;

+ MultiXactOffset safe_member_count = MaxMultiXactOffset / 2;

It is not completely clear what is more appropriate value

for safe_member_count (25% or 50%). Anybody else have any

opinion on this value?

4. Once we conclude on final algorithm, we should update the

same in docs as well, probably in description at below link:

http://www.postgresql.org/docs/devel/static/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUND

Thomas Munro

unread,

May 4, 2015, 3:12:49 AM5/4/15

to

On Mon, May 4, 2015 at 6:25 PM, Amit Kapila <amit.k...@gmail.com> wrote:
> [...]

> One more thing, I think the current calculation considers members
> usage, shouldn't we try to consider offset usage as well?

Offsets are indexed by multixact ID:

#define MultiXactIdToOffsetPage(xid) \
((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
#define MultiXactIdToOffsetEntry(xid) \
((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)

The existing multixact wraparound prevention code is already managing
the 32 bit multixact ID space. The problem with members comes about
because each one of those multixact IDs can have arbitrary numbers of
members, and yet the members are also addressed with a 32 bit index.
So we are trying to hijack the multixact ID wraparound prevention and
make it more aggressive if member space appears to be running out.
(Perhaps in future there should be a 64 bit index for member indexes
so that this problem disappears?)

>> (whereas when we used a scaled down
>> autovaccum_multixact_freeze_max_age, we usually didn't select any
>> tables at all until we scaled it down a lot, ie until we got close to
>> dangerous_member_count). Finally, I wanted a special value like -1
>> for 'none' so that table_recheck_autovac and ExecVacuum could use a
>> simple test >= 0 to know that they also need to set
>> multixact_freeze_min_age to zero in the case of a
>> member-space-triggered vacuum, so that we get maximum benefit from our
>> table scans by freezing all relevant tuples, not just some older ones
>>
>
> I think setting multixact_freeze_min_age to zero could be too aggresive
> for I/O. Yes with this you can get maximum benefit, but at cost of
> increased I/O. How would you justify setting it to zero as appropriate
> w.r.t increased I/O?

I assumed that if you were already vacuuming all your tablesto avoid
running out of member space, you would want to freeze any tuples you
possibly could to defer the next wraparound scan for as long as
possible, since wraparound scans are enormously expensive.

> Few more observations:
>
> 1.
> @@ -2687,6 +2796,10 @@ relation_needs_vacanalyze(Oid relid,
> ? Min(relopts-
>>multixact_freeze_max_age, autovacuum_multixact_freeze_max_age)
> :
> autovacuum_multixact_freeze_max_age;
>
> + /* Special settings if we are running out of member address space.
> */
> + if (max_multixact_age_to_avoid_member_wrap >= 0)
> + multixact_freeze_max_age =
> max_multixact_age_to_avoid_member_wrap;
> +
>
> Isn't it better to use minimum to already computed value of
> multixact_freeze_max_age and max_multixact_age_to_avoid_member_wrap?
>
> multixact_freeze_max_age = Min(multixact_freeze_max_age,
> max_multixact_age_to_avoid_member_wrap);

Except that I am using -1 as a special value. But you're right, I
guess it should be like this:

if (max_multixact_age_to_avoid_member_wrap >= 0)

multixact_freeze_max_age = Min(multixact_freeze_max_age,
max_multixact_age_to_avoid_member_wrap);

> Similar change needs to be done in table_recheck_autovac()
>
> 2.
> @@ -1118,7 +1197,12 @@ do_start_worker(void)
>
> /* Also determine the oldest datminmxid we will consider. */
> recentMulti = ReadNextMultiXactId();
> - multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
> + max_multixact_age_to_avoid_member_wrap =
> + compute_max_multixact_age_to_avoid_member_wrap(false);
> + if (max_multixact_age_to_avoid_member_wrap >= 0)
> + multiForceLimit = recentMulti - max_multixact_age_to_avoid_member_wrap;
> + else
> + multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
>
> Here also, isn't it better to use minimum of
> autovacuum_multixact_freeze_max_age
> and max_multixact_age_to_avoid_member_wrap.

Yeah, with the same proviso about -1.

> 3.
> +int
> +compute_max_multixact_age_to_avoid_member_wrap(bool manual)
> +{
> + MultiXactOffset members;
> + uint32 multixacts;
> + double fraction;
> + MultiXactOffset safe_member_count = MaxMultiXactOffset / 2;
>
> It is not completely clear what is more appropriate value
> for safe_member_count (25% or 50%). Anybody else have any
> opinion on this value?
>
> 4. Once we conclude on final algorithm, we should update the
> same in docs as well, probably in description at below link:
> http://www.postgresql.org/docs/devel/static/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUND

Agreed.

--
Thomas Munro
http://www.enterprisedb.com

Amit Kapila

unread,

May 4, 2015, 7:49:49 AM5/4/15

to

On Mon, May 4, 2015 at 12:42 PM, Thomas Munro <thomas...@enterprisedb.com> wrote:
>
> On Mon, May 4, 2015 at 6:25 PM, Amit Kapila <amit.k...@gmail.com> wrote:
> > [...]
> > One more thing, I think the current calculation considers members
> > usage, shouldn't we try to consider offset usage as well?
>
> Offsets are indexed by multixact ID:
>
> #define MultiXactIdToOffsetPage(xid) \
> ((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
> #define MultiXactIdToOffsetEntry(xid) \
> ((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
>
> The existing multixact wraparound prevention code is already managing
> the 32 bit multixact ID space. The problem with members comes about
> because each one of those multixact IDs can have arbitrary numbers of
> members, and yet the members are also addressed with a 32 bit index.
> So we are trying to hijack the multixact ID wraparound prevention and
> make it more aggressive if member space appears to be running out.
> (Perhaps in future there should be a 64 bit index for member indexes
> so that this problem disappears?)
>

Okay, that makes sense.

> >> (whereas when we used a scaled down
> >> autovaccum_multixact_freeze_max_age, we usually didn't select any
> >> tables at all until we scaled it down a lot, ie until we got close to
> >> dangerous_member_count). Finally, I wanted a special value like -1
> >> for 'none' so that table_recheck_autovac and ExecVacuum could use a
> >> simple test >= 0 to know that they also need to set
> >> multixact_freeze_min_age to zero in the case of a
> >> member-space-triggered vacuum, so that we get maximum benefit from our
> >> table scans by freezing all relevant tuples, not just some older ones
> >>
> >
> > I think setting multixact_freeze_min_age to zero could be too aggresive
> > for I/O. Yes with this you can get maximum benefit, but at cost of
> > increased I/O. How would you justify setting it to zero as appropriate
> > w.r.t increased I/O?
>
> I assumed that if you were already vacuuming all your tablesto avoid
> running out of member space,

I think here you mean all tables that has relminmxid lesser than the

newly computed age (compute_max_multixact_age_to_avoid_member_wrap)

> you would want to freeze any tuples you
> possibly could to defer the next wraparound scan for as long as
> possible, since wraparound scans are enormously expensive.
>

The point is valid to an extent, but If we go by this logic, then currently

also we should set multixact_freeze_min_age as zero for wraparound

vacuum.

Robert Haas

unread,

May 4, 2015, 2:46:43 PM5/4/15

to

On Sat, May 2, 2015 at 2:16 AM, Thomas Munro
<thomas...@enterprisedb.com> wrote:
> Here's a new version which sets up the multixact parameters in
> ExecVacuum for regular VACUUM commands just like it does for
> autovacuum if needed. When computing
> max_multixact_age_to_avoid_member_wrap for a manual vacuum, it uses
> lower constants, so that any manually scheduled vacuums get a chance
> to deal with some of this problem before autovacuum has to. Here are
> the arbitrary constants currently used: at 50% member address space
> usage, autovacuum starts wraparound scan of tables with the oldest
> active multixacts, and then younger ones as the usage increases, until
> at 75% usage it vacuums with multixact_freeze_table_age = 0; for
> manual VACUUM those numbers are halved so that it has a good head
> start.

I think the 75% threshold for reducing multxact_freeze_table_age to
zero is fine, but I don't agree with the 50% cutoff. The purpose of
autovacuum_multixact_freeze_max_age is to control the fraction of the
2^32-entry offset space that can be consumed before we begin viewing
the problem as urgent. We have a setting for that because it needs to
be tunable, and the default value for that setting is 400 million,
which is roughly 10% of the members space. That is a whole lot lower
than the 50% threshold you are proposing here. Moreover, it leaves
the user with no meaningful choice: if the 50% threshold consumes too
much disk space, or doesn't leave enough room before we hit the wall,
then the user is simply hosed. This is why I initially proposed that
the member-space-consumption-percentage at which we start derating
multixact_freeze_table_age should be based on
autovacuum_multixact_freeze_max_age/2^32. That way,
autovacuum_multixact_freeze_max_age controls not only how aggressively
we try to reclaim offset space but also how aggressively we try to
reclaim member space. The user can then tune the value, and the
default is the same in both cases.

I also think that halving the numbers for manual vacuums is arbitrary
and unprecedented. The thought process isn't bad, but an autovacuum
currently behaves in most respects like a manual vacuum, and I'm
reluctant to make those more different.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Robert Haas

unread,

May 4, 2015, 3:01:20 PM5/4/15

to

On Sat, May 2, 2015 at 7:22 AM, Amit Kapila <amit.k...@gmail.com> wrote:
>> 3. When the autovacuum runs, it advances oldest_mxid by different

That point is certainly worthy of some consideration. Letting the
freeze xmin get set to half of the (effective)
autovacuum_multixact_freeze_age would certainly be more consistent
with what we do elsewhere. The policy trade-off is not as
straightforward as you are making it out to be, though:

1. Using a min freeze age of zero will result in half as many
full-table scans, because we'll advance relminmxid twice as far each
time.

2. But each one will freeze more stuff, some of which might have been
updated again before the next freeze pass, so we might do more
freezing in total.

So either policy might win, depending on whether you care more about
reducing reads (in which case you want a very low min freeze age) or
about reducing writes (in which case you want a higher min freeze
age).

All things being equal, I'd rather stick with the existing 50% policy
in the back-branches, rather than going to zero, but I'm not sure all
things are equal. It matters what difference the higher value makes.

unread,

May 4, 2015, 9:03:43 PM5/4/15

to

Thomas Munro wrote:

> I can't help thinking there must be a different way to do this that
> takes advantage of the fact that multixacts are often created by
> copying all the members of an existing multixact and adding one new
> one, so that there is a lot of duplication and churn (at least when
> you have a workload that generates bigger multixacts, due to the
> O(n^2) process of building them up xid by xid).

Yeah, Simon expressed the same thought to me some months ago, and I gave
it some think-time (but not at lot of it TBH). I didn't see any way to
make it workable.

Normally, lockers go away reasonably quickly, so some of the original
members of the multixact are disappearing all the time. Maybe one way
would be to re-use a multixact you have in your local cache, as long as
the only difference with the multixact you want is some locker
transaction(s) that have already ended. Not sure how you would manage
the cache, though.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Amit Kapila

unread,

May 4, 2015, 11:57:12 PM5/4/15

to

On Tue, May 5, 2015 at 2:29 AM, Alvaro Herrera <alvh...@2ndquadrant.com> wrote:
>
>
> Please note that 9.4 and earlier do not have ExecVacuum; the
> determination of freeze ages is done partly in gram.y (yuck). Not sure
> what will the patch look like in those branches.
>

One way to make fix back-patchable is to consider doing the changes

for Vacuum and AutoVacuum in one common path (vacuum_set_xid_limits())?

However, I think we might need to distinguish whether the call is from

Vacuum or AutoVacuum path.

Alvaro Herrera

unread,

May 5, 2015, 9:37:39 AM5/5/15

to

Amit Kapila wrote:
> On Tue, May 5, 2015 at 2:29 AM, Alvaro Herrera <alvh...@2ndquadrant.com>
> wrote:
> >
> >
> > Please note that 9.4 and earlier do not have ExecVacuum; the
> > determination of freeze ages is done partly in gram.y (yuck). Not sure
> > what will the patch look like in those branches.
>
> One way to make fix back-patchable is to consider doing the changes
> for Vacuum and AutoVacuum in one common path (vacuum_set_xid_limits())?
> However, I think we might need to distinguish whether the call is from
> Vacuum or AutoVacuum path.

unread,

May 5, 2015, 9:53:18 PM5/5/15

to

On Tue, May 5, 2015 at 5:46 PM, Thomas Munro
<thomas...@enterprisedb.com> wrote:
> But member space *always* grows at least twice as fast as offset space
> (aka active multixact IDs), because multixacts always have at least 2
> members (except in some rare cases IIUC), don't they?

Oh. *facepalm*

All right, so maybe the way you had it is best after all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com

Robert Haas

unread,

May 5, 2015, 10:08:51 PM5/5/15

to

On Tue, May 5, 2015 at 6:36 PM, Kevin Grittner <kgr...@ymail.com> wrote:
>> So I think that
>> some_constant should be at least 2, if we try to do it this way, in
>> other words if you set the GUC for 10% of offset space, we also start
>> triggering wraparounds at 20% of member space.
>
> But what if they configure it to start at 80% (which I *have* seen
> people do)?

I might be confused here, but the upper limit for
autovacuum_multixact_freeze_max_age is 2 billion, so I don't think
this can ever be higher than 50%. Well, 46.5%, really, since 2^32 > 4
billion. autovacuum_freeze_max_age is similarly limited.

Robert Haas

unread,

May 5, 2015, 10:30:06 PM5/5/15

to

On Tue, May 5, 2015 at 3:58 AM, Thomas Munro
<thomas...@enterprisedb.com> wrote:

> Here's a new patch, with responses to several reviews.

Going back to this version...

+ * Based on the assumption that there is no reasonable way for an end user to
+ * configure the thresholds for this, we define the safe member count to be
+ * half of the member address space, and the dangerous level to be

but:

+ const MultiXactOffset safe_member_count = MaxMultiXactOffset / 4;

Those don't match. Also, we usually use #define rather than const for
constants. I suggest we do that here, too.

+ int safe_multixact_age = MultiXactCheckMemberUsage();
+ if (safe_multixact_age >= 0)

Project style is to leave a blank line between these, I think.

I think you need to update the comments for relation_needs_vacanalyze().

The documentation in section 23.1.5.1, "Multixacts and Wraparound",
also needs updating.

Robert Haas

unread,

May 6, 2015, 6:45:13 AM5/6/15

to

On Wed, May 6, 2015 at 6:26 AM, Thomas Munro
<thomas...@enterprisedb.com> wrote:

> On Wed, May 6, 2015 at 2:29 PM, Robert Haas <rober...@gmail.com> wrote:
>> + * Based on the assumption that there is no reasonable way for an end user to
>> + * configure the thresholds for this, we define the safe member count to be
>> + * half of the member address space, and the dangerous level to be
>>
>> but:
>>
>> + const MultiXactOffset safe_member_count = MaxMultiXactOffset / 4;
>>

>> Those don't match. [...]
>
> Fixed/obsoleted in the attached patch. It has a dynamic
> safe_member_count based on scaling the GUC as described in my earlier
> email with the v7 patch; the behaviour with the default GUC value
> works out to a similar safe_member_count value, but this way it can be
> changed if needed, and we don't introduce any new GUCs. Also, since
> the GUC used in determining safe_member_count is either
> autovacuum_multixact_freeze_max_age or vacuum_multixact_freeze_max_age
> depending on which kind of vacuum it is, that is now a parameter
> passed into MultiXactCheckMemberUsage, so safe_member_count is no
> longer a constant.

To be honest, now that you've pointed out that the fraction of the
multixact members space that is in use will always be larger,
generally much larger, than the fraction of the offset space that is
in use, I've kind of lost all enthusiasm for making the
safe_member_count stuff dependent on
autovacuum_multixact_freeze_max_age. I'm inclined to go back to 25%,
the way you had it before.

We could think about adding a new GUC in master, but I'm actually
leaning toward the view that we should just hard-code 25% for now and
consider revising it later if that proves inadequate.

Amit Kapila

unread,

May 6, 2015, 9:18:46 AM5/6/15

to

On Wed, May 6, 2015 at 3:56 PM, Thomas Munro <thomas...@enterprisedb.com> wrote:
>
> On Wed, May 6, 2015 at 2:29 PM, Robert Haas <rober...@gmail.com> wrote:

Few comments:

1.

+ /*

+ * Override the multixact freeze settings if we are running out of

+ * member address space.

+ */

+ if (safe_multixact_age >= 0)

+ {

+ multixact_freeze_table_age = Min(safe_multixact_age,

+ multixact_freeze_table_age);

+ /* Special settings if we are running out of member address space. */

+ if (safe_multixact_age >= 0)

+ multixact_freeze_max_age = Min(multixact_freeze_max_age, safe_multixact_age);

+

Some places use safe_multixact_age as first parameter and some

places use it at second place. I think it is better to use in same

order for the sake of consistency.

2.

in the hope

+ * that different tables will be vacuumed at different times due to their

+ * varying relminmxid values.

Does above line in comment on top of MultiXactCheckMemberUsage()

makes much sense?

3.

+ * we know the age of the oldest multixact in the system, so that's the

+ * value we want to when members is near safe_member_count. It should

typo.

so that's the value we want to *use* when ..

Alvaro Herrera

unread,

May 6, 2015, 10:16:42 AM5/6/15

to

I haven't read your patch, but I wonder if we should decrease the
default value of multixact_freeze_table_age (currently 150 million).
The freeze_min_age is 5 million; if freeze_table_age were a lot lower,
the problem would be less pronounced.

Additionally, I will backpatch commit 27846f02c176. The average size of
multixacts decreases with that fix in many common cases, which greatly
reduces the need for any of this in the first place.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alvaro Herrera

unread,

May 6, 2015, 10:33:54 AM5/6/15

to

Robert Haas wrote:

> So here's a new patch, based on your latest version, which looks
> reasonably committable to me.

I think this code should also reduce the multixact_freeze_min_age value
at the same time as multixact_freeze_table_age. If the table age is
reduced but freeze_min_age remains high, old multixacts might still
remain in the table. The default value for freeze min age is 5 million,
but users may change it. Perhaps freeze min age should be set to
Min(modified freeze table age, freeze min age) so that old multixacts
are effectively frozen whenever a full table scan requested.

> 1. Should we be installing one or more GUCs to control this behavior?
> I've gone back to hard-coding things so that at 25% we start
> triggering autovacuum and by 75% we zero out the freeze ages, because
> the logic you proposed in your last version looks insanely complicated
> to me. (I do realize that I suggested the approach, but that was
> before I realized the full complexity of the problem.) I now think
> that if we want to make this tunable, we need to create and expose
> GUCs for it. I'm hoping we can get by without that, but I'm not sure.

I think things are complicated enough; I vote for no additional GUCs at
this point.

> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
> similar logic? Otherwise, a user with autovacuum=off won't get
> emergency autovacuums for member exhaustion, even though they will get
> them for offset exhaustion.

Yeah, it looks like it does.

Kevin Grittner

unread,

May 6, 2015, 12:16:04 PM5/6/15

to

Alvaro Herrera <alvh...@2ndquadrant.com> wrote:
> Robert Haas wrote:

>> So here's a new patch, based on your latest version, which looks
>> reasonably committable to me.
>
> I think this code should also reduce the multixact_freeze_min_age value
> at the same time as multixact_freeze_table_age. If the table age is
> reduced but freeze_min_age remains high, old multixacts might still
> remain in the table. The default value for freeze min age is 5 million,
> but users may change it. Perhaps freeze min age should be set to
> Min(modified freeze table age, freeze min age) so that old multixacts
> are effectively frozen whenever a full table scan requested.

I would rather see min age reduced proportionally to table age, or
at least ensure that min age is some percentage below table age.

>> 1. Should we be installing one or more GUCs to control this behavior?
>> I've gone back to hard-coding things so that at 25% we start
>> triggering autovacuum and by 75% we zero out the freeze ages, because
>> the logic you proposed in your last version looks insanely complicated
>> to me. (I do realize that I suggested the approach, but that was
>> before I realized the full complexity of the problem.) I now think
>> that if we want to make this tunable, we need to create and expose
>> GUCs for it. I'm hoping we can get by without that, but I'm not sure.
>
> I think things are complicated enough; I vote for no additional GUCs at
> this point.

+1

For one thing, we should try to have something we can back-patch,
and new GUCs in a minor release seems like something to avoid, if
possible. For another thing, we've tended not to put in GUCs if
there is no reasonable way for a user to determine a good value,
and that seems to be the case here.

>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
>> similar logic? Otherwise, a user with autovacuum=off won't get
>> emergency autovacuums for member exhaustion, even though they will get
>> them for offset exhaustion.
>
> Yeah, it looks like it does.

+1

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Robert Haas

unread,

May 6, 2015, 12:23:37 PM5/6/15

to

On Wed, May 6, 2015 at 10:34 AM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
>> So here's a new patch, based on your latest version, which looks
>> reasonably committable to me.
>
> I think this code should also reduce the multixact_freeze_min_age value
> at the same time as multixact_freeze_table_age.

I think it does that. It sets the min age to half the value it sets
for the table age, which I think is consistent with what we do
elsewhere.

>> 1. Should we be installing one or more GUCs to control this behavior?
>> I've gone back to hard-coding things so that at 25% we start
>> triggering autovacuum and by 75% we zero out the freeze ages, because
>> the logic you proposed in your last version looks insanely complicated
>> to me. (I do realize that I suggested the approach, but that was
>> before I realized the full complexity of the problem.) I now think
>> that if we want to make this tunable, we need to create and expose
>> GUCs for it. I'm hoping we can get by without that, but I'm not sure.
>
> I think things are complicated enough; I vote for no additional GUCs at
> this point.

That's fine with me for now.

>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
>> similar logic? Otherwise, a user with autovacuum=off won't get
>> emergency autovacuums for member exhaustion, even though they will get
>> them for offset exhaustion.
>
> Yeah, it looks like it does.

OK, I'm not clear how to do that correctly, exactly, but hopefully one
of us can figure that out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com