Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: [BUGS] BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)

300 views
Skip to first unread message

Robert Haas

unread,
Apr 27, 2015, 10:46:55 AM4/27/15
to
On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgr...@ymail.com> wrote:
> I think I see why I was seeing this and nobody else was

Thomas said he reproduced it. No?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-bugs mailing list (pgsql...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Kevin Grittner

unread,
Apr 27, 2015, 10:57:20 AM4/27/15
to
Robert Haas <rober...@gmail.com> wrote:
> On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgr...@ymail.com> wrote:
>> I think I see why I was seeing this and nobody else was
>
> Thomas said he reproduced it. No?

I should have been more clear about what I meant by "this". Thomas
said he reproduced the immediate errors with Álvaro's patch, but if
he said anything about files in the members subdirectory not going
away with VACUUM followed by CHECKPOINT, regardless of
configuration, I missed it. It turns out that these steps only
"prime the pump" for the files to be deleted on subsequent access
to the members SLRU. That doesn't contribute to database
corruption, but it sure can be confusing for someone trying to
clean things up.

--
Kevin Grittner
EDB: http://www.enterprisedb.com

Alvaro Herrera

unread,
Apr 27, 2015, 10:58:56 AM4/27/15
to
Robert Haas wrote:
> On Thu, Apr 23, 2015 at 9:59 PM, Alvaro Herrera
> <alvh...@2ndquadrant.com> wrote:
> > Thomas Munro wrote:
> >> That's why I proposed not using xid-like logic, and instead using a
> >> type of three-way comparison that allows you to see when nextOffset
> >> would 'cross' oldestOffsetStopLimit, instead of the two-way comparison
> >> that considers half the number-space to be in the past and half in the
> >> future, in my earlier message.
> >
> > Yeah, that bit made sense to me.
>
> In addition to preventing the corruption, I think we also need a
> back-patchable fix for AV to try to keep this situation from happening
> in the first place.

Let me push a patch to fix the corruption, and then we can think of ways
to teach autovacuum about the problem. I'm not optimistic about that,
honestly: as all GUC settings, these are individual for each process,
and there's no way for one process to affect the values that are seen by
other processes (autovac workers). The only idea that comes to mind is
to publish values in shared memory, and autovac workers would read them
from there instead of using normal GUC values.

> What I think we should do is notice when members utilization exceeds
> offset utilization and progressively ramp back the effective value of
> autovacuum_multixact_freeze_max_age (and maybe also
> vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age)
> so that autovacuum (and maybe also manual vacuums) get progressively
> more aggressive about trying to advance relminmxid. Suppose we decide
> that when the "members" space is 75% used, we've got a big problem and
> want to treat autovacuum_multixact_freeze_max_age to effectively be
> zero.

I think we can easily determine the rate of multixact member space
consumption and compare to the rate of multixact ID consumption;
considering the historical multixact size (number of members per
multixact) it would be possible to change the freeze ages by the same
fraction, so that autovac effectively behaves as if the members
consumption rate is what is driving the freezing instead of ID
consumption rate. That way, we don't have to jump suddenly from
"normal" to "emergency" behavior as some fixed threshold.

> This may not be the right proposal in detail, but I think we should do
> something.

No disagreement on that.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alvaro Herrera

unread,
Apr 27, 2015, 11:24:20 AM4/27/15
to
Kevin Grittner wrote:
> Robert Haas <rober...@gmail.com> wrote:
> > On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgr...@ymail.com> wrote:
> >> I think I see why I was seeing this and nobody else was
> >
> > Thomas said he reproduced it. No?
>
> I should have been more clear about what I meant by "this". Thomas
> said he reproduced the immediate errors with Álvaro's patch, but if
> he said anything about files in the members subdirectory not going
> away with VACUUM followed by CHECKPOINT, regardless of
> configuration, I missed it. It turns out that these steps only
> "prime the pump" for the files to be deleted on subsequent access
> to the members SLRU. That doesn't contribute to database
> corruption, but it sure can be confusing for someone trying to
> clean things up.

The whole matter of truncating multixact is a longish trip. It starts
when autovacuum completes a round or VACUUM finishes processing a table;
these things call vac_update_datfrozenxid. That routine scans pg_class
and sees if datfrozenxid or datminmxid can be advanced from their
current points; only if any of these can, vac_truncate_clog is called.
That routine calls SetMultiXactIdLimit(), which determines a new
MultiXactState->oldestMultiXactId (saved in shared memory). The
involvement of vacuum stops here; following steps happen at checkpoint.

At checkpoint, oldestMultiXactId is saved to pg_control as part of a
checkpoint (MultiXactGetCheckptMulti); the checkpointed value is passed
back to multixact by MultiXactSetSafeTruncate, which saves it in shmem
as lastCheckpointedOldest. The same checkpoint later calls
TruncateMultiXact which can remove files.

Note that if vac_update_datfrozenxid finds that the pg_database values
cannot be changed (during the vacuum phase), the multixact truncation
point is not advanced and checkpoint has nothing to do. But note that
the clog counter advancing enough will also trigger multixact
truncation.

Robert Haas

unread,
Apr 27, 2015, 12:41:01 PM4/27/15
to
On Mon, Apr 27, 2015 at 10:59 AM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
> Let me push a patch to fix the corruption, and then we can think of ways
> to teach autovacuum about the problem.

Sounds good to me. Are you going to do that today?

> I'm not optimistic about that,
> honestly: as all GUC settings, these are individual for each process,
> and there's no way for one process to affect the values that are seen by
> other processes (autovac workers). The only idea that comes to mind is
> to publish values in shared memory, and autovac workers would read them
> from there instead of using normal GUC values.

I don't think we could store values for the parameters directly in
shared memory, because I think that at least some of those GUCs are
per-session changeable. But we might be able to store weighting
factors in shared memory that get applied to whatever the values in
the current session are. Or else maybe each backend can just
recompute the information for itself when it needs it.

>> What I think we should do is notice when members utilization exceeds
>> offset utilization and progressively ramp back the effective value of
>> autovacuum_multixact_freeze_max_age (and maybe also
>> vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age)
>> so that autovacuum (and maybe also manual vacuums) get progressively
>> more aggressive about trying to advance relminmxid. Suppose we decide
>> that when the "members" space is 75% used, we've got a big problem and
>> want to treat autovacuum_multixact_freeze_max_age to effectively be
>> zero.
>
> I think we can easily determine the rate of multixact member space
> consumption and compare to the rate of multixact ID consumption;
> considering the historical multixact size (number of members per
> multixact) it would be possible to change the freeze ages by the same
> fraction, so that autovac effectively behaves as if the members
> consumption rate is what is driving the freezing instead of ID
> consumption rate. That way, we don't have to jump suddenly from
> "normal" to "emergency" behavior as some fixed threshold.

Right. I think that not jumping from normal mode to emergency mode is
quite important, and was trying to describe a system that would
gradually ramp up the pressure rather than a system that would do
nothing for a while and then suddenly go ballistic.

With regard to what you've outlined here, we need to make sure that if
the multixact rate varies widely, we still clean up before we hit
autovac wraparond. That's why I think it should be driven off of the
fraction of the available address space which is currently consumed,
not some kind of short term measure of mxact size or generation rate.
I'm not sure exactly what you have in mind here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

David Gould

unread,
Apr 27, 2015, 4:12:10 PM4/27/15
to
On Mon, 27 Apr 2015 11:59:10 -0300
Alvaro Herrera <alvh...@2ndquadrant.com> wrote:

> I think we can easily determine the rate of multixact member space
> consumption and compare to the rate of multixact ID consumption;
> considering the historical multixact size (number of members per
> multixact) it would be possible to change the freeze ages by the same
> fraction, so that autovac effectively behaves as if the members
> consumption rate is what is driving the freezing instead of ID
> consumption rate. That way, we don't have to jump suddenly from
> "normal" to "emergency" behavior as some fixed threshold.

I would like to add a data point: one of my clients has a plpgsql function
that manages to use ten to 30 thousand multixact ids per invocation. It
interacts with a remote resource and sets an exception handler on a per
item basis to catch errors on the remote call.

-dg


--
David Gould 510 282 0869 da...@sonic.net
If simplicity worked, the world would be overrun with insects.

Alvaro Herrera

unread,
Apr 28, 2015, 2:23:34 AM4/28/15
to
Thomas Munro wrote:

> One thing I noticed about your patch is that it effectively halves the
> amount of multixact members you can have on disk. Sure, I'd rather
> hit an error at 2^31 members than a corrupt database at 2^32 members,
> but I wondered if we should try to allow the full range to be used.

Ah, yeah, we do want the full range; that's already built in the code
elsewhere.

In this version, I used your WouldWrap function, but there was a bug in
your formulation of the call site: after the WARNING has been issued
once, it is never issued again for that wraparound cycle, because the
second time around the nextOffset has already crossed the boundary and
your routine returns false. IMO this is wrong and the warning should be
issued every time. To fix that problem I removed the offsetWarnLimit
altogether, and instead do WouldWrap() of the value against
offsetStopLimit minus the 20 segments. That way, the warning is issued
continuously until the offsetStopLimit is reached (once there,
obviously, only the error is thrown, not the warning, which is correct.)

I also added a call to DetermineSafeOldestOffset() in TrimMultiXact:
as far as I can tell, this is necessary for the time when a standby
exits recovery, because when InRecovery we return early from
DetermineSafeOldestOffset() so the safe point would never get set.
memberswrap-2.patch

Thomas Munro

unread,
Apr 28, 2015, 2:30:43 AM4/28/15
to
On Tue, Apr 28, 2015 at 6:23 PM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
> Thomas Munro wrote:
>
>> One thing I noticed about your patch is that it effectively halves the
>> amount of multixact members you can have on disk. Sure, I'd rather
>> hit an error at 2^31 members than a corrupt database at 2^32 members,
>> but I wondered if we should try to allow the full range to be used.
>
> Ah, yeah, we do want the full range; that's already built in the code
> elsewhere.
>
> In this version, I used your WouldWrap function, but there was a bug in
> your formulation of the call site: after the WARNING has been issued
> once, it is never issued again for that wraparound cycle, because the
> second time around the nextOffset has already crossed the boundary and
> your routine returns false. IMO this is wrong and the warning should be
> issued every time. To fix that problem I removed the offsetWarnLimit
> altogether, and instead do WouldWrap() of the value against
> offsetStopLimit minus the 20 segments. That way, the warning is issued
> continuously until the offsetStopLimit is reached (once there,
> obviously, only the error is thrown, not the warning, which is correct.)

+1

Tomorrow I will send a separate patch for the autovacuum changes that
I sent earlier. Let's discuss and hopefully eventually commit that
separately.

--
Thomas Munro
http://www.enterprisedb.com

Robert Haas

unread,
Apr 28, 2015, 10:33:24 AM4/28/15
to
On Tue, Apr 28, 2015 at 2:23 AM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
> Ah, yeah, we do want the full range; that's already built in the code
> elsewhere.
>
> In this version, I used your WouldWrap function, but there was a bug in
> your formulation of the call site: after the WARNING has been issued
> once, it is never issued again for that wraparound cycle, because the
> second time around the nextOffset has already crossed the boundary and
> your routine returns false. IMO this is wrong and the warning should be
> issued every time. To fix that problem I removed the offsetWarnLimit
> altogether, and instead do WouldWrap() of the value against
> offsetStopLimit minus the 20 segments. That way, the warning is issued
> continuously until the offsetStopLimit is reached (once there,
> obviously, only the error is thrown, not the warning, which is correct.)
>
> I also added a call to DetermineSafeOldestOffset() in TrimMultiXact:
> as far as I can tell, this is necessary for the time when a standby
> exits recovery, because when InRecovery we return early from
> DetermineSafeOldestOffset() so the safe point would never get set.

Putting the period inside the parentheses here looks weird?

+ "This command would create a
multixact with %u members, which exceeds remaining space (%u
members.)",

Maybe rephrase as: "This command would create a multixact with %u
members, but the remaining space is only enough for %u members."

I don't think this should have a comma:

+ errhint("Execute a database-wide VACUUM in that
database, with reduced vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.")));

This looks like excess brace-ifiaction:

+ if (start < boundary)
+ {
+ return finish >= boundary || finish < start;
+ }
+ else
+ {
+ return finish >= boundary && finish < start;
+ }

I think this is confusing:


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas

unread,
Apr 28, 2015, 10:34:51 AM4/28/15
to
On Tue, Apr 28, 2015 at 10:33 AM, Robert Haas <rober...@gmail.com> wrote:
> I think this is confusing:

Oops, hit send too soon.

+/*
+ * Read the offset of the first member of the given multixact.
+ */

This is confusing to me because the two subdirectories of pg_multixact
are called "members" and "offsets". Here you are talking about the
offset of the first member. Maybe I'm just slow, but that seems like
conflating terminology. You end up with a function called
read_offset_for_multi() that is actually looking up information about
members. Ick.

Alvaro Herrera

unread,
Apr 28, 2015, 10:34:59 AM4/28/15
to
Thomas Munro wrote:

> > In this version, I used your WouldWrap function, [...]
>
> +1

Pushed.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Alvaro Herrera

unread,
Apr 28, 2015, 10:55:58 AM4/28/15
to

I sure wish this had arrived two minutes earlier ...

Robert Haas wrote:

> Putting the period inside the parentheses here looks weird?
>
> + "This command would create a
> multixact with %u members, which exceeds remaining space (%u
> members.)",
>
> Maybe rephrase as: "This command would create a multixact with %u
> members, but the remaining space is only enough for %u members."

WFM, will change.

> I don't think this should have a comma:
>
> + errhint("Execute a database-wide VACUUM in that
> database, with reduced vacuum_multixact_freeze_min_age and
> vacuum_multixact_freeze_table_age settings.")));

Ditto.

> This looks like excess brace-ifiaction:
>
> + if (start < boundary)
> + {
> + return finish >= boundary || finish < start;
> + }
> + else
> + {
> + return finish >= boundary && finish < start;
> + }

Yeah, agreed. Will undo that change. (I disliked the comment above the
indented single-statement, so added braces, but then moved the comment.
I should have removed the braces at that point.)

> I think this is confusing:
>
> +/*
> + * Read the offset of the first member of the given multixact.
> + */
>
> This is confusing to me because the two subdirectories of pg_multixact
> are called "members" and "offsets". Here you are talking about the
> offset of the first member. Maybe I'm just slow, but that seems like
> conflating terminology. You end up with a function called
> read_offset_for_multi() that is actually looking up information about
> members. Ick.

Yeah, I introduced the confusing terminology while inventing multixacts
initially and have regretted it many times. I will think about a better
name for this. (Meanwhile, on IM Robert suggested
find_start_of_first_multi_member)

Robert Haas

unread,
Apr 28, 2015, 12:05:10 PM4/28/15
to
On Tue, Apr 28, 2015 at 10:56 AM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
> I sure wish this had arrived two minutes earlier ...

Sorry about that. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Alvaro Herrera

unread,
Apr 28, 2015, 1:53:53 PM4/28/15
to
Alvaro Herrera wrote:

> > I think this is confusing:
> >
> > +/*
> > + * Read the offset of the first member of the given multixact.
> > + */
> >
> > This is confusing to me because the two subdirectories of pg_multixact
> > are called "members" and "offsets". Here you are talking about the
> > offset of the first member. Maybe I'm just slow, but that seems like
> > conflating terminology. You end up with a function called
> > read_offset_for_multi() that is actually looking up information about
> > members. Ick.
>
> Yeah, I introduced the confusing terminology while inventing multixacts
> initially and have regretted it many times. I will think about a better
> name for this. (Meanwhile, on IM Robert suggested
> find_start_of_first_multi_member)

Pushed. I chose find_multixact_start() as a name for this function.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Jeff Janes

unread,
Apr 28, 2015, 2:14:02 PM4/28/15
to
On Tue, Apr 28, 2015 at 10:54 AM, Alvaro Herrera <alvh...@2ndquadrant.com> wrote:
Alvaro Herrera wrote:

> > I think this is confusing:
> >
> > +/*
> > + * Read the offset of the first member of the given multixact.
> > + */
> >
> > This is confusing to me because the two subdirectories of pg_multixact
> > are called "members" and "offsets".  Here you are talking about the
> > offset of the first member.  Maybe I'm just slow, but that seems like
> > conflating terminology.  You end up with a function called
> > read_offset_for_multi() that is actually looking up information about
> > members.  Ick.
>
> Yeah, I introduced the confusing terminology while inventing multixacts
> initially and have regretted it many times.  I will think about a better
> name for this.  (Meanwhile, on IM Robert suggested
> find_start_of_first_multi_member)

Pushed.  I chose find_multixact_start() as a name for this function.


Starting with  commit b69bf30b9bfacafc733a9ba7 and continuing to this just-described commit, I can no longer upgrade from a 9.2.10 database using pg_upgrade.

I can reproduce it from a clean 9.2 install which has never even been started up.

Deleting files from new pg_multixact/offsets                ok
Setting oldest multixact ID on new cluster                  ok
Resetting WAL archives                                      ok

*failure*
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

The last few lines are:

command: "../bisect/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "../data2/" -o "-p 50432 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off  -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/home/jjanes/pgsql/git'" start >> "pg_upgrade_server.log" 2>&1
waiting for server to start....LOG:  database system was shut down at 2015-04-28 11:08:18 PDT
FATAL:  could not access status of transaction 1
DETAIL:  Could not open file "pg_multixact/offsets/0000": No such file or directory.
LOG:  startup process (PID 3977) exited with exit code 1
LOG:  aborting startup due to startup process failure


Cheers,

Jeff

Alvaro Herrera

unread,
Apr 28, 2015, 2:52:32 PM4/28/15
to
Jeff Janes wrote:

> Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this
> just-described commit, I can no longer upgrade from a 9.2.10 database using
> pg_upgrade.

How annoying, thanks for the report. I reproduced it here. The problem
is that the upgrade process removes the files from pg_multixact/offsets,
which is what we now want to read on startup. Not yet sure how to fix
it.

Alvaro Herrera

unread,
Apr 28, 2015, 7:12:41 PM4/28/15
to
Jeff Janes wrote:
> Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this
> just-described commit, I can no longer upgrade from a 9.2.10 database using
> pg_upgrade.

Here's a patch, but I don't like it too much. Will think more about it,
probably going to push something tomorrow.
memberswrap-3.patch

Thomas Munro

unread,
Apr 29, 2015, 2:10:39 AM4/29/15
to
On Tue, Apr 21, 2015 at 5:12 PM, Amit Kapila <amit.k...@gmail.com> wrote:
> On Tue, Apr 21, 2015 at 12:34 AM, Alvaro Herrera <alvh...@2ndquadrant.com>
> wrote:
>>
>> Alvaro Herrera wrote:
>>
>> > The fix is to raise an ERROR when generating a new multixact, if we
>> > detect that doing so would get close to the oldest multixact that the
>> > system knows about. If that happens, the solution is to vacuum so that
>> > the "oldest" point is advanced a bit more and you have room to generate
>> > more multixacts. In production, you would typically adjust the
>> > multixact freeze parameters so that "oldest multixact" is advanced more
>> > aggressively and you don't hit the ERROR.
>>
>> Here's a patch. I have tested locally and it closes the issue for me.
>> If those affected can confirm that it stops the file removal from
>> happening, I'd appreciate it.
>>
>
> 1. Do you think it makes sense to give warning in SetMultiXactIdLimit()
> if we have already reached offsetWarnLimit as we give for multiWarnLimit?

Amit and I discussed this offline. Yes, we could include a warning
message here, for consistency with the warnings you get about xid
wraparound. Concretely I think it means that you would also get
warnings about being being near the member space limit from vacuums,
rather than just from attempts to allocate new multixact IDs. The
test to detect an impending members-would-wrap ERROR would be similar
to what we do in GetNewMultiXactId, so something like:

MultiXactOffsetWouldWrap(offsetStopLimit,
nextOffset,
MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT *
OFFSET_WARN_SEGMENTS)

I'm not sure whether it's worth writing an extra patch for this
though, because if you're in this situation, your logs are already
overflowing with warnings from the regular backends that are
generating multixacts. Thoughts anyone?

--
Thomas Munro
http://www.enterprisedb.com


Amit Kapila

unread,
Apr 29, 2015, 7:42:04 AM4/29/15
to
On Wed, Apr 29, 2015 at 5:44 AM, Thomas Munro <thomas...@enterprisedb.com> wrote:
>
> On Tue, Apr 28, 2015 at 6:30 PM, Thomas Munro
> <thomas...@enterprisedb.com> wrote:
> > Tomorrow I will send a separate patch for the autovacuum changes that
> > I sent earlier.  Let's discuss and hopefully eventually commit that
> > separately.
>
> Here is a separate patch which makes autovacuum start a wrap-around
> vacuum sooner if the member space is running out, by adjusting
> autovacuum_multixact_freeze_max_age using a progressive scaling
> factor.  This version includes a clearer implementation of
> autovacuum_multixact_freeze_max_age_adjusted provided by Kevin
> Grittner off-list.
>

Some comments:

1. It seems that you are using autovacuum_multixact_freeze_max_age_adjusted()
only at couple of places, like it is not used in below calc:

vacuum_set_xid_limits()
{
..
mxid_freezemin = Min(mxid_freezemin,
autovacuum_multixact_freeze_max_age / 2); 
..
}

What is the reason of using this calculation at some places and
not at other places?

2.
@@ -2684,8 +2719,8 @@ relation_needs_vacanalyze(Oid relid,
  : autovacuum_freeze_max_age;
 
  multixact_freeze_max_age = (relopts && relopts->multixact_freeze_max_age >= 0)
- ? Min(relopts->multixact_freeze_max_age, autovacuum_multixact_freeze_max_age)
- : autovacuum_multixact_freeze_max_age;
+ ? Min(relopts->multixact_freeze_max_age, autovacuum_multixact_freeze_max_age_adjusted())
+ : autovacuum_multixact_freeze_max_age_adjusted();


It seems that it will try to read from offset file for each
relation which might or might not be good, shall we try to
cache the oldestMultiXactMemberOffset?

3. currently there is some minimum limit of autovacuum_multixact_freeze_age (10000000)
which might not be honored by this calculation, so not sure if that can impact the
system performance in some cases where it is currently working sane.

4. Can you please share results that can show improvement
with current patch versus un-patched master?

5.
+ /*
+ * TODO: In future, could oldestMultiXactMemberOffset be stored in shmem,
+
pg_controdata, alongside oldestMultiXactId?
+ */

You might want to write the comment as:
XXX: We can store oldestMultiXactMemberOffset in shmem, pg_controldata
alongside oldestMultiXactId?

6.
+ * Returns vacuum_multixact_freeze_max_age, adjusted down to prevent excessive use
+ * of addressable 
multixact member space if required.

I think here you mean autovacuum_multixact_freeze_max_age?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Jeff Janes

unread,
Apr 29, 2015, 1:07:34 PM4/29/15
to
On Tue, Apr 28, 2015 at 4:13 PM, Alvaro Herrera <alvh...@2ndquadrant.com> wrote:
Jeff Janes wrote:
> Starting with  commit b69bf30b9bfacafc733a9ba7 and continuing to this
> just-described commit, I can no longer upgrade from a 9.2.10 database using
> pg_upgrade.

Here's a patch, but I don't like it too much.  Will think more about it,
probably going to push something tomorrow.

It looks like that patch is targeted to 9.4 branch.  I couldn't readily get it to apply on HEAD.  I tested it on 9.4, and it solved the problem there.

Thanks,

Jeff

Amit Kapila

unread,
Apr 29, 2015, 11:53:16 PM4/29/15
to
On Tue, Apr 28, 2015 at 11:24 PM, Alvaro Herrera <alvh...@2ndquadrant.com> wrote:

>
> Alvaro Herrera wrote:
>
>
> Pushed.  I chose find_multixact_start() as a name for this function.
>

I have done test to ensure that the latest change has fixed the
reported problem and below are the results, to me it looks the
reported problem is fixed.

I have used test (explode_mxact_members) developed by Thomas
to reproduce the problem.  Start one transaction in a session.
After running the test for 3~4 hours with parameters as
explode_mxact_members 500 35000, I could see the warning messages
like below (before the fix there were no such messages and test is
completed but it has corrupted the database):

WARNING:  database with OID 1 must be vacuumed before 358 more multixact members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and 
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 1 must be vacuumed before 310 more multixact members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and 
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 1 must be vacuumed before 261 more multixact members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and 
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 1 must be vacuumed before 211 more multixact members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and 
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 1 must be vacuumed before 160 more multixact members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and 
vacuum_multixact_freeze_table_age settings.
explode_mxact_members: explode_mxact_members.c:38: main: Assertion `PQresultStatus(res) == PGRES_TUPLES_OK' 
failed.
 
After this I  set the vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age as zero and then performed
Vacuum freeze for template1 and postgres followed by
manual CHECKPOINT.  I could see below values in pg_database.

postgres=# select oid,datname,datminmxid from pg_database;
  oid  |  datname  | datminmxid 
-------+-----------+------------
     1 | template1 |   17111262
 13369 | template0 |   17111262
 13374 | postgres  |   17111262
(3 rows)

Again I start the test as ./explode_mxact_members 500 35000, but it
immediately failed as
500 sessions connected...
Loop 0...
WARNING:  database with OID 13369 must be vacuumed before 12 more multixact members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and 
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 13369 must be vacuumed before 11 more multixact members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and 
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 13369 must be vacuumed before 9 more multixact members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and 
vacuum_multixact_freeze_table_age settings.
explode_mxact_members: explode_mxact_members.c:38: main: Assertion `PQresultStatus(res) == PGRES_TUPLES_OK' 
failed.

Now it was confusing for me why it has failed for next time even
though I had Vacuum Freeze and CHECKPOINT, but then I waited
for a minute or two and ran Vacuum Freeze by below command:
./vacuumdb -a -F
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template1"

Here I have verified that all files except one were deleted.

After that when I restarted the test, it went perfectly fine and it never
lead to any warning messages, probable because the values for
vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age
were zero.

I am still not sure why it took some time to clean the members directory
and resume the test after running Vacuum Freeze and Checkpoint.

Robert Haas

unread,
Apr 30, 2015, 9:08:44 AM4/30/15
to
On Tue, Apr 28, 2015 at 7:13 PM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
> Jeff Janes wrote:
>> Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this
>> just-described commit, I can no longer upgrade from a 9.2.10 database using
>> pg_upgrade.
>
> Here's a patch, but I don't like it too much. Will think more about it,
> probably going to push something tomorrow.

What don't you like about it? We should get something committed here;
it's not good for the back-branches to be in a state where pg_upgrade
will break.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Alvaro Herrera

unread,
Apr 30, 2015, 12:50:46 PM4/30/15
to
Robert Haas wrote:
> On Tue, Apr 28, 2015 at 7:13 PM, Alvaro Herrera
> <alvh...@2ndquadrant.com> wrote:
> > Jeff Janes wrote:
> >> Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this
> >> just-described commit, I can no longer upgrade from a 9.2.10 database using
> >> pg_upgrade.
> >
> > Here's a patch, but I don't like it too much. Will think more about it,
> > probably going to push something tomorrow.
>
> What don't you like about it? We should get something committed here;
> it's not good for the back-branches to be in a state where pg_upgrade
> will break.

Yeah, I managed to find a real fix which I will push shortly.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Alvaro Herrera

unread,
Apr 30, 2015, 1:03:12 PM4/30/15
to
Jeff Janes wrote:
> On Tue, Apr 28, 2015 at 4:13 PM, Alvaro Herrera <alvh...@2ndquadrant.com>
> wrote:
>
> > Jeff Janes wrote:
> > > Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to
> > > this just-described commit, I can no longer upgrade from a 9.2.10
> > > database using pg_upgrade.
> >
> > Here's a patch, but I don't like it too much. Will think more about it,
> > probably going to push something tomorrow.
>
> It looks like that patch is targeted to 9.4 branch. I couldn't readily get
> it to apply on HEAD. I tested it on 9.4, and it solved the problem there.

Yeah, I wrote it in 9.3. However, it was wrong; or at least there's a
better way to formulate it, and the new formulation applies without
conflict from 9.3 to master. So I pushed that instead.

Thanks!

Robert Haas

unread,
May 1, 2015, 3:08:53 PM5/1/15
to
On Fri, May 1, 2015 at 6:51 AM, Thomas Munro
<thomas...@enterprisedb.com> wrote:
> Those other places are for capping the effective table and tuple
> multixact freeze ages for manual vacuums, so that manual vacuums (say
> in nightly cronjobs) get a chance to run a wraparound scans before
> autovacuum kicks in at a less convenient time. So, yeah, I think we
> want to incorporate member wraparound prevention into that logic, and
> I will add that in the next version of the patch.

+1. On a quick read-through of the patch, the biggest thing that
jumped out at me was that it only touches the autovacuum logic.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Amit Kapila

unread,
May 2, 2015, 7:22:42 AM5/2/15
to
On Thu, Apr 30, 2015 at 10:47 AM, Thomas Munro <thomas...@enterprisedb.com> wrote:

>
> On Wed, Apr 29, 2015 at 11:41 PM, Amit Kapila <amit.k...@gmail.com> wrote:
>
> > 3. currently there is some minimum limit of autovacuum_multixact_freeze_age
> > (10000000)
> > which might not be honored by this calculation, so not sure if that can
> > impact the
> > system performance in some cases where it is currently working sane.
>
> The reason why we need to be able to set the effective freeze age
> below that minimum in cases of high member data consumption rates is
> that you could hit the new member space wraparound prevention error
> before you consume anywhere near that many multixact IDs.  That
> minimum may well be entirely reasonable if the only thing you're
> worried about is multixact ID wraparound prevention.
>
> For example, my test program eats an average of 250 members per
> multixact ID when run with 500 sessions (each loop creates 500
> multixact IDs having 1, 2, 3, ..., 500 members).  At that rate, you'll
> run out of addressable member space after 2^32 / 250 = 17,179,869
> multixact IDs.  To prevent an error condition using only the existing
> multixact ID wraparound prevention machinery, we need to have an
> effective max table age (so that autovacuum wakes up and scans all
> tables) and min freeze age (so that it actually freezes the tuples)
> below that number.  So we have to ignore the GUC minimum in this
> situation.
>

I understand that point, but I mentioned so that if there is some specific
reason for keeping the current minimum value, then we should evaluate
that we have not broken the same by not honouring the minimum value of
GUC.  As far as I can see from code, there seems to be one place
(refer below code) where that value is used to calculate Warning limit for
multixacts and the current patch doesn't seem to have any impact on the
same.

SetMultiXactIdLimit()
{
..
multiWarnLimit = multiStopLimit - 10000000;
}


> ...
>
> Observations:
>
> 1.  Sometimes the values don't change from minute to minute,
> presumably because there hasn't been a checkpoint to update
> pg_controldata on disk, but hopefully we can still see what's going on
> here despite the slight lag in the data.
>

Yeah and I think this means that there will no advancement for oldest
multixactid and deletion of files if the checkpoints are configured for
a timeout value.  I think there is no harm in specifying this in document
if it is currently not specified. 

> 2.  We get to somewhere in the 73-75% SLRU used range before
> wraparound vacuums are triggered.  We probably need to spread things
> out more that that.
>
> 3.  When the autovacuum runs, it advances oldest_mxid by different
> amounts each time; that's because I'm using the adjusted freeze max
> age (the max age of a table before it gets a wraparound vacuum) as our
> freeze min age (the max age for individual tuples before they're
> frozen) here:
>
> @@ -1931,7 +1964,9 @@ do_autovacuum(void)
>   {
>   default_freeze_min_age = vacuum_freeze_min_age;
>   default_freeze_table_age = vacuum_freeze_table_age;
> - default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
> + default_multixact_freeze_min_age =
> + Min(vacuum_multixact_freeze_min_age,
> + autovacuum_multixact_freeze_max_age_adjusted());
>   default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
>   }
>
> Without that change, autovacuum would trigger repeatedly as we got
> near 75% SLRU usage but not freeze anything, because
> default_multixact_freeze_min_age was higher than the age of any tuples
> (which had only made it to an age of around ~12 million; actually it's
> not exactly the tuple age per se... I don't fully understand the
> treatment of locker and updater multixact IDs in the vacuum code,
> HeapTupleSatisfiesVacuum and heap_freeze_tuple etc yet so I'm not sure
> exactly how that value translates into vacuum work, but I can see
> experimentally that a low multixact freeze min age is needed to get
> relminxmid moved forward).
>
> It's good that freeze table age ramps down so that the autovacuum
> launcher trigger point jumps around a bit and we spread the autovacuum
> launches over time, but it's not great that we finish up truncating
> different amounts of multixacts and associated SLRU each time.  We
> could instead use a freeze min age of 0 to force freezing of *all*
> tuples if this is a member-space-wraparound-prevention vacuum (that
> is, if autovacuum_multixact_freeze_max_age !=
> autovacuum_multixact_freeze_max_age_adjusted()).

We already set vacuum_multixact_freeze_min_age to half of
autovacuum_multixact_freeze_max_age so that autovacuums to
prevent MultiXact wraparound won't occur too frequently as per below
code:

vacuum_set_xid_limits()
{
..
mxid_freezemin = Min(mxid_freezemin,
 
autovacuum_multixact_freeze_max_age / 2);
Assert(mxid_freezemin >= 0);
..
}

Now if we set it to zero, then I think it might lead to excessive
freezing and inturn more I/O without the actual need (more space
for multixact members)

>
> There is less to say about the results with an unpatched server: it
> drives in a straight line for a while, and then crashes into a wall
> (ie the new error preventing member wraparound), which I see you have
> also reproduced.  It's used up all of the circular member space, but
> only has around 17 million multixacts so autovacuum can't help you
> (it's not even possible to set autovacuum_multixact_freeze_max_age
> below 100 million), so to get things moving again you need to manually
> VACUUM FREEZE all databases including template databases.
>

In my tests on setting vacuum multixact parameter
(vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age)
values to zero, it has successfuly finished the tests (no warning and I could
see truncation of files in members directory) , so I think one might argue
that in many cases one could get the available space for members by
just setting appropriate values for vacuum_multixact_*  params, but I feel
it is better to have some auto adjustment algorithm like this patch is
trying to do so that even if those values are not set appropriately, it can
avoid the wraparound error.  I think the only thing we might need to be
cautious about is that new calculation should not make it worse (less
aggresive) in case of lower values for vacuum_multixact_* parameters.

Amit Kapila

unread,
May 3, 2015, 12:40:49 AM5/3/15
to
On Sat, May 2, 2015 at 11:46 AM, Thomas Munro <thomas...@enterprisedb.com> wrote:
>
> On Sat, May 2, 2015 at 7:08 AM, Robert Haas <rober...@gmail.com> wrote:

> > On Fri, May 1, 2015 at 6:51 AM, Thomas Munro
> > <thomas...@enterprisedb.com> wrote:
> >> Those other places are for capping the effective table and tuple
> >> multixact freeze ages for manual vacuums, so that manual vacuums (say
> >> in nightly cronjobs) get a chance to run a wraparound scans before
> >> autovacuum kicks in at a less convenient time.  So, yeah, I think we
> >> want to incorporate member wraparound prevention into that logic, and
> >> I will add that in the next version of the patch.
> >
> > +1.  On a quick read-through of the patch, the biggest thing that
> > jumped out at me was that it only touches the autovacuum logic.
>
>
> Also attached is the output of the monitor.sh script posted upthread,
> while running explode_mxact_members.c.  It looks better than the last
> results to me: whenever usage reaches 50%, autovacuum advances things
> such that usage drops right back to 0% (because it now uses
> multixact_freeze_min_age = 0) , and the system will happily chug on
> forever.  What this test doesn't really show adequately is that if you
> had a lot of different tables and databases with different relminmxid
> values, they'd be vacuumed at different times.  I should probably come
> up with a way to demonstrate that...
>

About data, I have extracted parts where there is a change in
oldest_mxid and segments

time segments usage_fraction usage_kb oldest_mxid next_mxid next_offset

13:48:36 1 0 16 1 1 0
13:49:36 369 .0044 94752 1 1 0
..
14:44:04 41703 .5083 10713400 1 8528909 2140755909

14:45:05 1374 .0167 352960 8573819 8722521 2189352521
..
15:37:16 41001 .4997 10529528 8573819 17060811 4282263311
..
15:38:16 709 .0086 182056 17132168 17254423 35892627
..
16:57:15 41440 .5051 10644712 17132168 25592713 2128803417
..
16:58:16 1120 .0136 287416 25695507 25786824 2177525278

Based on this data, it seems that truncation of member space
as well as advancement of oldest multixact id happens once
it reaches 50% usage and at that time segments drops down to almost
zero.  This happens repeatedly after 1 hour and in-between there
is no progress which indicates that all the work happens at
one go rather than in spreaded way. Won't this choke the system
when it happens due to I/O, isn't it better if we design it in a way such
that it is spreaded over period of time rather than doing everything at
one go?

-- 
+int
+compute_max_multixact_age_to_avoid_member_wrap(bool manual)
{
..
+ if (members <= safe_member_count)
+ {
+ /*
+ * There is no danger of 
member wrap, so return a number that is not
+ * lower than autovacuum_multixact_freeze_max_age.
+
*/
+ return -1;
+ }
..

The above code doesn't seem to match its comments.
Comment says "..not lower than autovacuum_multixact_freeze_max_age",
but then return -1.  It seems to me here we should return unchanged
autovacuum_multixact_freeze_max_age as it was coded in the initial
version of patch.  Do you have any specific reason to change it?

Amit Kapila

unread,
May 4, 2015, 2:26:11 AM5/4/15
to
On Mon, May 4, 2015 at 5:19 AM, Thomas Munro <thomas...@enterprisedb.com> wrote:

>
> On Sun, May 3, 2015 at 4:40 PM, Amit Kapila <amit.k...@gmail.com> wrote:
> > --
> > +int
> > +compute_max_multixact_age_to_avoid_member_wrap(bool manual)
> > {
> > ..
> > + if (members <= safe_member_count)
> > + {
> > + /*
> > + * There is no danger of
> > member wrap, so return a number that is not
> > + * lower than autovacuum_multixact_freeze_max_age.
> > +
> > */
> > + return -1;
> > + }
> > ..
> >
> > The above code doesn't seem to match its comments.
> > Comment says "..not lower than autovacuum_multixact_freeze_max_age",
> > but then return -1.  It seems to me here we should return unchanged
> > autovacuum_multixact_freeze_max_age as it was coded in the initial
> > version of patch.  Do you have any specific reason to change it?
>
> Oops, the comment is fixed in the attached patch.
>
> In an earlier version, I was only dealing with the autovacuum case.
> Now that the VACUUM command also calls it, I didn't want this
> compute_max_multixact_age_to_avoid_member_wrap function to assume that
> it was being called by autovacuum code and return the
> autovacuum-specific GUC in the case that no special action is needed.
> Also, the function no longer computes a value by scaling
> autovacuum_multixact_freeze_max_age, it now scales the current number
> of active multixacts, so that we can begin selecting a small non-zero
> number of tables to vacuum as soon as we exceed safe_member_count as
> described above 

I am slightly worried that if for scaling we don't consider the value for
multixact_*_age as configured by user, Vacuum/Autovacuum might 
behave totally different from what user is expecting.  Basically
it will be dominated based on member space usage and will ignore
the values set by user for multixact_*_age parameters.  One way
could be to use minimum of the value calculated based on member
space and the value specified by user for multixact related parameters
as suggested in points 1 and 2 (below in mail).

One more thing, I think the current calculation considers members
usage, shouldn't we try to consider offset usage as well?


> (whereas when we used a scaled down
> autovaccum_multixact_freeze_max_age, we usually didn't select any
> tables at all until we scaled it down a lot, ie until we got close to
> dangerous_member_count).  Finally, I wanted a special value like -1
> for 'none' so that table_recheck_autovac and ExecVacuum could use a
> simple test >= 0 to know that they also need to set
> multixact_freeze_min_age to zero in the case of a
> member-space-triggered vacuum, so that we get maximum benefit from our
> table scans by freezing all relevant tuples, not just some older ones
>

I think setting multixact_freeze_min_age to zero could be too aggresive
for I/O.  Yes with this you can get maximum benefit, but at cost of
increased I/O.  How would you justify setting it to zero as appropriate
w.r.t increased I/O?

Few more observations:

1.
@@ -2687,6 +2796,10 @@ relation_needs_vacanalyze(Oid relid,
  ? Min(relopts-
>multixact_freeze_max_age, autovacuum_multixact_freeze_max_age)
 
autovacuum_multixact_freeze_max_age;
 
+ /* Special settings if we are running out of member address space. 
*/
+ if (max_multixact_age_to_avoid_member_wrap >= 0)
+ multixact_freeze_max_age = 
max_multixact_age_to_avoid_member_wrap;
+

Isn't it better to use minimum to already computed value of 
multixact_freeze_max_age and max_multixact_age_to_avoid_member_wrap?

multixact_freeze_max_age = Min(multixact_freeze_max_age, max_multixact_age_to_avoid_member_wrap);

Similar change needs to be done in table_recheck_autovac()

2.
@@ -1118,7 +1197,12 @@ do_start_worker(void)
 
  /* Also determine the oldest datminmxid we will consider. */
  recentMulti = ReadNextMultiXactId();
- multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
+ max_multixact_age_to_avoid_member_wrap =
+ compute_max_multixact_age_to_avoid_member_wrap(false);
+ if (max_multixact_age_to_avoid_member_wrap >= 0)
+ multiForceLimit = recentMulti - max_multixact_age_to_avoid_member_wrap;
+ else
+ multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;

Here also, isn't it better to use minimum of autovacuum_multixact_freeze_max_age
and max_multixact_age_to_avoid_member_wrap.

3. 
+int
+compute_max_multixact_age_to_avoid_member_wrap(bool manual)
+{
+ MultiXactOffset members;
+ uint32 multixacts;
+ double fraction;
+ MultiXactOffset safe_member_count = MaxMultiXactOffset / 2;

It is not completely clear what is more appropriate value
for safe_member_count (25% or 50%).  Anybody else have any
opinion on this value?

4. Once we conclude on final algorithm, we should update the
same in docs as well, probably in description at below link:

Thomas Munro

unread,
May 4, 2015, 3:12:49 AM5/4/15
to
On Mon, May 4, 2015 at 6:25 PM, Amit Kapila <amit.k...@gmail.com> wrote:
> [...]
> One more thing, I think the current calculation considers members
> usage, shouldn't we try to consider offset usage as well?

Offsets are indexed by multixact ID:

#define MultiXactIdToOffsetPage(xid) \
((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
#define MultiXactIdToOffsetEntry(xid) \
((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)

The existing multixact wraparound prevention code is already managing
the 32 bit multixact ID space. The problem with members comes about
because each one of those multixact IDs can have arbitrary numbers of
members, and yet the members are also addressed with a 32 bit index.
So we are trying to hijack the multixact ID wraparound prevention and
make it more aggressive if member space appears to be running out.
(Perhaps in future there should be a 64 bit index for member indexes
so that this problem disappears?)

>> (whereas when we used a scaled down
>> autovaccum_multixact_freeze_max_age, we usually didn't select any
>> tables at all until we scaled it down a lot, ie until we got close to
>> dangerous_member_count). Finally, I wanted a special value like -1
>> for 'none' so that table_recheck_autovac and ExecVacuum could use a
>> simple test >= 0 to know that they also need to set
>> multixact_freeze_min_age to zero in the case of a
>> member-space-triggered vacuum, so that we get maximum benefit from our
>> table scans by freezing all relevant tuples, not just some older ones
>>
>
> I think setting multixact_freeze_min_age to zero could be too aggresive
> for I/O. Yes with this you can get maximum benefit, but at cost of
> increased I/O. How would you justify setting it to zero as appropriate
> w.r.t increased I/O?

I assumed that if you were already vacuuming all your tablesto avoid
running out of member space, you would want to freeze any tuples you
possibly could to defer the next wraparound scan for as long as
possible, since wraparound scans are enormously expensive.

> Few more observations:
>
> 1.
> @@ -2687,6 +2796,10 @@ relation_needs_vacanalyze(Oid relid,
> ? Min(relopts-
>>multixact_freeze_max_age, autovacuum_multixact_freeze_max_age)
> :
> autovacuum_multixact_freeze_max_age;
>
> + /* Special settings if we are running out of member address space.
> */
> + if (max_multixact_age_to_avoid_member_wrap >= 0)
> + multixact_freeze_max_age =
> max_multixact_age_to_avoid_member_wrap;
> +
>
> Isn't it better to use minimum to already computed value of
> multixact_freeze_max_age and max_multixact_age_to_avoid_member_wrap?
>
> multixact_freeze_max_age = Min(multixact_freeze_max_age,
> max_multixact_age_to_avoid_member_wrap);

Except that I am using -1 as a special value. But you're right, I
guess it should be like this:

if (max_multixact_age_to_avoid_member_wrap >= 0)
multixact_freeze_max_age = Min(multixact_freeze_max_age,
max_multixact_age_to_avoid_member_wrap);

> Similar change needs to be done in table_recheck_autovac()
>
> 2.
> @@ -1118,7 +1197,12 @@ do_start_worker(void)
>
> /* Also determine the oldest datminmxid we will consider. */
> recentMulti = ReadNextMultiXactId();
> - multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
> + max_multixact_age_to_avoid_member_wrap =
> + compute_max_multixact_age_to_avoid_member_wrap(false);
> + if (max_multixact_age_to_avoid_member_wrap >= 0)
> + multiForceLimit = recentMulti - max_multixact_age_to_avoid_member_wrap;
> + else
> + multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
>
> Here also, isn't it better to use minimum of
> autovacuum_multixact_freeze_max_age
> and max_multixact_age_to_avoid_member_wrap.

Yeah, with the same proviso about -1.

> 3.
> +int
> +compute_max_multixact_age_to_avoid_member_wrap(bool manual)
> +{
> + MultiXactOffset members;
> + uint32 multixacts;
> + double fraction;
> + MultiXactOffset safe_member_count = MaxMultiXactOffset / 2;
>
> It is not completely clear what is more appropriate value
> for safe_member_count (25% or 50%). Anybody else have any
> opinion on this value?
>
> 4. Once we conclude on final algorithm, we should update the
> same in docs as well, probably in description at below link:
> http://www.postgresql.org/docs/devel/static/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUND

Agreed.

--
Thomas Munro
http://www.enterprisedb.com


Amit Kapila

unread,
May 4, 2015, 7:49:49 AM5/4/15
to
On Mon, May 4, 2015 at 12:42 PM, Thomas Munro <thomas...@enterprisedb.com> wrote:
>
> On Mon, May 4, 2015 at 6:25 PM, Amit Kapila <amit.k...@gmail.com> wrote:
> > [...]
> > One more thing, I think the current calculation considers members
> > usage, shouldn't we try to consider offset usage as well?
>
> Offsets are indexed by multixact ID:
>
> #define MultiXactIdToOffsetPage(xid) \
>         ((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
> #define MultiXactIdToOffsetEntry(xid) \
>         ((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
>
> The existing multixact wraparound prevention code is already managing
> the 32 bit multixact ID space.  The problem with members comes about
> because each one of those multixact IDs can have arbitrary numbers of
> members, and yet the members are also addressed with a 32 bit index.
> So we are trying to hijack the multixact ID wraparound prevention and
> make it more aggressive if member space appears to be running out.
> (Perhaps in future there should be a 64 bit index for member indexes
> so that this problem disappears?)
>

Okay, that makes sense.

> >> (whereas when we used a scaled down
> >> autovaccum_multixact_freeze_max_age, we usually didn't select any
> >> tables at all until we scaled it down a lot, ie until we got close to
> >> dangerous_member_count).  Finally, I wanted a special value like -1
> >> for 'none' so that table_recheck_autovac and ExecVacuum could use a
> >> simple test >= 0 to know that they also need to set
> >> multixact_freeze_min_age to zero in the case of a
> >> member-space-triggered vacuum, so that we get maximum benefit from our
> >> table scans by freezing all relevant tuples, not just some older ones
> >>
> >
> > I think setting multixact_freeze_min_age to zero could be too aggresive
> > for I/O.  Yes with this you can get maximum benefit, but at cost of
> > increased I/O.  How would you justify setting it to zero as appropriate
> > w.r.t increased I/O?
>
> I assumed that if you were already vacuuming all your tablesto avoid
> running out of member space, 

I think here you mean all tables that has relminmxid lesser than the
newly computed age (compute_max_multixact_age_to_avoid_member_wrap)

> you would want to freeze any tuples you
> possibly could to defer the next wraparound scan for as long as
> possible, since wraparound scans are enormously expensive.
>

The point is valid to an extent, but If we go by this logic, then currently
also we should set multixact_freeze_min_age as zero for wraparound
vacuum.

Robert Haas

unread,
May 4, 2015, 2:46:43 PM5/4/15
to
On Sat, May 2, 2015 at 2:16 AM, Thomas Munro
<thomas...@enterprisedb.com> wrote:
> Here's a new version which sets up the multixact parameters in
> ExecVacuum for regular VACUUM commands just like it does for
> autovacuum if needed. When computing
> max_multixact_age_to_avoid_member_wrap for a manual vacuum, it uses
> lower constants, so that any manually scheduled vacuums get a chance
> to deal with some of this problem before autovacuum has to. Here are
> the arbitrary constants currently used: at 50% member address space
> usage, autovacuum starts wraparound scan of tables with the oldest
> active multixacts, and then younger ones as the usage increases, until
> at 75% usage it vacuums with multixact_freeze_table_age = 0; for
> manual VACUUM those numbers are halved so that it has a good head
> start.

I think the 75% threshold for reducing multxact_freeze_table_age to
zero is fine, but I don't agree with the 50% cutoff. The purpose of
autovacuum_multixact_freeze_max_age is to control the fraction of the
2^32-entry offset space that can be consumed before we begin viewing
the problem as urgent. We have a setting for that because it needs to
be tunable, and the default value for that setting is 400 million,
which is roughly 10% of the members space. That is a whole lot lower
than the 50% threshold you are proposing here. Moreover, it leaves
the user with no meaningful choice: if the 50% threshold consumes too
much disk space, or doesn't leave enough room before we hit the wall,
then the user is simply hosed. This is why I initially proposed that
the member-space-consumption-percentage at which we start derating
multixact_freeze_table_age should be based on
autovacuum_multixact_freeze_max_age/2^32. That way,
autovacuum_multixact_freeze_max_age controls not only how aggressively
we try to reclaim offset space but also how aggressively we try to
reclaim member space. The user can then tune the value, and the
default is the same in both cases.

I also think that halving the numbers for manual vacuums is arbitrary
and unprecedented. The thought process isn't bad, but an autovacuum
currently behaves in most respects like a manual vacuum, and I'm
reluctant to make those more different.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas

unread,
May 4, 2015, 3:01:20 PM5/4/15
to
On Sat, May 2, 2015 at 7:22 AM, Amit Kapila <amit.k...@gmail.com> wrote:
>> 3. When the autovacuum runs, it advances oldest_mxid by different
That point is certainly worthy of some consideration. Letting the
freeze xmin get set to half of the (effective)
autovacuum_multixact_freeze_age would certainly be more consistent
with what we do elsewhere. The policy trade-off is not as
straightforward as you are making it out to be, though:

1. Using a min freeze age of zero will result in half as many
full-table scans, because we'll advance relminmxid twice as far each
time.

2. But each one will freeze more stuff, some of which might have been
updated again before the next freeze pass, so we might do more
freezing in total.

So either policy might win, depending on whether you care more about
reducing reads (in which case you want a very low min freeze age) or
about reducing writes (in which case you want a higher min freeze
age).

All things being equal, I'd rather stick with the existing 50% policy
in the back-branches, rather than going to zero, but I'm not sure all
things are equal. It matters what difference the higher value makes.

Robert Haas

unread,
May 4, 2015, 3:06:09 PM5/4/15
to
On Sun, May 3, 2015 at 7:49 PM, Thomas Munro
<thomas...@enterprisedb.com> wrote:
> Restricting ourselves to selecting tables to vacuum using their
> relminmxid alone makes this patch small since autovacuum already works
> that way. We *could* introduce code that would be able to spread out
> the work of vacuuming tables that happen to have identical or very
> close relminmxid (say by introducing some non-determinism or doing
> something weird based on hashing table oids and the time to explicitly
> spread the start of processing over time, or <your idea here>), but I
> didn't want to propose anything too big/complicated/clever/stupid and
> I suspect that the relminmxid values will tend to diverge over time
> (but I could be wrong about that, if they all start at 1 and then move
> forward in lockstep over long periods of time then what I propose is
> not good enough... let's see if we can find out).

So, the problem of everything moving in lockstep is one we already
have. It's actually a serious operational problem for relfrozenxid,
because you might restore your database from pg_dump or similar and
every table will have a very similar relfrozenxid and so then the
anti-wraparound logic fires for all of them at the same time. There
might be cases where MXIDs behave the same way, although I would think
it would be less common.

Anyway, solving that problem would be nice (particularly for xmin!),
but we shouldn't get into that with relation to this bug fix. It's a
problem, but one that will probably take a good deal of work to solve,
and certainly not something we would back-patch.

Alvaro Herrera

unread,
May 4, 2015, 4:36:21 PM5/4/15
to
Thomas Munro wrote:

> FWIW, in some future release, I think we should consider getting a
> bigger multixact member address space that wraps around at 2^48 or
> 2^64 instead of 2^32, so that we can sidestep the whole business and
> go back to having just xid and mxid wraparounds to worry about.
> pg_multixact/offsets would be 50% or 100% bigger (an extra byte or two
> per multixact), but it's not very big. pg_multiact/members would be
> no bigger for any workload that currently works without hitting the
> wraparound error, but could grow bigger if needed.

Not sure that enlarging the addressable area to 48/64 bits is feasible,
TBH. We already have many complaints that multixacts take too much disk
space; we don't want to make that 2^32 times worse, not even 2^16 times
worse. I don't understand why you say it'd become 1 byte bigger per
multixact; it would have to be 4 more bytes (2^64) or 2 more bytes
(2^48), no? If you have 150 million multixacts (the default freeze
table age) that would mean about 300 or 600 MB of additional disk space,
which is not insignificant: with the current system, in an database with
normal multixact usage of 4 members per multixact, members/ would use
about 2.8 GB, so 600 additional MB in offsets/ is large enough growth to
raise some more complaints.

(The 2^48 suggestion might be a tad more difficult to implement, note,
becase a lot of stuff relies on unsigned integer wraparound addition,
and I'm not sure we can have that with a 2^48 counter. Maybe we could
figure how to make it work, but is it worth the bother?)

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Alvaro Herrera

unread,
May 4, 2015, 4:37:31 PM5/4/15
to
Robert Haas wrote:

> Anyway, solving that problem would be nice (particularly for xmin!),
> but we shouldn't get into that with relation to this bug fix. It's a
> problem, but one that will probably take a good deal of work to solve,
> and certainly not something we would back-patch.

+1

Kevin Grittner

unread,
May 4, 2015, 4:42:00 PM5/4/15
to
Robert Haas <rober...@gmail.com> wrote:

> 1. Using a min freeze age of zero will result in half as many
> full-table scans, because we'll advance relminmxid twice as far
> each time.
>
> 2. But each one will freeze more stuff, some of which might have
> been updated again before the next freeze pass, so we might do
> more freezing in total.
>
> So either policy might win, depending on whether you care more
> about reducing reads (in which case you want a very low min
> freeze age) or about reducing writes (in which case you want a
> higher min freeze age).
>
> All things being equal, I'd rather stick with the existing 50%
> policy in the back-branches, rather than going to zero, but I'm
> not sure all things are equal. It matters what difference the
> higher value makes.

I really don't like the "honor the configured value of
vacuum_multixact_freeze_min_age until the members SLRU gets to 50%
of wraparound and then use zero" approach. It made a lot more
sense to me to honor the configured value to 25% and decrease it in
a linear fashion until it hit zero at 75%. It seems like maybe we
weren't aggressive enough in the dynamic adjustment of
autovacuum_multixact_freeze_max_age, but I'm not clear why fixing
that required the less gradual adjustment of the *_min_age setting.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Alvaro Herrera

unread,
May 4, 2015, 4:59:25 PM5/4/15
to
So I might have understood an earlier description of the proposed
solution all wrong, or this patch was designed without consideration to
that description. What I thought would happen is that all freeze ages
would get multiplied by some factor <= 1, depending on the space used up
by members. If members space usage is low enough, factor would remain
at 1 so things would behave as today. If members space usage is larger
than X, the factor decreases smoothly and this makes freeze_min_age and
freeze_max_age decrease smoothly as well, for all vacuums equally.

For instance, we could choose a method to compute X based on considering
that a full 2^32 storage area for members is enough to store one
vacuum_multixact_freeze_table_age cycle of multixacts. The default
value of this param is 150 million, and 2^32/150000000 = 28; so if your
average multixact size = 38, you would set the multiplier at 0.736 and
your effective freeze_table_age would become 110 million and effective
freeze_min_age would become 3.68 million.


As a secondary point, I find variable-names-as-documentation bad
practice. Please don't use a long name such as
max_multixact_age_to_avoid_member_wrap; code becomes unwieldy. A short
name such as safe_mxact_age preceded by a comment /* this variable is
the max that avoids member wrap */ seems more palatable; side-by-side
merges and all that! I don't think long function names are as
problematic (though the name of your new function is still a bit too
long).

Please note that 9.4 and earlier do not have ExecVacuum; the
determination of freeze ages is done partly in gram.y (yuck). Not sure
what will the patch look like in those branches.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Alvaro Herrera

unread,
May 4, 2015, 5:30:31 PM5/4/15
to
Alvaro Herrera wrote:

> For instance, we could choose a method to compute X based on considering
> that a full 2^32 storage area for members is enough to store one
> vacuum_multixact_freeze_table_age cycle of multixacts. The default
> value of this param is 150 million, and 2^32/150000000 = 28; so if your
> average multixact size = 38, you would set the multiplier at 0.736 and
> your effective freeze_table_age would become 110 million and effective
> freeze_min_age would become 3.68 million.

Actually, apologies --- this is not what I was thinking at all. I got
distracted while I was writing the previous email. My thinking was that
the values would be at their normal defaults when the wraparound is
distant, and the multiplier would start to become slightly less than 1
as the counter moves towards wraparound; by the time we're at at an
emergency i.e. we reach max_freeze_age, the values naturally become zero
(or perhaps just before we reach max_freeze_age, the values were 50% of
their normal values, so the drop to zero is not as dramatic). Since
this is gradual, the behavior is not as jumpy as in the proposed patch.

Anyway this is in line with what Kevin is saying elsewhere: we shouldn't
just use the normal values all the time just up to the freeze_max_age
point; there should be some gradual ramp-up.

Perhaps we can combine this with the other idea of using a multiplier
connected to average size of multixact, if it doesn't become too
complicated, surprising, or both.

Thomas Munro

unread,
May 4, 2015, 6:51:47 PM5/4/15
to
On Tue, May 5, 2015 at 8:36 AM, Alvaro Herrera <alvh...@2ndquadrant.com> wrote:
> Thomas Munro wrote:
>
>> FWIW, in some future release, I think we should consider getting a
>> bigger multixact member address space that wraps around at 2^48 or
>> 2^64 instead of 2^32, so that we can sidestep the whole business and
>> go back to having just xid and mxid wraparounds to worry about.
>> pg_multixact/offsets would be 50% or 100% bigger (an extra byte or two
>> per multixact), but it's not very big. pg_multiact/members would be
>> no bigger for any workload that currently works without hitting the
>> wraparound error, but could grow bigger if needed.
>
> Not sure that enlarging the addressable area to 48/64 bits is feasible,
> TBH. We already have many complaints that multixacts take too much disk
> space; we don't want to make that 2^32 times worse, not even 2^16 times
> worse. I don't understand why you say it'd become 1 byte bigger per
> multixact; it would have to be 4 more bytes (2^64) or 2 more bytes
> (2^48), no? If you have 150 million multixacts (the default freeze
> table age) that would mean about 300 or 600 MB of additional disk space,
> which is not insignificant: with the current system, in an database with
> normal multixact usage of 4 members per multixact, members/ would use
> about 2.8 GB, so 600 additional MB in offsets/ is large enough growth to
> raise some more complaints.

Right, sorry, I must have been thinking of 40 bit or 48 bit indexes
when I said 1 or 2 bytes.

I can't help thinking there must be a different way to do this that
takes advantage of the fact that multixacts are often created by
copying all the members of an existing multixact and adding one new
one, so that there is a lot of duplication and churn (at least when
you have a workload that generates bigger multixacts, due to the
O(n^2) process of building them up xid by xid).

Maybe there is a way to store a pointer to some other multixact + a
new xid in a chain structure, but I don't see how to do the cleanup
when you have active multixacts with backwards references to older
multixacts.

Maybe you could find some way to leave gaps in member space (perhaps
by making member index point to member groups with space for 4 or 8
member xids), and MultiXactIdExpand could create new multixacts that
point to the same member offset but a different size so they see extra
members, but that would also waste disk space, be hard to synchronize
and you'd need to fall back to copying the members into new member
space when the spare space is filled anyway.

Alvaro Herrera

unread,
May 4, 2015, 9:03:43 PM5/4/15
to
Thomas Munro wrote:

> I can't help thinking there must be a different way to do this that
> takes advantage of the fact that multixacts are often created by
> copying all the members of an existing multixact and adding one new
> one, so that there is a lot of duplication and churn (at least when
> you have a workload that generates bigger multixacts, due to the
> O(n^2) process of building them up xid by xid).

Yeah, Simon expressed the same thought to me some months ago, and I gave
it some think-time (but not at lot of it TBH). I didn't see any way to
make it workable.

Normally, lockers go away reasonably quickly, so some of the original
members of the multixact are disappearing all the time. Maybe one way
would be to re-use a multixact you have in your local cache, as long as
the only difference with the multixact you want is some locker
transaction(s) that have already ended. Not sure how you would manage
the cache, though.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Amit Kapila

unread,
May 4, 2015, 11:57:12 PM5/4/15
to
On Tue, May 5, 2015 at 2:29 AM, Alvaro Herrera <alvh...@2ndquadrant.com> wrote:
>
>
> Please note that 9.4 and earlier do not have ExecVacuum; the
> determination of freeze ages is done partly in gram.y (yuck).  Not sure
> what will the patch look like in those branches.
>

One way to make fix back-patchable is to consider doing the changes
for Vacuum and AutoVacuum in one common path (vacuum_set_xid_limits())?
However, I think we might need to distinguish whether the call is from
Vacuum or AutoVacuum path.

Alvaro Herrera

unread,
May 5, 2015, 9:37:39 AM5/5/15
to
Amit Kapila wrote:
> On Tue, May 5, 2015 at 2:29 AM, Alvaro Herrera <alvh...@2ndquadrant.com>
> wrote:
> >
> >
> > Please note that 9.4 and earlier do not have ExecVacuum; the
> > determination of freeze ages is done partly in gram.y (yuck). Not sure
> > what will the patch look like in those branches.
>
> One way to make fix back-patchable is to consider doing the changes
> for Vacuum and AutoVacuum in one common path (vacuum_set_xid_limits())?
> However, I think we might need to distinguish whether the call is from
> Vacuum or AutoVacuum path.

I think it's easier if we just adjust the patch in older branches to
affect the code that now lives in ExecVacuum. Trying to make all
branches the same will probably make the whole thing more complicated,
for no real purpose.

Robert Haas

unread,
May 5, 2015, 5:26:44 PM5/5/15
to
On Tue, May 5, 2015 at 3:58 AM, Thomas Munro
<thomas...@enterprisedb.com> wrote:
> Ok, the new patch uses 25% as the safe threshold, and then scales
> multixact_freeze_table_age down from the current number of active
> multixacts (ie to select the minimum number of tables) progressively
> to 0 (to select all tables) when you reach 75% usage.

I definitely think that 25% is better than 50%. But see below.

> Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts
> before wraparound vacuum, which is ~10% of 2^32, we would interpret
> that to mean 400 million multixacts OR ~10% * some_constant of member
> space, in other worlds autovacuum_freeze_max_age * some_constant
> members, whichever comes first. But what should some_constant be?

some_constant should be all the member space there is. So we trigger
autovac if we've used more than ~10% of the offsets OR more than ~10%
of the members. Why is autovacuum_multixact_freeze_max_age
configurable in the place? It's configurable so that you can set it
low enough that wraparound scans complete and advance the minmxid
before you hit the wall, but high enough to avoid excessive scanning.
The only problem is that it only lets you configure the amount of
headroom you need for offsets, not members. If you squint at what I'm
proposing the right way, it's essentially that that GUC should control
both of those things.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Thomas Munro

unread,
May 5, 2015, 5:47:07 PM5/5/15
to
On Wed, May 6, 2015 at 9:26 AM, Robert Haas <rober...@gmail.com> wrote:
> On Tue, May 5, 2015 at 3:58 AM, Thomas Munro <thomas...@enterprisedb.com> wrote:
>> Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts
>> before wraparound vacuum, which is ~10% of 2^32, we would interpret
>> that to mean 400 million multixacts OR ~10% * some_constant of member
>> space, in other worlds autovacuum_freeze_max_age * some_constant
>> members, whichever comes first. But what should some_constant be?
>
> some_constant should be all the member space there is. So we trigger
> autovac if we've used more than ~10% of the offsets OR more than ~10%
> of the members. Why is autovacuum_multixact_freeze_max_age
> configurable in the place? It's configurable so that you can set it
> low enough that wraparound scans complete and advance the minmxid
> before you hit the wall, but high enough to avoid excessive scanning.
> The only problem is that it only lets you configure the amount of
> headroom you need for offsets, not members. If you squint at what I'm
> proposing the right way, it's essentially that that GUC should control
> both of those things.

But member space *always* grows at least twice as fast as offset space
(aka active multixact IDs), because multixacts always have at least 2
members (except in some rare cases IIUC), don't they? So if we do
what you just said, then we'll trigger wraparound vacuums twice as
soon as we do now for everybody, even people who don't have any
problem with member space management. We don't want this patch to
change anything for most people, let alone everyone. So I think that
some_constant should be at least 2, if we try to do it this way, in
other words if you set the GUC for 10% of offset space, we also start
triggering wraparounds at 20% of member space. The code in
MultiXactCheckMemberSpace would just say safe_member_count =
autovacum_multixact_freeze_max_age * 2, where 2 is some_constant (this
number is the average number of multixact members below which your
workload will be unaffected by the new autovac behaviour).

--
Thomas Munro
http://www.enterprisedb.com


Kevin Grittner

unread,
May 5, 2015, 6:37:14 PM5/5/15
to
That, I think, is what has been driving this patch away from just
considering the *_multixact_* settings as applying to both the
members SLRU and the offsets SLRU; that would effectively simply
change the monitored resource from one to the other. (We would
probably want to actually use the max of the two, just to be safe,
but that offsets might never actually be the trigger.) As Thomas
says, that would be a big change for everyone, and not everyone
necessarily *wants* their existing settings to have new and
different meanings.

> So I think that
> some_constant should be at least 2, if we try to do it this way, in
> other words if you set the GUC for 10% of offset space, we also start
> triggering wraparounds at 20% of member space.

But what if they configure it to start at 80% (which I *have* seen
people do)?

The early patches were a heuristic to attempt to allow current
behavior for those not getting into trouble, and gradually ramp up
aggressiveness as needed to prevent hitting the hard ERROR that now
prevents wraparound. Perhaps, rather than reducing the threshold
gradually, as the members SLRU approaches wraparound we could
gradually shift from using offsets to members as the number we
compare the thresholds to. Up to 25% of maximum members, or if
offset is somehow larger, we just use offsets; else above 75%
maximum members we use members; else we use a weighted average
based on how far we are between 25% and 75%. It's kinda weird, but
I think it gives us a reasonable way to ramp up up vacuum
aggressiveness from what we currently do toward what Robert
proposed based on whether the workload is causing things to head
for trouble.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Robert Haas

unread,
May 5, 2015, 9:53:18 PM5/5/15
to
On Tue, May 5, 2015 at 5:46 PM, Thomas Munro
<thomas...@enterprisedb.com> wrote:
> But member space *always* grows at least twice as fast as offset space
> (aka active multixact IDs), because multixacts always have at least 2
> members (except in some rare cases IIUC), don't they?

Oh. *facepalm*

All right, so maybe the way you had it is best after all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com

Robert Haas

unread,
May 5, 2015, 10:08:51 PM5/5/15
to
On Tue, May 5, 2015 at 6:36 PM, Kevin Grittner <kgr...@ymail.com> wrote:
>> So I think that
>> some_constant should be at least 2, if we try to do it this way, in
>> other words if you set the GUC for 10% of offset space, we also start
>> triggering wraparounds at 20% of member space.
>
> But what if they configure it to start at 80% (which I *have* seen
> people do)?

I might be confused here, but the upper limit for
autovacuum_multixact_freeze_max_age is 2 billion, so I don't think
this can ever be higher than 50%. Well, 46.5%, really, since 2^32 > 4
billion. autovacuum_freeze_max_age is similarly limited.

Robert Haas

unread,
May 5, 2015, 10:30:06 PM5/5/15
to
On Tue, May 5, 2015 at 3:58 AM, Thomas Munro
<thomas...@enterprisedb.com> wrote:
> Here's a new patch, with responses to several reviews.

Going back to this version...

+ * Based on the assumption that there is no reasonable way for an end user to
+ * configure the thresholds for this, we define the safe member count to be
+ * half of the member address space, and the dangerous level to be

but:

+ const MultiXactOffset safe_member_count = MaxMultiXactOffset / 4;

Those don't match. Also, we usually use #define rather than const for
constants. I suggest we do that here, too.

+ int safe_multixact_age = MultiXactCheckMemberUsage();
+ if (safe_multixact_age >= 0)

Project style is to leave a blank line between these, I think.

I think you need to update the comments for relation_needs_vacanalyze().

The documentation in section 23.1.5.1, "Multixacts and Wraparound",
also needs updating.

Robert Haas

unread,
May 6, 2015, 6:45:13 AM5/6/15
to
On Wed, May 6, 2015 at 6:26 AM, Thomas Munro
<thomas...@enterprisedb.com> wrote:
> On Wed, May 6, 2015 at 2:29 PM, Robert Haas <rober...@gmail.com> wrote:
>> + * Based on the assumption that there is no reasonable way for an end user to
>> + * configure the thresholds for this, we define the safe member count to be
>> + * half of the member address space, and the dangerous level to be
>>
>> but:
>>
>> + const MultiXactOffset safe_member_count = MaxMultiXactOffset / 4;
>>
>> Those don't match. [...]
>
> Fixed/obsoleted in the attached patch. It has a dynamic
> safe_member_count based on scaling the GUC as described in my earlier
> email with the v7 patch; the behaviour with the default GUC value
> works out to a similar safe_member_count value, but this way it can be
> changed if needed, and we don't introduce any new GUCs. Also, since
> the GUC used in determining safe_member_count is either
> autovacuum_multixact_freeze_max_age or vacuum_multixact_freeze_max_age
> depending on which kind of vacuum it is, that is now a parameter
> passed into MultiXactCheckMemberUsage, so safe_member_count is no
> longer a constant.

To be honest, now that you've pointed out that the fraction of the
multixact members space that is in use will always be larger,
generally much larger, than the fraction of the offset space that is
in use, I've kind of lost all enthusiasm for making the
safe_member_count stuff dependent on
autovacuum_multixact_freeze_max_age. I'm inclined to go back to 25%,
the way you had it before.

We could think about adding a new GUC in master, but I'm actually
leaning toward the view that we should just hard-code 25% for now and
consider revising it later if that proves inadequate.

Amit Kapila

unread,
May 6, 2015, 9:18:46 AM5/6/15
to
On Wed, May 6, 2015 at 3:56 PM, Thomas Munro <thomas...@enterprisedb.com> wrote:
>
> On Wed, May 6, 2015 at 2:29 PM, Robert Haas <rober...@gmail.com> wrote:

Few comments:

1.
+ /*
+ * Override the multixact freeze settings if we are running out of
+ * member address space.
+ */
+ if (safe_multixact_age >= 0)
+ {
+ multixact_freeze_table_age = Min(safe_multixact_age,
+ multixact_freeze_table_age);

+ /* Special settings if we are running out of member address space. */
+ if (safe_multixact_age >= 0)
+ multixact_freeze_max_age = Min(multixact_freeze_max_age, safe_multixact_age);
+


Some places use safe_multixact_age as first parameter and some
places use it at second place.  I think it is better to use in same
order for the sake of consistency.

2.
in the hope
+ * that different tables will be vacuumed at different times due to their
+ * varying relminmxid values.

Does above line in comment on top of MultiXactCheckMemberUsage()
makes much sense?



3.
+ * we know the age of the oldest multixact in the system, so that's the
+ * value we want to when members is near safe_member_count.  It should

typo.
so that's the value we want to *use* when ..

Alvaro Herrera

unread,
May 6, 2015, 10:16:42 AM5/6/15
to
I haven't read your patch, but I wonder if we should decrease the
default value of multixact_freeze_table_age (currently 150 million).
The freeze_min_age is 5 million; if freeze_table_age were a lot lower,
the problem would be less pronounced.

Additionally, I will backpatch commit 27846f02c176. The average size of
multixacts decreases with that fix in many common cases, which greatly
reduces the need for any of this in the first place.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Alvaro Herrera

unread,
May 6, 2015, 10:33:54 AM5/6/15
to
Robert Haas wrote:

> So here's a new patch, based on your latest version, which looks
> reasonably committable to me.

I think this code should also reduce the multixact_freeze_min_age value
at the same time as multixact_freeze_table_age. If the table age is
reduced but freeze_min_age remains high, old multixacts might still
remain in the table. The default value for freeze min age is 5 million,
but users may change it. Perhaps freeze min age should be set to
Min(modified freeze table age, freeze min age) so that old multixacts
are effectively frozen whenever a full table scan requested.

> 1. Should we be installing one or more GUCs to control this behavior?
> I've gone back to hard-coding things so that at 25% we start
> triggering autovacuum and by 75% we zero out the freeze ages, because
> the logic you proposed in your last version looks insanely complicated
> to me. (I do realize that I suggested the approach, but that was
> before I realized the full complexity of the problem.) I now think
> that if we want to make this tunable, we need to create and expose
> GUCs for it. I'm hoping we can get by without that, but I'm not sure.

I think things are complicated enough; I vote for no additional GUCs at
this point.

> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
> similar logic? Otherwise, a user with autovacuum=off won't get
> emergency autovacuums for member exhaustion, even though they will get
> them for offset exhaustion.

Yeah, it looks like it does.

Kevin Grittner

unread,
May 6, 2015, 12:16:04 PM5/6/15
to
Alvaro Herrera <alvh...@2ndquadrant.com> wrote:
> Robert Haas wrote:

>> So here's a new patch, based on your latest version, which looks
>> reasonably committable to me.
>
> I think this code should also reduce the multixact_freeze_min_age value
> at the same time as multixact_freeze_table_age. If the table age is
> reduced but freeze_min_age remains high, old multixacts might still
> remain in the table. The default value for freeze min age is 5 million,
> but users may change it. Perhaps freeze min age should be set to
> Min(modified freeze table age, freeze min age) so that old multixacts
> are effectively frozen whenever a full table scan requested.

I would rather see min age reduced proportionally to table age, or
at least ensure that min age is some percentage below table age.

>> 1. Should we be installing one or more GUCs to control this behavior?
>> I've gone back to hard-coding things so that at 25% we start
>> triggering autovacuum and by 75% we zero out the freeze ages, because
>> the logic you proposed in your last version looks insanely complicated
>> to me. (I do realize that I suggested the approach, but that was
>> before I realized the full complexity of the problem.) I now think
>> that if we want to make this tunable, we need to create and expose
>> GUCs for it. I'm hoping we can get by without that, but I'm not sure.
>
> I think things are complicated enough; I vote for no additional GUCs at
> this point.

+1

For one thing, we should try to have something we can back-patch,
and new GUCs in a minor release seems like something to avoid, if
possible. For another thing, we've tended not to put in GUCs if
there is no reasonable way for a user to determine a good value,
and that seems to be the case here.

>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
>> similar logic? Otherwise, a user with autovacuum=off won't get
>> emergency autovacuums for member exhaustion, even though they will get
>> them for offset exhaustion.
>
> Yeah, it looks like it does.

+1

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Robert Haas

unread,
May 6, 2015, 12:23:37 PM5/6/15
to
On Wed, May 6, 2015 at 10:34 AM, Alvaro Herrera
<alvh...@2ndquadrant.com> wrote:
>> So here's a new patch, based on your latest version, which looks
>> reasonably committable to me.
>
> I think this code should also reduce the multixact_freeze_min_age value
> at the same time as multixact_freeze_table_age.

I think it does that. It sets the min age to half the value it sets
for the table age, which I think is consistent with what we do
elsewhere.

>> 1. Should we be installing one or more GUCs to control this behavior?
>> I've gone back to hard-coding things so that at 25% we start
>> triggering autovacuum and by 75% we zero out the freeze ages, because
>> the logic you proposed in your last version looks insanely complicated
>> to me. (I do realize that I suggested the approach, but that was
>> before I realized the full complexity of the problem.) I now think
>> that if we want to make this tunable, we need to create and expose
>> GUCs for it. I'm hoping we can get by without that, but I'm not sure.
>
> I think things are complicated enough; I vote for no additional GUCs at
> this point.

That's fine with me for now.

>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
>> similar logic? Otherwise, a user with autovacuum=off won't get
>> emergency autovacuums for member exhaustion, even though they will get
>> them for offset exhaustion.
>
> Yeah, it looks like it does.

OK, I'm not clear how to do that correctly, exactly, but hopefully one
of us can figure that out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
0 new messages