Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Plans for solving the VACUUM problem

4 views
Skip to first unread message

Vadim Mikheev

unread,
May 20, 2001, 2:42:50 AM5/20/01
to
> Hm. On the other hand, relying on WAL for undo means you cannot drop
> old WAL segments that contain records for any open transaction. We've
> already seen several complaints that the WAL logs grow unmanageably huge
> when there is a long-running transaction, and I think we'll see a lot
> more.
>
> It would be nicer if we could drop WAL records after a checkpoint or two,
> even in the presence of long-running transactions. We could do that if
> we were only relying on them for crash recovery and not for UNDO.

As you understand this is old, well-known problem in database practice,
described in books. Two ways - either abort too long running transactions
or (/and) compact old log segments: fetch and save (to use for undo)
records of long-running transactions and remove other records. Neither
way is perfect but nothing is perfect at all -:)

> 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
> make lightweight VACUUM work well. (I definitely don't like the idea

Sorry, but I'm going to consider background vacuum as temporary solution
only. As I've already pointed, original PG authors finally became
disillusioned with the same approach. What is good in using UNDO for 1.
is the fact that WAL records give you *direct* physical access to changes
which should be rolled back.

> that after a very long transaction fails and aborts, I'd have to wait
> another very long time for UNDO to do its thing before I could get on
> with my work. Would much rather have the space reclamation happen in
> background.)

Understandable, but why other transactions should read dirty data again
and again waiting for background vacuum? I think aborted transaction
should take some responsibility for mess made by them -:)
And keeping in mind 2. very long transactions could be continued -:)

> 2. SAVEPOINTs would be awfully nice to have, I agree.
>
> 3. Reusing xact IDs would be nice, but there's an answer with a lot less
> impact on the system: go to 8-byte xact IDs. Having to shut down the
> postmaster when you approach the 4Gb transaction mark isn't going to
> impress people who want a 24x7 commitment, anyway.

+8 bytes in tuple header is not so tiny thing.

> 4. Recycling pg_log would be nice too, but we've already discussed other
> hacks that might allow pg_log to be kept finite without depending on
> UNDO (or requiring postmaster restarts, IIRC).

We did... and didn't get agreement.

> I'm sort of thinking that undoing back to a savepoint is the only real
> usefulness of WAL-based UNDO. Is it practical to preserve the WAL log
> just back to the last savepoint in each xact, not the whole xact?

No, it's not. It's not possible in overwriting systems at all - all
transaction records are required.

> Another thought: do we need WAL UNDO at all to implement savepoints?
> Is there some way we could do them like nested transactions, wherein
> each savepoint-to-savepoint segment is given its own transaction number?
> Committing multiple xact IDs at once might be a little tricky, but it
> seems like a narrow, soluble problem.

Implicit savepoints wouldn't be possible - this is very convenient
feature I've found in Oracle.
And additional code in tqual.c wouldn't be good addition.

> Implementing UNDO without creating lots of performance issues looks
> a lot harder.

What *performance* issues?!
The only issue is additional disk requirements.

Vadim

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majo...@postgresql.org so that your
message can get through to the mailing list cleanly

Tom Lane

unread,
May 20, 2001, 1:15:48 PM5/20/01
to
"Vadim Mikheev" <vmik...@sectorbase.com> writes:
>> 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
>> make lightweight VACUUM work well.

> Sorry, but I'm going to consider background vacuum as temporary solution


> only. As I've already pointed, original PG authors finally became
> disillusioned with the same approach.

How could they become disillusioned with it, when they never tried it?
I know of no evidence that any version of PG has had backgroundable
(non-blocking-to-other-transactions) VACUUM, still less within-relation
space recycling. They may have become disillusioned with the form of
VACUUM that they actually had (ie, the same one we've inherited) --- but
please don't call that "the same approach" I'm proposing.

Certainly, doing VACUUM this way is an experiment that may fail, or may
require further work before it really works well. But I'd appreciate it
if you wouldn't prejudge the results of the experiment.

>> Would much rather have the space reclamation happen in
>> background.)

> Understandable, but why other transactions should read dirty data again
> and again waiting for background vacuum? I think aborted transaction
> should take some responsibility for mess made by them -:)

They might read it again and again before the failed xact gets around to
removing the data, too. You cannot rely on UNDO for correctness; at
most it can be a speed/space optimization. I see no reason to assume
that it's a more effective optimization than a background vacuum
process.

>> 3. Reusing xact IDs would be nice, but there's an answer with a lot less
>> impact on the system: go to 8-byte xact IDs.

> +8 bytes in tuple header is not so tiny thing.

Agreed, but the people who need 8-byte IDs are not running small
installations. I think they'd sooner pay a little more in disk space
than risk costs in performance or reliability.

>> Another thought: do we need WAL UNDO at all to implement savepoints?
>> Is there some way we could do them like nested transactions, wherein
>> each savepoint-to-savepoint segment is given its own transaction number?

> Implicit savepoints wouldn't be possible - this is very convenient


> feature I've found in Oracle.

Why not? Seems to me that establishing implicit savepoints is just a
user-interface issue; you can do it, or not do it, regardless of the
underlying mechanism.

>> Implementing UNDO without creating lots of performance issues looks
>> a lot harder.

> What *performance* issues?!
> The only issue is additional disk requirements.

Not so. UNDO does failed-transaction cleanup work in the interactive
backends, where it necessarily delays clients who might otherwise be
issuing their next command. A VACUUM-based approach does the cleanup
work in the background. Same work, more or less, but it's not in the
clients' critical path.

BTW, UNDO for failed transactions alone will not eliminate the need for
VACUUM. Will you also make successful transactions go back and
physically remove the tuples they deleted?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Vadim Mikheev

unread,
May 20, 2001, 5:06:29 PM5/20/01
to
> >> 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
> >> make lightweight VACUUM work well.
>
> > Sorry, but I'm going to consider background vacuum as temporary solution
> > only. As I've already pointed, original PG authors finally became
> > disillusioned with the same approach.
>
> How could they become disillusioned with it, when they never tried it?
> I know of no evidence that any version of PG has had backgroundable
> (non-blocking-to-other-transactions) VACUUM, still less within-relation
> space recycling. They may have become disillusioned with the form of
> VACUUM that they actually had (ie, the same one we've inherited) --- but
> please don't call that "the same approach" I'm proposing.

Pre-Postgres'95 (original) versions had vacuum daemon running in
background. I don't know if that vacuum shrinked relations or not
(there was no shrinking in '95 version), I know that daemon had to
do some extra work in moving old tuples to archival storage, but
anyway as you can read in old papers in the case of consistent heavy
load daemon was not able to cleanup storage fast enough. And the
reason is obvious - no matter how optimized your daemon will be
(in regard to blocking other transactions etc), it will have to
perform huge amount of IO just to find space available for reclaiming.

> Certainly, doing VACUUM this way is an experiment that may fail, or may
> require further work before it really works well. But I'd appreciate it
> if you wouldn't prejudge the results of the experiment.

Why not, Tom? Why shouldn't I say my opinion?
Last summer your comment about WAL, may experiment that time, was that
it will save just a few fsyncs. It was your right to make prejudment,
what's wrong with my rights? And you appealed to old papers as well, BTW.

> > Understandable, but why other transactions should read dirty data again
> > and again waiting for background vacuum? I think aborted transaction
> > should take some responsibility for mess made by them -:)
>
> They might read it again and again before the failed xact gets around to
> removing the data, too. You cannot rely on UNDO for correctness; at
> most it can be a speed/space optimization. I see no reason to assume
> that it's a more effective optimization than a background vacuum
> process.

Really?! Once again: WAL records give you *physical* address of tuples
(both heap and index ones!) to be removed and size of log to read
records from is not comparable with size of data files.

> >> Another thought: do we need WAL UNDO at all to implement savepoints?
> >> Is there some way we could do them like nested transactions, wherein
> >> each savepoint-to-savepoint segment is given its own transaction number?
>
> > Implicit savepoints wouldn't be possible - this is very convenient
> > feature I've found in Oracle.
>
> Why not? Seems to me that establishing implicit savepoints is just a
> user-interface issue; you can do it, or not do it, regardless of the
> underlying mechanism.

Implicit savepoints are setted by server automatically before each
query execution - you wouldn't use transaction IDs for this.

> >> Implementing UNDO without creating lots of performance issues looks
> >> a lot harder.
>
> > What *performance* issues?!
> > The only issue is additional disk requirements.
>
> Not so. UNDO does failed-transaction cleanup work in the interactive
> backends, where it necessarily delays clients who might otherwise be
> issuing their next command. A VACUUM-based approach does the cleanup
> work in the background. Same work, more or less, but it's not in the
> clients' critical path.

Not same work but much more and in the critical pathes of all clients.
And - is overall performance of Oracle or Informix worse then in PG?
Seems delays in clients for rollback doesn't affect performance so much.
But dirty storage does it.

> BTW, UNDO for failed transactions alone will not eliminate the need for
> VACUUM. Will you also make successful transactions go back and
> physically remove the tuples they deleted?

They can't do this, as you know pretty well. But using WAL to get TIDs to
be deleted is considerable, no?

Vadim

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

Vadim Mikheev

unread,
May 20, 2001, 5:19:23 PM5/20/01
to
> Were you going to use WAL to get free space from old copies too?

Considerable approach.

> Vadim, I think I am missing something. You mentioned UNDO would be used
> for these cases and I don't understand the purpose of adding what would
> seem to be a pretty complex capability:

Yeh, we already won title of most advanced among simple databases, -:)
Yes, looking in list of IDs assigned to single transaction in tqual.c is much
easy to do than UNDO. As well as couple of fsyncs is easy than WAL.

> > 1. Reclaim space allocated by aborted transactions.
>
> Is there really a lot to be saved here vs. old tuples of committed
> transactions?

Are you able to protect COPY FROM from abort/crash?

Vadim

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majo...@postgresql.org

Tom Lane

unread,
May 20, 2001, 5:31:34 PM5/20/01
to
"Vadim Mikheev" <vmik...@sectorbase.com> writes:
> Really?! Once again: WAL records give you *physical* address of tuples
> (both heap and index ones!) to be removed and size of log to read
> records from is not comparable with size of data files.

You sure? With our current approach of dumping data pages into the WAL
on first change since checkpoint (and doing so again after each
checkpoint) it's not too difficult to devise scenarios where the WAL log
is *larger* than the affected datafiles ... and can't be truncated until
someone commits.

The copied-data-page traffic is the worst problem with our current
WAL implementation. I did some measurements last week on VACUUM of a
test table (the accounts table from a "pg_bench -s 10" setup, which
contains 1000000 rows; I updated 20000 rows and then vacuumed). This
generated about 34400 8k blocks of WAL traffic, of which about 33300
represented copied pages and the other 1100 blocks were actual WAL
entries. That's a pretty massive I/O overhead, considering the table
itself was under 20000 8k blocks. It was also interesting to note that
a large fraction of the CPU time was spent calculating CRCs on the WAL
data.

Would it be possible to split the WAL traffic into two sets of files,
one for WAL log records proper and one for copied pages? Seems like
we could recycle the pages after each checkpoint rather than hanging
onto them until the associated transactions commit.

>> Why not? Seems to me that establishing implicit savepoints is just a
>> user-interface issue; you can do it, or not do it, regardless of the
>> underlying mechanism.

> Implicit savepoints are setted by server automatically before each
> query execution - you wouldn't use transaction IDs for this.

If the user asked you to, I don't see why not.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

The Hermit Hacker

unread,
May 20, 2001, 5:35:09 PM5/20/01
to

If its an "experiment", shouldn't it be done outside of the main source
tree, with adequate testing in a high load situation, with a patch
released to the community for further testing/comments, before it is added
to the source tree? From reading Vadim's comment above (re:
pre-Postgres95), this daemonized approach would cause a high I/O load on
the server in a situation where there are *alot* of UPDATE/DELETEs
happening to the database, which should be easily recreatable, no? Or,
Vadim, am I misundertanding?

Tom Lane

unread,
May 20, 2001, 6:03:05 PM5/20/01
to
The Hermit Hacker <scr...@hub.org> writes:
> If its an "experiment", shouldn't it be done outside of the main source
> tree, with adequate testing in a high load situation, with a patch
> released to the community for further testing/comments, before it is added
> to the source tree?

Mebbe we should've handled WAL that way too ;-)

Seriously, I don't think that my proposed changes need be treated with
quite that much suspicion. The only part that is really intrusive is
the shared-memory free-heap-space-management change. But AFAICT that
will be a necessary component of *any* approach to getting rid of
VACUUM. We've been arguing here, in essence, about whether a background
or on-line approach to finding free space will be more useful; but that
still leaves you with the question of what you do with the free space
after you've found it. Without some kind of shared free space map,
there's not anything you can do except have the process that found the
space do tuple moving and file truncation --- ie, VACUUM. So even if
I'm quite wrong about the effectiveness of a background VACUUM, the FSM
code will still be needed: an UNDO-style approach is also going to need
an FSM to do anything with the free space it finds. It's equally clear
that the index AMs have to support index tuple deletion without
exclusive lock, or we'll still have blocking problems during free-space
cleanup, no matter what drives that cleanup. The only part of what
I've proposed that might end up getting relegated to the scrap heap is
the "lazy vacuum" command itself, which will be a self-contained and
relatively small module (smaller than the present commands/vacuum.c,
for sure).

Besides which, Vadim has already said that he won't have time to do
anything about space reclamation before 7.2. So even if background
vacuum does end up getting superseded by something better, we're going
to need it for a release or two ...

regards, tom lane

---------------------------(end of broadcast)---------------------------

Lincoln Yeoh

unread,
May 20, 2001, 9:35:09 PM5/20/01
to
At 01:09 PM 20-05-2001 -0400, Tom Lane wrote:
>>> 3. Reusing xact IDs would be nice, but there's an answer with a lot less
>>> impact on the system: go to 8-byte xact IDs.
>
>> +8 bytes in tuple header is not so tiny thing.
>
>Agreed, but the people who need 8-byte IDs are not running small
>installations. I think they'd sooner pay a little more in disk space
>than risk costs in performance or reliability.

An additional 4 (8?) bytes per tuple to increase the "mean time before
problem " 4 billion times sounds good to me.

Cheerio,
Link.

Vadim Mikheev

unread,
May 20, 2001, 10:32:47 PM5/20/01
to
> > Really?! Once again: WAL records give you *physical* address of tuples
> > (both heap and index ones!) to be removed and size of log to read
> > records from is not comparable with size of data files.
>
> You sure? With our current approach of dumping data pages into the WAL
> on first change since checkpoint (and doing so again after each
> checkpoint) it's not too difficult to devise scenarios where the WAL log
> is *larger* than the affected datafiles ... and can't be truncated until
> someone commits.

Yes, but note mine "size of log to read records from" - each log record
has pointer to previous record made by same transaction: rollback must
not read entire log file to get all records of specific transaction.

> >> Why not? Seems to me that establishing implicit savepoints is just a
> >> user-interface issue; you can do it, or not do it, regardless of the
> >> underlying mechanism.
>
> > Implicit savepoints are setted by server automatically before each
> > query execution - you wouldn't use transaction IDs for this.
>
> If the user asked you to, I don't see why not.

Example of one of implicit savepoint usage: skipping duplicate key insertion.
Using transaction IDs when someone want to insert a few thousand records?

Vadim

Vadim Mikheev

unread,
May 20, 2001, 11:04:43 PM5/20/01
to
> If its an "experiment", shouldn't it be done outside of the main source
> tree, with adequate testing in a high load situation, with a patch
> released to the community for further testing/comments, before it is added
> to the source tree? From reading Vadim's comment above (re:
> pre-Postgres95), this daemonized approach would cause a high I/O load on
> the server in a situation where there are *alot* of UPDATE/DELETEs
> happening to the database, which should be easily recreatable, no? Or,
> Vadim, am I misundertanding?

It probably will not cause more IO than vacuum does right now.
But unfortunately it will not reduce that IO. Cleanup work will be spreaded
in time and users will not experience long lockouts but average impact
on overall system throughput will be same (or maybe higher).
My point is that we'll need in dynamic cleanup anyway and UNDO is
what should be implemented for dynamic cleanup of aborted changes.
Plus UNDO gives us natural implementation of savepoints and some
abilities in transaction IDs management, which we may use or not
(though, 4. - pg_log size management - is really good thing).

Vadim

---------------------------(end of broadcast)---------------------------

Vadim Mikheev

unread,
May 20, 2001, 11:11:54 PM5/20/01
to
> Seriously, I don't think that my proposed changes need be treated with
> quite that much suspicion. The only part that is really intrusive is

Agreed. I fight for UNDO, not against background vacuum -:)

> the shared-memory free-heap-space-management change. But AFAICT that
> will be a necessary component of *any* approach to getting rid of
> VACUUM. We've been arguing here, in essence, about whether a background
> or on-line approach to finding free space will be more useful; but that
> still leaves you with the question of what you do with the free space
> after you've found it. Without some kind of shared free space map,
> there's not anything you can do except have the process that found the
> space do tuple moving and file truncation --- ie, VACUUM. So even if
> I'm quite wrong about the effectiveness of a background VACUUM, the FSM
> code will still be needed: an UNDO-style approach is also going to need
> an FSM to do anything with the free space it finds. It's equally clear

Unfortunately, I think that we'll need in on-disk FSM and that FSM is
actually the most complex thing to do in "space reclamation" project.

> Besides which, Vadim has already said that he won't have time to do
> anything about space reclamation before 7.2. So even if background
> vacuum does end up getting superseded by something better, we're going
> to need it for a release or two ...

Yes.

Vadim

---------------------------(end of broadcast)---------------------------

Tom Lane

unread,
May 21, 2001, 12:39:57 AM5/21/01
to
"Vadim Mikheev" <vmik...@sectorbase.com> writes:
> Unfortunately, I think that we'll need in on-disk FSM and that FSM is
> actually the most complex thing to do in "space reclamation" project.

I hope we can avoid on-disk FSM. Seems to me that that would create
problems both for performance (lots of extra disk I/O) and reliability
(what happens if FSM is corrupted? A restart won't fix it).

But, if we do need it, most of the work needed to install FSM APIs
should carry over. So I still don't see an objection to doing
in-memory FSM as a first step.


BTW, I was digging through the old Postgres papers this afternoon,
to refresh my memory about what they actually said about VACUUM.
I was interested to discover that at one time the tuple-insertion
algorithm went as follows:
1. Pick a page at random in the relation, read it in, and see if it
has enough free space. Repeat up to three times.
2. If #1 fails to find space, append tuple at end.
When they got around to doing some performance measurement, they
discovered that step #1 was a serious loser, and dropped it in favor
of pure #2 (which is what we still have today). Food for thought.

regards, tom lane

Henshall, Stuart - WCP

unread,
May 21, 2001, 8:01:25 AM5/21/01
to
Apologises if I've missed something, but isn't that the same xmin that ODBC
uses for row versioning?
- Stuart

<Snip>
> Currently, the XMIN/XMAX command counters are used only by the current
> transaction, and they are useless once the transaction finishes and take
> up 8 bytes on disk.
<Snip>

Tom Lane

unread,
May 21, 2001, 10:14:16 AM5/21/01
to
"Vadim Mikheev" <vmik...@sectorbase.com> writes:
> It probably will not cause more IO than vacuum does right now.
> But unfortunately it will not reduce that IO.

Uh ... what? Certainly it will reduce the total cost of vacuum,
because it won't bother to try to move tuples to fill holes.
The index cleanup method I've proposed should be substantially
more efficient than the existing code, as well.

> My point is that we'll need in dynamic cleanup anyway and UNDO is
> what should be implemented for dynamic cleanup of aborted changes.

UNDO might offer some other benefits, but I doubt that it will allow
us to eliminate VACUUM completely. To do that, you would need to
keep track of free space using exact, persistent (on-disk) bookkeeping
data structures. The overhead of that will be very substantial: more,
I predict, than the approximate approach I proposed.

regards, tom lane

Zeugswetter Andreas SB

unread,
May 21, 2001, 11:03:23 AM5/21/01
to

> > Vadim, can you remind me what UNDO is used for?
> 4. Split pg_log into small files with ability to remove old ones (which
> do not hold statuses for any running transactions).

They are already small (16Mb). Or do you mean even smaller ?
This imposes one huge risk, that is already a pain in other db's. You need
all logs of one transaction online. For a GigaByte transaction like a bulk
insert this can be very inconvenient.
Imho there should be some limit where you can choose whether you want
to continue without the feature (no savepoint) or are automatically aborted.

In any case, imho some thought should be put into this :-)

Another case where this is a problem is a client that starts a tx, does one little
insert or update on his private table, and then sits and waits for a day.

Both cases currently impose no problem whatsoever.

Andreas

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majo...@postgresql.org)

Hannu Krosing

unread,
May 21, 2001, 11:11:13 AM5/21/01
to
"Henshall, Stuart - WCP" wrote:
>
> Apologises if I've missed something, but isn't that the same xmin that ODBC
> uses for row versioning?
> - Stuart
>
> <Snip>
> > Currently, the XMIN/XMAX command counters are used only by the current
> > transaction, and they are useless once the transaction finishes and take
> > up 8 bytes on disk.
> <Snip>

BTW, is there some place where I could read about exact semantics of
sytem fields ?

--------------------
Hannu

Zeugswetter Andreas SB

unread,
May 21, 2001, 11:42:03 AM5/21/01
to

> > > Vadim, can you remind me what UNDO is used for?
> > 4. Split pg_log into small files with ability to remove old ones (which
> > do not hold statuses for any running transactions).

and I wrote:
> They are already small (16Mb). Or do you mean even smaller ?

Sorry for above little confusion of pg_log with WAL on my side :-(

Andreas

---------------------------(end of broadcast)---------------------------

Zeugswetter Andreas SB

unread,
May 21, 2001, 12:26:54 PM5/21/01
to

> Really?! Once again: WAL records give you *physical* address of tuples
> (both heap and index ones!) to be removed and size of log to read
> records from is not comparable with size of data files.

So how about a background "vacuum like" process, that reads the WAL
and does the cleanup ? Seems that would be great, since it then does not
need to scan, and does not make forground cleanup necessary.

Problem is when cleanup can not keep up with cleaning WAL files, that already
want to be removed. I would envision a config, that sais how many Mb of WAL
are allowed to queue up before clients are blocked.

Andreas

---------------------------(end of broadcast)---------------------------

Zeugswetter Andreas SB

unread,
May 21, 2001, 12:27:49 PM5/21/01
to

> Would it be possible to split the WAL traffic into two sets of files,

Sure, downside is two fsyncs :-( When I first suggested physical log
I had a separate file in mind, but that is imho only a small issue.

Of course people with more than 3 disks could benefit from a split.

Tom: If your ratio of physical pages vs WAL records is so bad, the config
should simply be changes to do fewer checkpoints (say every 20 min like a
typical Informix setup).

Andreas

---------------------------(end of broadcast)---------------------------

Tom Lane

unread,
May 21, 2001, 1:02:13 PM5/21/01
to
Zeugswetter Andreas SB <Zeugsw...@wien.spardat.at> writes:
> Tom: If your ratio of physical pages vs WAL records is so bad, the config
> should simply be changes to do fewer checkpoints (say every 20 min like a
> typical Informix setup).

I was using the default configuration. What caused the problem was
probably not so much the standard 5-minute time-interval-driven
checkpoints, as it was the standard every-3-WAL-segments checkpoints.
Possibly we ought to increase that number?

regards, tom lane

---------------------------(end of broadcast)---------------------------

Zeugswetter Andreas SB

unread,
May 21, 2001, 1:47:35 PM5/21/01
to

> My point is that we'll need in dynamic cleanup anyway and UNDO is
> what should be implemented for dynamic cleanup of aborted changes.

I do not yet understand why you want to handle aborts different than outdated
tuples. The ratio in a well tuned system should well favor outdated tuples.
If someone ever adds "dirty read" it is also not the case that it is guaranteed,
that nobody accesses the tuple you currently want to undo. So I really miss to see
the big difference.

Andreas

Mikheev, Vadim

unread,
May 21, 2001, 1:59:51 PM5/21/01
to
> > It probably will not cause more IO than vacuum does right now.
> > But unfortunately it will not reduce that IO.
>
> Uh ... what? Certainly it will reduce the total cost of vacuum,
> because it won't bother to try to move tuples to fill holes.

Oh, you're right here, but daemon will most likely read data files
again and again with in-memory FSM. Also, if we'll do partial table
scans then we'll probably re-read indices > 1 time.

> The index cleanup method I've proposed should be substantially
> more efficient than the existing code, as well.

Not in IO area.

> > My point is that we'll need in dynamic cleanup anyway and UNDO is
> > what should be implemented for dynamic cleanup of aborted changes.
>

> UNDO might offer some other benefits, but I doubt that it will allow
> us to eliminate VACUUM completely. To do that, you would need to

I never told this -:)

> keep track of free space using exact, persistent (on-disk) bookkeeping
> data structures. The overhead of that will be very substantial: more,
> I predict, than the approximate approach I proposed.

I doubt that "big guys" use in-memory FSM. If they were able to do this...

Vadim

Zeugswetter Andreas SB

unread,
May 21, 2001, 2:08:56 PM5/21/01
to

> > Tom: If your ratio of physical pages vs WAL records is so bad, the config
> > should simply be changes to do fewer checkpoints (say every 20 min like a
> > typical Informix setup).
>
> I was using the default configuration. What caused the problem was
> probably not so much the standard 5-minute time-interval-driven

I am quite sure, that I would increase the default to at least 15 min here.

> checkpoints, as it was the standard every-3-WAL-segments checkpoints.
> Possibly we ought to increase that number?

Here I am unfortunately not so sure with the current logic (that you can only free
them after the checkpoint). I think the admin has to choose this. Maybe increase to 4,
but 64 Mb is quite a lot for a small installation :-(

Andreas

Mikheev, Vadim

unread,
May 21, 2001, 2:13:50 PM5/21/01
to
> I hope we can avoid on-disk FSM. Seems to me that that would create
> problems both for performance (lots of extra disk I/O) and reliability
> (what happens if FSM is corrupted? A restart won't fix it).

We can use WAL for FSM.

Vadim

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 21, 2001, 2:18:02 PM5/21/01
to
> > Really?! Once again: WAL records give you *physical*
> > address of tuples (both heap and index ones!) to be
> > removed and size of log to read records from is not
> > comparable with size of data files.
>
> So how about a background "vacuum like" process, that reads
> the WAL and does the cleanup ? Seems that would be great,
> since it then does not need to scan, and does not make
> forground cleanup necessary.
>
> Problem is when cleanup can not keep up with cleaning WAL
> files, that already want to be removed. I would envision a
> config, that sais how many Mb of WAL are allowed to queue
> up before clients are blocked.

Yes, some daemon could read logs and gather cleanup info.
We could activate it when switching to new log file segment
and synchronization with checkpointer is not big deal. That
daemon would also archive log files for WAL-based BAR,
if archiving is ON.

But this will be useful only with on-disk FSM.

Vadim

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 21, 2001, 2:52:01 PM5/21/01
to
> > We could keep share buffer lock (or add some other kind of lock)
> > untill tuple projected - after projection we need not to read data
> > for fetched tuple from shared buffer and time between fetching
> > tuple and projection is very short, so keeping lock on buffer will
> > not impact concurrency significantly.
>
> Or drop the pin on the buffer to show we no longer have a pointer to it.

This is not good for seqscans which will return to that buffer anyway.

> > Or we could register callback cleanup function with buffer so bufmgr
> > would call it when refcnt drops to 0.
>
> Hmm ... might work. There's no guarantee that the refcnt
> would drop to zero before the current backend exits, however.
> Perhaps set a flag in the shared buffer header, and the last guy
> to drop his pin is supposed to do the cleanup?

This is what I've meant - set (register) some pointer in buffer header
to cleanup function.

> But then you'd be pushing VACUUM's work into productive transactions,
> which is probably not the way to go.

Not big work - I wouldn't worry about it.

> > Two ways: hold index page lock untill heap tuple is checked
> > or (rough schema) store info in shmem (just IndexTupleData.t_tid
> > and flag) that an index tuple is used by some scan so cleaner could
> > change stored TID (get one from prev index tuple) and set flag to
> > help scan restore its current position on return.
>
> Another way is to mark the index tuple "gone but not forgotten", so to
> speak --- mark it dead without removing it. (We could know that we need
> to do that if we see someone else has a buffer pin on the index page.)

Register cleanup function just like with heap above.

> None of these seem real clean though. Needs more thought.

Vadim

---------------------------(end of broadcast)---------------------------

Tom Lane

unread,
May 21, 2001, 2:54:11 PM5/21/01
to
Jan Wieck <JanW...@yahoo.com> writes:
> I think the in-shared-mem FSM could have some max-per-table
> limit and the background VACUUM just skips the entire table
> as long as nobody reused any space.

I was toying with the notion of trying to use Vadim's "MNMB" idea
(see his description of the work he did for Perlstein last year);
that is, keep track of the lowest block number of any modified block
within each relation since the last VACUUM. Then VACUUM would only
have to scan from there to the end. This covers the totally-untouched-
relation case nicely, and also helps a lot for large rels that you're
mostly just adding to or perhaps updating recent additions.

The FSM could probably keep track of such info fairly easily, since
it will already be aware of which blocks it's told backends to try
to insert into. But it would have to be told about deletes too,
which would mean more FSM access traffic and more lock contention.
Another problem (given my current view of how FSM should work) is that
rels not being used at all would not be in FSM, or would age out of it,
and so you wouldn't know that you didn't need to vacuum them.
So I'm not sure yet if it's a good idea.

regards, tom lane

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 21, 2001, 3:01:27 PM5/21/01
to
> > My point is that we'll need in dynamic cleanup anyway and UNDO is
> > what should be implemented for dynamic cleanup of aborted changes.
>
> I do not yet understand why you want to handle aborts different than
> outdated tuples.

Maybe because of aborted tuples have shorter Time-To-Live.
And probability to find pages for them in buffer pool is higher.

> The ratio in a well tuned system should well favor outdated tuples.
> If someone ever adds "dirty read" it is also not the case that it
> is guaranteed, that nobody accesses the tuple you currently want
> to undo. So I really miss to see the big difference.

It will not be guaranteed anyway as soon as we start removing
tuples without exclusive access to relation.

And, I cannot say that I would implement UNDO because of
1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
but because of ALL of 1., 2., 4.

Vadim

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 21, 2001, 3:09:38 PM5/21/01
to
> From: Mikheev, Vadim
> Sent: Monday, May 21, 2001 10:23 AM
> To: 'Jan Wieck'; Tom Lane
> Cc: The Hermit Hacker; 'Bruce Momjian';
> pgsql-...@postgresql.orgrg.us.greatbridge.com

Strange address, Jan?

> Subject: RE: [HACKERS] Plans for solving the VACUUM problem


>
>
> > I think the in-shared-mem FSM could have some max-per-table
> > limit and the background VACUUM just skips the entire table

> > as long as nobody reused any space. Also it might only
> > compact pages that lead to 25 or more percent of freespace in
> > the first place. That makes it more likely that if someone
> > looks for a place to store a tuple that it'll fit into that
> > block (remember that the toaster tries to keep main tuples
> > below BLKSZ/4).
>
> This should be configurable parameter like PCFREE (or something
> like that) in Oracle: consider page for insertion only if it's
> PCFREE % empty.
>
> Vadim
>

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 21, 2001, 4:10:10 PM5/21/01
to
> > We could keep share buffer lock (or add some other kind of lock)
> > untill tuple projected - after projection we need not to read data
> > for fetched tuple from shared buffer and time between fetching
> > tuple and projection is very short, so keeping lock on buffer will
> > not impact concurrency significantly.
>
> Or drop the pin on the buffer to show we no longer have a pointer
> to it. I'm not sure that the time to do projection is short though
> --- what if there are arbitrary user-defined functions in the quals
> or the projection targetlist?

Well, while we are on this subject I finally should say about issue
bothered me for long time: only "simple" functions should be allowed
to deal with data in shared buffers directly. "Simple" means: no SQL
queries there. Why? One reason: we hold shlock on buffer while doing
seqscan qual - what if qual' SQL queries will try to acquire exclock
on the same buffer? Another reason - concurrency. I think that such
"heavy" functions should be provided with copy of data.

Vadim

---------------------------(end of broadcast)---------------------------

Barry Lind

unread,
May 21, 2001, 4:23:49 PM5/21/01
to

Zeugswetter Andreas SB wrote:

>>> Vadim, can you remind me what UNDO is used for?
>>
>> 4. Split pg_log into small files with ability to remove old ones (which
>> do not hold statuses for any running transactions).
>
>

> They are already small (16Mb). Or do you mean even smaller ?

> This imposes one huge risk, that is already a pain in other db's. You need
> all logs of one transaction online. For a GigaByte transaction like a bulk
> insert this can be very inconvenient.
> Imho there should be some limit where you can choose whether you want
> to continue without the feature (no savepoint) or are automatically aborted.
>
> In any case, imho some thought should be put into this :-)
>
> Another case where this is a problem is a client that starts a tx, does one little
> insert or update on his private table, and then sits and waits for a day.
>
> Both cases currently impose no problem whatsoever.

Correct me if I am wrong, but both cases do present a problem currently
in 7.1. The WAL log will not remove any WAL files for transactions that
are still open (even after a checkpoint occurs). Thus if you do a bulk
insert of gigabyte size you will require a gigabyte sized WAL
directory. Also if you have a simple OLTP transaction that the user
started and walked away from for his one week vacation, then no WAL log
files can be deleted until that user returns from his vacation and ends
his transaction.

--Barry

>
>
> Andreas


>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majo...@postgresql.org)
>
>

Mikheev, Vadim

unread,
May 21, 2001, 4:41:18 PM5/21/01
to
> Correct me if I am wrong, but both cases do present a problem
> currently in 7.1. The WAL log will not remove any WAL files
> for transactions that are still open (even after a checkpoint
> occurs). Thus if you do a bulk insert of gigabyte size you will
> require a gigabyte sized WAL directory. Also if you have a simple
> OLTP transaction that the user started and walked away from for
> his one week vacation, then no WAL log files can be deleted until
> that user returns from his vacation and ends his transaction.

Todo:

1. Compact log files after checkpoint (save records of uncommitted
transactions and remove/archive others).
2. Abort long running transactions.

Vadim

Jan Wieck

unread,
May 21, 2001, 4:54:04 PM5/21/01
to
Barry Lind wrote:
>
>
> Zeugswetter Andreas SB wrote:
>
> >>> Vadim, can you remind me what UNDO is used for?
> >>
> >> 4. Split pg_log into small files with ability to remove old ones (which
> >> do not hold statuses for any running transactions).
> >
> >
> > They are already small (16Mb). Or do you mean even smaller ?
> > This imposes one huge risk, that is already a pain in other db's. You need
> > all logs of one transaction online. For a GigaByte transaction like a bulk
> > insert this can be very inconvenient.
> > Imho there should be some limit where you can choose whether you want
> > to continue without the feature (no savepoint) or are automatically aborted.
> >
> > In any case, imho some thought should be put into this :-)
> >
> > Another case where this is a problem is a client that starts a tx, does one little
> > insert or update on his private table, and then sits and waits for a day.
> >
> > Both cases currently impose no problem whatsoever.
>
> Correct me if I am wrong, but both cases do present a problem currently
> in 7.1. The WAL log will not remove any WAL files for transactions that
> are still open (even after a checkpoint occurs). Thus if you do a bulk
> insert of gigabyte size you will require a gigabyte sized WAL
> directory. Also if you have a simple OLTP transaction that the user
> started and walked away from for his one week vacation, then no WAL log
> files can be deleted until that user returns from his vacation and ends
> his transaction.

As a rule of thumb, online applications that hold open
transactions during user interaction are considered to be
Broken By Design (tm). So I'd slap the programmer/design
team with - let's use the server box since it doesn't contain
anything useful.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanW...@Yahoo.com #

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

Tom Lane

unread,
May 21, 2001, 5:24:52 PM5/21/01
to
"Mikheev, Vadim" <vmik...@SECTORBASE.COM> writes:
>> I'm not sure that the time to do projection is short though
>> --- what if there are arbitrary user-defined functions in the quals
>> or the projection targetlist?

> Well, while we are on this subject I finally should say about issue
> bothered me for long time: only "simple" functions should be allowed
> to deal with data in shared buffers directly. "Simple" means: no SQL
> queries there. Why? One reason: we hold shlock on buffer while doing
> seqscan qual - what if qual' SQL queries will try to acquire exclock
> on the same buffer?

I think we're there already: AFAICT, user-specified quals and
projections are done after dropping the buffer shlock. (Yes, I know
there's a HeapKeyTest inside heapgettup, but user quals don't get
done there.) We do still hold a pin, but that seems OK to me.

regards, tom lane

Barry Lind

unread,
May 21, 2001, 6:54:24 PM5/21/01
to

Tom Lane wrote:

>
> Hm. On the other hand, relying on WAL for undo means you cannot drop
> old WAL segments that contain records for any open transaction. We've
> already seen several complaints that the WAL logs grow unmanageably huge
> when there is a long-running transaction, and I think we'll see a lot
> more.
>
> It would be nicer if we could drop WAL records after a checkpoint or two,
> even in the presence of long-running transactions. We could do that if
> we were only relying on them for crash recovery and not for UNDO.

In Oracle the REDO and UNDO are separate. REDO in oracle is limited in
size to x log files y big (where x and y are parameters at database
creation). These x log files are reused in a round robin way with
checkpoints forced when wrapping around.

REDO in oracle is done by something known as a 'rollback segment'.
There are many rollback segments. A transaction is assigned one
rollback segment to write its UNDO records to. Transactions are
assigned to a rollback segment randomly (or you can explicitly assign a
transaction to a particular rollback segment if you want). The size of
rollback segments are limited and transactions are aborted if they
exceed that size. This is why oracle has the option to assign a
specific rollback segment to a transaction, so if you know you are going
to update/insert tons of stuff you want to use an appropriately sized
rollback segment for that transaction.

Things I like about the Oracle approach vs current WAL REDO and proposed
UNDO:

1 Sizes allocated to REDO and UNDO in Oracle are known and
configurable. WAL sizes are unknown and not constrained.
2 Oracle allows for big transactions by allowing them to use
specifically sizied rollback segments to handle the large transaction.
While WAL could be enhanced to fix 1, it appears difficult to have
different limits for different types of transactions as Oracle supports
in 2.

Thinks I don't like about the Oracle approach:

1 Not only updates, but also long running queries can be aborted if the
rollback segment size it too small, as the undo is necessary to create
an snapshot of the state of the database at the time the query started.

thanks,
--Barry

Barry Lind

unread,
May 21, 2001, 8:23:04 PM5/21/01
to

Mikheev, Vadim wrote:

>
> Ok, last reminder -:))
>
> On transaction abort, read WAL records and undo (rollback)
> changes made in storage. Would allow:
>
> 1. Reclaim space allocated by aborted transactions.
> 2. Implement SAVEPOINTs.
> Just to remind -:) - in the event of error discovered by server
> - duplicate key, deadlock, command mistyping, etc, - transaction
> will be rolled back to the nearest implicit savepoint setted
> just before query execution; - or transaction can be aborted by
> ROLLBACK TO <savepoint_name> command to some explicit savepoint
> setted by user. Transaction rolled back to savepoint may be continued.
> 3. Reuse transaction IDs on postmaster restart.


> 4. Split pg_log into small files with ability to remove old ones (which
> do not hold statuses for any running transactions).
>

> Vadim

This is probably not a good thread to add my two cents worth, but here
goes anyway.

The biggest issue I see with the proposed UNDO using WAL is the issue of
large/long lasting transactions. While it might be possible to solve
this problem with some extra work. However keep in mind that different
types of transactions (i.e. normal vs bulk loads) require different
amounts of time and/or UNDO. To solve this problem, you really need per
transaction limits which seems difficult to implement.

I have no doubt that UNDO with WAL can be done. But is there some other
way of doing UNDO that might be just as good or better?

Part of what I see in this thread reading between the lines is that some
believe the solution to many problems in the long term is to implement
an overwriting storage manager. Implementing UNDO via WAL is a
necessary step in that direction. While others seem to believe that the
non-overwriting storage manager has some life in it yet, and may even be
the storage manager for many releases to come. I don't know enough
about the internals to have any say in that discussion, however the
grass isn't always greener on the other side of the fence (i.e. an
overwriting storage manager will come with its own set of problems/issues).

So let me throw out an idea for UNDO using the current storage manager.
First let me state that UNDO is a bit of a misnomer, since undo for
transactions is already implemented. That is what pg_log is all about.
The part of UNDO that is missing is savepoints (either explicit or
implicit), because pg_log doesn't capture the information for each
command in a transaction. So the question really becomes, how to
implement savepoints with current storage manager?

I am going to lay out one assumption that I am making:
1) Most transactions are either completely successful or completely
rolled back
(If this weren't true, i.e. you really needed savepoints to partially
rollback changes, you couldn't be using the existing version of
postgresql at all)

My proposal is:
1) create a new relation to store 'failed commands' for transactions.
This is similar to pg_log for transactions, but takes it to the
command level. And since it records only failed commands (or ranges of
failed commands), thus most transactions will not have any information
in this relation per the assumption above.
2) Use the unused pg_log status (3 = unused, 2 = commit, 1 = abort, 0
= inprocess) to mean that the transaction was commited but some commands
were rolled back (i.e. partial commit)
Again for the majority of transactions nothing will need to change,
since they will still be marked as committed or aborted.
3) Code that determines whether or not a tuple is committed or not
needs to be aware of this new pg_log status, and look in the new
relation to see if the particular command was rolled back or not to
determine the commited status of the tuple. This subtly changes the
meaning of HEAP_XMIN_COMMITTED and related flags to reflect the
transaction and command status instead of just the transaction status.

The runtime cost of this shouldn't be too high, since the committed
state is cached in HEAP_XMIN_COMMITTED et al, it is only the added cost
for the pass that needs to set these flags, and then there is only added
cost in the case that the transaction wasn't completely sucessful (again
my assumption above).

Now I have know idea if what I am proposing is really doable or not. I
am just throwing this out as an alternative to WAL based
UNDO/savepoints. The reason I am doing this is that to me it seems to
leverage much of the existing infrastructure already in place that
performs undo for rolledback transactions (all the tmin, tmax, cmin,
cmax stuff as well as vacuum). Also it doesn't come with the large WAL
log file problem for large transactions.

Now having said all of this I realize that this doesn't solve the 4
billion transaction id limit problem, or the large size of the pg_log
file with large numbers of transactions.

thanks,
--Barry


>


---------------------------(end of broadcast)---------------------------

Lincoln Yeoh

unread,
May 21, 2001, 9:44:22 PM5/21/01
to
At 04:41 PM 21-05-2001 -0400, Jan Wieck wrote:
>
> As a rule of thumb, online applications that hold open
> transactions during user interaction are considered to be
> Broken By Design (tm). So I'd slap the programmer/design
> team with - let's use the server box since it doesn't contain
> anything useful.
>

Many web applications use persistent database connections for performance
reasons.

I suppose it's unlikely for webapps to update a row and then sit and wait a
long time for a hit, so it shouldn't affect most of them.

However if long running transactions are to be aborted automatically, it
could possibly cause problems with some apps out there.

Worse if long running transactions are _disconnected_ (not just aborted).

Regards,
Link.

Denis Perchine

unread,
May 22, 2001, 4:09:32 AM5/22/01
to
> > As a rule of thumb, online applications that hold open
> > transactions during user interaction are considered to be
> > Broken By Design (tm). So I'd slap the programmer/design
> > team with - let's use the server box since it doesn't contain
> > anything useful.
>
> Many web applications use persistent database connections for performance
> reasons.

Persistent connection is not the same as an OPEN transaction BTW.

> I suppose it's unlikely for webapps to update a row and then sit and wait a
> long time for a hit, so it shouldn't affect most of them.
>
> However if long running transactions are to be aborted automatically, it
> could possibly cause problems with some apps out there.
>
> Worse if long running transactions are _disconnected_ (not just aborted).

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: d...@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

Zeugswetter Andreas SB

unread,
May 22, 2001, 4:10:32 AM5/22/01
to

> As a rule of thumb, online applications that hold open
> transactions during user interaction are considered to be
> Broken By Design (tm). So I'd slap the programmer/design
> team with - let's use the server box since it doesn't contain
> anything useful.

We have a database system here, and not an OLTP helper app.
A database system must support all sorts of mixed usage from simple
OLTP to OLAP. Imho the usual separation on different servers gives more
headaches than are necessary.

Thus above statement can imho be true for one OLTP application, but not
for all applications on one db server.

Andreas

---------------------------(end of broadcast)---------------------------

Zeugswetter Andreas SB

unread,
May 22, 2001, 4:22:02 AM5/22/01
to

> Correct me if I am wrong, but both cases do present a problem currently
> in 7.1. The WAL log will not remove any WAL files for transactions that
> are still open (even after a checkpoint occurs). Thus if you do a bulk
> insert of gigabyte size you will require a gigabyte sized WAL
> directory. Also if you have a simple OLTP transaction that the user
> started and walked away from for his one week vacation, then no WAL log
> files can be deleted until that user returns from his vacation and ends
> his transaction.

I am not sure, it might be so implemented. But there is no technical reason
to keep them beyond checkpoint without UNDO.

Zeugswetter Andreas SB

unread,
May 22, 2001, 4:28:45 AM5/22/01
to

> REDO in oracle is done by something known as a 'rollback segment'.

You are not seriously saying that you like the "rollback segments" in Oracle.
They only cause trouble:
1. configuration (for every different workload you need a different config)
2. snapshot too old
3. tx abort because rollback segments are full
4. They use up huge amounts of space (e.g. 20 Gb rollback seg for a 120 Gb SAP)

If I read the papers correctly Version 9 gets rid of Point 1 but the rest ...

Zeugswetter Andreas SB

unread,
May 22, 2001, 5:33:28 AM5/22/01
to
> Todo:
>
> 1. Compact log files after checkpoint (save records of uncommitted
> transactions and remove/archive others).

On the grounds that undo is not guaranteed anyway (concurrent heap access),
why not simply forget it, since above sounds rather expensive ?
The downside would only be, that long running txn's cannot [easily] rollback
to savepoint.

> 2. Abort long running transactions.

This is imho "the" big downside of UNDO, and should not simply be put on
the TODO without thorow research. I think it would be better to forget UNDO for long
running transactions before aborting them.

Andreas

---------------------------(end of broadcast)---------------------------

Jan Wieck

unread,
May 22, 2001, 9:01:57 AM5/22/01
to
Lincoln Yeoh wrote:
> At 04:41 PM 21-05-2001 -0400, Jan Wieck wrote:
> >
> > As a rule of thumb, online applications that hold open
> > transactions during user interaction are considered to be
> > Broken By Design (tm). So I'd slap the programmer/design
> > team with - let's use the server box since it doesn't contain
> > anything useful.
> >
>
> Many web applications use persistent database connections for performance
> reasons.
>
> I suppose it's unlikely for webapps to update a row and then sit and wait a
> long time for a hit, so it shouldn't affect most of them.
>
> However if long running transactions are to be aborted automatically, it
> could possibly cause problems with some apps out there.
>
> Worse if long running transactions are _disconnected_ (not just aborted).

All true, but unrelated. He was talking about open
transactions holding locks while the user is off to recycle
some coffee or so. A persistent database connection doesn't
mean that you're holding a transaction while waiting for the
next hit.

And Postgres doesn't abort transaction or disconnect because
of their runtime. Then again, it'd take care for half done
work aborted by the httpd because a connection loss inside of
a transaction causes an implicit rollback.

Barry Lind

unread,
May 22, 2001, 1:23:05 PM5/22/01
to
Actually I don't like the problems with rollback segments in oracle at
all. I am just concerned that using WAL for UNDO will have all of the
same problems if it isn't designed carefully. At least in oracle's
rollback segments there are multiple of them, in WAL there is just one,
thus you will potentially have that 20Gig all in your single log
directory. People are already reporting the log directory growing to a
gig or more when running vacuum in 7.1.

Of the points you raised about oracle's rollback segment problems:

1. configuration (for every different workload you need a different config)

Postgres should be able to do a better job here.


2. snapshot too old

Shouldn't be a problem as long as postgres continues to use a non-overwriting storage manager. However under an overwriting storage manager, you need to keep all of the changes in the UNDO records to satisfy the query snapshot, thus if you want to limit the size of UNDO you may need to kill long running queries.

3. tx abort because rollback segments are full

If you want to limit the size of the UNDO, then this is a corresponding
byproduct. I believe a mail note was sent out yesterday suggesting that
limits like this be added to the todo list.

4. They use up huge amounts of space (e.g. 20 Gb rollback seg for a 120 Gb SAP)

You need to store the UNDO information somewhere. And on active
databases that can amount to alot of information, especially for bulk
loads or massive updates.

thanks,
--Barry


Zeugswetter Andreas SB wrote:

>
>
>> REDO in oracle is done by something known as a 'rollback segment'.
>
>
> You are not seriously saying that you like the "rollback segments" in Oracle.
> They only cause trouble:
> 1. configuration (for every different workload you need a different config)
> 2. snapshot too old
> 3. tx abort because rollback segments are full
> 4. They use up huge amounts of space (e.g. 20 Gb rollback seg for a 120 Gb SAP)
>
> If I read the papers correctly Version 9 gets rid of Point 1 but the rest ...
>

> Andreas
>
>


---------------------------(end of broadcast)---------------------------

Bruce Momjian

unread,
May 22, 2001, 1:58:25 PM5/22/01
to
[ Charset ISO-8859-1 unsupported, converting... ]

> > > My point is that we'll need in dynamic cleanup anyway and UNDO is
> > > what should be implemented for dynamic cleanup of aborted changes.
> >
> > I do not yet understand why you want to handle aborts different than
> > outdated tuples.
>
> Maybe because of aborted tuples have shorter Time-To-Live.
> And probability to find pages for them in buffer pool is higher.

This brings up an idea I had about auto-vacuum. I wonder if autovacuum
could do most of its work by looking at the buffer cache pages and
commit xids. Seems it would be quite easy record freespace in pages
already in the buffer and collect that information for other backends to
use. It could also move tuples between cache pages with little
overhead.

There wouldn't be an I/O overhead, and frequently used tables are
already in the cache.

--
Bruce Momjian | http://candle.pha.pa.us
pg...@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

Bruce Momjian

unread,
May 22, 2001, 2:39:28 PM5/22/01
to
> > The ratio in a well tuned system should well favor outdated tuples.
> > If someone ever adds "dirty read" it is also not the case that it
> > is guaranteed, that nobody accesses the tuple you currently want
> > to undo. So I really miss to see the big difference.
>
> It will not be guaranteed anyway as soon as we start removing
> tuples without exclusive access to relation.
>
> And, I cannot say that I would implement UNDO because of
> 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
> but because of ALL of 1., 2., 4.

OK, I understand your reasoning here, but I want to make a comment.

Looking at the previous features you added, like subqueries, MVCC, or
WAL, these were major features that greatly enhanced the system's
capabilities.

Now, looking at UNDO, I just don't see it in the same league as those
other additions. Of course, you can work on whatever you want, but I
was hoping to see another major feature addition for 7.2. We know we
badly need auto-vacuum, improved replication, and point-in-time recover.

I can see UNDO improving row reuse, and making subtransactions and
pg_log compression easier, but these items do not require UNDO.

In fact, I am unsure why we would want an UNDO way of reusing rows of
aborted transactions and an autovacuum way of reusing rows from
committed transactions, expecially because aborted transactions account
for <5% of all transactions. It would be better to put work into one
mechanism that would reuse all tuples.

If UNDO came with no limitations, it may be a good option, but the need
to carry tuples until transaction commit does add an extra burden on
programmers and administrators, and I just don't see what we are getting
for it.

--
Bruce Momjian | http://candle.pha.pa.us
pg...@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 22, 2001, 2:54:07 PM5/22/01
to
> > 1. Compact log files after checkpoint (save records of uncommitted
> > transactions and remove/archive others).
>
> On the grounds that undo is not guaranteed anyway (concurrent
> heap access), why not simply forget it,

We can set flag in ItemData and register callback function in
buffer header regardless concurrent heap/index access. So buffer
will be cleaned before throwing it out from buffer pool
(little optimization: if at the time when pin drops to 0 buffer
is undirty then we shouldn't really clean it up to avoid unnecessary
write - we can save info in FSM that space is available and clean it
up on first pin/read).
So, only ability of *immediate reusing* is not guaranteed. But this is
general problem of space reusing till we assume that buffer pin is
enough to access data.

> since above sounds rather expensive ?

I'm not sure. For non-overwriting smgr required UNDO info is pretty
small because of we're not required to keep tuple data, unlike
Oracle & Co. We can even store UNDO info in non-WAL format to avoid
log record header overhead. UNDO files would be kind of Oracle rollback
segments but muuuuch smaller. But yeh - additional log reads.

> The downside would only be, that long running txn's cannot
> [easily] rollback to savepoint.

We should implement savepoints for all or none transactions, no?

> > 2. Abort long running transactions.
>
> This is imho "the" big downside of UNDO, and should not
> simply be put on the TODO without thorow research. I think it
> would be better to forget UNDO for long running transactions
> before aborting them.

Abort could be configurable.

Vadim

Mikheev, Vadim

unread,
May 22, 2001, 6:01:35 PM5/22/01
to
> > And, I cannot say that I would implement UNDO because of
> > 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
> > but because of ALL of 1., 2., 4.
>
> OK, I understand your reasoning here, but I want to make a comment.
>
> Looking at the previous features you added, like subqueries, MVCC, or
> WAL, these were major features that greatly enhanced the system's
> capabilities.
>
> Now, looking at UNDO, I just don't see it in the same league as those
> other additions. Of course, you can work on whatever you want, but I
> was hoping to see another major feature addition for 7.2. We know we
> badly need auto-vacuum, improved replication, and point-in-time recover.

I don't like auto-vacuum approach in long term, WAL-based BAR is too easy
to do -:) (and you know that there is man who will do it, probably),
bidirectional sync replication is good to work on, but I'm more
interested in storage/transaction management now. And I'm not sure
if I'll have enough time for "another major feature in 7.2" anyway.

> It would be better to put work into one mechanism that would
> reuse all tuples.

This is what we're discussing now -:)
If community will not like UNDO then I'll probably try to implement
dead space collector which will read log files and so on. Easy to
#ifdef it in 7.2 to use in 7.3 (or so) with on-disk FSM. Also, I have
to implement logging for non-btree indices (anyway required for UNDO,
WAL-based BAR, WAL-based space reusing).

Vadim

---------------------------(end of broadcast)---------------------------

Hiroshi Inoue

unread,
May 22, 2001, 8:57:17 PM5/22/01
to
Bruce Momjian wrote:
>
> > > The ratio in a well tuned system should well favor outdated tuples.
> > > If someone ever adds "dirty read" it is also not the case that it
> > > is guaranteed, that nobody accesses the tuple you currently want
> > > to undo. So I really miss to see the big difference.
> >
> > It will not be guaranteed anyway as soon as we start removing
> > tuples without exclusive access to relation.
> >
> > And, I cannot say that I would implement UNDO because of
> > 1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
> > but because of ALL of 1., 2., 4.
>
> OK, I understand your reasoning here, but I want to make a comment.
>
> Looking at the previous features you added, like subqueries, MVCC, or
> WAL, these were major features that greatly enhanced the system's
> capabilities.
>
> Now, looking at UNDO, I just don't see it in the same league as those
> other additions.

Hmm hasn't it been an agreement ? I know UNDO was planned
for 7.0 and I've never heard objections about it until
recently. People also have referred to an overwriting smgr
easily. Please tell me how to introduce an overwriting smgr
without UNDO.

regards,
Hiroshi Inoue

Bruce Momjian

unread,
May 22, 2001, 9:09:06 PM5/22/01
to
> > Looking at the previous features you added, like subqueries, MVCC, or
> > WAL, these were major features that greatly enhanced the system's
> > capabilities.
> >
> > Now, looking at UNDO, I just don't see it in the same league as those
> > other additions.
>
> Hmm hasn't it been an agreement ? I know UNDO was planned
> for 7.0 and I've never heard objections about it until
> recently. People also have referred to an overwriting smgr
> easily. Please tell me how to introduce an overwriting smgr
> without UNDO.

I guess that is the question. Are we heading for an overwriting storage
manager? I didn't see that in Vadim's list of UNDO advantages, but
maybe that is his final goal. If so UNDO may make sense, but then the
question is how do we keep MVCC with an overwriting storage manager?

The only way I can see doing it is to throw the old tuples into the WAL
and have backends read through that for MVCC info.

--
Bruce Momjian | http://candle.pha.pa.us
pg...@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------

Hiroshi Inoue

unread,
May 22, 2001, 11:13:41 PM5/22/01
to

Bruce Momjian wrote:
>
> > > Looking at the previous features you added, like subqueries, MVCC, or
> > > WAL, these were major features that greatly enhanced the system's
> > > capabilities.
> > >
> > > Now, looking at UNDO, I just don't see it in the same league as those
> > > other additions.
> >
> > Hmm hasn't it been an agreement ? I know UNDO was planned
> > for 7.0 and I've never heard objections about it until
> > recently. People also have referred to an overwriting smgr
> > easily. Please tell me how to introduce an overwriting smgr
> > without UNDO.
>
> I guess that is the question. Are we heading for an overwriting storage
> manager?

I've never heard that it was given up. So there seems to be
at least a possibility to introduce it in the future.
PostgreSQL could have lived without UNDO due to its no
overwrite smgr. I don't know if avoiding UNDO is possible
to implement partial rollback(I don't think it's easy
anyway). However it seems harmful for the future
implementation of an overwriting smgr if we would
introduce it.

> I didn't see that in Vadim's list of UNDO advantages, but
> maybe that is his final goal.
> If so UNDO may make sense, but then the
> question is how do we keep MVCC with an overwriting storage manager?
>

It doesn't seem easy. ISTM it's one of the main reason we
couldn't introduce an overwriting smgr in 7.2.

regards,
Hiroshi Inoue

---------------------------(end of broadcast)---------------------------

Zeugswetter Andreas SB

unread,
May 23, 2001, 4:04:41 AM5/23/01
to

> If community will not like UNDO then I'll probably try to implement

Imho UNDO would be great under the following circumstances:
1. The undo is only registered for some background work process
and not done in the client's backend (or only if it is a small txn).
2. The same mechanism should also be used for outdated tuples
(the only difference beeing, that some tuples need to wait longer
because of an active xid)

The reasoning to not do it in the client's backend is not only that the client
does not need to wait, but that the nervous dba tends to kill them if after one hour
of forward work the backend seemingly does not respond anymore (because it is
busy with undo).

> dead space collector which will read log files and so on.

Which would then only be a possible implementation variant of above :-)
First step probably would be to separate the physical log to reduce WAL size.

> to implement logging for non-btree indices (anyway required for UNDO,
> WAL-based BAR, WAL-based space reusing).

Imho it would be great to implement a generic (albeit more expensive)
redo for all possible index types, that would be used in absence of a physical
redo for that particular index type (which is currently available for btree).

The prerequisites would be a physical log that saves the page before
modification. The redo could then be done (with the info from the heap tuple log record)
with the same index interface, that is used during normal operation.

Imho implementing a new index type is difficult enough as is without the need
to write a redo and undo.

Andreas

---------------------------(end of broadcast)---------------------------

Zeugswetter Andreas SB

unread,
May 23, 2001, 4:32:32 AM5/23/01
to
> > People also have referred to an overwriting smgr
> > easily. Please tell me how to introduce an overwriting smgr
> > without UNDO.

There is no way. Although undo for an overwriting smgr would involve a
very different approach than with non-overwriting. See Vadim's post about what
info suffices for undo in non overwriting smgr (file and ctid).

> I guess that is the question. Are we heading for an overwriting storage

> manager? I didn't see that in Vadim's list of UNDO advantages, but


> maybe that is his final goal. If so UNDO may make sense, but then the
> question is how do we keep MVCC with an overwriting storage manager?
>

> The only way I can see doing it is to throw the old tuples into the WAL
> and have backends read through that for MVCC info.

If PostgreSQL wants to stay MVCC, then we should imho forget "overwriting smgr"
very fast.

Let me try to list the pros and cons that I can think of:
Pro:
no index modification if key stays same
no search for free space for update (if tuple still fits into page)
no pg_log
Con:
additional IO to write "before image" to rollback segment
(every before image, not only first after checkpoint)
(also before image of every index page that is updated !)
need a rollback segment that imposes all sorts of contention problems
active rollback, that needs to do a lot of undo work

Andreas

---------------------------(end of broadcast)---------------------------

Zeugswetter Andreas SB

unread,
May 23, 2001, 4:51:07 AM5/23/01
to

> > The downside would only be, that long running txn's cannot
> > [easily] rollback to savepoint.
>
> We should implement savepoints for all or none transactions, no?

We should not limit transaction size to online available disk space for WAL.
Imho that is much more important. With guaranteed undo we would need
diskspace for more than 2x new data size (+ at least space for 1x all modified
pages unless physical log is separated from WAL).

Imho a good design should involve only little more than 1x new data size.

>
> > > 2. Abort long running transactions.
> >
> > This is imho "the" big downside of UNDO, and should not
> > simply be put on the TODO without thorow research. I think it
> > would be better to forget UNDO for long running transactions
> > before aborting them.
>
> Abort could be configurable.

The point is, that you need to abort before WAL runs out of disk space
regardless of configuration.

Philip Warner

unread,
May 23, 2001, 5:09:18 AM5/23/01
to
At 14:33 22/05/01 -0700, Mikheev, Vadim wrote:
>
>If community will not like UNDO then I'll probably try to implement
>dead space collector which will read log files and so on.

I'd vote for UNDO; in terms of usability & friendliness it's a big win.
Tom's plans for FSM etc are, at least, going to get us some useful data,
and at best will mean we can hang of WAL based FSM for a few versions.


----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

Zeugswetter Andreas SB

unread,
May 23, 2001, 5:31:00 AM5/23/01
to

> >If community will not like UNDO then I'll probably try to implement
> >dead space collector which will read log files and so on.
>
> I'd vote for UNDO; in terms of usability & friendliness it's a big win.

Could you please try it a little more verbose ? I am very interested in
the advantages you see in "UNDO for rollback only".

pg_log is a very big argument, but couldn't we try to change the format
to something that only stores ranges of aborted txn's in a btree like format ?
Now that we have WAL, that should be possible.

Andreas

---------------------------(end of broadcast)---------------------------

Hannu Krosing

unread,
May 23, 2001, 6:15:30 AM5/23/01
to
Hiroshi Inoue wrote:
>
> People also have referred to an overwriting smgr easily.

I am all for an overwriting smgr, but as a feature that can be selected
on a table-by table basis (or at least in compile time), not as an
overall change

> Please tell me how to introduce an overwriting smgr
> without UNDO.

I would much more like a dead-space-reusing smgr on top of MVCC which
does
not touch live transactions.

------------------
Hannu

Philip Warner

unread,
May 23, 2001, 6:52:21 AM5/23/01
to
At 11:25 23/05/01 +0200, Zeugswetter Andreas SB wrote:
>
>> >If community will not like UNDO then I'll probably try to implement
>> >dead space collector which will read log files and so on.
>>
>> I'd vote for UNDO; in terms of usability & friendliness it's a big win.
>
>Could you please try it a little more verbose ? I am very interested in
>the advantages you see in "UNDO for rollback only".

I have not been paying strict attention to this thread, so it may have
wandered into a narrower band than I think we are in, but my understanding
is that UNDO is required for partial rollback in the case of failed
commands withing a single TX. Specifically,

- A simple typo in psql can currently cause a forced rollback of the entire
TX. UNDO should avoid this.

- It is not uncommon for application in other databases to handle errors
from the database (eg. missing FKs), and continue a TX.

- Similarly, when we get a new error reporting system, general constraint
(or other) failures should be able to be handled in one TX.


----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

---------------------------(end of broadcast)---------------------------

Zeugswetter Andreas SB

unread,
May 23, 2001, 7:14:28 AM5/23/01
to

> - A simple typo in psql can currently cause a forced rollback of the entire
> TX. UNDO should avoid this.

Yes, I forgot to mention this very big advantage, but undo is not the only possible way
to implement savepoints. Solutions using CommandCounter have been discussed.
Although the pg_log mechanism would become more complex, a background
"vacuum-like" process could put highest priority on removing such rolled back parts
of transactions.

Bruce Momjian

unread,
May 23, 2001, 11:31:22 AM5/23/01
to
> >> I'd vote for UNDO; in terms of usability & friendliness it's a big win.
> >
> >Could you please try it a little more verbose ? I am very interested in
> >the advantages you see in "UNDO for rollback only".

The confusion here is that you say you want UNDO, but then say you want
it to happen in the background, which sounds more like autovacuum than
UNDO.

> I have not been paying strict attention to this thread, so it may have
> wandered into a narrower band than I think we are in, but my understanding
> is that UNDO is required for partial rollback in the case of failed
> commands withing a single TX. Specifically,
>

> - A simple typo in psql can currently cause a forced rollback of the entire
> TX. UNDO should avoid this.
>

> - It is not uncommon for application in other databases to handle errors
> from the database (eg. missing FKs), and continue a TX.
>
> - Similarly, when we get a new error reporting system, general constraint
> (or other) failures should be able to be handled in one TX.

I think what you are asking for here is subtransactions, which can be
done without UNDO if we assign a unique transaction id's to each
subtransaction. Not pretty, but possible.

UNDO makes subtransactions easier because you can UNDO the part that
failed. However, someone mentioned you may have to wait for that part
to be undone. I think you have to wait because the current transaction
would have no way to know which rows were visible to the current
transaction unless you mark them right away or grope through the WAL
every time you visit a tuple.

--
Bruce Momjian | http://candle.pha.pa.us
pg...@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------

Tom Lane

unread,
May 23, 2001, 4:13:41 PM5/23/01
to
Hiroshi Inoue <In...@tpf.co.jp> writes:
>> I guess that is the question. Are we heading for an overwriting storage
>> manager?

> I've never heard that it was given up. So there seems to be


> at least a possibility to introduce it in the future.

Unless we want to abandon MVCC (which I don't), I think an overwriting
smgr is impractical. We need a more complex space-reuse scheme than
that.

regards, tom lane

Hiroshi Inoue

unread,
May 23, 2001, 7:17:29 PM5/23/01
to
Tom Lane wrote:
>
> Hiroshi Inoue <In...@tpf.co.jp> writes:
> >> I guess that is the question. Are we heading for an overwriting storage
> >> manager?
>
> > I've never heard that it was given up. So there seems to be
> > at least a possibility to introduce it in the future.
>
> Unless we want to abandon MVCC (which I don't), I think an overwriting
> smgr is impractical.

Impractical ? Oracle does it.

> We need a more complex space-reuse scheme than
> that.
>

IMHO we have to decide which to go now.
As I already mentioned, changing current handling
of transactionId/CommandId to avoid UNDO is not
only useless but also harmful for an overwriting
smgr.

regards,
Hiroshi Inoue

Don Baccus

unread,
May 23, 2001, 7:30:41 PM5/23/01
to
At 08:15 AM 5/24/01 +0900, Hiroshi Inoue wrote:

>> Unless we want to abandon MVCC (which I don't), I think an overwriting
>> smgr is impractical.
>
>Impractical ? Oracle does it.

It's not easy, though ... the current PG scheme has the advantage of being
relatively simple and probably more efficient than scanning logs like
Oracle has to do (assuming your datafiles aren't thoroughly clogged with
old dead tuples).

Has anyone looked at InterBase for hints for space-reusing strategies?

As I understand it, they have a tuple-versioning scheme similar to PG's.

If nothing else, something might be learned as to the efficiency and
effectiveness of one particular approach to solving the problem.

- Don Baccus, Portland OR <dho...@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

Hiroshi Inoue

unread,
May 23, 2001, 10:25:06 PM5/23/01
to
Don Baccus wrote:
>
> At 08:15 AM 5/24/01 +0900, Hiroshi Inoue wrote:
>
> >> Unless we want to abandon MVCC (which I don't), I think an overwriting
> >> smgr is impractical.
> >
> >Impractical ? Oracle does it.
>
> It's not easy, though ... the current PG scheme has the advantage of being
> relatively simple and probably more efficient than scanning logs like
> Oracle has to do (assuming your datafiles aren't thoroughly clogged with
> old dead tuples).
>

I think so too. I've never said that an overwriting smgr
is easy and I don't love it particularily.

What I'm objecting is to avoid UNDO without giving up
an overwriting smgr. We shouldn't be noncommittal now.

regards,
Hiroshi Inoue

---------------------------(end of broadcast)---------------------------

Tom Lane

unread,
May 23, 2001, 11:06:49 PM5/23/01
to
Hiroshi Inoue <In...@tpf.co.jp> writes:

> Tom Lane wrote:
>> Unless we want to abandon MVCC (which I don't), I think an overwriting
>> smgr is impractical.

> Impractical ? Oracle does it.

Oracle has MVCC?

regards, tom lane

Hiroshi Inoue

unread,
May 23, 2001, 11:27:06 PM5/23/01
to
Tom Lane wrote:
>
> Hiroshi Inoue <In...@tpf.co.jp> writes:
> > Tom Lane wrote:
> >> Unless we want to abandon MVCC (which I don't), I think an overwriting
> >> smgr is impractical.
>
> > Impractical ? Oracle does it.
>
> Oracle has MVCC?
>

Yes.

regards,
Hiroshi Inoue

---------------------------(end of broadcast)---------------------------

Don Baccus

unread,
May 24, 2001, 9:03:25 AM5/24/01
to
At 11:02 PM 5/23/01 -0400, Tom Lane wrote:
>Hiroshi Inoue <In...@tpf.co.jp> writes:
>> Tom Lane wrote:
>>> Unless we want to abandon MVCC (which I don't), I think an overwriting
>>> smgr is impractical.
>
>> Impractical ? Oracle does it.
>
>Oracle has MVCC?

With restrictions, yes. You didn't know that? Vadim did ...

- Don Baccus, Portland OR <dho...@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 24, 2001, 1:18:14 PM5/24/01
to
> >> Impractical ? Oracle does it.
> >
> >Oracle has MVCC?
>
> With restrictions, yes.

What restrictions? Rollback segments size?
Non-overwriting smgr can eat all disk space...

> You didn't know that? Vadim did ...

Didn't I mention a few times that I was
inspired by Oracle? -:)

Vadim

Mikheev, Vadim

unread,
May 24, 2001, 1:45:44 PM5/24/01
to
> If PostgreSQL wants to stay MVCC, then we should imho forget
> "overwriting smgr" very fast.
>
> Let me try to list the pros and cons that I can think of:
> Pro:
> no index modification if key stays same
> no search for free space for update (if tuple still
> fits into page)
> no pg_log
> Con:
> additional IO to write "before image" to rollback segment
> (every before image, not only first after checkpoint)
> (also before image of every index page that is updated !)

I don't think that Oracle writes entire page as before image - just
tuple data and some control info. As for additional IO - we'll do it
anyway to remove "before image" (deleted tuple data) from data files.

Vadim

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 24, 2001, 2:29:10 PM5/24/01
to
> > - A simple typo in psql can currently cause a forced
> > rollback of the entire TX. UNDO should avoid this.
>
> Yes, I forgot to mention this very big advantage, but undo is
> not the only possible way to implement savepoints. Solutions
> using CommandCounter have been discussed.

This would be hell.

Vadim

---------------------------(end of broadcast)---------------------------

Hiroshi Inoue

unread,
May 24, 2001, 7:54:54 PM5/24/01
to
"Mikheev, Vadim" wrote:
>
> > I think so too. I've never said that an overwriting smgr
> > is easy and I don't love it particularily.
> >
> > What I'm objecting is to avoid UNDO without giving up
> > an overwriting smgr. We shouldn't be noncommittal now.
>
> Why not? We could decide to do overwriting smgr later
> and implement UNDO then.

What I'm refering to is the discussion about the handling
of subtransactions in order to introduce the savepoints
functionality. Or do we postpone *savepoints* again ?

I realize now few people have had the idea how to switch
to an overwriting smgr. I don't think an overwriting smgr
could be achived at once and we have to prepare one by one
for it. AFAIK there's no idea how to introduce an overwriting
smgr without UNDO. If we avoid UNDO now when overwriting smgr
would appear ? I also think that the problems Andreas has
specified are pretty serious. I also have known the problems
and I've expected that people have the idea to solve it but
... I'm inclined to give up an overwriting smgr now.

regards,
Hiroshi Inoue

Hannu Krosing

unread,
May 24, 2001, 7:58:27 PM5/24/01
to
"Mikheev, Vadim" wrote:
>
> > >> Impractical ? Oracle does it.
> > >
> > >Oracle has MVCC?
> >
> > With restrictions, yes.
>
> What restrictions? Rollback segments size?
> Non-overwriting smgr can eat all disk space...

Is'nt the same true for an overwriting smgr ? ;)

> > You didn't know that? Vadim did ...
>
> Didn't I mention a few times that I was
> inspired by Oracle? -:)

How does it do MVCC with an overwriting storage manager ?

Could it possibly be a Postgres-inspired bolted-on hack
needed for better concurrency ?


BTW, are you aware how Interbase does its MVCC - is it more
like Oracle's way or like PostgreSQL's ?

----------------
Hannu

Hannu Krosing

unread,
May 24, 2001, 8:11:43 PM5/24/01
to
Don Baccus wrote:

>
> At 08:15 AM 5/24/01 +0900, Hiroshi Inoue wrote:
>
> >> Unless we want to abandon MVCC (which I don't), I think an overwriting
> >> smgr is impractical.
> >
> >Impractical ? Oracle does it.
>
> It's not easy, though ... the current PG scheme has the advantage of being
> relatively simple and probably more efficient than scanning logs like
> Oracle has to do (assuming your datafiles aren't thoroughly clogged with
> old dead tuples).
>
> Has anyone looked at InterBase for hints for space-reusing strategies?
>
> As I understand it, they have a tuple-versioning scheme similar to PG's.
>
> If nothing else, something might be learned as to the efficiency and
> effectiveness of one particular approach to solving the problem.

It may also be beneficial to study SapDB (which is IIRC a branch-off of
Adabas) although they claim at http://www.sapdb.org/ in features
section:

NOT supported features:

Collations

Result sets that are created within a stored procedure and
fetched outside. This feature is planned to be
offered in one of the coming releases.
Meanwhile, use temporary tables.
see Reference Manual: SAP DB 7.2 and 7.3 -> Data
definition -> CREATE TABLE statement: Owner of a
table

Multi version concurrency for OLTP
It is available with the object extension of SAPDB only.

Hot stand by
This feature is planned to be offered in one of the coming
releases.

So MVCC seems to be a bolt-on feature there.

---------------------
Hannu

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 24, 2001, 8:27:46 PM5/24/01
to
> > > >Oracle has MVCC?
> > >
> > > With restrictions, yes.
> >
> > What restrictions? Rollback segments size?
> > Non-overwriting smgr can eat all disk space...
>
> Is'nt the same true for an overwriting smgr ? ;)

Removing dead records from rollback segments should
be faster than from datafiles.

> > > You didn't know that? Vadim did ...
> >
> > Didn't I mention a few times that I was
> > inspired by Oracle? -:)
>
> How does it do MVCC with an overwriting storage manager ?

1. System Change Number (SCN) is used: system increments it
on each transaction commit.
2. When scan meets data block with SCN > SCN as it was when
query/transaction started, old block image is restored
using rollback segments.

> Could it possibly be a Postgres-inspired bolted-on hack
> needed for better concurrency ?

-:)) Oracle has MVCC for years, probably from the beginning
and for sure before Postgres.

> BTW, are you aware how Interbase does its MVCC - is it more
> like Oracle's way or like PostgreSQL's ?

Like ours.

Vadim

---------------------------(end of broadcast)---------------------------

Philip Warner

unread,
May 24, 2001, 8:29:35 PM5/24/01
to
At 01:51 25/05/01 +0500, Hannu Krosing wrote:
>
>How does it do MVCC with an overwriting storage manager ?
>

I don't know about Oracle, but Dec/RDB also does overwriting and MVCC. It
does this by taking a snapshot of pages that are participating in both RW
and RO transactions (De/RDB has the options on SET TRANSACTION that specify
if the TX will do updates or not). It has the disadvantage that the
snapshot will grow quite large for bulk loads. Typically they are about
10-20% of DB size. Pages are freed from the snapshot as active TXs finish.

Note that the snapshots are separate from the journalling (WAL) and
rollback files.


----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

---------------------------(end of broadcast)---------------------------

Bruce Momjian

unread,
May 24, 2001, 9:10:55 PM5/24/01
to
> What I'm refering to is the discussion about the handling
> of subtransactions in order to introduce the savepoints
> functionality. Or do we postpone *savepoints* again ?
>
> I realize now few people have had the idea how to switch
> to an overwriting smgr. I don't think an overwriting smgr
> could be achived at once and we have to prepare one by one
> for it. AFAIK there's no idea how to introduce an overwriting
> smgr without UNDO. If we avoid UNDO now when overwriting smgr
> would appear ? I also think that the problems Andreas has
> specified are pretty serious. I also have known the problems
> and I've expected that people have the idea to solve it but
> ... I'm inclined to give up an overwriting smgr now.

Now that everyone has commented on the UNDO issue, I thought I would try
to summarize the comments so we can come to some kind of conclusion.

Here are the issues as I see them:

---------------------------------------------------------------------------

Do we want to keep MVCC?

Yes. No one has said otherwise.

---------------------------------------------------------------------------

Do we want to head for an overwriting storage manager?

Not sure.

Advantages: UPDATE has easy space reuse because usually done in-place,
no index change on UPDATE unless key is changed.

Disadvantages: Old records have to be stored somewhere for MVCC use.
Could limit transaction size.

---------------------------------------------------------------------------

Do we want UNDO _if_ we are heading for an overwriting storage manager?

Everyone seems to say yes.

---------------------------------------------------------------------------

Do we want UNDO if we are _not_ heading for an overwriting storage
manager?

This is the tough one. UNDO advantages are:

Make subtransactions easier by rolling back aborted subtransaction.
Workaround is using a new transactions id for each subtransaction.

Easy space reuse for aborted transactions.

Reduce size of pg_log.

UNDO disadvantages are:

Limit size of transactions to log storage size.

---------------------------------------------------------------------------

If we are heading for an overwriting storage manager, we may as well get
UNDO now. If we are not, then we have to decide if we can solve the
problems that UNDO would fix. Basically, can we solve those problems
easier without UNDO, or are the disadvanges of UNDO too great?

--
Bruce Momjian | http://candle.pha.pa.us
pg...@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Lincoln Yeoh

unread,
May 24, 2001, 10:35:18 PM5/24/01
to
At 10:00 AM 24-05-2001 -0700, Mikheev, Vadim wrote:
>> >> Impractical ? Oracle does it.
>> >
>> >Oracle has MVCC?
>>
>> With restrictions, yes.
>
>What restrictions? Rollback segments size?
>Non-overwriting smgr can eat all disk space...
>
>> You didn't know that? Vadim did ...
>
>Didn't I mention a few times that I was
>inspired by Oracle? -:)

Is there yet another way to do it, that could be better? Has Oracle
actually done it the best way for once? ;).

But as long as Postgresql doesn't get painted into a corner, it doesn't
matter so much to me - I believe you guys can do a good job (as long as
"it's ready when it's ready", not "when Marketing says so").

My worry is if suddenly it is better to do savepoints another way, but it
changes _usage_ and thus breaks apps. Or it makes Postgresql look really
ugly.

Right now Postgresql is fairly neat/clean with only a few exceptions
(BLOBs, VACUUM). Whereas Oracle is full of so many cases where things were
done wrong but had to be kept that way (and erm why VARCHAR2?). And full of
bits slapped on. So I sure hope postgresql doesn't end up like Oracle.
Because if I want a Frankenstein database I'll go for Oracle. Sure it's
powerful and all that, but it's damn ugly...

Take all the time you want to do it right, coz once Postgresql gets really
popular, your hands will be even more tied. When that happens it's better
to be tied to a nice spot eh?

Cheerio,
Link.

Zeugswetter Andreas SB

unread,
May 25, 2001, 4:07:09 AM5/25/01
to

> > >> Impractical ? Oracle does it.
> > >
> > >Oracle has MVCC?
> >
> > With restrictions, yes.
>
> What restrictions? Rollback segments size?

No, that is not the whole story. The problem with their "rollback segment approach" is,
that they do not guard against overwriting a tuple version in the rollback segment.
They simply recycle each segment in a wrap around manner.
Thus there could be an open transaction that still wanted to see a tuple version
that was already overwritten, leading to the feared "snapshot too old" error.

Copying their "rollback segment" approach is imho the last thing we want to do.

> Non-overwriting smgr can eat all disk space...
>
> > You didn't know that? Vadim did ...
>
> Didn't I mention a few times that I was inspired by Oracle? -:)

Looking at what they supply in the feature area is imho good.
Copying their technical architecture is not so good in general.

Andreas

Hannu Krosing

unread,
May 25, 2001, 5:12:49 AM5/25/01
to
"Mikheev, Vadim" wrote:
>
> > > > >Oracle has MVCC?
> > > >
> > > > With restrictions, yes.
> > >
> > > What restrictions? Rollback segments size?
> > > Non-overwriting smgr can eat all disk space...
> >
> > Is'nt the same true for an overwriting smgr ? ;)
>
> Removing dead records from rollback segments should
> be faster than from datafiles.

Is it for better locality or are they stored in a different way ?

Do you think that there is some fundamental performance advantage
in making a copy to rollback segment and then deleting it from
there vs. reusing space in datafiles ?

One thing (not having to updata non-changing index entries) can be
quite substantial under some scenarios, but there are probably ways
to at least speed up part of this by doing other compromizes, perhaps
by saving more info in index leaf (trading lookup speed for space
and insert speed) or chaining data pages (trading insert speed for
(some) space and lookup speed)

> > > > You didn't know that? Vadim did ...
> > >
> > > Didn't I mention a few times that I was
> > > inspired by Oracle? -:)
> >

> > How does it do MVCC with an overwriting storage manager ?
>

> 1. System Change Number (SCN) is used: system increments it
> on each transaction commit.
> 2. When scan meets data block with SCN > SCN as it was when
> query/transaction started, old block image is restored
> using rollback segments.

You mean it is restored in session that is running the transaction ?

I guess thet it could be slower than our current way of doing it.

> > Could it possibly be a Postgres-inspired bolted-on hack
> > needed for better concurrency ?
>
> -:)) Oracle has MVCC for years, probably from the beginning
> and for sure before Postgres.

In that case we can claim thet their way is more primitive ;) ;)

-----------------
Hannu

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 25, 2001, 1:12:27 PM5/25/01
to
> > > >Oracle has MVCC?
> > >
> > > With restrictions, yes.
> >
> > What restrictions? Rollback segments size?
>
> No, that is not the whole story. The problem with their
> "rollback segment approach" is, that they do not guard against
> overwriting a tuple version in the rollback segment.
> They simply recycle each segment in a wrap around manner.
> Thus there could be an open transaction that still wanted to
> see a tuple version that was already overwritten, leading to the
> feared "snapshot too old" error.
>
> Copying their "rollback segment" approach is imho the last
> thing we want to do.

So, they limit size of rollback segments and we don't limit
how big our datafiles may grow if there is some long running
transaction in serializable mode. We could allow our rollback
segments to grow without limits as well.

> > Non-overwriting smgr can eat all disk space...
> >

> > > You didn't know that? Vadim did ...
> >
> > Didn't I mention a few times that I was inspired by Oracle? -:)
>

> Looking at what they supply in the feature area is imho good.
> Copying their technical architecture is not so good in general.

Copying is not inspiration -:)

Vadim

Mikheev, Vadim

unread,
May 25, 2001, 1:09:49 PM5/25/01
to
> Do we want to head for an overwriting storage manager?
>
> Not sure.
>
> Advantages: UPDATE has easy space reuse because usually done
> in-place, no index change on UPDATE unless key is changed.
>
> Disadvantages: Old records have to be stored somewhere for MVCC use.
> Could limit transaction size.

Really? Why is it assumed that we *must* limit size of rollback segments?
We can let them grow without bounds, as we do now keeping old records in
datafiles and letting them eat all of disk space.

> UNDO disadvantages are:
>
> Limit size of transactions to log storage size.

Don't be kidding - in any system transactions size is limitted
by available storage. So we should tell that more disk space
is required for UNDO. From my POV, putting $100 to buy 30Gb
disk is not big deal, keeping in mind that PGSQL requires
$ZERO to be used.

Vadim

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 25, 2001, 1:59:29 PM5/25/01
to
> > Removing dead records from rollback segments should
> > be faster than from datafiles.
>
> Is it for better locality or are they stored in a different way ?

Locality - all dead data would be localized in one place.

> Do you think that there is some fundamental performance advantage
> in making a copy to rollback segment and then deleting it from
> there vs. reusing space in datafiles ?

As it showed by WAL additional writes don't mean worse performance.
As for deleting from RS (rollback segment) - we could remove or reuse
RS files as whole.

> > > How does it do MVCC with an overwriting storage manager ?
> >
> > 1. System Change Number (SCN) is used: system increments it
> > on each transaction commit.
> > 2. When scan meets data block with SCN > SCN as it was when
> > query/transaction started, old block image is restored
> > using rollback segments.
>
> You mean it is restored in session that is running the transaction ?
>
> I guess thet it could be slower than our current way of doing it.

Yes, for older transactions which *really* need in *particular*
old data, but not for newer ones. Look - now transactions have to read
dead data again and again, even if some of them (newer) need not to see
those data at all, and we keep dead data as long as required for other
old transactions *just for the case* they will look there.
But who knows?! Maybe those old transactions will not read from table
with big amount of dead data at all! So - why keep dead data in datafiles
for long time? This obviously affects overall system performance.

Vadim

---------------------------(end of broadcast)---------------------------

Hiroshi Inoue

unread,
May 27, 2001, 4:38:40 AM5/27/01
to
> -----Original Message-----
> From: Mikheev, Vadim [mailto:vmik...@SECTORBASE.COM]
>
> > Do we want to head for an overwriting storage manager?
> >
> > Not sure.
> >
> > Advantages: UPDATE has easy space reuse because usually done
> > in-place, no index change on UPDATE unless key is changed.
> >
> > Disadvantages: Old records have to be stored somewhere for MVCC use.
> > Could limit transaction size.
>
> Really? Why is it assumed that we *must* limit size of rollback segments?
> We can let them grow without bounds, as we do now keeping old records in
> datafiles and letting them eat all of disk space.
>

Is it proper/safe for a DBMS to allow the system eating all disk
space ? For example, could we expect to recover the database
even when no disk space available ?

1) even before WAL
Is 'deleting records and vacuum' always possible ?
I saw the cases that indexes grow by vacuum.

2) under WAL(current)
If DROP or VACUUM is done after a checkpoint, wouldn't
REDO recovery add the pages drop/truncated by the
DROP/VACUUM ?

3) with rollback data
Shouldn't WAL log UNDO operations either ?
If so, UNDO requires an extra disk space which could
be unlimitedly big.

There's another serious problem. Once UNDO is required
with a biiiig rollback data, it would take a veeery long time
to undo. It's quite different from the current behavior. Even
though people want to cancel the UNDO, there's no way
unfortunately(under an overwriting smgr).

regards,
Hiroshi Inoue

Zeugswetter Andreas SB

unread,
May 28, 2001, 4:09:38 AM5/28/01
to

> > You mean it is restored in session that is running the transaction ?

Depends on what you mean with restored. It first reads the heap page,
sees that it needs an older version and thus reads it from the "rollback segment".

> >
> > I guess thet it could be slower than our current way of doing it.
>
> Yes, for older transactions which *really* need in *particular*
> old data, but not for newer ones. Look - now transactions have to read
> dead data again and again, even if some of them (newer) need not to see
> those data at all, and we keep dead data as long as required for other
> old transactions *just for the case* they will look there.
> But who knows?! Maybe those old transactions will not read from table
> with big amount of dead data at all! So - why keep dead data in datafiles
> for long time? This obviously affects overall system performance.

Yes, that is a good description. And old version is only required in the following
two cases:

1. the txn that modified this tuple is still open (reader in default committed read)
2. reader is in serializable transaction isolation and has earlier xtid

Seems overwrite smgr has mainly advantages in terms of speed for operations
other than rollback.

Andreas

---------------------------(end of broadcast)---------------------------

Hannu Krosing

unread,
May 28, 2001, 5:11:52 AM5/28/01
to
Zeugswetter Andreas SB wrote:
>
> > > You mean it is restored in session that is running the transaction ?
>
> Depends on what you mean with restored. It first reads the heap page,
> sees that it needs an older version and thus reads it from the "rollback segment".

So are whole pages stored in rollback segments or just the modified data
?

Storing whole pages could be very wasteful for tables with small records
that
are often modified.

---------------
Hannu

---------------------------(end of broadcast)---------------------------

Don Baccus

unread,
May 28, 2001, 11:47:16 AM5/28/01
to
At 10:00 AM 5/24/01 -0700, Mikheev, Vadim wrote:
>> >> Impractical ? Oracle does it.
>> >
>> >Oracle has MVCC?
>>
>> With restrictions, yes.
>
>What restrictions? Rollback segments size?
>Non-overwriting smgr can eat all disk space...

Actually, the restriction I'm thinking about isn't MVCC related, per
se, but a within-transaction restriction. The infamous "mutating table"
error.

>> You didn't know that? Vadim did ...
>
>Didn't I mention a few times that I was
>inspired by Oracle? -:)

Yes, you most certainly have!

- Don Baccus, Portland OR <dho...@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

---------------------------(end of broadcast)---------------------------

Vadim Mikheev

unread,
May 28, 2001, 2:05:25 PM5/28/01
to
> Yes, that is a good description. And old version is only required in the following
> two cases:
>
> 1. the txn that modified this tuple is still open (reader in default committed read)
> 2. reader is in serializable transaction isolation and has earlier xtid
>
> Seems overwrite smgr has mainly advantages in terms of speed for operations
> other than rollback.

... And rollback is required for < 5% transactions ...

Vadim

---------------------------(end of broadcast)---------------------------

Vadim Mikheev

unread,
May 28, 2001, 2:15:36 PM5/28/01
to
> > > > You mean it is restored in session that is running the transaction ?
> >
> > Depends on what you mean with restored. It first reads the heap page,
> > sees that it needs an older version and thus reads it from the "rollback segment".
>
> So are whole pages stored in rollback segments or just the modified data?

This is implementation dependent. Storing whole pages is much easy to do,
but obviously it's better to store just modified data.

Vadim

---------------------------(end of broadcast)---------------------------

Hannu Krosing

unread,
May 28, 2001, 2:06:54 PM5/28/01
to
Vadim Mikheev wrote:
>
> > Yes, that is a good description. And old version is only required in the following
> > two cases:
> >
> > 1. the txn that modified this tuple is still open (reader in default committed read)
> > 2. reader is in serializable transaction isolation and has earlier xtid
> >
> > Seems overwrite smgr has mainly advantages in terms of speed for operations
> > other than rollback.
>
> ... And rollback is required for < 5% transactions ...

This obviously depends on application.

I know people who rollback most of their transactions (actually they use
it to
emulate temp tables when reporting).

OTOH it is possible to do without rolling back at all as MySQL folks
have
shown us ;)

Also, IIRC, pgbench does no rollbacks. I think that we have no
performance test that does.

-----------------
Hannu

Zeugswetter Andreas SB

unread,
May 29, 2001, 4:16:46 AM5/29/01
to

> > > > > You mean it is restored in session that is running the transaction ?
> > >
> > > Depends on what you mean with restored. It first reads the heap page,
> > > sees that it needs an older version and thus reads it from the "rollback segment".
> >
> > So are whole pages stored in rollback segments or just the modified data?
>
> This is implementation dependent. Storing whole pages is much easy to do,
> but obviously it's better to store just modified data.

I am not sure it is necessarily better. Seems to be a tradeoff here.
pros of whole pages:
a possible merge with physical log (for first modification of a page after checkpoint
there would be no overhead compared to current since it is already written now)
in a clever implementation a page already in the "rollback segment" might satisfy the
modification of another row on that page, and thus would not need any additional io.

Andreas

Mikheev, Vadim

unread,
May 29, 2001, 1:47:08 PM5/29/01
to
> > > So are whole pages stored in rollback segments or just
> > > the modified data?
> >
> > This is implementation dependent. Storing whole pages is
> > much easy to do, but obviously it's better to store just
> > modified data.
>
> I am not sure it is necessarily better. Seems to be a tradeoff here.
> pros of whole pages:
> a possible merge with physical log (for first
> modification of a page after checkpoint
> there would be no overhead compared to current
> since it is already written now)

Using WAL as RS data storage is questionable.

> in a clever implementation a page already in the
> "rollback segment" might satisfy the
> modification of another row on that page, and
> thus would not need any additional io.

This would be possible only if there was no commit (same SCN)
between two modifications.

But, aren't we too deep on overwriting smgr (O-smgr) implementation?
It's doable. It has advantages in terms of IO active transactions
must do to follow MVCC. It has drawback in terms of required
disk space (and, oh yeh, it's not easy to implement -:)).
So, any other opinions about value of O-smgr?

Vadim

---------------------------(end of broadcast)---------------------------

Mikheev, Vadim

unread,
May 29, 2001, 1:58:02 PM5/29/01
to
> > > Seems overwrite smgr has mainly advantages in terms of
> > > speed for operations other than rollback.
> >
> > ... And rollback is required for < 5% transactions ...
>
> This obviously depends on application.

Small number of aborted transactions was used to show
useless of UNDO in terms of space cleanup - that's why
I use same argument to show usefulness of O-smgr -:)

> I know people who rollback most of their transactions
> (actually they use it to emulate temp tables when reporting).

Shouldn't they use TEMP tables? -:)

> OTOH it is possible to do without rolling back at all as
> MySQL folks have shown us ;)

Not with SDB tables which support transactions.

Vadim

---------------------------(end of broadcast)---------------------------

Don Baccus

unread,
May 29, 2001, 2:09:22 PM5/29/01
to
At 10:49 AM 5/29/01 -0700, Mikheev, Vadim wrote:

>> I know people who rollback most of their transactions
>> (actually they use it to emulate temp tables when reporting).
>
>Shouldn't they use TEMP tables? -:)

Which is a very good point. Pandering to poor practice at the
expense of good performance for better-designed applications
isn't a good idea.

- Don Baccus, Portland OR <dho...@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

---------------------------(end of broadcast)---------------------------

Hannu Krosing

unread,
May 29, 2001, 4:35:45 PM5/29/01
to
"Mikheev, Vadim" wrote:
>
> > I know people who rollback most of their transactions
> > (actually they use it to emulate temp tables when reporting).
>
> Shouldn't they use TEMP tables? -:)

They probably should.

Actually they did it on Oracle, so it shows that it can be done
even with O-smgr ;)

> > OTOH it is possible to do without rolling back at all as
> > MySQL folks have shown us ;)
>
> Not with SDB tables which support transactions.

My point was that MySQL was used quite a long time without it
and still quite many useful applications were produced.

BTW, do you know what strategy is used by BSDDB/SDB for
rollback/undo ?

---------------
Hannu

Mikheev, Vadim

unread,
May 29, 2001, 5:11:10 PM5/29/01
to
> > > OTOH it is possible to do without rolling back at all as
> > > MySQL folks have shown us ;)
> >
> > Not with SDB tables which support transactions.
>
> My point was that MySQL was used quite a long time without it
> and still quite many useful applications were produced.

And my point was that needless to talk about rollbacks in
non-transaction system and in transaction system one has to
implement rollback somehow.

> BTW, do you know what strategy is used by BSDDB/SDB for
> rollback/undo ?

AFAIR, they use O-smgr => UNDO is required.

Vadim

---------------------------(end of broadcast)---------------------------

0 new messages