[HACKERS] DROP TABLE inside a transaction block

Tatsuo Ishii

unread,

Mar 5, 2000, 3:00:00 AM3/5/00

to

I see following in HISTORY:

Disallow DROP TABLE/DROP INDEX inside a transaction block

However, it seems that this is not done with current?

test=# create table t1(i int);
CREATE
test=# begin;
BEGIN
test=# drop table t1;
NOTICE: Caution: DROP TABLE cannot be rolled back, so don't abort now
DROP
test=# end;
COMMIT
test=# \d
No relations found.
--
Tatsuo Ishii

************

Peter Eisentraut

unread,

Mar 5, 2000, 3:00:00 AM3/5/00

to

Tatsuo Ishii writes:

> I see following in HISTORY:
>
> Disallow DROP TABLE/DROP INDEX inside a transaction block
>
> However, it seems that this is not done with current?
>
> test=# create table t1(i int);
> CREATE
> test=# begin;
> BEGIN
> test=# drop table t1;
> NOTICE: Caution: DROP TABLE cannot be rolled back, so don't abort now

Wow, with all due respect, that's pretty sh^H^Hpoor. That's like saying
"Haha, either you commit your transaction or your database is fried." Any
reason that's not an ERROR before anything destructive is done?

> DROP
> test=# end;
> COMMIT
> test=# \d
> No relations found.
> --
> Tatsuo Ishii
>
> ************
>
>

--
Peter Eisentraut Sernanders väg 10:115
pet...@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

************

Mike Mascari

unread,

Mar 6, 2000, 3:00:00 AM3/6/00

to

Peter Eisentraut wrote:
>
> Tatsuo Ishii writes:
>
> > I see following in HISTORY:
> >
> > Disallow DROP TABLE/DROP INDEX inside a transaction block
> >
> > However, it seems that this is not done with current?
> >
> > test=# create table t1(i int);
> > CREATE
> > test=# begin;
> > BEGIN
> > test=# drop table t1;
> > NOTICE: Caution: DROP TABLE cannot be rolled back, so don't abort now
>
> Wow, with all due respect, that's pretty sh^H^Hpoor. That's like saying
> "Haha, either you commit your transaction or your database is fried." Any
> reason that's not an ERROR before anything destructive is done?
>
> > DROP
> > test=# end;
> > COMMIT
> > test=# \d
> > No relations found.

We had an elaborate discussion on this very topic several months
ago. What it comes down to is three possible options:

1) Allow DDL statements in transactions. If the transaction
aborts, currently, corruption can result. Some DDL statements
(such as TRUNCATE) make no sense with respect to ROLLBACK. So, I
guess, the idea is that SOME DDL statements will be ROLLBACK-able
and some won't - yuck.

2) Disallow DDL statement in transactions. This would break code
for people which is working now, only because their transactions
are being committed between the time they issue the DDL statement
and the COMMIT (or END), instead of aborting and causing their
database to become corrupt, or require manual removal of files
when the catalogue gets out-of-sync with the filesystem.

3) Implicitly commit the running transaction and begin a new one.
Only Vadim and I support this notion, although this is precisely
what Oracle does (not that that should define PostgreSQL's
behavior, of course). Everyone else, it seems wants to try to
implement #1 successfully...(I don't see it happening any time
soon).

So, as some sort of compromise, a NOTICE was issued.

Mike Mascari

************

Peter Eisentraut

unread,

Mar 6, 2000, 3:00:00 AM3/6/00

to

On Sun, 5 Mar 2000, Mike Mascari wrote:

> 1) Allow DDL statements in transactions. If the transaction
> aborts, currently, corruption can result. Some DDL statements

^^^^^^^^^^^^^^^^^^^^^
I think those are the key words.

> (such as TRUNCATE) make no sense with respect to ROLLBACK. So, I
> guess, the idea is that SOME DDL statements will be ROLLBACK-able
> and some won't - yuck.

I don't see a problem with disallowing some DDL commands in a transaction
as long as they throw an error and the transaction aborts. Users see this
and don't do it next time. Sure it's inconsistent but the current state is
plain bad, sorry.

> 3) Implicitly commit the running transaction and begin a new one.
> Only Vadim and I support this notion, although this is precisely
> what Oracle does (not that that should define PostgreSQL's
> behavior, of course). Everyone else, it seems wants to try to
> implement #1 successfully...(I don't see it happening any time
> soon).

I support that too since it also happens to be SQL's idea more or less.
One of these days we'll have to offer this as an option. At least for
commands for which #1 doesn't work yet.

Philip Warner

unread,

Mar 6, 2000, 3:00:00 AM3/6/00

to

At 07:59 6/03/00 +0100, Peter Eisentraut wrote:
>> (such as TRUNCATE) make no sense with respect to ROLLBACK. So, I
>> guess, the idea is that SOME DDL statements will be ROLLBACK-able
>> and some won't - yuck.
>
>I don't see a problem with disallowing some DDL commands in a transaction
>as long as they throw an error and the transaction aborts.

Is it really necessary to abort the TX? Seems a little antisocial - can't
you just return an error, and let the user/application decide if it needs
to abort?

>> 3) Implicitly commit the running transaction and begin a new one.
>> Only Vadim and I support this notion, although this is precisely
>> what Oracle does (not that that should define PostgreSQL's
>> behavior, of course). Everyone else, it seems wants to try to
>> implement #1 successfully...(I don't see it happening any time
>> soon).
>
>I support that too since it also happens to be SQL's idea more or less.
>One of these days we'll have to offer this as an option. At least for
>commands for which #1 doesn't work yet.

Do you really mean it when ou say 'Implicitly commit the running
transaction'. I would be deeply opposed to this philosophically, if so. No
TX should ever be commited unless the user requests it.

Just my 0.02c

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: +61-03-5367 7422 | _________ \
Fax: +61-03-5367 7430 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

************

Mike Mascari

unread,

Mar 6, 2000, 3:00:00 AM3/6/00

to

Philip Warner wrote:
>
> At 07:59 6/03/00 +0100, Peter Eisentraut wrote:
> >> (such as TRUNCATE) make no sense with respect to ROLLBACK. So, I
> >> guess, the idea is that SOME DDL statements will be ROLLBACK-able
> >> and some won't - yuck.
> >
> >I don't see a problem with disallowing some DDL commands in a transaction
> >as long as they throw an error and the transaction aborts.
>
> Is it really necessary to abort the TX? Seems a little antisocial - can't
> you just return an error, and let the user/application decide if it needs
> to abort?
>
> >> 3) Implicitly commit the running transaction and begin a new one.
> >> Only Vadim and I support this notion, although this is precisely
> >> what Oracle does (not that that should define PostgreSQL's
> >> behavior, of course). Everyone else, it seems wants to try to
> >> implement #1 successfully...(I don't see it happening any time
> >> soon).
> >
> >I support that too since it also happens to be SQL's idea more or less.
> >One of these days we'll have to offer this as an option. At least for
> >commands for which #1 doesn't work yet.
>
> Do you really mean it when ou say 'Implicitly commit the running
> transaction'. I would be deeply opposed to this philosophically, if so. No
> TX should ever be commited unless the user requests it.
>
> Just my 0.02c

Philosophically, I agree with you 100%. And apparently, from the
previous discussion on this issue, databases like Informix are
completely capable of rolling back DDL statements like DROP
TABLE, ALTER TABLE RENAME, etc. However, the complexity involved
apparently was too much for Oracle:

"ORACLE implicitly commits the current transaction before and
after every Data Definition Language statement."

Its just my feeling that robustness is the number one priority
and that the current state is kind of "riding the fence" between
ORACLE and Informix. On either side of the fence, it is safe, but
in the middle, you risk corruption.

Naturally, I'd like to see PostgreSQL on the Informix side of the
fence, but I don't see it happening any time soon. And the ORACLE
side of the fence is far easier to implement. Or, of course, you
could choose Peter's suggestion, and disallow the DDL statement
entirely. But as soon as that happened, all those people that
begin their .cgi programs with BEGIN and end it with END will
say, "Hey, if we can't use DDL statements in transactions, can't
we at least do what Oracle does so we don't have to change our
code?"

Mike Mascari

************

Zeugswetter Andreas SB

unread,

Mar 6, 2000, 3:00:00 AM3/6/00

to

> >> 3) Implicitly commit the running transaction and begin a new one.
> >> Only Vadim and I support this notion, although this is precisely
> >> what Oracle does (not that that should define PostgreSQL's
> >> behavior, of course). Everyone else, it seems wants to try to
> >> implement #1 successfully...(I don't see it happening any time
> >> soon).
> >
> >I support that too since it also happens to be SQL's idea more or less.
> >One of these days we'll have to offer this as an option. At least for
> >commands for which #1 doesn't work yet.
>
> Do you really mean it when ou say 'Implicitly commit the running
> transaction'. I would be deeply opposed to this philosophically, if so. No
> TX should ever be commited unless the user requests it.

Yes, that was also the general consensus on the list.
No statement is ever going to do an implicit commit of
previous statements.

Andreas

************

Peter Eisentraut

unread,

Mar 6, 2000, 3:00:00 AM3/6/00

to

On Mon, 6 Mar 2000, Philip Warner wrote:

> >I don't see a problem with disallowing some DDL commands in a transaction
> >as long as they throw an error and the transaction aborts.
>
> Is it really necessary to abort the TX? Seems a little antisocial - can't
> you just return an error, and let the user/application decide if it needs
> to abort?

I'm afraid yes, it is necessary. Either the whole transaction or none of
it. Anything else is opening a can of worms that you can't control unless
you have a Ph.D. in fancy databases or something. (Incidentally, I know
that a non-zero amount of people around here have one of those, but that
won't help the rest of us much. :{ )

Peter Eisentraut

unread,

Mar 6, 2000, 3:00:00 AM3/6/00

to

On Mon, 6 Mar 2000, Zeugswetter Andreas SB wrote:

> Yes, that was also the general consensus on the list. No statement is
> ever going to do an implicit commit of previous statements.

I can understand that, but one of these days I hope we can offer the SQL
semantics of transactions where you don't require a BEGIN. (*Optional*,
people.) In that case you have to do *something* about non-rollbackable
DDL (face it, there's always going to be one). Doing what Oracle does is
certainly not the *worst* one could do. Again, optional.

That still doesn't excuse the current behavior though.

Zeugswetter Andreas SB

unread,

Mar 6, 2000, 3:00:00 AM3/6/00

to

> > Yes, that was also the general consensus on the list. No statement is
> > ever going to do an implicit commit of previous statements.
>
> I can understand that, but one of these days I hope we can offer the SQL
> semantics of transactions where you don't require a BEGIN.

> (*Optional*,people.) In that case you have to do *something* about

> non-rollbackable DDL (face it, there's always going to be one). Doing what

> Oracle does is certainly not the *worst* one could do. Again, optional.

Imho it *is* the worst one can do.
The only also bad, but acceptable solutions to me would be:

1. disallow this DDL if there is any open DML in this tx,
( allow it, if only select or DDL statements since tx open, and do
the implicit commit)
2. handle this DDL outside any transaction scope even if a tx is open

Implicitly committing previous DML with a DDL statement is imho out of
discussion.

Not in the scope of this discussion is imho the "truncate" command,
since it is 1. not SQL92, 2. per definition a non rollbackable statement
and 3. probably rather a DML statement.

> That still doesn't excuse the current behavior though.

Agreed

Andreas

************

Mike Mascari

unread,

Mar 6, 2000, 3:00:00 AM3/6/00

to

Peter Eisentraut wrote:
>
> On Mon, 6 Mar 2000, Zeugswetter Andreas SB wrote:
>

> > Yes, that was also the general consensus on the list. No statement is
> > ever going to do an implicit commit of previous statements.
>
> I can understand that, but one of these days I hope we can offer the SQL
> semantics of transactions where you don't require a BEGIN. (*Optional*,
> people.)

I often think that the current behavior with respect to BEGIN
often hurts PostgreSQL's reputation with respect to speed. If the
default behavior was to begin a transaction at the first
non-SELECT DML statement, PostgreSQL wouldn't fare so poorly in
tests of:

INSERT INTO testspeed(1);
INSERT INTO testspeed(2);
INSERT INTO testspeed(3);
...
INSERT INTO testspeed(100000);

where, the same .sql script submitted against other databases is
running in a transaction, and, as such, is not being committed
immediately to disk. Fortunately, the Ziff-Davis reviewer ran his
tests with fsync() off. But your run-of-the-mill enterprise
application developer is probably going to just install the
software via rpms and run their sql scripts against it.

> In that case you have to do *something* about non-rollbackable
> DDL (face it, there's always going to be one). Doing what Oracle does is
> certainly not the *worst* one could do. Again, optional.
>

> That still doesn't excuse the current behavior though.

I can certainly understand Andreas' viewpoint. If no DDL,
however, was allowed inside a transaction -or- you could
optionally turn on implicit commit, imagine how much easier life
becomes in implementing ALTER TABLE DROP COLUMN, DROP TABLE, DROP
INDEX, etc, not having to worry about restoring filesystem files,
or deleting them in aborted CREATE TABLE/CREATE INDEX statements,
etc. A far-reaching idea would be to make use of foreign keys in
the system catalogue, with triggers used to add/rename/remove
relation files. That could be done if DDL statements could not be
executed in transactions. With AccessExclusive locks on the
appropriate relations, a host of race-condition related bugs
would disappear. And the complexity involved with dropping (or
perhaps disallowing the dropping of) related objects, such as
triggers, indexes, etc. would be automatic.

Mike Mascari

************

Tom Lane

unread,

Mar 7, 2000, 3:00:00 AM3/7/00

to

Mike Mascari <mas...@mascari.com> writes:
> So, as some sort of compromise, a NOTICE was issued.

It seems everybody but Mike has forgotten the previous go-round on
this issue. I had in fact put in an ERROR for DROP TABLE inside a
transaction block, and was beat up for it --- on the very reasonable
grounds that it's useful to be able to drop a table and do some other
things inside a transaction. Although we can't support rollback-ability
for such a transaction right now, we *do* support the atomic nature of
such a transaction. It's not reasonable to take away a capability that
was available in prior releases just because it's got deficiencies.
So the compromise was to issue a NOTICE instead of an ERROR.

BTW, we are not *that* far from being able to roll back a DROP TABLE.
The only thing that's really needed is for everyone to take a deep
breath and let go of the notion that table files ought to be named
after the tables. If we named table files after the OIDs of their
tables, then rollback-able DROP or RENAME TABLE would be pretty
straightforward. If you don't recall why this is, consult the
pghackers archives...

regards, tom lane

************

Tatsuo Ishii

unread,

Mar 7, 2000, 3:00:00 AM3/7/00

to

> It seems everybody but Mike has forgotten the previous go-round on
> this issue. I had in fact put in an ERROR for DROP TABLE inside a
> transaction block, and was beat up for it --- on the very reasonable
> grounds that it's useful to be able to drop a table and do some other
> things inside a transaction. Although we can't support rollback-ability
> for such a transaction right now, we *do* support the atomic nature of
> such a transaction. It's not reasonable to take away a capability that
> was available in prior releases just because it's got deficiencies.
> So the compromise was to issue a NOTICE instead of an ERROR.
>
> BTW, we are not *that* far from being able to roll back a DROP TABLE.
> The only thing that's really needed is for everyone to take a deep
> breath and let go of the notion that table files ought to be named
> after the tables. If we named table files after the OIDs of their
> tables, then rollback-able DROP or RENAME TABLE would be pretty
> straightforward. If you don't recall why this is, consult the
> pghackers archives...

So what was the conclusion for 7.0?

> Disallow DROP TABLE/DROP INDEX inside a transaction block

We should remove above from HISTORY, no?
--
Tatsuo Ishii

************

Tom Lane

unread,

Mar 7, 2000, 3:00:00 AM3/7/00

to

Tatsuo Ishii <t-i...@sra.co.jp> writes:
>> BTW, we are not *that* far from being able to roll back a DROP TABLE.
>> The only thing that's really needed is for everyone to take a deep
>> breath and let go of the notion that table files ought to be named
>> after the tables. If we named table files after the OIDs of their
>> tables, then rollback-able DROP or RENAME TABLE would be pretty
>> straightforward. If you don't recall why this is, consult the
>> pghackers archives...

> So what was the conclusion for 7.0?

Too late to consider it for 7.0, I think. I'd like to see it happen in
7.1 or 7.2 or so.

>> Disallow DROP TABLE/DROP INDEX inside a transaction block

> We should remove above from HISTORY, no?

Yes, it's not correct.

regards, tom lane

************

Ross J. Reedstrom

unread,

Mar 7, 2000, 3:00:00 AM3/7/00

to

On Tue, Mar 07, 2000 at 02:53:49AM -0500, Tom Lane wrote:
> Mike Mascari <mas...@mascari.com> writes:
> > So, as some sort of compromise, a NOTICE was issued.
>
>

> BTW, we are not *that* far from being able to roll back a DROP TABLE.
> The only thing that's really needed is for everyone to take a deep
> breath and let go of the notion that table files ought to be named
> after the tables. If we named table files after the OIDs of their
> tables, then rollback-able DROP or RENAME TABLE would be pretty
> straightforward. If you don't recall why this is, consult the
> pghackers archives...

Another data point regarding table filenames: I've been looking into
what's needed to support SQL92 schemas.

The standard defines a hierarchy of what are essentially scopes for
database objects. They are (in order from high to low):

Cluster of catalogs -> catalog -> schema -> (DB objects)

"Cluster of catalogs" is defined as (section 4.13):

Exactly one cluster is associated with an SQL-session and it defines
the totality of the SQL-data that is available to that SQL-session.

A catalog is (section 4.12):

Catalogs are named collections of schemas in an SQL-environment. An
SQL-environment contains zero or more catalogs. A catalog con-
tains one or more schemas, but always contains a schema named
INFORMATION_SCHEMA that contains the views and domains of the
Information Schema.

catalog and schema names show up in the syntax, e.g. the BNF for table
names (section 5.4):

Which collapses to (using unique names for the various identifiers):

and make a fully qualified column name BNF:

[ <catalog identifier> <period> ] [ <schema identifier> <period> ]
[ <table identifier> <period> ] <column identifier>

so:
foo.bar.baz.bongo

is a well formed column identifier for column bongo of table baz in
schema bar in catalog foo.

What's all this mean for pgsql? Well, SCHEMA are an Entry SQL
requirement. So, the syntax: schema.table needs to be supported. Both
schema and catalog define persistent visibilty scopes, and we need to
support identical table names in multiple schema.

I see two possiblities:

1) Map a pgsql database to a SQL schema.

Since we need to support identical table names in multiple schema,
it might be tempting to map a pgsql database to a schema. In fact,
since Entry SQL requires the syntax:

CREATE SCHEMA <schema authorization identifier>

And, in practice, the SCHEMA name seems to be equal to the database user
name, the pgsql default of creating (and accessing) a DB matching the
username implies this mapping.

However, that means we need to solve the one backend accessing multiple
DBs problem. I have a feeling that there may be 'gotchas' in the current
backend code that presume that all the tuples are coming from one DB.

2) Map pgsql DB -> SQL catalog

If we do this, the multiDB access problem can be pushed down the road,
since cross catalog access (<catalog name> in identifiers) is not
even required by Intermediate SQL, only Full SQL. In addition, almost
everything about catalogs is 'implemetation defined' so we get to claim
them as done. ;-)

2a) However, if a single pgsql database is a catalog, then each DB needs
to be able to contain tables in multiple schema, potentially with the
identical table names. One solution would be to do what we do for DBs:
create seperate subdirs for each schema, and put the table files in there.
Changes are probably isolated to the storage manager code, but I haven't
looked in detail.

2b) Another possiblity is what Tom has suggested, to solve the DDL
statements in a transaction problem: use some other unique identifier
for table filenames, perhaps based on OID. Then, supporting schemas
means supporting the syntax in the parser, and that's it, I think. This
would seem to minimize the changes needed to implement this Entry SQL92
requirement.

So, what do y'all think?

Ross
--
Ross J. Reedstrom, Ph.D., <reed...@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

************

Bruce Momjian

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

> I see following in HISTORY:
>

> Disallow DROP TABLE/DROP INDEX inside a transaction block
>

> However, it seems that this is not done with current?
>
> test=# create table t1(i int);
> CREATE
> test=# begin;
> BEGIN
> test=# drop table t1;
> NOTICE: Caution: DROP TABLE cannot be rolled back, so don't abort now

> DROP
> test=# end;
> COMMIT
> test=# \d
> No relations found.

> --
> Tatsuo Ishii

OK, seems it is fixed. I will remove the item.

--
Bruce Momjian | http://www.op.net/~candle
pg...@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

************

Bruce Momjian

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

[Charset ISO-8859-1 unsupported, filtering to ASCII...]

> Tatsuo Ishii writes:
>
> > I see following in HISTORY:
> >
> > Disallow DROP TABLE/DROP INDEX inside a transaction block
> >
> > However, it seems that this is not done with current?
> >
> > test=# create table t1(i int);
> > CREATE
> > test=# begin;
> > BEGIN
> > test=# drop table t1;
> > NOTICE: Caution: DROP TABLE cannot be rolled back, so don't abort now
>

> Wow, with all due respect, that's pretty sh^H^Hpoor. That's like saying
> "Haha, either you commit your transaction or your database is fried." Any
> reason that's not an ERROR before anything destructive is done?

I tried it and the ABORT worked, so I have no idea now what is happening
here.

Bruce Momjian

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

> So what was the conclusion for 7.0?
>

> > Disallow DROP TABLE/DROP INDEX inside a transaction block
>

> We should remove above from HISTORY, no?

Yes removed.

Bruce Momjian

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

> BTW, we are not *that* far from being able to roll back a DROP TABLE.
> The only thing that's really needed is for everyone to take a deep
> breath and let go of the notion that table files ought to be named
> after the tables. If we named table files after the OIDs of their
> tables, then rollback-able DROP or RENAME TABLE would be pretty
> straightforward. If you don't recall why this is, consult the
> pghackers archives...

The oid will be appended to the base file name.

Tom Lane

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

Bruce Momjian <pg...@candle.pha.pa.us> writes:
> I tried it and the ABORT worked, so I have no idea now what is happening
> here.

Is the table file still there after the ABORT? If not, it won't work
for long...

regards, tom lane

************

Tom Lane

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

Bruce Momjian <pg...@candle.pha.pa.us> writes:
>> BTW, we are not *that* far from being able to roll back a DROP TABLE.
>> The only thing that's really needed is for everyone to take a deep
>> breath and let go of the notion that table files ought to be named
>> after the tables. If we named table files after the OIDs of their
>> tables, then rollback-able DROP or RENAME TABLE would be pretty
>> straightforward. If you don't recall why this is, consult the
>> pghackers archives...

> The oid will be appended to the base file name.

If we do it that way, then RENAME TABLE will be kinda complicated...
not impossible, but is it worth it?

regards, tom lane

************

Bruce Momjian

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

> Bruce Momjian <pg...@candle.pha.pa.us> writes:
> > I tried it and the ABORT worked, so I have no idea now what is happening
> > here.
>
> Is the table file still there after the ABORT? If not, it won't work
> for long...

Oh, well.

Bruce Momjian

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

100% worth it. Ingres doesn't use table names in the file name, and
administration is a mess.

Lamar Owen

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

On Tue, 07 Mar 2000, Tom Lane wrote:
> Bruce Momjian <pg...@candle.pha.pa.us> writes:
> >> BTW, we are not *that* far from being able to roll back a DROP TABLE.
> >> The only thing that's really needed is for everyone to take a deep
> >> breath and let go of the notion that table files ought to be named
> >> after the tables. If we named table files after the OIDs of their
> >> tables, then rollback-able DROP or RENAME TABLE would be pretty
> >> straightforward. If you don't recall why this is, consult the
> >> pghackers archives...

> > The oid will be appended to the base file name.

> If we do it that way, then RENAME TABLE will be kinda complicated...
> not impossible, but is it worth it?

You know, I really hate to disagree with Bruce, but, Tom, you have a point.
IMHO, the benefits of having tables named by OID are going to be numerous --
the schema idea included. Of course, Bruce's concerns are good concerns as
well. What to do......

Why not do this:

Let the tables be named by OID, and only OID. Then, for admins' convenience,
put in a flat file that is updated periodically, similarly to pg_pwd being a
flat text dump of pg_shadow. Since there's going to have to be a system
table mapping table names to OID's anyway, a flat dump of said system table
should be similarly done as pg_pwd. Call it pg_realnames or something. Have
it have two columns: OID, and pathname (relative to PGDATA) of table.

This would help admins who might want to restore single tables -- as long as
they have a snapshot of pg_realnames at the same time each table is dumped,
then a restore of pg_realnames into a temp dir, then the actual table can be
restored in by its OID (of course, its OID might have changed in the interim,
but you would simply restore it on top of the OID that that table now maps to).

Besides, the OID for the table itself is not likely to change often.

Bruce, would this allay some of your (entirely valid) concerns? (I read the
thread about this the first time around, when Vadim said the tbale names
_would_ go to OID's.)

Just my two cents.

--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

************

The Hermit Hacker

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

On Tue, 7 Mar 2000, Lamar Owen wrote:

> Let the tables be named by OID, and only OID. Then, for admins' convenience,
> put in a flat file that is updated periodically, similarly to pg_pwd being a
> flat text dump of pg_shadow. Since there's going to have to be a system
> table mapping table names to OID's anyway, a flat dump of said system table
> should be similarly done as pg_pwd. Call it pg_realnames or something. Have
> it have two columns: OID, and pathname (relative to PGDATA) of table.

This I would be against ... I personally hate the whole pg_hba.conf,
pg_pwd, etc 'flatfiles' ...

But, could there not be some way of 'extracting extended data' from the
backend? ie. some sort of \d command that would provide you with
tablename+path+disk size+??

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scr...@hub.org secondary: scrappy@{freebsd|postgresql}.org

************

Bruce Momjian

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

> > I will fight this to my death. :-)
> > I have cursed Ingres every time I needed to look at the Ingres data
> > directory to find out which tables match which files. Even a lookup
> > file is a pain. Right now, I can do ls -l to see which tables are
> > taking disk space.
>
> I had Ingres also, and found their scheme to be a royal pain. But that
> was really only because they had such a *bad* schema that I'd have to
> poke around forever to reconstruct a query which would give me file
> names and table names. And then I'd have to print that and compare
> that to the directories which were buried way down in a directory
> tree.
>
> But with Postgres, we can write a utility to do this for us, so I
> think that it isn't so much of an issue. In fact, perhaps we could
> have a backend function which could do this, so we could query the
> sizes directly.

Does not work if the table was accidentally deleted. Also requires the
backend to be running.

Thomas Lockhart

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

> I will fight this to my death. :-)
> I have cursed Ingres every time I needed to look at the Ingres data
> directory to find out which tables match which files. Even a lookup
> file is a pain. Right now, I can do ls -l to see which tables are
> taking disk space.

I had Ingres also, and found their scheme to be a royal pain. But that
was really only because they had such a *bad* schema that I'd have to
poke around forever to reconstruct a query which would give me file
names and table names. And then I'd have to print that and compare
that to the directories which were buried way down in a directory
tree.

But with Postgres, we can write a utility to do this for us, so I
think that it isn't so much of an issue. In fact, perhaps we could
have a backend function which could do this, so we could query the
sizes directly.

- Thomas

--
Thomas Lockhart lock...@alumni.caltech.edu
South Pasadena, California

************

Philip Warner

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

At 21:57 7/03/00 -0500, Lamar Owen wrote:
>Let the tables be named by OID, and only OID. Then, for admins' convenience,
>put in a flat file that is updated periodically, similarly to pg_pwd being a
>flat text dump of pg_shadow. Since there's going to have to be a system
>table mapping table names to OID's anyway, a flat dump of said system table
>should be similarly done as pg_pwd. Call it pg_realnames or something. Have
>it have two columns: OID, and pathname (relative to PGDATA) of table.

For the ignorant, are you able to explain why naming files
'<table_name>_<IOD>' is not acceptable? This seems to satisfy both
requirements (and seemed to be the conclusion of the previous discussion).

I presume I have missed something, and assume there is a good reason for
the '<IOD>' naming convention, so if that is the final choice, would it be
hard to have a file header containing details about the table/index/thing
the the file contains and it's OID. In this way, a future pd_dumpfile
command can tell us what our backed up file '1FA12347.dat' is supposed to
contain?

Chris Bitmead

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

Bruce Momjian wrote:

> I will fight this to my death. :-)
>
> I have cursed Ingres every time I needed to look at the Ingres data
> directory to find out which tables match which files. Even a lookup
> file is a pain. Right now, I can do ls -l to see which tables are
> taking disk space.

Assuming a script "tableoid", is..

ls -l `tableoid foobar`

or

tableoid | xargs ls -l

so bad?

************

Tom Lane

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

Philip Warner <p...@rhyme.com.au> writes:
> For the ignorant, are you able to explain why naming files
> '<table_name>_<IOD>' is not acceptable? This seems to satisfy both
> requirements (and seemed to be the conclusion of the previous discussion).

Well, it's pretty simple: consider what has to happen to make RENAME
TABLE be rollback-able.

You clearly have to update the pg_class tuple whose relname field
contains the table name. That's no problem, because the normal
tuple commit mechanics will take care of making that tuple update
visible or not.

But, in the current implementation, renaming a table also requires
renaming the physical files that hold the table's data --- and last
I checked, Unix filesystems don't know anything about Postgres
transactions. Our current code renames the files instantly when
the table rename command is done, and there isn't any code for
undoing that rename. Thus, aborting the xact afterwards fails, because
the pg_class entries revert to their pre-xact values, but the physical
files don't revert to their prior names.

If we change the implementation so that the files are named after
the (fixed, never-changed-after-creation) table OID, then RENAME
TABLE is no problem: it affects *nothing* except the relname field
of the table's pg_class row, and either that row update is committed
or it ain't.

But if the physical file names contain the logical table name, we
have to be prepared to rename those files in sync with the transaction
commit that makes the pg_class update valid. Quite aside from any
implementation effort involved, the critical point is this: it is
*not possible* to ensure that that collection of changes is atomic.
At best, we can make the window for failure small.

Bruce seems to be willing to accept a window of failure for RENAME
TABLE in order to make database admin easier. That is very possibly
the right tradeoff --- but it is *not* an open-and-shut decision.
We need to talk about it.

regards, tom lane

************

Bruce Momjian

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

Give me a reason we don't put the table name in the file name?

Philip Warner

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

At 01:54 8/03/00 -0500, Tom Lane wrote:
>Philip Warner <p...@rhyme.com.au> writes:
>> For the ignorant, are you able to explain why naming files
>> '<table_name>_<IOD>' is not acceptable? This seems to satisfy both
>> requirements (and seemed to be the conclusion of the previous discussion).
>
>Well, it's pretty simple: consider what has to happen to make RENAME
>TABLE be rollback-able.

...etc

Sorry for the stupid question. I was confusing the previous discussions
over 'DROP COLUMN' with this one, without actually engaging my brain.

Your response was admirably patient.

FWIW, without a 'storage area' or 'table space' concept, I agree that table
names based on OID's are TWTG.

Bruce Momjian

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

> If we change the implementation so that the files are named after
> the (fixed, never-changed-after-creation) table OID, then RENAME
> TABLE is no problem: it affects *nothing* except the relname field
> of the table's pg_class row, and either that row update is committed
> or it ain't.
>
> But if the physical file names contain the logical table name, we
> have to be prepared to rename those files in sync with the transaction
> commit that makes the pg_class update valid. Quite aside from any
> implementation effort involved, the critical point is this: it is
> *not possible* to ensure that that collection of changes is atomic.
> At best, we can make the window for failure small.
>
> Bruce seems to be willing to accept a window of failure for RENAME
> TABLE in order to make database admin easier. That is very possibly
> the right tradeoff --- but it is *not* an open-and-shut decision.
> We need to talk about it.

How about creating a hard link during RENAME, and you can just remove
the old link on commit or remove the new link on transaction rollback?

We can register this in the at_exit processing too if you think it is
necessary to clean it up on a backend crash that never gets to an abort,
though I think abort is always called.

Mike Mascari

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

Can I throw one more question out there on this subject?

There's something that I view as inconsistent behavior with
respect to DDL statements and MVCC and was wondering if this
would have any impact on the discussion (the following is with
6.5.3):

Session #1:

emptoris=> begin;
BEGIN
emptoris=> select * from test;
value
-----
1
(1 row)

Session #2:

emptoris=> begin;
BEGIN
emptoris=> select * from test;
value
-----
1
(1 row)

Session #1:

emptoris=> drop table test;
DROP

Session #2:

emptoris=> select * from test;
ERROR: mdopen: couldn't open test: No such file or directory

Now it would seem to me that if DROP TABLE is going to be
ROLLBACK-able, then Session #2, in a MVCC environment should
never see:

ERROR: mdopen: couldn't open test: No such file or directory

but it does, because the "effect" of the drop table is an action
that is seen by all sessions, as though it were "committed". So I
am now wondering, are there any
Multi-Versioning/Multi-Generational RDBMS that support
ROLLBACK-able DDL statements in transactions...

Just curious,

Mike Mascari

************

Tom Lane

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

Mike Mascari <mas...@mascari.com> writes:
> Now it would seem to me that if DROP TABLE is going to be
> ROLLBACK-able, then Session #2, in a MVCC environment should
> never see:

> ERROR: mdopen: couldn't open test: No such file or directory

Check. We didn't say this worked yet ;-)

regards, tom lane

************

Philip Warner

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

>but it does, because the "effect" of the drop table is an action
>that is seen by all sessions, as though it were "committed". So I
>am now wondering, are there any
>Multi-Versioning/Multi-Generational RDBMS that support
>ROLLBACK-able DDL statements in transactions...
>

Dec/Rdb for one. They do, however, make their lives easier by 'locking the
metadata' when a user does a select. This means the 'drop table' would hang
until the first user commits. I think it even hangs until the first user
exits - basically if they have referenced tha table, you can't touch it
until they exit. But they do allow rollback on all DDL statements.

They do not allow rollback on 'managment' functions like moving storage
areas (where one or more tables are stored) across disks, doing vacuum-like
functions etc.

Tom Lane

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

Bruce Momjian <pg...@candle.pha.pa.us> writes:
>> Bruce seems to be willing to accept a window of failure for RENAME
>> TABLE in order to make database admin easier. That is very possibly
>> the right tradeoff --- but it is *not* an open-and-shut decision.
>> We need to talk about it.

> How about creating a hard link during RENAME, and you can just remove
> the old link on commit or remove the new link on transaction rollback?

Still non-atomic as far as I can see...

regards, tom lane

************

Hiroshi Inoue

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

> -----Original Message-----
> From: owner-pgs...@postgreSQL.org
> [mailto:owner-pgs...@postgreSQL.org]On Behalf Of Tom Lane
>
> Philip Warner <p...@rhyme.com.au> writes:
> > For the ignorant, are you able to explain why naming files
> > '<table_name>_<IOD>' is not acceptable? This seems to satisfy both
> > requirements (and seemed to be the conclusion of the previous
> discussion).
>
> Well, it's pretty simple: consider what has to happen to make RENAME
> TABLE be rollback-able.
>

Is it necessary to get the relation path name from the relation name/oid etc
each time ?
Is it bad to keep the relation path name in pg_class(or another relation) ?
If a new vessel is needed for copy(etc)ing existent tuples we have to
allocate
another unique path name otherwise we can use already allocated file name.
And is it good to dicide the unique path name from oid/relname etc ?

Regards.

Hiroshi Inoue
In...@tpf.co.jp

************

Tom Lane

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

"Hiroshi Inoue" <In...@tpf.co.jp> writes:
> Is it necessary to get the relation path name from the relation name/oid etc
> each time ?
> Is it bad to keep the relation path name in pg_class(or another relation) ?

Hmm, we could maybe do that for user relations, but it obviously would
not work for pg_class itself. I'm a little worried about trying to do
it for the other critical system relations, too. We'd want to keep the
relation's pathname in its relcache entry, so any system relation that
is read while setting up a relcache entry has to have a fixed path that
can be determined without a relcache entry.

Perhaps it would be good enough to say that all system relations live in
the database's primary directory, and only user relations have pathnames
specified in their pg_class entries. Renaming a system table would be
a Really Bad Idea anyway ;-)

regards, tom lane

************

Mark Hollomon

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

Tom Lane wrote:
>
>
> If we change the implementation so that the files are named after
> the (fixed, never-changed-after-creation) table OID, then RENAME
> TABLE is no problem: it affects *nothing* except the relname field
> of the table's pg_class row, and either that row update is committed
> or it ain't.
>
> But if the physical file names contain the logical table name, we
> have to be prepared to rename those files in sync with the transaction
> commit that makes the pg_class update valid. Quite aside from any
> implementation effort involved, the critical point is this: it is
> *not possible* to ensure that that collection of changes is atomic.
> At best, we can make the window for failure small.
>

How about using hard-links? The transaction that created the change
would see the new link along with the new tuple. other transactions
would see the old directory and the old tuple. rollback drops the new
tuple and the new directory entry. Commit does the obvious.

Does WinNT have something similar to a hard link?
--

Mark Hollomon
m...@nortelnetworks.com
ESN 451-9008 (302)454-9008

************

The Hermit Hacker

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

On Wed, 8 Mar 2000, Bruce Momjian wrote:

> > But with Postgres, we can write a utility to do this for us, so I
> > think that it isn't so much of an issue. In fact, perhaps we could
> > have a backend function which could do this, so we could query the
> > sizes directly.
>

> Does not work if the table was accidentally deleted. Also requires the
> backend to be running.

For ppl that aim ourselves at providing for data integrity, we sure have a
lot of "if the table was accidentally deleted" problems with poor
solutions, no? :)

IMHO, we are basically supporting ppl *not* doing regular backups of their
data ... most, if not all, of the problems that ppl feel exist that
requires the use of 'flat files', IMHO, aren't big problems if properly
backup procedures are followed ...

Lamar Owen

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

The Hermit Hacker wrote:
> IMHO, we are basically supporting ppl *not* doing regular backups of their
> data ... most, if not all, of the problems that ppl feel exist that
> requires the use of 'flat files', IMHO, aren't big problems if properly
> backup procedures are followed ...

I suggested the 'flat-file' more as a compromise than anything else.
(Although it kindof backfired :-(). Technically speaking, my
'flat-file' is trading the flat-file in the OS's filesystem (the
directory) with a separate flat-file. Little to no admin difference
from my point of view.

The problem that Bruce is talking about occurs when you try to restore
(from a properly built off-line binary backup) a single table or small
set of tables. It doesn't have anything to do with supporting people
who won't do proper backups, IMO.

Of course, I personally use on-line pg_dump backups and feed into psql
for on-line restore -- which doesn't require knowing anything about the
underlying filesystem structures.

So, the dichotomy is between those who want to admin at the OS file
level versus those who feel the backend should hide all those details
from the admin. At least that's how istm.

Ross J. Reedstrom

unread,

Mar 8, 2000, 3:00:00 AM3/8/00

to

And there doesn't seem to be an obvious way to extend it to the DROP
TABLE case. Hmm, on second thought, to rollback DROP TABLE we'll need to
'hide' the table from the current transaction: one way would be to rename
it, then do the drop at commit time.

Regardless, since I think there are other, SQL92 standard driven reasons to
break the relname == filename link, I decided to go ahead and see how hard
coding it would be, and how much code might be depending on that behavior.

Looked like it was going to be very simple: the RelationGetRelationName
and RelationGetPhysicalRelationName macros encapsulate access to the
(relation)->rd_rel->relname structure member pretty effectively (thanks
to Bruce's temp. relation work, I presume)

As a first crack, I decided to use the oid for the filename, just because
it simplified the chamges to the Macro, and there was already an oidout()
builtin that'd do the palloc for me ;-)

Well, ... it is, as they say, a Small Matter of Programming. I now know
a lot more about the bootstrap process, and the relcache, I can tell you!

Most problems where code that used RelationGetPhysicalRelationName
when they it should use RelationGetRelationName. In several cases,
the code assumed RelationGetPhysicalRelationName handed them a
pointer to rd_rel->relname, which they copy into! I substituted
RelationGetRelationName for all these cases.

There's some uglyness with SharedSystemRelations, as well. I just hacked
in hard coded numbers where ever I found hardcoded relation names, for
an inital test of principle.

I've got a version running, and I can type at a standalone backend:
still some problems with initdb: the pg_log, pg_shadow and pg_user
relations don't get created: I cheated and copied the first two from my
'current' install. That got the backend up, either standalone, or as
postmaster. It'll even accept connections from pgsql: just errors a lot,
since pg_user isn't there!

However, typing at the backend, I can create tables, insert, delete,
start transactions, rollback, etc. Basically, everything works.

Suffice to say, altering the physical storage name is not too difficult,
if _I_ can get this far in just a few hours. Whatever we decide for
'policy' on the name issue (and now that I've generated it, I can tell
you that a directory full of numbers is _really_ ugly) implementation
should go easily.

Ross
--
Ross J. Reedstrom, Ph.D., <reed...@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

************

Bruce Momjian

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

> Looked like it was going to be very simple: the RelationGetRelationName
> and RelationGetPhysicalRelationName macros encapsulate access to the
> (relation)->rd_rel->relname structure member pretty effectively (thanks
> to Bruce's temp. relation work, I presume)

Yes.

> As a first crack, I decided to use the oid for the filename, just because
> it simplified the chamges to the Macro, and there was already an oidout()
> builtin that'd do the palloc for me ;-)
>
> <some time latter...>
>
> Well, ... it is, as they say, a Small Matter of Programming. I now know
> a lot more about the bootstrap process, and the relcache, I can tell you!
>
> Most problems where code that used RelationGetPhysicalRelationName
> when they it should use RelationGetRelationName. In several cases,
> the code assumed RelationGetPhysicalRelationName handed them a
> pointer to rd_rel->relname, which they copy into! I substituted
> RelationGetRelationName for all these cases.

Please send in a patch on those if they need to be corrected, OK?

Ross J. Reedstrom

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

On Wed, Mar 08, 2000 at 06:24:35PM -0500, Bruce Momjian wrote:
> > Looked like it was going to be very simple: the RelationGetRelationName
> > and RelationGetPhysicalRelationName macros encapsulate access to the
> > (relation)->rd_rel->relname structure member pretty effectively (thanks
> > to Bruce's temp. relation work, I presume)
>
> Yes.

Well, thank you, then ;-)

>
> >
> > Most problems where code that used RelationGetPhysicalRelationName
> > when they it should use RelationGetRelationName. In several cases,
> > the code assumed RelationGetPhysicalRelationName handed them a
> > pointer to rd_rel->relname, which they copy into! I substituted
> > RelationGetRelationName for all these cases.
>
> Please send in a patch on those if they need to be corrected, OK?
>

Once I'm sure it's the Right Thing To Do, I will. That's probably
the only clean part of the ugly hack I've done so far.

I've got a complete system up, now. For some reason, the bootstrapping
in initdb doesn't create the pg_log (or pg_shadow!) relations, even
though the same step on a clean CVS tree does. Can't quite find why. So,
the non-bootstrap connections in initdb (creating all the system views)
then fail. If I manually copy the pg_log and pg_shadow files over from
'current' to 'hacked', I can then run the code from initdb by hand,
and get a fully functional system.

I went ahead and ifdefed out the rename() in renamerel(). Low and behold,
I can rollback an ALTER TABLE RENAME, and have a concurrent session
see the right thing. The conncurent session hangs, though, because of
the exclusive lock. Based on comments in that function, I think the
lock is still needed to handle the buffer cache, which is indexed by
relname. Should probably make that indexed by PhysicalName, since they're
disk buffers, after all. Haven't touched DROP TABLE, yet though.

My real goal with all this is to now look at the parser, and see how
hard it will be to do something with schema. I think that's going to
require an other field in the relation structure, to indicate which
schema a relation belongs to, then control access based on what the
current default schema is. All the relname stuff was just so different
schema can have different tables with the same name. ;-)

Hiroshi Inoue

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

> -----Original Message-----
> From: owner-pgs...@postgreSQL.org
> [mailto:owner-pgs...@postgreSQL.org]On Behalf Of Bruce Momjian
>
> > Looked like it was going to be very simple: the RelationGetRelationName
> > and RelationGetPhysicalRelationName macros encapsulate access to the
> > (relation)->rd_rel->relname structure member pretty effectively (thanks
> > to Bruce's temp. relation work, I presume)
>
> Yes.
>

> > As a first crack, I decided to use the oid for the filename,
> just because
> > it simplified the chamges to the Macro, and there was already
> an oidout()
> > builtin that'd do the palloc for me ;-)
> >

I object to this proposal.

I have been suspicious why mapping algorithm from relations
to the relation file names is needed for existent relations.
This should be changed first.

And pluaral relation file names are needed for a relation oid/relname.
Why do you prefer fixed mapping oid/relname --> relation file name ?

Peter Eisentraut

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

Tom Lane writes:

> The only thing that's really needed is for everyone to take a deep
> breath and let go of the notion that table files ought to be named
> after the tables. If we named table files after the OIDs of their
> tables, then rollback-able DROP or RENAME TABLE would be pretty
> straightforward. If you don't recall why this is, consult the
> pghackers archives...

I think most of us are with you on that. We just need to beat up Bruce. ;)
Perhaps we could let go of database name-like subdirectories too and just
have one big directory of number-named files. Great ...

--
Peter Eisentraut Sernanders väg 10:115
pet...@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

Zeugswetter Andreas SB

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

> > > The oid will be appended to the base file name.
>
> > If we do it that way, then RENAME TABLE will be kinda complicated...
> > not impossible, but is it worth it?

There is a simple solution to this problem:

If the table file tabname1_oid.dat is not found, then scan the directory
for the file *_oid.dat and rename that to the now correct name.
This can be done automatically.

The file will only not be found if something crashed during rename table.

Andreas

Zeugswetter Andreas SB

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

> My real goal with all this is to now look at the parser, and see how
> hard it will be to do something with schema. I think that's going to
> require an other field in the relation structure, to indicate which
> schema a relation belongs to, then control access based on what the
> current default schema is. All the relname stuff was just so different
> schema can have different tables with the same name. ;-)

The schema in ANSI speak is imho the owner of the table in pg speak.
That is also how Informix, Oracle and DB/2 handle that part of the Standard.

imho the standard would require tablename in pg like follows:

dbname.ownername.tabname

Andreas

Ross J. Reedstrom

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

On Thu, Mar 09, 2000 at 09:43:35AM +0900, Hiroshi Inoue wrote:
> > -----Original Message-----
> > From: owner-pgs...@postgreSQL.org
> > [mailto:owner-pgs...@postgreSQL.org]On Behalf Of Bruce Momjian
> >
> > > Looked like it was going to be very simple: the RelationGetRelationName
> > > and RelationGetPhysicalRelationName macros encapsulate access to the
> > > (relation)->rd_rel->relname structure member pretty effectively (thanks
> > > to Bruce's temp. relation work, I presume)
> >
> > Yes.
> >
> > > As a first crack, I decided to use the oid for the filename,
> > just because
> > > it simplified the chamges to the Macro, and there was already
> > an oidout()
> > > builtin that'd do the palloc for me ;-)
> > >
>
> I object to this proposal.

It's not really a proposal, it's a 'hacking report' of what I found
trying to implement (one version of) the changes we'd all been arguing
(with Bruce, mostly ;-) about.

(As an aside, I'd like to mention that I have come to agree with him
_more_ than I did before try to implement this. Error messages like:

ELOG FATAL: can't open file: /usr/local/pgsql7.0/data/base/1269
vs.
ELOG FATAL: can't open file: /usr/local/pgsql7.0/data/base/pg_log

Clearly, the numeric version is much harder to read, easier to get wrong
(Was that 1269, or 1296?) and much less meaningful for both the user
and the developer.)

>
> I have been suspicious why mapping algorithm from relations
> to the relation file names is needed for existent relations.
> This should be changed first.

Ah, now I understand: this has started to both me as well. This relates
to your idea earlier to store the storage file name in the reldesc, right?

>
> And pluaral relation file names are needed for a relation oid/relname.

? Plural? More than one file name per relation? Why?

> Why do you prefer fixed mapping oid/relname --> relation file name ?
>

Well, the oid _is_ the fixed identifer that we already take care of as
part of the transactional machinery. The file _should_ have the same
lifetime as the relation oid: if the relation is dropped, the file goes
away. I think of the oid like a unix filesystem inode: it's the thing
actually tied to the tuple/relation. The name is just a convenience for
us humans. (Or IP number/DNS name, if you want a more up to date analogy.)

It's clear that the DMBS doesn't care what the filename is, as long as it
can find the right blocks to load into the cache. I was wondering about
setting the filename to be relname.oid at relation creation time, and
storing it in the reldesc. Then, just ignoring it for rename (that way
the cached blocks don't have to be invalidated), and checking at vacuum
if the filename still matches the storage name. Hmm, that strategy
could be applied to the _current_ names, as well: Don't rename the file
in the transaction, and let vacuum clean it up.

Magnus Hagander

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

No. :-(

Windows 2000 introduces hard links, but not before that.

//Magnus

Bruce Momjian

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

[Charset iso-8859-1 unsupported, filtering to ASCII...]

This sounds like a plan.

Bruce Momjian

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

> It's not really a proposal, it's a 'hacking report' of what I found
> trying to implement (one version of) the changes we'd all been arguing
> (with Bruce, mostly ;-) about.

I am arguing for all the unsung administrators out there... :-)

> Well, the oid _is_ the fixed identifer that we already take care of as
> part of the transactional machinery. The file _should_ have the same
> lifetime as the relation oid: if the relation is dropped, the file goes
> away. I think of the oid like a unix filesystem inode: it's the thing
> actually tied to the tuple/relation. The name is just a convenience for
> us humans. (Or IP number/DNS name, if you want a more up to date analogy.)
>
> It's clear that the DMBS doesn't care what the filename is, as long as it
> can find the right blocks to load into the cache. I was wondering about
> setting the filename to be relname.oid at relation creation time, and
> storing it in the reldesc. Then, just ignoring it for rename (that way
> the cached blocks don't have to be invalidated), and checking at vacuum
> if the filename still matches the storage name. Hmm, that strategy
> could be applied to the _current_ names, as well: Don't rename the file
> in the transaction, and let vacuum clean it up.

I don't have a problem if ALTER TABLE RENAME keeps the same filename,
and doesn't rename the file. If that is the only issue, I can live with
it.

One idea would be to create a symlink at transaction time, and have
vacuum do the actual renaming and removal of the symlink.

Peter Eisentraut

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

Philip Warner writes:

> Dec/Rdb for one. They do, however, make their lives easier by 'locking the
> metadata' when a user does a select. This means the 'drop table' would hang
> until the first user commits. I think it even hangs until the first user
> exits - basically if they have referenced tha table, you can't touch it
> until they exit. But they do allow rollback on all DDL statements.

This is BAD. We have something like this for databases: you can't drop
them if someone is connected to it. This is surely an improvement over the
previous state but it's like saying "You can't delete this file, someone
is using it. Why don't you give him a call and ask him to stop."

Peter Eisentraut

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

Tom Lane writes:

> Philip Warner <p...@rhyme.com.au> writes:
> > For the ignorant, are you able to explain why naming files
> > '<table_name>_<IOD>' is not acceptable? This seems to satisfy both
> > requirements (and seemed to be the conclusion of the previous discussion).
>
> Well, it's pretty simple: consider what has to happen to make RENAME
> TABLE be rollback-able.

How about this: It has to happen to make RENAME TABLE work right in the
first place. If you have a multi-GB table the renames (which ever
way) cannot be done atomically. If the filesystem fails after the third of
seven segments what do you do then? Start saying "oops" and change the
names of the first three back? You'll never be able to finish this.

Peter Eisentraut

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

Bruce Momjian writes:

> I will fight this to my death. :-)
>
> I have cursed Ingres every time I needed to look at the Ingres data
> directory to find out which tables match which files. Even a lookup
> file is a pain. Right now, I can do ls -l to see which tables are
> taking disk space.

Okay, your premise is that file names need to be readable because
administrators might want to muck around with the files. Well,

1) Backing up and restoring through copying the actual relation files is a
recipe for death and destruction. This might have worked rather well in
the past but with WAL and MVCC and more fancy features coming along this
has to stop anyway.

2) Checking the disk space required by tables can just as easily be done
via a backend function or better yet integrated into some tablespace
scheme.

Regarding your argument that the backend needs to be running, that's true
but I don't buy it because after all your kernel needs to be running as
well to associate file names to inodes. The notion that you should have
access to the data stored by some application without using that
application is inherently bogus.

It wouldn't be all that hard to create marginally up to date
readable-named symlinks to the oid files created by the backend upon
request or by some external utility.

It is quite evident to me that the advantages of oid named files would be
tremendous. You are asking us to sacrifice safe and elegant
implementations because you want to have convenient access to what amounts
to internal storage details. Why stop there, why not make one subdirectory
per table with files named after columns so I can check that data without
a running backend? If there is information you need access to (storage
size) then an interface to them should be implemented using official
channels.

Ross J. Reedstrom

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

On Thu, Mar 09, 2000 at 12:44:19PM -0500, Bruce Momjian wrote:
> > It's not really a proposal, it's a 'hacking report' of what I found
> > trying to implement (one version of) the changes we'd all been arguing
> > (with Bruce, mostly ;-) about.
>
> I am arguing for all the unsung administrators out there... :-)

Hey, no fair clipping the part were I agree with you, just to make more
martyr points ;-)

>
>
> One idea would be to create a symlink at transaction time, and have
> vacuum do the actual renaming and removal of the symlink.
>

But then we're still storing information about the relation -> filename
mapping in the filesystem, rather than in the database. BTW, there seems
to be _no_ checking if filenames generated are acceptable to the underlying
OS filesystem. We may have a problem on NTFS, which is case preserving,
but not case sensitive: i.e., two relations, named "test" and "Test"
would probably collide on NT: trying to retrieve one will get the other.

Aha, it _does_ in fact collide: here is a log from a 6.5.3 install,
running under NT 4.0 (under VMWare on linux):

test=> create table testit (a int, b text);
CREATE
test=> insert into testit values (1,'some text');
INSERT 27914 1
test=> select * from testit;
a|b
-+---------
1|some text
(1 row)

test=> create table "TestIt" (a int, b text);
ERROR: cannot create TestIt
test=>

As you would imagine, the above test works fine under Linux (or any
other Unix). Well, I think that puts the nail in the coffin: we _have_
to do something, since this is no longer SQL92 compliant at all, on that
platform. Admittedly, having different tables that differ only by case
isn't the greatest idea (BTW, what's the dealt with functions that differ
by only the case of the first letter!?!) but it is required behavior.

Say, does that mean this is a _bug_ and a fix might qualify for inclusion
in 7.0 ? ;->

Bruce Momjian

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

> But then we're still storing information about the relation -> filename
> mapping in the filesystem, rather than in the database. BTW, there seems
> to be _no_ checking if filenames generated are acceptable to the underlying
> OS filesystem. We may have a problem on NTFS, which is case preserving,
> but not case sensitive: i.e., two relations, named "test" and "Test"
> would probably collide on NT: trying to retrieve one will get the other.
>

> As you would imagine, the above test works fine under Linux (or any
> other Unix). Well, I think that puts the nail in the coffin: we _have_
> to do something, since this is no longer SQL92 compliant at all, on that
> platform. Admittedly, having different tables that differ only by case
> isn't the greatest idea (BTW, what's the dealt with functions that differ
> by only the case of the first letter!?!) but it is required behavior.
>
> Say, does that mean this is a _bug_ and a fix might qualify for inclusion
> in 7.0 ? ;->

Good try.

Oliver Mueschke

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

On Thu, Mar 09, 2000 at 04:57:35PM +0100, Magnus Hagander wrote:
> > Does WinNT have something similar to a hard link?
>
> No. :-(
>
> Windows 2000 introduces hard links, but not before that.
>

ntfs has always had hard links. see

<http://support.microsoft.com/support/kb/articles/q100/1/08.asp?LNG=ENG&SA=ALLKB&FR=0>

for an overview of fat, hpfs and ntfs.

oliver

Bruce Momjian

unread,

Mar 9, 2000, 3:00:00 AM3/9/00

to

[Charset ISO-8859-1 unsupported, filtering to ASCII...]

> Tom Lane writes:
>
> > Philip Warner <p...@rhyme.com.au> writes:
> > > For the ignorant, are you able to explain why naming files
> > > '<table_name>_<IOD>' is not acceptable? This seems to satisfy both
> > > requirements (and seemed to be the conclusion of the previous discussion).
> >
> > Well, it's pretty simple: consider what has to happen to make RENAME
> > TABLE be rollback-able.
>
> How about this: It has to happen to make RENAME TABLE work right in the
> first place. If you have a multi-GB table the renames (which ever
> way) cannot be done atomically. If the filesystem fails after the third of
> seven segments what do you do then? Start saying "oops" and change the
> names of the first three back? You'll never be able to finish this.

You would need to do the renames, but have some vacuum process that
makes sure all oid file names match the system catalog for that oid.
Seems there is no way around that type of cleanup, assuming we change
the table name as part of the rename.

Pretty ugly, but not as ugly as having a directory of numbers.

Bruce Momjian

unread,

Mar 10, 2000, 3:00:00 AM3/10/00

to

> > 2) Checking the disk space required by tables can just as easily be done
> > via a backend function or better yet integrated into some tablespace
> > scheme.
> >
> > Regarding your argument that the backend needs to be running, that's true
> > but I don't buy it because after all your kernel needs to be
> > running as well to associate file names to inodes. The notion that you
> > should have access to the data stored by some application without using
> that
> > application is inherently bogus.
>

> Exactly! More of that, after WAL I would like to re-implement storage
> manager:
> having separate file per table is bad, bad, bad!

Yes, if we have to go that way, then we will lose file names. I just
don't want to lose them because RENAME is a problem.

Mikheev, Vadim

unread,

Mar 10, 2000, 3:00:00 AM3/10/00

to

Hi, All -:)

> Okay, your premise is that file names need to be readable because
> administrators might want to muck around with the files. Well,
>
> 1) Backing up and restoring through copying the actual relation files is a
> recipe for death and destruction. This might have worked
> rather well in the past but with WAL and MVCC and more fancy features
coming
> along this has to stop anyway.
>

> 2) Checking the disk space required by tables can just as easily be done
> via a backend function or better yet integrated into some tablespace
> scheme.
>
> Regarding your argument that the backend needs to be running, that's true
> but I don't buy it because after all your kernel needs to be
> running as well to associate file names to inodes. The notion that you
> should have access to the data stored by some application without using
that
> application is inherently bogus.

Exactly! More of that, after WAL I would like to re-implement storage
manager:
having separate file per table is bad, bad, bad!

Vadim

Philip Warner

unread,

Mar 10, 2000, 3:00:00 AM3/10/00

to

At 20:08 9/03/00 +0100, Peter Eisentraut wrote:
>Philip Warner writes:
>
>> Dec/Rdb for one. They do, however, make their lives easier by 'locking the
>> metadata' when a user does a select. This means the 'drop table' would hang
>> until the first user commits. I think it even hangs until the first user
>> exits - basically if they have referenced tha table, you can't touch it
>> until they exit. But they do allow rollback on all DDL statements.
>
>This is BAD. We have something like this for databases: you can't drop
>them if someone is connected to it. This is surely an improvement over the
>previous state but it's like saying "You can't delete this file, someone
>is using it. Why don't you give him a call and ask him to stop."

Because I'm a paranoid sort of person, I actually don't see any problem
with "You can't delete this file, someone is using it". Especially if it
says "...because user abc is using it".

Besides, if you really want to get brutal and drop the table despite the
fact someone is using it, then you can close the database (yes there is a
function to make the database unavailable). You can even just disconnect
user 'abc'.

I agree it's not as nice as it could be, but if it's the price to pay for
guaranteed integrity, and rollback with DDL, I'd be happy.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: +61-03-5367 7422 | _________ \
Fax: +61-03-5367 7430 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

Magnus Hagander

unread,

Mar 10, 2000, 3:00:00 AM3/10/00

to

Ok. NTFS has had it, but there has been no API for it. At least not from
Win32 (see API documentation for CreateHardLink() -
http://msdn.microsoft.com/library/psdk/winbase/fsys_21yj.htm).

The only way in NT4 to do it through Win32 is using the BackupWrite()
function, and that's not good :-)
(http://msdn.microsoft.com/library/psdk/winbase/tapeback_3bl1.htm). A good
example of why this is not good is on
http://support.microsoft.com/support/kb/articles/Q234/7/27.ASP - it's way
more code to write..

You can also do it using the Posix libraries in the SDK, but then it's no
longer a Win32 application. (Somewhat like Cygwin32 is not Win32 either. And
I don't think Cygwin32 runs on Posix, because it's supposed to work on
non-NT, right?)

//Magnus

Oliver Mueschke

unread,

Mar 10, 2000, 3:00:00 AM3/10/00

to

On Fri, Mar 10, 2000 at 09:50:57AM +0100, Magnus Hagander wrote:
> Ok. NTFS has had it, but there has been no API for it. At least not from

...

> You can also do it using the Posix libraries in the SDK, but then it's no
> longer a Win32 application. (Somewhat like Cygwin32 is not Win32 either. And
> I don't think Cygwin32 runs on Posix, because it's supposed to work on
> non-NT, right?)

i thought postgres uses cygwin anyway. and cygwin definitly supports hard links
on ntfs. so, your initial statement that only win2k introduces the concept of
hard links was not correct.

oliver

Peter Eisentraut

unread,

Mar 10, 2000, 3:00:00 AM3/10/00

to

Ross J. Reedstrom writes:

> (As an aside, I'd like to mention that I have come to agree with him
> _more_ than I did before try to implement this. Error messages like:
>
> ELOG FATAL: can't open file: /usr/local/pgsql7.0/data/base/1269
> vs.
> ELOG FATAL: can't open file: /usr/local/pgsql7.0/data/base/pg_log
>
> Clearly, the numeric version is much harder to read, easier to get wrong
> (Was that 1269, or 1296?) and much less meaningful for both the user
> and the developer.)

How about: "The table doesn't exist on disk." The user will be hosed
anyway. Of course this error message will appear much less often when this
goes through.

Bruce Momjian

unread,

Mar 11, 2000, 3:00:00 AM3/11/00

to

[Charset ISO-8859-1 unsupported, filtering to ASCII...]

> Ross J. Reedstrom writes:
>
> > (As an aside, I'd like to mention that I have come to agree with him
> > _more_ than I did before try to implement this. Error messages like:
> >
> > ELOG FATAL: can't open file: /usr/local/pgsql7.0/data/base/1269
> > vs.
> > ELOG FATAL: can't open file: /usr/local/pgsql7.0/data/base/pg_log
> >
> > Clearly, the numeric version is much harder to read, easier to get wrong
> > (Was that 1269, or 1296?) and much less meaningful for both the user
> > and the developer.)
>
> How about: "The table doesn't exist on disk." The user will be hosed
> anyway. Of course this error message will appear much less often when this
> goes through.

We already have the problem of keeping the file system in sync with the
database in create table and drop table. Rename table is just another
case of this. Using oid names doesn't fix create or drop table cases,
so it seems we need a general solution for all these cases.

Create table and drop table don't contain meaningful data, while rename
table does.