[PostgreSQL] - SCHEMA.SQL - CREATE SEQUENCE Improvements

76 views
Skip to first unread message

cga

unread,
Jun 22, 2011, 3:13:24 PM6/22/11
to habar...@googlegroups.com
Hi all,

<PREMISE>
you might forgive me if I don't use Trac, but I'd like to explain and discuss
what I'm doing with PGSQL schema.sql for Habari before submitting (if my
contrib will be accepted code wise) anything.

Not to mention that in these very days things are getting moved on Github. If
so, I would clone and request for pull once there, since I have an active
account on Github and it seems pointless to get a Trac account now.

BUT: since this very patch isn't going (at least theoretically) to have any
bad effect on Habari , I'm attaching it to this very email for now. For you to
test. Forgive me on this last thing but I haven't set up a test env just yet.
I'll do it after sending this patch.

<little INTRO and WHY>
I'm cga. I'm a Linux SysAdmin and I'm all for "best practices" as much as I
can. Not to mention I'm really paranoid and analytical when it concerns
coding/installing/configuring/deploying. Especially for things I care/use/
administer.

I basically started this little contribution because I really love my stack:
debian + postgresql + nginx and I'd like to use it for all my needs. Habari
included. Ref: http://drunkenmonkey.org/irc/habari/2011-06-19#T17-04-24
I think it'sfair enough for a small intro.

<CONTRIB AND REASONING>
As you can read in the chat log I linked above , I'm not expert in neither
PostgreSQL nor SQL but I like to contribute for as much as I can. Especially
if this can be the only way to keep Habari's support for PostgreSQL alive.

This first little but important contribution it is all about the "CREATE
SEQUENCE" used in PGSQL actual schema.sql.

All the reasoning I do it's to be documented w/ links to official docs
and (attacched) chat logs from freenode.#postgresql , where (beside docs and
wiki) I learn and ask my stuff. And to my great pleasure I always find very
competent and very helpful people.

Please forgive my English errors, I'm Italian. Please *really* forgive
my verbosity but I'm sure you'd understand that people who didn't follow in
freenode.#habari don't know all the reasoning from the room. If I'm to be
accepted I promise I'll be as concise as possible in next communications.

Here we go:

I noticed that you use a lot of serialization in schema.sql , w/o questioning
why (which I'm not in a position to, anyway) I noticed that you (being whoever
did that schema.sql) do it wrong. Why?

Because: CREATE SEQUENCE done in the present form from schema.sql:

>CREATE SEQUENCE {$prefix}posts_pkey_seq;
>CREATE TABLE {$prefix}posts (
> id BIGINT NOT NULL DEFAULT nextval('{$prefix}posts_pkey_seq'),
>.....
>.....
> PRIMARY KEY (id),
>.....
>);
>

1 - It's *almost* correct, but no cigar. It basically miss the most crucial
part of the query to make the SEQUENCE itself have any sense *and* work
properly:

> ALTER SEQUENCE {$prefix}posts_pkey_seq OWNED BY{$prefix}posts.id;

w/o this ALTER TABLE the SEQUENCE is to behave weird more sooner than later,
resulting in duplicate numbers (see chat log; using BIGINT won't save you from
this, more on this later) and errors that are more than diffcult to find when
they happen.

2 - it determines in a recursive way the name of the SEUQNECE itself , which
is wrong because you cannot be 100% sure that will actually work as intended.
(see chat log).

3 - it uses BIGINT. My reasoning is that whoever created this schema.sql
didn't wan't to finish the SEQUENCE too soon. Good idea , bad implementation. I
strongly believe that BIGINT is not needed indeed, and that it's just a waste
of cylces and space. Not to mention that BIGINT behaves differently on 32bit
and 64bit, causing slowness (and perhaps some more side effects) on 32bit.
(this is chat logs from the office, if you need them I'll attach them tomorrow).
And as explained just above, it is not going to be used as a BIGINT anyway
because of the wrong query CREATE SEQUENCE. But the main point why it is not
needed it's because who's to write a post every second for # quintillions
years? I think that a post every second for 68yrs is more than enough
(INTEGER/SERIAL). (see chat log too).

4 - when all the above are to be combined w/ the use of NOT NULL DEFAULT
nextval()" (and I mean all of them are compulsory clauses, see chat log), you
can save a lot of hassles, code and cycles with just the improvement I'm
submitting for every CREATE SEQUENCE query, leaving intact the final result for
what it concerns Habari's functioning *and* improves things a whole lotta
love:

>CREATE TABLE {$prefix}posts (
> id SERIAL,
>.....
>.....
> PRIMARY KEY (id),
>.....
>);

<IMPROVEMENTS ACHIEVED>
Improvements and optimizations achieved with this little patch: (see Ref. [2])

0 - it will leave intact the need of Habari to use the numerical field for
internal functioning.
1 - (id) will be unique and auto incremental by properties inherited from
query.
2 - You won't skip a number in a SEQUENCE.
3 - It won't fail because of wrong creation query.
4 - It won't determine an unexpected SEQUENCE name/relation to the column you
want it to.
5 - OWNED BY is to be used implicitly , this will drop anything related to the
elected talbe and/or column. This sounds harsh, but what will you do w/ data
that isn't associated w/ anything anymore?

<FINAL THOUGHTS>
1 - For the impact on Habari's code (which I didn't read because I don't do
php and couldn't afford to get it wrong) on insert the next value of the
sequence into the serial column please read Ref. [2] just below the PostgreSQL
7.3 Note.

2 - This new schema.sql would work (in theory) for new installations but I
don't know how to implement and test a system/schema/pgsql/upgrades/post/*.sql
yet. Help and advice from you is very much apprecciated.

3 - Let me know if anyone is going to test this before I do. If you want.
Anyway I'm going to test it when my test env will be up and running.

4 - I'm already planning the next batch of fixes but that can wait for now. I
anticipate that it will be about optimizing queries and about keyvil issues as
much as possible and w/o breaking Habari Ref. [3].

My pleasure to meet you.

<REFERENCES>
Ref. chat log attacched.

Ref [1]. http://www.postgresql.org/docs/8.4/interactive/sql-
createsequence.html

Ref [2]. http://www.postgresql.org/docs/8.4/interactive/datatype-
numeric.html#DATATYPE-SERIAL

Ref. [3] http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-i-7327
http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-ii-7345
http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-iii-7365

--
knowledge has no owner, only means to reveal itself

CREATE_SEQUENCE_#postgresql.log
postgresql.1stbatch.optimize.sequences.sql
pgsql.1stbatch.sql_schema.sql.diff
signature.asc

rick c

unread,
Jun 22, 2011, 4:05:30 PM6/22/11
to habari-dev
cga,

From what I see reading the docs you referenced, this it seems to a
reasonable change. It decreases the verbosity of the sql, while at the
same time ensuring the sequence generated is private to the table,
which, if I understand correctly, isn't the case with Habari's current
sql.

How would this change affect people upgrading their databases? Would
the tables need to be totally rebuilt?

Rick
> Ref [2].http://www.postgresql.org/docs/8.4/interactive/datatype-
> numeric.html#DATATYPE-SERIAL
>
> Ref. [3]http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-i-7327http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-ii-7345http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-iii-7365
>
> --
> knowledge has no owner, only means to reveal itself
>
>  CREATE_SEQUENCE_#postgresql.log
> 34KViewDownload
>
>  postgresql.1stbatch.optimize.sequences.sql
> 6KViewDownload
>
>  pgsql.1stbatch.sql_schema.sql.diff
> 3KViewDownload
>
>  signature.asc
> < 1KViewDownload

Chris Meller

unread,
Jun 22, 2011, 6:12:43 PM6/22/11
to habar...@googlegroups.com
I've skimmed over the chat logs you referenced, but I don't see the vast majority of these points discussed... Primarily all I see is a discussion about using ID alone as the primary key vs. other fields. So I'm going to go through (below) and ask several questions to clarify.



cga
June 22, 2011 3:13 PM

The OWNED BY argument was only added in 8.2. The last time I tested the PG support (admittedly a long time ago) Debian only had 8.1 in its repos, so that was a major deal. I'm not sure what is commonly available now (though I suspect 8.3 is widely used), but breaking support for 8.1 users is something to be considered, particularly since I see no real advantage to adding the statement in our use case.

Per the Postgres documentation on CREATE SEQUENCE, OWNED BY merely associates the sequence with the table (and field) specified so that if the table (or field) is dropped the sequence will be automatically dropped as well. This sounds like a good idea at first, but I can see instances in which this might cause problems.

In the past there have been times that we've needed to modify a table, but one RDBMS or another won't support the right ALTER syntax, so we've created a new field, migrated data, dropped the original field, and then reversed the process. If we used OWNED BY on the sequence as you suggest the sequence would be automatically dropped and likely cause a ton of problems. Not only would we have to remember to re-create the sequence (manually and specifically for Postgres), but we would have to guess at what the starting value should be, otherwise we would generate duplicate IDs - a Very Bad Thing (tm).

So the only advantage we'd get is that if the table were to be dropped the sequence would be automatically deleted with it. That's nice for a user who is no longer using Habari, but hardly a major deal - it's an extra query per table to drop its relevant sequence or a few extra clicks in your GUI of choice. I can't see any instance in which we would be dropping the Posts table (or most of the others) programmatically, so it really provides no extra value for developers.


w/o this ALTER TABLE the SEQUENCE is to behave weird more sooner than later, 
resulting in duplicate numbers (see chat log; using BIGINT won't save you from 
this, more on this later) and errors that are more than diffcult to find when 
they happen. 
I don't understand this. Can you explain how it's ever going to behave in an unexpected way? How would a sequence ever result in duplicate numbers, unless it has been manually ALTERed with the RESTART argument? Even if we used OWNED BY, you could still screw the process up by manually changing the sequence, so that wouldn't solve the problem either.

The use of the sequence, whether OWNED BY or not, should have nothing to do with whether the destination field is an INT or a BIGINT.

2 - it determines in a recursive way the name of the SEUQNECE itself , which 
is wrong because you cannot be 100% sure that will actually work as intended. 
(see chat log).

I'm not sure I understand your point here either. The sequence name is dynamically created during install based on the DB prefix you specify, yes. Again, how is that going to cause a problem and (aside from using a SERIAL, as mentioned below) what other alternative is there?


3 - it uses BIGINT. My reasoning is that whoever created this schema.sql 
didn't wan't to finish the SEQUENCE too soon. Good idea , bad implementation. I 
strongly believe that BIGINT is not needed indeed, and that it's just a waste 
of cylces and space. Not to mention that BIGINT behaves differently on 32bit 
and 64bit, causing slowness (and perhaps some more side effects) on 32bit. 
(this is chat logs from the office, if you need them I'll attach them tomorrow). 
And as explained just above, it is not going to be used as a BIGINT anyway 
because of the wrong query CREATE SEQUENCE. But the main point why it is not 
needed it's because who's to write a post every second for # quintillions 
years? I think that a post every second for 68yrs is more than enough 
(INTEGER/SERIAL). (see chat log too).
I cannot speak towards any performance differences between 32bit and 64bit versions of Postgres, though I'm admittedly skeptical. We use BIGINT everywhere we can for ID columns (at least in MySQL and SQLite, looks like Postgres might have been missed) for one simple reason: we can. The storage difference between INT and BIGINT varies between platform but is so negligible we've never worried about it.

On the other hand there are instances in which it may be of value to have the added capacity. A few off the top of my head:

1. Logs. Particularly if a plugin starts generating errors, you could conceivably start generating a TON of IDs. Over the course of many years running Habari it's not entirely outside the realm of possibility that you could exhaust an INT.

2. Posts that aren't posts. While most of us use Habari only for blogging, there are several instances where there have been a lot of other things built on top of it. Think product catalogs / online stores, forums, etc. It may seem crazy to create a new post every second for 68 years, but it seems a lot less crazy if you stop thinking about each entry as a time-consuming piece of written content. When you also include multiple types of content (like revisions, tweets, links, etc.) and consider that there might be a lot of turnover (or "churn") there's all the more reason to go ahead and use a BIGINT.

If an INT column were to be maxed out it's a simple matter for a DBA to go in and change it (after figuring out what the problem actually was, of course), but recreating the indexes could be a very time consuming process as well. I believe this is actually one of the primary reasons WordPress decided to switch to BIGINT fields many years ago.

So basically there are several reasons to use BIGINTs, even if they are long-shots, and no real reason not to (aside from your suggestion that there's a performance difference, which we've never heard of before). Think of it as a rather hopeful best practice.


4 - when all the above are to be combined w/ the use of NOT NULL DEFAULT 
nextval()" (and I mean all of them are compulsory clauses, see chat log), you 
can save a lot of hassles, code and cycles with just the improvement I'm 
submitting for every CREATE SEQUENCE query, leaving intact the final result for 
what it concerns Habari's functioning *and* improves things a whole lotta 
love:

CREATE TABLE {$prefix}posts (
 id SERIAL,
.....
.....
 PRIMARY KEY (id),
.....
);

Ignoring everything in the last 3 points related to sequences, I'm really not sure why we never used SERIAL fields. From an implementation standpoint they are identical to our approach (the Postgres documentation even shows you the equivalent syntax using a regular sequence), but they would have been more analogous to the AUTO INCREMENT fields we use in MySQL and SQLite.

On the flip side, since they're the same there's likewise no real compelling reason to use SERIAL over a sequence. We had previously (and hopefully still do) wanted to also support Oracle, which is very similar to Postgres. Oracle lacks a SERIAL data type, so it would require the sequence method to be used anyway. Continuing to use sequences in Postgres would mean less variation between the two and make it a good bit easier to manage them.

cga

unread,
Jun 22, 2011, 6:17:47 PM6/22/11
to habar...@googlegroups.com
Hi Rick,

> How would this change affect people upgrading their databases? Would
> the tables need to be totally rebuilt?

ANSWER:

Right now I already have a compromise that will keep existing tables intact.
Alas, using old style by adding the missing part to SQL w/ upgrade/2265.sql:

> ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;

to all sequences in the upgrade. This won't affect the data, only add the
missing part to the original schema.

IDEAS AND TESTS:

From here on are things I want to try:

While for new db it's not to going be a problem and would make a good
schema.sql , I really want to understand how to migrate to the proposed schema
also for those who have an existing db. I already have the raw idea and I'm
already planning some steps and tests to "gradually" move to "new style" and
keep the old db/tables intact.

Raw Idea: (implies that we used the above mentioned compromise already).

When we issue ALTER SEQUENCE, it won't affect the sequence and will allow us to
use "pg_get_serial_sequence()" to find the name of the sequence associated with
a given column.

Then with the use of currval() function, which returns the most recent value
generated by a sequence for the current session, combined with
"pg_get_serial_sequence()" that we are now able to use:

> SELECT currval(pg_get_serial_sequence('tablename', 'colname'));

we retrieve the most recent value generated by that specific sequence.

This value (for each sequence) could be stored in a variable_seq# or tmp table
(either for each sequence) and set as "START WITH start" for each table that
is to be recreated at a certain given point in the SEQUENCE.

My educated guess it's that this (and perhaps some more code in upgrade and
perhaps php?) will allow us to extract the exact point of a given sequence at
the moment of upgrade and migrate the existing to new schema too.

Basically the step could be:

0 - backup. backup. backup. and do not put anything more into db.

1 - ALTER SEQUENCE .. OWNED BY ... ;

2 - SELECT currval(pg_get_serial_sequence('tablename', 'colname')); and store
that in a variable or tmp_table (either one of those, but one for each
sequence);

3 - dump the db, drop the db.

4 - apply new schema w/ use of SERIAL ;

5 - import the previusly dump for old db w/ use of tmp_table/var and the
"START WITH var#" parameter.

6 - Check the resutls, prolly Frankenstein..... by Mel Brooks.

But this will require a test env, which I'm going to install later or at max
tomorrow.

cga

unread,
Jun 22, 2011, 8:24:03 PM6/22/11
to habar...@googlegroups.com
Hi Chris,

I appreciate your comments, but keep in mind that I'm not a DBA nor a SQLNinja
nor a PostgreSQL, expert. I'm learning and trying to help. That's all, and
that was specified too. Any contribution that can make me grow is more than
welcome (of course while I do something hoepfully useful for Habari too).

The chat logs are very informative about the things I mentioned. I attached
them on puropose so people could read what they exactly told me when asking
help and seeking info/trying to understand what I was going to do. I might
have misunderstood a point, even if I asked until I got it, so their written
and logged words are to be a better source if you please.

One thing is for sure for me and my personal knwoledge: they are certainly
expert on the matter (more than me) and I trust their word as much as I trust
yours. (the PKEY issue was briefly chatted in perhaps 4 or 5 lines, but you got
it right, it is something that I would implement if i was to plan a schema
from scratch and still would like to if it doesn't break a thing in Habari)

Now,

<BIGINT>
I could agree on the use of BIGINT if BIGINT keeps you safer for the reasons
you mentioned, fair enough. After "deciding" to use only INTEGER (2 billions
of post for a blog? eh!) I made a joke:

>[Wednesday 22 June 2011] [00:22:55] * cga 30yrs later: damn.... this is
ipv4 vs ipv6 all over again..... i should have used BIGSERIAL!!! :P

That reflects my first intention to stick w/ BIGINT/BIGSERIAL.

But still: is this a ecommerce platform or a blog engine? Do you post w/
scripts ever second? Wouldn't be better to use twitter widgets instead of
spamming the db? Shouldn't plugins log to filesystem?

Anyway if BIGINT makes you folks feel safer, I agree to use it.

<ALTER SEQUENCE ... OWNED BY ... ;>
Allow me to say that if you used ALTER SEQUENCE ... OWNED BY ... since the
beginning you could extract the *exact* sequence point, not guess it, during
migration prior to drop, and used it for anything related to migration. Hence

>we would generate duplicate IDs - a Very Bad Thing (tm).

wouldn't to be the case. At all. That's what I meant with:

> 1 - It's *almost* correct, but no cigar. It basically miss the most crucial
> part of the query to make the SEQUENCE itself have any sense *and* work
> properly

Anyway this is something I want to test, like I replied to Rick in the other
email. Tomorrow I will test.

RhodiumToad already told me that they will not look the same across "old
style" and "new style", but still I want to try. At least to check that the
use "old style" w/ ALTER SEQUENCE ... OWNED BY ... ; could be good enough and
prove you wrong on the migrate issue. (speaking of old style vs old style w/
ALTER OWNED).

As for legacy support (pgsql 8.1 in debian >etch?) I can see why you did it,
but as you guessed correctly that the most wide used should be 8.3 and 8.4
(which is my case on lenny w/ bpo repo) and marginally 9.x on cutting edge
distro like Arch (my distro of choice for desktop use).

So applying ALTER SEQUENCE ... OWNED BY .... ; it is not as much of a problem
unless you use EOL version of PGSQL . If they introduced such a feature it
might have been for a good reason. The above mentioned could be it, for one.

http://wiki.postgresql.org/wiki/PostgreSQL_Release_Support_Policy

Unless of course you want to support all of the EOL products, good luck.

You also aid that:

> So the only advantage we'd get is that if the table were to be dropped
> the sequence would be automatically deleted with it.

Not only this....

> I can't see any instance in which we would be dropping the Posts table (or
most of the others) programmatically, so it really provides no extra value for
developers.

....in fact it allows you a key feature that is the main reason why you could
have migrated w/o guessing. The use of:

SELECT currval(pg_get_serial_sequence('tablename', 'colname'));

pg_get_serial_sequence() is available only if you either use SERIAL or the
"old method" correctly (w/ ALTER OWNED).

>I don't understand this. Can you explain how it's ever going to behave
> in an unexpected way? How would a sequence ever result in duplicate
> numbers, unless it has been manually ALTERed with the RESTART argument?

>.....


> The use of the sequence, whether OWNED BY or not, should have nothing to
> do with whether the destination field is an INT or a BIGINT.

My bad, I misuderstood the problem of misbehaviour, I mistook numbers with
names... so I'll let you read what RhodiumToad told me about it, the very first
minutes after I asked for help.

From:
> [Tuesday 21 June 2011] [22:00:10] <RhodiumToad> you should avoid assuming
you know the sequence name, of course - use insert ... returning or
pg_get_serial_sequence

To at least:
>[Tuesday 21 June 2011] [22:11:50] <RhodiumToad> we're talking about the NAME
of the sequence, not the value
>[Tuesday 21 June 2011] [22:11:59] <cga> thanks a lot RhodiumToad

BUT *IF* Habari also relies on sequences' names too, you are to be
experiencing issues and headaches. == RPITA.

<SERIAL>


>I'm not sure I understand your point here either. The sequence name is
>dynamically created during install based on the DB prefix you specify,
>yes.

Consider this (from the very first lines of logs):

> [Tuesday 21 June 2011] [21:59:28] <RhodiumToad> those are equivalent, yes
(the first one is of course impossible due to the circular reference)

> [Tuesday 21 June 2011] [22:00:10] <RhodiumToad> you should avoid assuming
you know the sequence name, of course - use insert ... returning or
pg_get_serial_sequence

He's talking about 1st == your schema.sql w/ use of ALTER SEQUENCE ; 2nd my
patches.

> Again, how is that going to cause a problem and (aside from using a
>SERIAL, as mentioned below) what other alternative is there?

Using SERIAL you lets pgsql determine things sequence related, while
preserving your ${prefix}foo thing intacted for the table, and have the
advantages mentioned in docs for DATATYPE-SERIAL and my first email.

Anyway,

If you don't want to implement SERIAL it's another matter on its own. Yet the
pros as more than the cons you've mentioned (most of them are obsolete anyway
or non issues to my way of seeing this if we don't consider existing db for a
day or two until i test).

> On the flip side, since they're the same there's likewise no real
>compelling reason to use SERIAL over a sequence. We had previously (and
>hopefully still do) wanted to also support Oracle, which is very similar
> to Postgres. Oracle lacks a SERIAL data type, so it would require the
> sequence method to be used anyway. Continuing to use sequences in
> Postgres would mean less variation between the two and make it a good
> bit easier to manage them.

I agree that if you want to support more DB you are to keep things as similar
as possible. Yet I believe you should try using "specific"
paramenters/features if they don't cost too much or break anything or are
called differently across DB but still do the exact same things..

In fact you already use nextval() which is somewhat pgsql only.

http://www.postgresql.org/docs/8.4/interactive/sql-createsequence.html
(bottom of the page)

Then if you are already using a "specific" parameter to pgsql way of doing
this, why don't just use a proper way to do it? Which is basically the whole
point?

good night and thanks for your feedback =)

On Thursday 23 June 2011 00:12:43 Chris Meller wrote:
> I've skimmed over the chat logs you referenced, but I don't see the vast
> majority of these points discussed... Primarily all I see is a
> discussion about using ID alone as the primary key vs. other fields. So
> I'm going to go through (below) and ask several questions to clarify.
>

> > ------------------------------------------------------------------------
> >
> > cga <mailto:c...@cga.cx>

> <http://www.postgresql.org/docs/8.3/static/sql-createsequence.html> on

> <http://www.postgresql.org/docs/8.1/static/datatype.html#DATATYPE-SERIAL>

--

cga

unread,
Jun 25, 2011, 7:35:57 AM6/25/11
to habar...@googlegroups.com
Hi all,

BIGINT: if it makes you feel safer and you think it's not an issue, fine.

SERIAL vs verbose query:

*provided* *that* as of the feature >= PGSQL 8.2, the use of "old style" vs
"new style" is basically the same:

- 1 there's no point in breaking things by insinsting on the use of SERIAL
when with the use of "old style" + ALTER TABLE .. OWNED BY you achieve the
same results without breaking existing dbs if you use this form:

> CREATE SEQUENCE tablename_colname_seq;
> CREATE TABLE tablename (
> colname integer NOT NULL DEFAULT nextval('tablename_colname_seq')
> );


> ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;

- 2 since PGSQL < 8.2 are EOL, you could update existing db by adding the
following to system/schema/pgsql/upgrades/post/*.sql

> ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;

IMHO there's no point in not doing that , unless you want to support EOL
products.

+1 anyone?

Last but not least:

Please excuse my previous errors , I made it perfectly clear I'm not expert
but just willing to help.

Reply all
Reply to author
Forward
0 new messages