<PREMISE>
you might forgive me if I don't use Trac, but I'd like to explain and discuss
what I'm doing with PGSQL schema.sql for Habari before submitting (if my
contrib will be accepted code wise) anything.
Not to mention that in these very days things are getting moved on Github. If
so, I would clone and request for pull once there, since I have an active
account on Github and it seems pointless to get a Trac account now.
BUT: since this very patch isn't going (at least theoretically) to have any
bad effect on Habari , I'm attaching it to this very email for now. For you to
test. Forgive me on this last thing but I haven't set up a test env just yet.
I'll do it after sending this patch.
<little INTRO and WHY>
I'm cga. I'm a Linux SysAdmin and I'm all for "best practices" as much as I
can. Not to mention I'm really paranoid and analytical when it concerns
coding/installing/configuring/deploying. Especially for things I care/use/
administer.
I basically started this little contribution because I really love my stack:
debian + postgresql + nginx and I'd like to use it for all my needs. Habari
included. Ref: http://drunkenmonkey.org/irc/habari/2011-06-19#T17-04-24
I think it'sfair enough for a small intro.
<CONTRIB AND REASONING>
As you can read in the chat log I linked above , I'm not expert in neither
PostgreSQL nor SQL but I like to contribute for as much as I can. Especially
if this can be the only way to keep Habari's support for PostgreSQL alive.
This first little but important contribution it is all about the "CREATE
SEQUENCE" used in PGSQL actual schema.sql.
All the reasoning I do it's to be documented w/ links to official docs
and (attacched) chat logs from freenode.#postgresql , where (beside docs and
wiki) I learn and ask my stuff. And to my great pleasure I always find very
competent and very helpful people.
Please forgive my English errors, I'm Italian. Please *really* forgive
my verbosity but I'm sure you'd understand that people who didn't follow in
freenode.#habari don't know all the reasoning from the room. If I'm to be
accepted I promise I'll be as concise as possible in next communications.
Here we go:
I noticed that you use a lot of serialization in schema.sql , w/o questioning
why (which I'm not in a position to, anyway) I noticed that you (being whoever
did that schema.sql) do it wrong. Why?
Because: CREATE SEQUENCE done in the present form from schema.sql:
>CREATE SEQUENCE {$prefix}posts_pkey_seq;
>CREATE TABLE {$prefix}posts (
> id BIGINT NOT NULL DEFAULT nextval('{$prefix}posts_pkey_seq'),
>.....
>.....
> PRIMARY KEY (id),
>.....
>);
>
1 - It's *almost* correct, but no cigar. It basically miss the most crucial
part of the query to make the SEQUENCE itself have any sense *and* work
properly:
> ALTER SEQUENCE {$prefix}posts_pkey_seq OWNED BY{$prefix}posts.id;
w/o this ALTER TABLE the SEQUENCE is to behave weird more sooner than later,
resulting in duplicate numbers (see chat log; using BIGINT won't save you from
this, more on this later) and errors that are more than diffcult to find when
they happen.
2 - it determines in a recursive way the name of the SEUQNECE itself , which
is wrong because you cannot be 100% sure that will actually work as intended.
(see chat log).
3 - it uses BIGINT. My reasoning is that whoever created this schema.sql
didn't wan't to finish the SEQUENCE too soon. Good idea , bad implementation. I
strongly believe that BIGINT is not needed indeed, and that it's just a waste
of cylces and space. Not to mention that BIGINT behaves differently on 32bit
and 64bit, causing slowness (and perhaps some more side effects) on 32bit.
(this is chat logs from the office, if you need them I'll attach them tomorrow).
And as explained just above, it is not going to be used as a BIGINT anyway
because of the wrong query CREATE SEQUENCE. But the main point why it is not
needed it's because who's to write a post every second for # quintillions
years? I think that a post every second for 68yrs is more than enough
(INTEGER/SERIAL). (see chat log too).
4 - when all the above are to be combined w/ the use of NOT NULL DEFAULT
nextval()" (and I mean all of them are compulsory clauses, see chat log), you
can save a lot of hassles, code and cycles with just the improvement I'm
submitting for every CREATE SEQUENCE query, leaving intact the final result for
what it concerns Habari's functioning *and* improves things a whole lotta
love:
>CREATE TABLE {$prefix}posts (
> id SERIAL,
>.....
>.....
> PRIMARY KEY (id),
>.....
>);
<IMPROVEMENTS ACHIEVED>
Improvements and optimizations achieved with this little patch: (see Ref. [2])
0 - it will leave intact the need of Habari to use the numerical field for
internal functioning.
1 - (id) will be unique and auto incremental by properties inherited from
query.
2 - You won't skip a number in a SEQUENCE.
3 - It won't fail because of wrong creation query.
4 - It won't determine an unexpected SEQUENCE name/relation to the column you
want it to.
5 - OWNED BY is to be used implicitly , this will drop anything related to the
elected talbe and/or column. This sounds harsh, but what will you do w/ data
that isn't associated w/ anything anymore?
<FINAL THOUGHTS>
1 - For the impact on Habari's code (which I didn't read because I don't do
php and couldn't afford to get it wrong) on insert the next value of the
sequence into the serial column please read Ref. [2] just below the PostgreSQL
7.3 Note.
2 - This new schema.sql would work (in theory) for new installations but I
don't know how to implement and test a system/schema/pgsql/upgrades/post/*.sql
yet. Help and advice from you is very much apprecciated.
3 - Let me know if anyone is going to test this before I do. If you want.
Anyway I'm going to test it when my test env will be up and running.
4 - I'm already planning the next batch of fixes but that can wait for now. I
anticipate that it will be about optimizing queries and about keyvil issues as
much as possible and w/o breaking Habari Ref. [3].
My pleasure to meet you.
<REFERENCES>
Ref. chat log attacched.
Ref [1]. http://www.postgresql.org/docs/8.4/interactive/sql-
createsequence.html
Ref [2]. http://www.postgresql.org/docs/8.4/interactive/datatype-
numeric.html#DATATYPE-SERIAL
Ref. [3] http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-i-7327
http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-ii-7345
http://it.toolbox.com/blogs/database-soup/primary-keyvil-part-iii-7365
--
knowledge has no owner, only means to reveal itself
cga
June 22, 2011 3:13 PM
w/o this ALTER TABLE the SEQUENCE is to behave weird more sooner than later, resulting in duplicate numbers (see chat log; using BIGINT won't save you from this, more on this later) and errors that are more than diffcult to find when they happen.
2 - it determines in a recursive way the name of the SEUQNECE itself , which is wrong because you cannot be 100% sure that will actually work as intended. (see chat log).
3 - it uses BIGINT. My reasoning is that whoever created this schema.sql didn't wan't to finish the SEQUENCE too soon. Good idea , bad implementation. I strongly believe that BIGINT is not needed indeed, and that it's just a waste of cylces and space. Not to mention that BIGINT behaves differently on 32bit and 64bit, causing slowness (and perhaps some more side effects) on 32bit. (this is chat logs from the office, if you need them I'll attach them tomorrow). And as explained just above, it is not going to be used as a BIGINT anyway because of the wrong query CREATE SEQUENCE. But the main point why it is not needed it's because who's to write a post every second for # quintillions years? I think that a post every second for 68yrs is more than enough (INTEGER/SERIAL). (see chat log too).
4 - when all the above are to be combined w/ the use of NOT NULL DEFAULT nextval()" (and I mean all of them are compulsory clauses, see chat log), you can save a lot of hassles, code and cycles with just the improvement I'm submitting for every CREATE SEQUENCE query, leaving intact the final result for what it concerns Habari's functioning *and* improves things a whole lotta love:CREATE TABLE {$prefix}posts ( id SERIAL, ..... ..... PRIMARY KEY (id), ..... );
> How would this change affect people upgrading their databases? Would
> the tables need to be totally rebuilt?
ANSWER:
Right now I already have a compromise that will keep existing tables intact.
Alas, using old style by adding the missing part to SQL w/ upgrade/2265.sql:
> ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;
to all sequences in the upgrade. This won't affect the data, only add the
missing part to the original schema.
IDEAS AND TESTS:
From here on are things I want to try:
While for new db it's not to going be a problem and would make a good
schema.sql , I really want to understand how to migrate to the proposed schema
also for those who have an existing db. I already have the raw idea and I'm
already planning some steps and tests to "gradually" move to "new style" and
keep the old db/tables intact.
Raw Idea: (implies that we used the above mentioned compromise already).
When we issue ALTER SEQUENCE, it won't affect the sequence and will allow us to
use "pg_get_serial_sequence()" to find the name of the sequence associated with
a given column.
Then with the use of currval() function, which returns the most recent value
generated by a sequence for the current session, combined with
"pg_get_serial_sequence()" that we are now able to use:
> SELECT currval(pg_get_serial_sequence('tablename', 'colname'));
we retrieve the most recent value generated by that specific sequence.
This value (for each sequence) could be stored in a variable_seq# or tmp table
(either for each sequence) and set as "START WITH start" for each table that
is to be recreated at a certain given point in the SEQUENCE.
My educated guess it's that this (and perhaps some more code in upgrade and
perhaps php?) will allow us to extract the exact point of a given sequence at
the moment of upgrade and migrate the existing to new schema too.
Basically the step could be:
0 - backup. backup. backup. and do not put anything more into db.
1 - ALTER SEQUENCE .. OWNED BY ... ;
2 - SELECT currval(pg_get_serial_sequence('tablename', 'colname')); and store
that in a variable or tmp_table (either one of those, but one for each
sequence);
3 - dump the db, drop the db.
4 - apply new schema w/ use of SERIAL ;
5 - import the previusly dump for old db w/ use of tmp_table/var and the
"START WITH var#" parameter.
6 - Check the resutls, prolly Frankenstein..... by Mel Brooks.
But this will require a test env, which I'm going to install later or at max
tomorrow.
I appreciate your comments, but keep in mind that I'm not a DBA nor a SQLNinja
nor a PostgreSQL, expert. I'm learning and trying to help. That's all, and
that was specified too. Any contribution that can make me grow is more than
welcome (of course while I do something hoepfully useful for Habari too).
The chat logs are very informative about the things I mentioned. I attached
them on puropose so people could read what they exactly told me when asking
help and seeking info/trying to understand what I was going to do. I might
have misunderstood a point, even if I asked until I got it, so their written
and logged words are to be a better source if you please.
One thing is for sure for me and my personal knwoledge: they are certainly
expert on the matter (more than me) and I trust their word as much as I trust
yours. (the PKEY issue was briefly chatted in perhaps 4 or 5 lines, but you got
it right, it is something that I would implement if i was to plan a schema
from scratch and still would like to if it doesn't break a thing in Habari)
Now,
<BIGINT>
I could agree on the use of BIGINT if BIGINT keeps you safer for the reasons
you mentioned, fair enough. After "deciding" to use only INTEGER (2 billions
of post for a blog? eh!) I made a joke:
>[Wednesday 22 June 2011] [00:22:55] * cga 30yrs later: damn.... this is
ipv4 vs ipv6 all over again..... i should have used BIGSERIAL!!! :P
That reflects my first intention to stick w/ BIGINT/BIGSERIAL.
But still: is this a ecommerce platform or a blog engine? Do you post w/
scripts ever second? Wouldn't be better to use twitter widgets instead of
spamming the db? Shouldn't plugins log to filesystem?
Anyway if BIGINT makes you folks feel safer, I agree to use it.
<ALTER SEQUENCE ... OWNED BY ... ;>
Allow me to say that if you used ALTER SEQUENCE ... OWNED BY ... since the
beginning you could extract the *exact* sequence point, not guess it, during
migration prior to drop, and used it for anything related to migration. Hence
>we would generate duplicate IDs - a Very Bad Thing (tm).
wouldn't to be the case. At all. That's what I meant with:
> 1 - It's *almost* correct, but no cigar. It basically miss the most crucial
> part of the query to make the SEQUENCE itself have any sense *and* work
> properly
Anyway this is something I want to test, like I replied to Rick in the other
email. Tomorrow I will test.
RhodiumToad already told me that they will not look the same across "old
style" and "new style", but still I want to try. At least to check that the
use "old style" w/ ALTER SEQUENCE ... OWNED BY ... ; could be good enough and
prove you wrong on the migrate issue. (speaking of old style vs old style w/
ALTER OWNED).
As for legacy support (pgsql 8.1 in debian >etch?) I can see why you did it,
but as you guessed correctly that the most wide used should be 8.3 and 8.4
(which is my case on lenny w/ bpo repo) and marginally 9.x on cutting edge
distro like Arch (my distro of choice for desktop use).
So applying ALTER SEQUENCE ... OWNED BY .... ; it is not as much of a problem
unless you use EOL version of PGSQL . If they introduced such a feature it
might have been for a good reason. The above mentioned could be it, for one.
http://wiki.postgresql.org/wiki/PostgreSQL_Release_Support_Policy
Unless of course you want to support all of the EOL products, good luck.
You also aid that:
> So the only advantage we'd get is that if the table were to be dropped
> the sequence would be automatically deleted with it.
Not only this....
> I can't see any instance in which we would be dropping the Posts table (or
most of the others) programmatically, so it really provides no extra value for
developers.
....in fact it allows you a key feature that is the main reason why you could
have migrated w/o guessing. The use of:
SELECT currval(pg_get_serial_sequence('tablename', 'colname'));
pg_get_serial_sequence() is available only if you either use SERIAL or the
"old method" correctly (w/ ALTER OWNED).
>I don't understand this. Can you explain how it's ever going to behave
> in an unexpected way? How would a sequence ever result in duplicate
> numbers, unless it has been manually ALTERed with the RESTART argument?
>.....
> The use of the sequence, whether OWNED BY or not, should have nothing to
> do with whether the destination field is an INT or a BIGINT.
My bad, I misuderstood the problem of misbehaviour, I mistook numbers with
names... so I'll let you read what RhodiumToad told me about it, the very first
minutes after I asked for help.
From:
> [Tuesday 21 June 2011] [22:00:10] <RhodiumToad> you should avoid assuming
you know the sequence name, of course - use insert ... returning or
pg_get_serial_sequence
To at least:
>[Tuesday 21 June 2011] [22:11:50] <RhodiumToad> we're talking about the NAME
of the sequence, not the value
>[Tuesday 21 June 2011] [22:11:59] <cga> thanks a lot RhodiumToad
BUT *IF* Habari also relies on sequences' names too, you are to be
experiencing issues and headaches. == RPITA.
<SERIAL>
>I'm not sure I understand your point here either. The sequence name is
>dynamically created during install based on the DB prefix you specify,
>yes.
Consider this (from the very first lines of logs):
> [Tuesday 21 June 2011] [21:59:28] <RhodiumToad> those are equivalent, yes
(the first one is of course impossible due to the circular reference)
> [Tuesday 21 June 2011] [22:00:10] <RhodiumToad> you should avoid assuming
you know the sequence name, of course - use insert ... returning or
pg_get_serial_sequence
He's talking about 1st == your schema.sql w/ use of ALTER SEQUENCE ; 2nd my
patches.
> Again, how is that going to cause a problem and (aside from using a
>SERIAL, as mentioned below) what other alternative is there?
Using SERIAL you lets pgsql determine things sequence related, while
preserving your ${prefix}foo thing intacted for the table, and have the
advantages mentioned in docs for DATATYPE-SERIAL and my first email.
Anyway,
If you don't want to implement SERIAL it's another matter on its own. Yet the
pros as more than the cons you've mentioned (most of them are obsolete anyway
or non issues to my way of seeing this if we don't consider existing db for a
day or two until i test).
> On the flip side, since they're the same there's likewise no real
>compelling reason to use SERIAL over a sequence. We had previously (and
>hopefully still do) wanted to also support Oracle, which is very similar
> to Postgres. Oracle lacks a SERIAL data type, so it would require the
> sequence method to be used anyway. Continuing to use sequences in
> Postgres would mean less variation between the two and make it a good
> bit easier to manage them.
I agree that if you want to support more DB you are to keep things as similar
as possible. Yet I believe you should try using "specific"
paramenters/features if they don't cost too much or break anything or are
called differently across DB but still do the exact same things..
In fact you already use nextval() which is somewhat pgsql only.
http://www.postgresql.org/docs/8.4/interactive/sql-createsequence.html
(bottom of the page)
Then if you are already using a "specific" parameter to pgsql way of doing
this, why don't just use a proper way to do it? Which is basically the whole
point?
good night and thanks for your feedback =)
On Thursday 23 June 2011 00:12:43 Chris Meller wrote:
> I've skimmed over the chat logs you referenced, but I don't see the vast
> majority of these points discussed... Primarily all I see is a
> discussion about using ID alone as the primary key vs. other fields. So
> I'm going to go through (below) and ask several questions to clarify.
>
> > ------------------------------------------------------------------------
> >
> > cga <mailto:c...@cga.cx>
> <http://www.postgresql.org/docs/8.3/static/sql-createsequence.html> on
> <http://www.postgresql.org/docs/8.1/static/datatype.html#DATATYPE-SERIAL>
--
BIGINT: if it makes you feel safer and you think it's not an issue, fine.
SERIAL vs verbose query:
*provided* *that* as of the feature >= PGSQL 8.2, the use of "old style" vs
"new style" is basically the same:
- 1 there's no point in breaking things by insinsting on the use of SERIAL
when with the use of "old style" + ALTER TABLE .. OWNED BY you achieve the
same results without breaking existing dbs if you use this form:
> CREATE SEQUENCE tablename_colname_seq;
> CREATE TABLE tablename (
> colname integer NOT NULL DEFAULT nextval('tablename_colname_seq')
> );
> ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;
- 2 since PGSQL < 8.2 are EOL, you could update existing db by adding the
following to system/schema/pgsql/upgrades/post/*.sql
> ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;
IMHO there's no point in not doing that , unless you want to support EOL
products.
+1 anyone?
Last but not least:
Please excuse my previous errors , I made it perfectly clear I'm not expert
but just willing to help.