we are facing the task of migrating a large (600k LOC) oracle backend to
postgresql. As i figured out by now is that we theoretically have two
options:
1. teach Postgres to understand PL/SQL by using 3rd-party-software
2. convert oracle PL/SQL to PG PL/SQL
Has anyone of you already experience with such a job?
Where are problems to be expected?
Are there some tools for automated translating oracle PL/SQL to PG
PL/SQL? How good are they?
Are there other options except the two above?
What option is to be preferred?
....and many, many more...
Thank you in advance.
Thomas
I've had such a pilot project but was forced to give it up. There are
several major snags with Postgres:
1) Postgres is the only major database without hints. That makes
keeping any deadlines impossible and makes it necessary to rewrite
each and every SQL. Tuning SQL in Postgres is done by rewriting it
and fixing the model, which takes time, in my case more time than
I was alloted. They do have "set mode" statements for the session,
but that's in effect for the entire session. Granularity is wrong.
2) Postgres partitioning is not very robust. It's even weaker than
Oracle7 partition views because partition elimination only happens
during the parse phase. Once the statement is parsed, that's it.
Did I mention that there are no global indexes? Also, optimizer seems
to be confused with partitioning and is frequently producing bad plans,
based on the full table scan of the entire table.
3) The most important problem are the people leading the community. They
stubbornly refuse to even talk about hints. Also, there is a closed
source version of Postgres, called EnterpriseDB which has hints. One
of the pillars of Postgres community is a guy who works for
EnterpriseDB and is opposed to hints in the open source version but is
merrily selling them in the closed source version. Here is a pearl of
wisdom, contributed to this valley of tears by another shining pillar
of the Postgres community: http://tinyurl.com/68gu822
To dispel any doubts, I am the DBA that this genius is talking about.
4) Postgres optimizer is bad. Statistics anomalies are frequent, there is
an email list devoted to Postgres performance and every other question
is "how to speed up this query". There is no tuning methodology, no
event interface, nothing.
5) There is no parallelism. Oracle can parallelize queries, Postgres can
not. It's as simple as that. There is no cure in sight.
There are some other problems, like not caching parsed SQL (no shared
pool) but these were sufficient to cast serious doubt on Postgres as a
viable option. I am an Oracle DBA for more than 20 years and recent
changes in the pricing policy have motivated the company that I work for
to look for alternatives, starting with OSS software. Sadly, there are no
open source alternatives. The spirit in the Postgres community ensures
that things will remain as they are, for the foreseeable future. The next
step is to look for the commercial alternatives, like DB/2. However, if
you are building a data warehouse, take a look at MongoDB. It's a NoSQL
database which can be used to create a very decent data warehouse.
If you want to rewrite Oracle code and use it on Postgres, forget it.
It's going to be a frustrating and humiliating experience and will
produce an underperforming application system. You can contact me
directly, should you need any more information.
thank you for your detailed response.
> I've had such a pilot project but was forced to give it up.
Hmm. That's just *not* the answer was hoping to hear - but the one i
expected.
> 1) Postgres is the only major database without hints.
The "hint-discussion" is not new to me. I am already following it for
some time.
> 2) Postgres partitioning is not very robust.
OK. Good to know. But that would not be the show stopper in my case.
> 3) The most important problem are the people leading the community. They
> stubbornly refuse to even talk about hints. Also, there is a closed
> source version of Postgres, called EnterpriseDB which has hints.
Would EnterpriseDB be a better destination to migrate to from oracle?
> 4) Postgres optimizer is bad.
Let's assume that this would not be a problem to me.
> 5) There is no parallelism. Oracle can parallelize queries, Postgres can
> not. It's as simple as that. There is no cure in sight.
Also good to know. But not critically to my szenario.
> I am an Oracle DBA for more than 20 years and recent
> changes in the pricing policy have motivated the company that I work for
> to look for alternatives, starting with OSS software.
That's exactly the situation that i am facing now.
Most of your critics on postgres refer to large amounts of data and
perfomance critical applications. In my scenario the oracle-db neither
uses hints or partitioned tables nor does performance matter that much.
I "only" have the problem of porting the 600k LOC PL/SQL application
logic (triggers, strored procedures etc.) to postgres.
Leaving your legitimate criticism on performance issues aside, and only
focussing on the pl/sql code, would you then still stay with your
introductory statement that it is impossible to migrate the code?
Thomas
> Most of your critics on postgres refer to large amounts of data and
> perfomance critical applications. In my scenario the oracle-db neither
> uses hints or partitioned tables nor does performance matter that much.
> I "only" have the problem of porting the 600k LOC PL/SQL application
> logic (triggers, strored procedures etc.) to postgres.
If that is the case, I would try mixing PlPgSQ with PLPerl. PlPgSQL is
far inferior to PL/SQL. No packages, no procedures, but with Perl you can
get it all. With the untrusted version, you can call external modules.
Before you start: RAISE NOTICE is the PLPgSQL equivalent of
dbms_output.put_line.
> Would EnterpriseDB be a better destination to migrate to from oracle?
That depends. It definitely has more useful goodies than the normal
Postgres, but I am not sure how stable the company is. They didn't make a
splash, EnterpriseDB is not becoming wildly popular, which is rather
strange, given how unhappy people are with the Oracle pricing. I don't
really know what seems to be the problem, but I am a DBA, not a business
person.
I think that option 1 won't work.
You can look at EnterpriseDB, which is a closed source database based
on PostgreSQL that tries to be as similar to PostgreSQL as possible in
order to facilitate migration, but don't expect that you can just
go on using your PL/SQL procedures without modification.
Ask them, they'll be happy to tell you more.
Option 2 might or might not be feasible.
The syntax of PL/pgSQL is similar enough to PL/SQL that the code
itself won't be hard to port. You'll run into problems as soon as
you make heavy use of Oracle's goodies like the extensive PL/SQL
library, the LOB interface, autonomous transactions, bulk operations,
the object oriented cruft -- to name just a few of the more common
things that I can think of right away.
If your code uses these a lot, porting will be pretty difficult and
will probably end up in a rewrite. Note also that, different from PL/SQL,
PL/pgSQL is not designed to be an efficient all-purpose programming
language, it's designed for database manipulations where SQL won't suffice.
If you need to access the file system or other things outside the database,
it's better to use other languages like PL/Perl.
On the other hand, if your code is e.g. mostly triggers that modify
tables, porting might not be too difficult.
Yours,
Laurenz Albe
PS: Don't let people discourage you. PostgreSQL is a good and stable
database system.
> PS: Don't let people discourage you. PostgreSQL is a good and stable
> database system.
It is stable, no contest there. Is it good or not is a matter of opinion.
I could list several grave faults which make it less than good in my eyes.
The first and foremost is, of course, the lack of hints. Postgres is the
only major database system which lacks hints. The second is ludicrous
need to "vacuum" database. The third is the fact that Postgres doesn't do
multi-block reads. Fixed size archive logs (16M) would be the next issue.
The lack of shared pool makes it impossible to see the plan of the
statement being executed. And there is more.
You begin to sound like a broken record.
There are pros and cons to hints, and you can like them or not, but
from what you write it seems to be more of an emotional issue than anything
else, so I won't enter a discussion here.
A liberal use of pejoratives is no substitute for a technical argument.
I understand that with your Oracle background it is easy to think
that there is only one way to skin the cat, and other approaches are
"ludicrous".
Please don't turn into a troll.
Yours,
Laurenz Albe
PS: I have a question concerning Oracle's superior query optimizer.
I am dinking around with a "Foreign Data Wrapper" for the PostgreSQL 9.1
implementation of SQL/MED to access Oracle. Now it would be nice
to get optimizer information from Oracle and feed it back to PostgreSQL
so that it can be used for planing a query with a foreign table.
Is there any way to get a decent estimate how expensive Oracle thinks
a query might be? Something that can be expressed in units like blocks read
from disk or time spent? All I can find is the TIME column of PLAN_TABLE
which has a granularity of seconds (!) and is never less than 1.
He's been in my killfile since about 1 year...
> You begin to sound like a broken record.
Of course. There is no change in the situation.
> There are pros and cons to
> hints, and you can like them or not,
All other major databases have them. That should be enough for now.
> but from what you write it seems to
> be more of an emotional issue than anything else, so I won't enter a
> discussion here.
So what is this post all about, if you don't want to enter discussion?
In my understanding it is about asking you not to go completely
berserk.
/Str
> In my understanding it is about asking you not to go completely berserk.
I am not going "berserk" at all, quite contrary. I am remaining polite,
even in the face of posts like yours. It is not me who is refusing the
debate. I do have a voice in the Oracle community and I will continue to
warn people considering Postgres as a possible alternative to Oracle what
are they up against, should they decide to try going down that road.
There are many deficiencies of PostgreSQL and I'll gladly point them out
to any Oraclite who is considering PgSQL as a possible replacement. The
phrase about Postgres being an open source Oracle is a misleading lie.
If you want to try debating the possible reasons for not providing hints
to Postgres users, I'll gladly debate it with you. Frankly, I don't think
it makes much sense, the spirit of Postgres community is far from
instilling confidence. Maybe that's why PGEast was such a flop? By the
way, what do you think, why is Postgres so far behind in popularity
behind MySQL? I just want to be the voice of reason, even at the risk of
sounding like a broken record.
Lastly, this forum is not moderated and the only way I can be dealt with
is the force of arguments. From what I've seen so far, there is no danger
of that ever happening. I would even suggest that rude posts like
Laurenz's or yours, in which you accuse a complete stranger of "sounding
like a broken record" or "going berserk", are more akin to trolling than
polite and rational posts of the kind I am attempting to promote here.
I don't intend to get into a shouting match with either of you, but if
somebody asks about conversion from Oracle, you can count on me being
here. Again, the only way to make me go away is the force of arguments.
Unfortunately people often forget that "good" in itself is meaningless
and do not provide a list of criteria as you do.
> I could list several grave faults which make it less than good in my eyes.
> The first and foremost is, of course, the lack of hints. Postgres is the
> only major database system which lacks hints. The second is ludicrous
> need to "vacuum" database.
Why do you consider this bad? I agree that it sounds silly at first
sight that the user needs to take care of cleaning up deadwood in the
database. On the other hand, on an Oracle instance you'll likely also
have periodic jobs doing statistics updates which is what VACUUMing also
does. Reducing the work a transaction has to do can be beneficial for
the client, too.
> The third is the fact that Postgres doesn't do
> multi-block reads.
Isn't this alleviated by the OS doing readahead on many platforms?
> Fixed size archive logs (16M) would be the next issue.
This can be changed during compilation though. But I agree, you'd
rather want this to be tuneable at runtime - at least without
recompiling the software.
> The lack of shared pool makes it impossible to see the plan of the
> statement being executed. And there is more.
That's unfortunate. At least you can see SQL currently executed by
querying pg_stat_activity.
Kind regards
robert
> Why do you consider this bad? I agree that it sounds silly at first
> sight that the user needs to take care of cleaning up deadwood in the
> database. On the other hand, on an Oracle instance you'll likely also
> have periodic jobs doing statistics updates which is what VACUUMing also
> does. Reducing the work a transaction has to do can be beneficial for
> the client, too.
No, what I consider bad is the architecture of putting dead rows in the
table. That prevents Postgres from assigning a fixed row id to every row
and from creating global indexes on partitioned tables. I believe that
they should consider putting previous versions of the row in a separate
table and have the vacuum process, let's call it "smon" for the sake of
argument, do a periodical cleanup. That way rows in the table are not a
moving target, there is a possibility of assigning a permanent row id,
which would also allow creating global indexes.
>
>> The third is the fact that Postgres doesn't do multi-block reads.
>
> Isn't this alleviated by the OS doing readahead on many platforms?
Weeellllll....it is. You are right, my objection is mostly of aesthetic
nature. When I come to think of it, it does simplify the optimizer
because there is only one type of I/O to think about.
>
>> Fixed size archive logs (16M) would be the next issue.
>
> This can be changed during compilation though. But I agree, you'd
> rather want this to be tuneable at runtime - at least without
> recompiling the software.
>
>> The lack of shared pool makes it impossible to see the plan of the
>> statement being executed. And there is more.
>
> That's unfortunate. At least you can see SQL currently executed by
> querying pg_stat_activity.
>
> Kind regards
Oh yes, and there is also auto explain module which will dump the plan of
any statement running longer than a predefined period of time (tunable).
What I really miss is the ability to click on the session in PgAdmin3 and
see the plan of a running statement. Also, sorting by the CPU time or the
number of I/O requests issued would be nice.
> I believe that
> they should consider putting previous versions of the row in a separate
> table
What I meant is "a separate file".
I have no problem to discuss hints and other perceived or real shortcomings
of PostgreSQL, only before I do this I wanted to ascertain that you are still
able to post a civil reply. I wanted to bring the heat level down.
I think that founded criticism of PostgreSQL is not off-topic here, and
I have no problem with it. Yet lately you have spread a great deal of
FUD around here.
So here I go.
About hints:
============
Let me say that I personally do not feel very strongly
either way - I have used hints in Oracle and get by without them in PostgreSQL.
To begin, I searched the archives to verify your (insulting) claims about
what the PGDG thinks about query hints. Here is a good statement I've found:
http://archives.postgresql.org/pgsql-hackers/2005-02/msg00436.php
And there is of course the entry in the TODO list that states that optimizer
hints are not wanted. It points to three threads:
http://archives.postgresql.org/pgsql-hackers/2006-08/msg00506.php
http://archives.postgresql.org/pgsql-hackers/2006-10/msg00517.php
http://archives.postgresql.org/pgsql-hackers/2006-10/msg00663.php
Sifting through some of the e-mails, here is a list of some arguments against
query hints that I found:
- They create a maintenance burden whenever there is a change in the optimizer.
- They are a hack that does not address the real problem (a deficiency in the
planer) and may not work well in the next version or when the data change.
- They will cause people to work around deficiencies in the planer, rather than
report their problem so it can be fixed.
The first point is something that could be handled if more able people were
willing to put resources into it. The third point is probably more idealistic
than a commercial vendor would be (fewer bug reports mean less work for support).
Still I think all three are valid points.
The only recognizable argument you made so far is "everybody else has them,
so they must be the right thing". PostgreSQL hackers will reply that they are
out to do things *better* than everybody else, so while it is certainly a good
idea to seriously consider something that everybody else does, that does not
dictate that PostgreSQL has to do it the same way.
If you really want to influence the PostgreSQL planer, there is a number of
configuration parameters that allow you to do this (only for a whole query though).
Also, a query rewrite can often do that same thing that a hint would do
(didn't you post an example recently?).
Interestingly, Oracle also seems to think that hints are not so great.
I quote from
http://download.oracle.com/docs/cd/E11882_01/server.112/e17118/sql_elements006.htm#SQLRF00219
Hints were introduced in Oracle7, when users had little recourse if the
optimizer generated suboptimal plans. Now Oracle provides a number of tools,
including the SQL Tuning Advisor, SQL plan management, and SQL Performance
Analyzer, to help you address performance problems that are not solved by
the optimizer. Oracle strongly recommends that you use those tools rather
than hints. The tools are far superior to hints, because when used on an
ongoing basis, they provide fresh solutions as your data and database
environment change.
Hints should be used sparingly, and only after you have collected statistics
on the relevant tables and evaluated the optimizer plan without hints using
the EXPLAIN PLAN statement. Changing database conditions as well as query
performance enhancements in subsequent releases can have significant impact
on how hints in your code affect performance.
So it seems that Oracle considers hints a legacy from when their optimizer was
even worse and they had nothing better to offer.
About vacuum:
=============
Of course vacuum is a pain, but every database system has to pay a similar
price somewhere. Either you reduce concurrency (DB2), or (if you use some kind
of MVCC) you can move dead rows somewhere else (Oracle) or leave them where
they are and clean up later.
Each approach has its disadvantages.
Oracle has to burn extra cycles (and disk access?) to move rows away (and maybe
back) right when the performance might be needed elsewhere. Moreover, there's
the problem of having to maintain an extra data structure (undo space) which
may run out, causing an expensive long-running operation to fail randomly.
PostgreSQL's VACUUM is of course also unpleasant, but the pain has become much
less with the greatly improved autovacuum of recent versions. Normally, you don't
have to worry about it any more.
The design principle behind VACUUM is that the maintenance burden is delayed
to a time when the system is not totally busy, so that performance is not affected
by it.
ROWID:
======
You make the point that because of VACUUM there is nothing compared to
Oracle's ROWID. But do you really need that? Shouldn't a row be referenced
by its primary key?
Oracle itself says in
http://download.oracle.com/docs/cd/E11882_01/server.112/e17118/pseudocolumns008.htm#SQLRF00254
You should not use ROWID as the primary key of a table. If you delete and
reinsert a row with the Import and Export utilities, for example, then its
rowid may change. If you delete a row, then Oracle may reassign its rowid
to a new row inserted later.
Other maintenance commands like ALTER TABLE ... SHRINK SPACE will also change
ROWID.
Oracle pays a price to keep ROWID as stable as it is. Why? Because reference via
ROWID is faster than anything else? Why do you need that outside of the scope of
one transaction? And inside one transaction, PostgreSQL's ctid should do the trick
for you.
I do not understand how lack of ROWID should affect global indexes on partitioned
tables. Can you explain?
Fixed size of WAL segments:
===========================
Maybe I am dense, but what is the problem there?
Yours,
Laurenz Albe
PG's ctid is very similar as Oracle's ROWID (as similar as things like that can be between completely different architectures)
I mentioned that. I am afraid that my posting was too long and maybe not
organized enough so that it got lost in the noise. Sorry.
Yours,
Laurenz Albe
> Mladen Gogala wrote:
>> So what is this post all about, if you don't want to enter discussion?
>
> I have no problem to discuss hints and other perceived or real
> shortcomings of PostgreSQL, only before I do this I wanted to ascertain
> that you are still able to post a civil reply. I wanted to bring the
> heat level down.
You wanted to bring the level down by telling me that I sound like a
broken record? I do find that interesting.
>
> I think that founded criticism of PostgreSQL is not off-topic here, and
> I have no problem with it. Yet lately you have spread a great deal of
> FUD around here.
>
> So here I go.
>
> About hints:
> ============
>
> Let me say that I personally do not feel very strongly either way - I
> have used hints in Oracle and get by without them in PostgreSQL.
Unfortunately, I was forced to kill a porting project because I couldn't
have make the SQL perform properly within the allotted time frame. The
goal was to make an oracle application written in Java work against
Postgres. The tables and indexes were converted using ora2pg. There were
few triggers which I recoded manually. However, the performance was
abysmal because of bad plans and the only chance to make the deadline
was to have a mechanism to force specific plans. I didn't have that. I
was even reasonable enough to suggest that such mechanism should be only
put on TODO list, but nobody wanted to even discuss that. I was
steamrolled, so I killed the project and stopped all the plans for
porting applications. In the end, we both lost: Postgres lost a user, I
lost an opportunity to save some money for the company.
> Sifting through some of the e-mails, here is a list of some arguments
> against query hints that I found:
> - They create a maintenance burden whenever there is a change in the
> optimizer. -
They do. The optimizer itself is a maintenance burden. Every new feature
is a maintenance burden.
> They are a hack that does not address the real problem (a
> deficiency in the planer)
This argument is ludicrous. Hints are the way around the deficiencies in
the planner. Achieving the "perfect optimizer" is impossible at this
state of technology. Oracle has excellent optimizer and vast resources
and yet it provides hints. Hints are a solution of the last resort.
Undesired and unloved, but it has to be there when necessary. If they are
not there, the database will not get used.
> and may not work well in the next version or when the data
> change.
Well, that's up to the maintainers. However, the type of decisions
influenced by hints is unlikely to change between versions.
> - They will cause people to work around deficiencies in the planer,
> rather than
> report their problem so it can be fixed.
Unfortunately, the real developers and project managers work on projects
that have deadlines. Users must have tuning methodology. Postgres will
have to trust its users to report problems. On the other hand, if the
community doesn't trust its users, it is very likely that it will not
have many of them. Personally, I don't think that posting the problem on
the mailing list is a satisfactory tuning methodology. I'd rather kill
the project than do it that way. So I did.
>
> The first point is something that could be handled if more able people
> were willing to put resources into it. The third point is probably more
> idealistic than a commercial vendor would be (fewer bug reports mean
> less work for support). Still I think all three are valid points.
On the other hand, I interpret those points as a deep mistrust of the
Postgres community to its users and the reluctance to provide the
necessary tools for the users to have chance to help themselves.
> The only recognizable argument you made so far is "everybody else has
> them, so they must be the right thing".
No. The argument that I have made is not about everybody else. That
"argument" was thrown when I was faced with the "everybody else is
against" argument. The real argument is that hints are needed and that
the users must have a way of helping themselves. What you and the rest of
the Postgres community are arguing against is allowing users to help
themselves. Essentially, Postgres is the only database that doesn't allow
its users to help themselves. That is the argument.
> PostgreSQL hackers will reply
> that they are out to do things *better* than everybody else, so while it
> is certainly a good idea to seriously consider something that everybody
> else does, that does not dictate that PostgreSQL has to do it the same
> way.
Let's be realistic here. Postgres is one of the available RDBMS systems
on the market, far behind other systems in popularity, even behind MySQL.
Nobody uses PgSQL for benchmarks (see www.tpc.org) because of the
performance problems. Such grandeur is not justified. The goal must be to
give the users something that they can use as an alternative to the much
better performing commercial databases like Oracle or DB2. In other
words, Postgres must be an alternative to Oracle. At present, it isn't. I
tried, I failed and looked elsewhere. I am not the only one, I am just
the most vocal one.
When NY PUG was revived in 2010, there were people from BoA, Goldman
Sachs, Verizon and other major NYC companies. They stopped coming to the
meetings and PGEast in NYC in March this year was a flop. I haven't
noticed any attempts to figure out what was the problem.
>
> If you really want to influence the PostgreSQL planer, there is a number
> of configuration parameters that allow you to do this (only for a whole
> query though). Also, a query rewrite can often do that same thing that a
> hint would do (didn't you post an example recently?).
Yes, I did post an example. There are limited cases, but that is not
enough. There must be a way of making SQL perform adequately in minutes.
If a VP is looking over your shoulder and asking "is it working yet",
there is no time to wait for Tom Lane to respond.
>
> Interestingly, Oracle also seems to think that hints are not so great. I
> quote from
> http://download.oracle.com/docs/cd/E11882_01/server.112/e17118/
sql_elements006.htm#SQLRF00219
Don't get me wrong. Hints are the last resort, the solution used only in
desperation. However, they must be there and they are there, at least in
Oracle. Real life situations, with real life problems also makes
desperate situations highly probable. As DBAs, we are frequently in dire
straits, and have to be the sultans of swing. And we can't swing if don't
have hints.
>
> Hints were introduced in Oracle7, when users had little recourse if
> the optimizer generated suboptimal plans. Now Oracle provides a number
> of tools, including the SQL Tuning Advisor, SQL plan management, and
> SQL Performance Analyzer, to help you address performance problems
> that are not solved by the optimizer. Oracle strongly recommends that
> you use those tools rather than hints. The tools are far superior to
> hints, because when used on an ongoing basis, they provide fresh
> solutions as your data and database environment change.
Of course that Oracle recommends those tools. Performance and diagnostic
packs are $5000 per CPU thread, each. Therein lies the problem. If it
wasn't for Oracle pricing, I wouldn't even look twice at Postgres, would
try the pilot project and discard it straight away. This is sales pitch.
Oracle, however, knows better than to take hints away. Users must have
the self help methodology. The main problem of Postgres is that it
doesn't provide that. We've both read Cary Millsap's book. This book
outlines the tuning methodology for Oracle. The tuning methodology has
two parts to it:
1) Figure out where the time is spent
2) Shorten the time.
What is so groundbreaking in Millsap's book is the recommendation to
always start from the application instead of the database and to stop
looking at "performance indicators" like the infamous "buffer cache hit
ratio" or "BCHR", as it was known before Connor McDonald's method of
adjusting it according to your heart's desires.
In order to aid users with that methodology, Oracle has developed an
extensive set of V$ tables which help users dissect the performance
problems and en extensive system of hints which can be used to solve the
performance problems. Those "profiles" mentioned in the Oracle manual are
nothing more than the complete sets of hints. See the recent book by
Christian Antognini. In other words, OEM will write your hints for you
and store them as "profile".
Speaking of the tuning methodology, that's precisely the problem with
Postgres: it has none. When faced with performance problems in Oracle,
there are things I can do, short of rewriting the application and
rethinking the data model. In Postgres, I can only post to the list and
hope for the best.
>
> Hints should be used sparingly, and only after you have collected
> statistics on the relevant tables and evaluated the optimizer plan
> without hints using the EXPLAIN PLAN statement. Changing database
> conditions as well as query performance enhancements in subsequent
> releases can have significant impact on how hints in your code affect
> performance.
Hints in the code are needed mainly for two purposes:
1) To ensure/prevent the use of the particular index.
2) To select a join method (hash/merge/nl).
I don't see the need for those decisions being eliminated anytime soon.
All versions of the optimizer will have to make such decisions in the
foreseeable future. In the likely situation that the decision will
sometimes be wrong, the users must be allowed to make the choice for the
optimizer. The main problem with Postgres is that it has no self help
methodology for users. I once asked Joshua Drake what is user to do when
faced with a performance problem. The answer was something that sounds
like a bad joke: "post it in the mailing list and we'll tell you what to
do". Unfortunately, I cannot accept such "tuning methodology". Postgres
will have to come up with a better one or sink into obscurity.
>
> So it seems that Oracle considers hints a legacy from when their
> optimizer was even worse and they had nothing better to offer.
I don't see it that way, as explained above.
> Oracle has to burn extra cycles (and disk access?) to move rows away
> (and maybe back) right when the performance might be needed elsewhere.
Well, Oracle still defeats Postgres in performance, and by a wide margin.
That means that the cost is not as great as it may seem at the first
glance.
> Moreover, there's the problem of having to maintain an extra data
> structure (undo space) which may run out, causing an expensive
> long-running operation to fail randomly.
Well, there is a whole philosophy about "snapshot too old". That, I
suppose, would be open to vote.
> ROWID:
> ======
>
> You make the point that because of VACUUM there is nothing compared to
> Oracle's ROWID. But do you really need that? Shouldn't a row be
> referenced by its primary key?
Rowid is needed for the global indexes. I am frustrated with Postgres
partitioning, which leaves a lot to be desired. I can build a global
index using Sphinx, and yes, it does use the primary key. Unfortunately,
the problem with building the global index on the partitioned table is
precisely the lack of the universal primary key, like ROWID. Somebody has
mentioned CTID, but here is what the manual (
http://www.postgresql.org/docs/9.0/static/ddl-system-columns.html)
says:
************************************************************************
ctid
The physical location of the row version within its table. Note that
although the ctid can be used to locate the row version very quickly, a
row's ctid will change if it is updated or moved by VACUUM FULL.
Therefore ctid is useless as a long-term row identifier. The OID, or even
better a user-defined serial number, should be used to identify logical
rows.
************************************************************************
That is the reason for nagging against vacuum and CTID.
>
> Oracle itself says in
> http://download.oracle.com/docs/cd/E11882_01/server.112/e17118/
pseudocolumns008.htm#SQLRF00254
>
> You should not use ROWID as the primary key of a table. If you delete
> and reinsert a row with the Import and Export utilities, for example,
> then its rowid may change. If you delete a row, then Oracle may
> reassign its rowid to a new row inserted later.
Oracle, however, does have partitioning with global indexes, which is
what I was really after. Let me clarify: I do not advocate using ROWID as
the primary key, nor do I even allow that in the databases entrusted to
me.
>
> Other maintenance commands like ALTER TABLE ... SHRINK SPACE will also
> change ROWID.
Only if you allow it. There is a command "alter table <name> enable row
movement". If this command is not executed, rowid remains constant during
the lifetime of the row. That also makes index maintenance much more
expensive, which is why people tend to not do that lightly.
> I do not understand how lack of ROWID should affect global indexes on
> partitioned tables. Can you explain?
Index is a mathematical structure called B*Tree. It locates the key and
with the key, there is the information how to locate the row(s)
containing this key. If you want to have a global index, spanning several
tables, you must have a global identifier for rows, something like
file.block.offset, which is precisely what Oracle ROWID is. Oracle's
solution to partitioning was to disassociate segment from an object,
which used to be the same before Oracle8. Object in Oracle can have many
segments, which is a sort of dirty solution. I like the Postgres idea of
inheritance, but I really need global indexes. I don't think that having
them is possible without a permanent global row identifier.
>
> Fixed size of WAL segments:
> ===========================
>
> Maybe I am dense, but what is the problem there?
Sheer number of files is annoying. I think I can live with that, it's
just an aesthetic complaint. However, it is indicative of the fact that
the creators of Postgres have never been DBA's working on the real life
databases and do not think high of us, "the DBA types". Josh Berkus
expressed his disdain for Dan Tow and people like him. In return, I can
express my contempt for Josh and his ilk. Unfortunately for Josh, it is
people like me who decide which database will get to live in the
corporate server room. I can only advise an attitude re-adjustment. If
Postgres is meant for the corporate users, then it should provide us with
what we need, in order to be able to do our jobs. Only after that can we
work on the perfect optimizer, the final frontier and the mission to
boldly go where no RDBMS system has gone before.
> PG's ctid is very similar as Oracle's ROWID (
Unfortunately, it's only valid within a transaction. There is no
permanent rowid in Postgres. This wouldn't be a problem in itself because
one should never use it as a primary key. This only presents a problem
because the lack of the permanent rowid is what prevents Postgres from
allowing global indexes on the partitioned tables.
If you do "another level of redirection" as the old joke says, you have
to access block headers and possibly do another I/O to read the row
itself, which makes things slower. In a large partitioned table, with
hundreds of millions of records, every row counts and you need all the
speed you can get.
I created global indexes on a partitioned table using Sphinx. Sphinx
requires a bigint primary key, which was available here and the index
works well. However, I cannot enforce the primary key or have a globally
unique key without a ROWID. If there an implementation of global indexes
without the unique row identifier, I don't need it any more. What I am
talking about is a table like this:
news=# select count(*) from moreover_documents;
count
-----------
181491106
(1 row)
Table "moreover.moreover_documents"
Column | Type | Modifiers
----------------------+-----------------------------+-----------
document_id | bigint | not null
dre_reference | bigint | not null
headline | character varying(4000) |
author | character varying(200) |
url | character varying(1000) |
rank | bigint |
content | text |
stories_like_this | character varying(1000) |
internet_web_site_id | bigint | not null
harvest_time | timestamp without time zone |
valid_time | timestamp without time zone |
keyword | character varying(200) |
article_id | bigint | not null
media_type | character varying(20) |
source_type | character varying(20) |
created_at | timestamp without time zone |
autonomy_fed_at | timestamp without time zone |
language | character varying(150) |
Indexes:
"moreover_documents_pkey" PRIMARY KEY, btree (document_id)
Triggers:
insert_moreover_trigger BEFORE INSERT ON moreover_documents FOR EACH
ROW EXECUTE PROCEDURE moreover_insert_trgfn()
Number of child tables: 10 (Use \d+ to list them.)
I need, global index and global stats, as well as the optimizer capable
of dealing with that.
Aha, thanks for clarifying! I think I need to read up a bit on
PostgreSQL architecture. Generally both approaches (copy on write,
backup old version) work and PostgreSQL simply did choose another way to
do it than Oracle. I am not sure yet which one is better. When I think
about infamous ORA-01555 I would at least say that Oracle's approach has
caused its share of grief for DB users. :-) Which approach is best
certainly depends on usage patterns (how much concurrency, ratio of read
to write activity, inserts vs. deletes vs. updates etc.).
>>> The third is the fact that Postgres doesn't do multi-block reads.
>>
>> Isn't this alleviated by the OS doing readahead on many platforms?
>
> Weeellllll....it is. You are right, my objection is mostly of aesthetic
> nature. When I come to think of it, it does simplify the optimizer
> because there is only one type of I/O to think about.
:-)
>>> Fixed size archive logs (16M) would be the next issue.
>>
>> This can be changed during compilation though. But I agree, you'd
>> rather want this to be tuneable at runtime - at least without
>> recompiling the software.
>>
>>> The lack of shared pool makes it impossible to see the plan of the
>>> statement being executed. And there is more.
>>
>> That's unfortunate. At least you can see SQL currently executed by
>> querying pg_stat_activity.
>
> Oh yes, and there is also auto explain module which will dump the plan of
> any statement running longer than a predefined period of time (tunable).
> What I really miss is the ability to click on the session in PgAdmin3 and
> see the plan of a running statement. Also, sorting by the CPU time or the
> number of I/O requests issued would be nice.
So it's rather an issue of the administration UI and not of the database
itself. It wasn't that long ago that even Oracle did not have a
graphical UI for administration - IIRC it was Oracle 10 which packaged
Enterprise Manager with the DB by default the first time. At that time
I still considered Microsoft SQL Server's tools better.
Kind regards
robert
> IIRC it was Oracle 10 which packaged
> Enterprise Manager with the DB by default the first time.
Yes, but there were both commercial and freeware monitors like Karma,
Tora, Oracletool and others. There is nothing like that for Postgres. The
reason for that is that there cannot be because Postgres doesn't store
the execution plan in a shared location where it can be picked up and
displayed. There is no way of seeing the execution plan of a running SQL
statement.
> At that time I still considered Microsoft SQL Server's tools better.
I still do.
I think we need to separate the ROWID issue from having global indexes
for partitioned tables. I can think of ways to make a global index
work without the existence of ROWID. All it would need would be a PK
per partition and then the global index can use the partition key and
the local PK to find a row. Granted, that would add logical read
operations, but: when partitioning tables you typically do it to gain
efficiency during querying by partition pruning and to make management
of the data easier (e.g. dropping an obsolete partition). So a lookup
which does not use partition pruning should be avoided anyway (or:
done rarely) so the overall price paid might not be too high.
For completeness reasons, there's also HOT to deal more efficiently
with updates:
http://pgsql.tapoueh.org/site/html/misc/hot.html
Kind regards
robert
I apologize for the quip -- I tried to stay calm in the face of all your
insults against PostgreSQL and its developers, but I couldn't hold this
one back.
I try to sum up your arguments why hints are necessary:
- You missed them in a time constrained porting project where they might
have saved the day.
I think that this is a valid concern (remember, I'm not dead set against hints).
However, a porting project from a database system you know well to one
that you don't know well -- with project managers tapping fingers and
breathing down your neck -- is something that in my experience has a very
good chance of failing, no matter what.
There probably went some tuning effort (and some hinting) into getting the
original system to perform as it did, and that was done by somebody who
knows Oracle.
That's not your fault -- whoever planed that migration project should have
known better than to assume that you just have to rewrite a couple of triggers
and the system will perform well.
I trust that you made sure that it were really bad query plans that brought
the performance down, that you had current statistics and default_statistics_target
set high enough.
[...]
> Hints are a solution of the last resort.
[...]
I think we agree on that one. Good.
Another thing where I agree with you is that Oracle has more
options for the end user to analyze what a long running query is currently
doing and why. In PostgreSQL you'll typically have to analyze these things
later, and you'll need tools that are not part of core PostgreSQL.
> If they are not there, the database will not get used.
That's a bold statement. How do you explain that there are people who
successfully use PostgreSQL for big performance-critical applications?
Writing to the mailing list to solve a performance problem might work
better than you think. But you are not limited to that. You can either
hire people who can give you the answers, or you can build up skills to
answer the question yourself. Writing to the list is only the low-cost,
low-end option.
Free open source software works differently in many ways. Don't consider it
as something to save money primarily. If you want to do it right, you will
have to invest money, either into building skills or into hiring them.
You can read the code and modify the optimizer, you can even build a hint
system into it. That's what I call "providing the necessary tools for
the users to have chance to help themselves".
But by then you probably understand PostgreSQL well enough to get away
without hints :^)
Another misunderstanding on your side is to separate "the community" and
"the users". In free open source, the users are an integral part of the
community. I guess you sense a split here because you "got rejected" on
a mailing list.
I did not witness the exchange, but could it be that you entered there with
the deprecatory language and the absolute certainty that you are right
that you exhibited here? I normally get good reasons if my ideas get rejected.
Your hapless Oracle Analyst will have to remain polite even if you call
Oracle's optimizer a crock of shit, but there is no such constraint on
people who talk to you on PostgreSQL mailing lists.
Anyway, I digressed from a technical discussion. Sorry.
But before I return, let me complain about some unfounded accusations that you make:
"Nobody uses PgSQL for benchmarks (see www.tpc.org) because of the
performance problems."
Can you justify this statement?
"... the much better performing commercial databases like Oracle or DB2."
Can you justify that?
"Oracle still defeats Postgres in performance, and by a wide margin."
You use this to swipe any concern that Oracle's implementation may be
suboptimal in some respects. How can we make reasonable comparisons in the
face of arguments like this?
As you as an engineer must know, performance is not a magic number that
drops out of a benchmark program and has a linear order.
That's what management would like, and commercial databases go a long way
to "tune" for commercial benchmarks just for that reason.
I would never claim that PostgreSQL performs better than Oracle nor vice versa.
You need a lot of additional constraints to make a meaningful statement,
and that statement could not easily be used outside its context.
I think we have pretty much exhausted what we have to say concerning hints.
It seems that we don't disagree that much, except that I don't think that
PostgreSQL's lack of query hints renders it useless in real-life "enterprise"
scenarios.
About ROWID:
Upon further meditation, I think that Oracle did not originally intend
to provide ROWID as a goodie for the users to quickly locate rows.
The reason to keep it stable must be to make the implementation of indexes
easier. I don't think that "rows as moving targets" are the real problem,
that is easily solved with unique keys. The problem is that a changing
ROWID will force the index to change whenever the row is updated.
This is a price that PostgreSQL has to pay (extra work on update,
indexes must get vacuumed), but Oracle does not get this for free.
In order to keep ROWIDs stable, it must leave a trail of pointers from
the original location of a row to where the row is actually stored
("row chaining" is the technical term).
This will lead to extra disk accesses and a performance penalty in the
critical path of statement execution, and if things get too bad one might
have to reorganize the table (thereby changing the ROWIDs of the affected rows).
So I'd say that ROWID is at least a mixed blessing.
I still don't see how "rows as a moving target" should prevent global
indexes on inheritance hierarchies. Normal indexes have to deal with the
problem as well. What should make it harder for a global index?
In fact, there has been at least one attempt to implement such a solution:
http://archives.postgresql.org/pgsql-hackers/2009-03/msg01046.php
So I don't think that it is impossible by design.
Finally:
> Unfortunately for Josh, it is
> people like me who decide which database will get to live in the
> corporate server room.
That is interesting. You seem to operate in a different kind of business
environment than most of us.
Usually management (where purchase decisions of that magnitude are made)
will resort to glossy brochures and hired consultants rather than asking their
DBA. To some extent, that is because they know that the DBA will always advise
them to use the system he or she knows best. Right?
Yours,
Laurenz Albe
> I apologize for the quip -- I tried to stay calm in the face of all your
> insults against PostgreSQL and its developers, but I couldn't hold this
> one back.
Accepted. As for my insults, I will have to refer you back to this:
http://tinyurl.com/68gu822
That, in my humble opinion, is a work of an idiot, period. That isn't an
insult, that is just a statement of fact.
>
> I try to sum up your arguments why hints are necessary: - You missed
> them in a time constrained porting project where they might
> have saved the day.
They would have saved the day. I needed the way to make certain that the
existing indexes are used. In other words, I believe that I am smarter
than the optimizer and that I should be able to override its decisions,
at my own peril, of course.
>
> I think that this is a valid concern (remember, I'm not dead set against
> hints). However, a porting project from a database system you know well
> to one that you don't know well -- with project managers tapping fingers
> and breathing down your neck -- is something that in my experience has a
> very good chance of failing, no matter what.
Well, my knowledge of PostgreSQL was, in my opinion, quite satisfactory.
I did read the literature, I did read several articles, I did read Greg
Smith's tuning book and I did engage in the learning process for several
months. Given my experience with other databases (Oracle, MySQL), I'd say
that my knowledge was far above the knowledge of an average Postgres user.
> There probably went some tuning effort (and some hinting) into getting
> the original system to perform as it did, and that was done by somebody
> who knows Oracle.
True. The original system was tuned by the people who know oracle.
> That's not your fault -- whoever planed that migration project should
> have known better than to assume that you just have to rewrite a couple
> of triggers and the system will perform well.
Actually, that is precisely the punch line of the Postgres promotors.
EnterpriseDB even sells a product that is supposed to make it easier
still. They are saying that similarities are such that it is fairly easy
to do just that, change few things, recompile and, voila. I am not going
to name names, but I have attended several of the NY PUG meetings and
that's precisely the claim that has been made. I am glad that we agree on
that. You can trust me, however, that hints would have saved the day.
> I trust that you made sure that it were really bad query plans that
> brought the performance down, that you had current statistics and
> default_statistics_target set high enough.
I did. The size of the histograms is 1024. That ought to be enough.
>
> [...]
>> Hints are a solution of the last resort.
> [...]
>
> I think we agree on that one. Good.
But they have to be there, as a solution of the last resort. The main
problem with Postgres is that there is no tuning methodology, there is
not much one can do to help himself. Postgres doesn't allow its users to
control the query plan. It reflects a deep mistrust toward its own users.
>
> Another thing where I agree with you is that Oracle has more options for
> the end user to analyze what a long running query is currently doing and
> why. In PostgreSQL you'll typically have to analyze these things later,
> and you'll need tools that are not part of core PostgreSQL
It's not just tools, it's the architecture. Plans of the running
statement are not visible because they are not shared. You cannot see
them. There is no way to tell whether the executing process has just
executed "set enable_seqscan=off". You don't see the running plan and you
don't see the execution environment for that process. The only way you
could see that would be to have a shared memory area to read it from. In
Oracle terminology, that would mean a shared pool.
>
>> If they are not there, the database will not get used.
>
> That's a bold statement. How do you explain that there are people who
> successfully use PostgreSQL for big performance-critical applications?
I don't. I can only see that PostgreSQL is much less popular than MySQL
and I can see that it isn't getting better. I can see that the meeting in
NYC was a flop, despite people like me desperately seeking for an
alternative to Oracle. Sorry, PostgreSQL is not it. As I've stated above,
the main problem is a complete lack of tuning methodology. There is not
much user can do, short of posting the output of explain analyze,
preferably formatted using www.depesz.com. User or a DBA doesn't have a
methodology that could lead to a better performing plan. And yes, the
fact that Postgres designers do not trust their users enough to provide
them with means to override the optimizer decisions, should all else
fail, is also deeply disturbing. Of course, now I can pull my final
objection: PostgreSQL is the ONLY major database system which does so.
Even Firebird designers have accepted the inevitable and will provide
basic hints in the version 3.0, despite the similar objections as are the
one made by PostgreSQL developers.
>
> Writing to the mailing list to solve a performance problem might work
> better than you think.
It is not a tuning methodology. Sorry. This is simply unacceptable. I
will not do so, under any circumstances.
> But you are not limited to that. You can either
> hire people who can give you the answers, or you can build up skills to
> answer the question yourself. Writing to the list is only the low-cost,
> low-end option.
>
> Free open source software works differently in many ways. Don't consider
> it as something to save money primarily.
Unfortunately, if I have to spend money for the pilot project to succeed,
it will not be on PostgreSQL. If we are talking about spending money,
there are other databases which are commercial from scratch. That is
probably the reason for such a low popularity of PostgreSQL.
> If you want to do it right, you
> will have to invest money, either into building skills or into hiring
> them. You can read the code and modify the optimizer, you can even build
> a hint system into it. That's what I call "providing the necessary tools
> for the users to have chance to help themselves". But by then you
> probably understand PostgreSQL well enough to get away without hints :^)
The trouble is that I need to have hints as an insurance policy. Should
all else fail, I want to be able to force my decision upon the optimizer.
Humans are still much smarter than computer programs and know the data
better.
> Another misunderstanding on your side is to separate "the community" and
> "the users". In free open source, the users are an integral part of the
> community. I guess you sense a split here because you "got rejected" on
> a mailing list.
I didn't just get rejected, I got steamrolled. The argument that hints
are unnecessary is indefensible, unless you can prove that the optimizer
is smarter than humans. On the other hand, such a claim would be not only
preposterous, it would be downright rude.
I've seen many bad optimizer decisions by the Postgres optimizer which
I've influenced through the set of "set" commands. Unfortunately, when
application is written in Hibernate, it's easy to rewrite a SQL command
or two, but to enter countless "set" commands is daunting at best. Of
course, those "set" commands will stay in effect even after the
problematic command has been executed, so you'll have to do it all over
again, for the next command, plus keeping in mind the switches for the
previous executions.
> I did not witness the exchange, but could it be that you entered there
> with the deprecatory language and the absolute certainty that you are
> right that you exhibited here?
Because nobody has proven me wrong. The pearls of wisdom, like the URL I
quoted at the beginning of this post, just reinforce me in my argument.
In essence, the argument boils down to the following points:
- Hints are needed. I know that for a fact from my personal experience.
That is beyond debate. I do trust my own eyes.
- Humans are smarter than computer programs. To claim otherwise is rude,
to say the least.
- Refusal to provide hints reflects deep mistrust toward the users.
- PostgreSQL has no usable tuning methodology that user could follow to
speed up the SQL execution. None whatsoever.
> I normally get good reasons if my ideas
> get rejected. Your hapless Oracle Analyst will have to remain polite
> even if you call Oracle's optimizer a crock of shit, but there is no
> such constraint on people who talk to you on PostgreSQL mailing lists.
On the other hand, many people who have tried talking sense into
PostgreSQL people and have failed, might have stopped using PostgreSQL
altogether. Maybe that's why PGEast in NYC was such a flop? There is a
reason for so much larger degree of popularity of MySQL, even after its
purchase by Oracle. Has anyone ever asked what is PostgreSQL doing wrong?
>
> Anyway, I digressed from a technical discussion. Sorry.
> But before I return, let me complain about some unfounded accusations
> that you make:
>
> "Nobody uses PgSQL for benchmarks (see www.tpc.org) because of the
> performance problems."
>
> Can you justify this statement?
Yes, of course I can. I was unable to find even a single TPC benchmark on
the TPC site, done by using Postgres. TPC (stands for "Transaction
Processing Council") is the place where people place TPC benchmarks.
Databases used for benchmarks are the usual suspects: Oracle, MS SQL,
DB2, Sybase and some more exotic versions EXASQL. I didn't find any
Postgres benchmark there, feel free to correct me if I'm wrong.
> "... the much better performing commercial databases like Oracle or
> DB2."
>
> Can you justify that?
Yes, I can. The reasons why both Oracle and DB2 are, generally speaking,
much faster than Postgres are
- proper partitioning
- the ability to parallelize queries
Both of those abilities are missing from Postgres. In particular, Oracle
can do many things that Postgres cannot. Here I have in mind things like
the star schema, bitmap indexes are still somewhat painful topic, index
organized tables, index clusters, hash clusters, full index scan, fast
index scan or a skip scan. And I didn't even mention RAC, "AS OF" queries
or flashback.
> "Oracle still defeats Postgres in performance, and by a wide margin."
>
> You use this to swipe any concern that Oracle's implementation may be
> suboptimal in some respects. How can we make reasonable comparisons in
> the face of arguments like this?
The "reasonable comparison" must start with the fact that Oracle is the
2nd largest software company in the world and that its resources are many
times larger than the resources at the disposal to the Postgres
developers. If Postgres wants to compete with Oracle, it can do so only
based on price, definitely not on the features or performance. Even so,
if you want people to consider Postgres being alternative at all, which
wast majority of companies is not doing, you have to ask corporate users
what do they want. The topic of hints have popped up again and again. And
will not go away. As a matter of fact, I once stated that I will write a
hint system myself. The project that I am currently working on and is
consuming the most of my spare time will be done by August the 1st. I
will then renew my C programming, I used to be pretty good at that, and
write it, even if it takes me a full year to do so.
> As you as an engineer must know, performance is not a magic number that
> drops out of a benchmark program and has a linear order. That's what
> management would like, and commercial databases go a long way to "tune"
> for commercial benchmarks just for that reason. I would never claim that
> PostgreSQL performs better than Oracle nor vice versa. You need a lot of
> additional constraints to make a meaningful statement, and that
> statement could not easily be used outside its context.
What allows me to make such a broad comparison is the fact that PostgreSQL
has no tuning methodology. If, for instance, the two of us are given the
same SQL statement to make faster, you on Postgres and me on Oracle, I
will have a method to follow and will, most likely, end up with a
solution, much, much before you. If all else fails, I will be able to
just force the execution plan that I like. You will still be waiting for
an answer from the mailing list.
> I think we have pretty much exhausted what we have to say concerning
> hints. It seems that we don't disagree that much, except that I don't
> think that PostgreSQL's lack of query hints renders it useless in
> real-life "enterprise" scenarios.
Well, it's like cruising on a passenger ship which doesn't have life
boats. Most likely, you will not need them but the passengers on the
Titanic could have used few more of those.
> I still don't see how "rows as a moving target" should prevent global
> indexes on inheritance hierarchies. Normal indexes have to deal with the
> problem as well. What should make it harder for a global index? In fact,
> there has been at least one attempt to implement such a solution:
> http://archives.postgresql.org/pgsql-hackers/2009-03/msg01046.php So I
> don't think that it is impossible by design.
Well, I will refrain from judgment here, but I am still waiting for an
implementation. I've mentioned Sphinx, which requires globally unique
bigint as the key. Maybe, there is a way to do it without. Let's wait and
see.
>
> Finally:
>
>> Unfortunately for Josh, it is
>> people like me who decide which database will get to live in the
>> corporate server room.
>
> That is interesting. You seem to operate in a different kind of business
> environment than most of us.
> Usually management (where purchase decisions of that magnitude are made)
> will resort to glossy brochures and hired consultants rather than asking
> their DBA. To some extent, that is because they know that the DBA will
> always advise them to use the system he or she knows best. Right?
It seems that we do operate in different environments. In my world, DBA
acts as goalkeeper who evaluates and filters out acceptable solutions.
I've been working for several CIO's and have been advising on a vide
range of software, from monitoring software, backup software to
alternative DB software. I cannot recollect an instance of my advice not
being followed. The decision, of course, is made by the CIO, but the
evaluation is my responsibility. That is, after all, how I ended up with
Postgres. Now, I have a problem: CIO has made a decision, based on my
recommendation, and I have to tell him that I was wrong. I was able to
kill one porting project, but that doesn't mean that I am off the hook.
Finally, let me explain what am I doing here. What I've set out to do is
to warn anybody who wants to replace Oracle by Postgres that it will not
be as easy as it looks and that there are terminal deficiencies in the
mentality of PostgreSQL developer which will never allow it to gain much
popularity or become a real alternative to Oracle. Had I been properly
warned, I wouldn't have ended with such an egg on my face. I will only
react if someone mentions conversion from Oracle. People ought to be
warned.
I guess you feel insulted by his comments about "older DBAs" and their
habits. It is probably exaggerated, but I think he is refering to people
who say "let's use hints" before they even start to think properly.
>> That's not your fault -- whoever planed that migration project should
>> have known better than to assume that you just have to rewrite a couple
>> of triggers and the system will perform well.
>
> Actually, that is precisely the punch line of the Postgres promotors.
> EnterpriseDB even sells a product that is supposed to make it easier
> still. They are saying that similarities are such that it is fairly easy
> to do just that, change few things, recompile and, voila.
You have to be able to tell marketing talk from reality.
And for simple applications it might work well enough.
> I can only see that PostgreSQL is much less popular than MySQL
> and I can see that it isn't getting better.
First, be patient. And second, popularity is not always the measure
of quality (ever heard of Microsoft?).
>> Free open source software works differently in many ways. Don't consider
>> it as something to save money primarily.
>
> Unfortunately, if I have to spend money for the pilot project to succeed,
> it will not be on PostgreSQL. If we are talking about spending money,
> there are other databases which are commercial from scratch. That is
> probably the reason for such a low popularity of PostgreSQL.
Well, you said that you spent months learning PostgreSQL. That's an
investment that can be measured in money, right?
Commercial software has a different and enticing pricing model. You'll
get the thing for a moderate price at first (or you get a crippled free
version), then you realize that without (expensive) support you can't
even get patches for the showstopping bugs, and the next time you'll have
to negotiate license fees you'll have to pay more.
With open source software you may have a higher initial cost: you will
have to spend more money on integrating it into your environment, you
might have to invest more into education, you might have to develop
add-ons that you need and that are not supplied. But once this is done,
costs will drop rapidly, and in the long run you'll be cheaper.
Furthermore, you'll get added values with open source: the ability to
influence decisions (the only enhancement request I ever made for
Oracle years ago did not even receive an answer), the ability to help
yourself if you run into a bug or need a feature.
>> "Nobody uses PgSQL for benchmarks (see www.tpc.org) because of the
>> performance problems."
>>
>> Can you justify this statement?
>
> Yes, of course I can. I was unable to find even a single TPC benchmark on
> the TPC site, done by using Postgres. TPC (stands for "Transaction
> Processing Council") is the place where people place TPC benchmarks.
> Databases used for benchmarks are the usual suspects: Oracle, MS SQL,
> DB2, Sybase and some more exotic versions EXASQL. I didn't find any
> Postgres benchmark there, feel free to correct me if I'm wrong.
Maybe there are other reasons besides terrible performance?
Do you have to pay for TPC? Is it expensive?
>> "... the much better performing commercial databases like Oracle or
>> DB2."
>>
>> Can you justify that?
>
> Yes, I can. The reasons why both Oracle and DB2 are, generally speaking,
> much faster than Postgres are
> - proper partitioning
> - the ability to parallelize queries
> Both of those abilities are missing from Postgres. In particular, Oracle
> can do many things that Postgres cannot. Here I have in mind things like
> the star schema, bitmap indexes are still somewhat painful topic, index
> organized tables, index clusters, hash clusters, full index scan, fast
> index scan or a skip scan. And I didn't even mention RAC, "AS OF" queries
> or flashback.
PostgreSQL has some of these things.
Even if Oracle has some performance related features that PostgreSQL does not,
that does not prove that it will always perform better. I'm sure that Oracle
will sometimes be faster than PostgreSQL, but I don't accept a general statement
like that.
> The "reasonable comparison" must start with the fact that Oracle is the
> 2nd largest software company in the world and that its resources are many
> times larger than the resources at the disposal to the Postgres
> developers. If Postgres wants to compete with Oracle, it can do so only
> based on price, definitely not on the features or performance. Even so,
> if you want people to consider Postgres being alternative at all, which
> wast majority of companies is not doing, you have to ask corporate users
> what do they want.
There you go. PostgreSQL with its limited resources has to concentrate
on the core system. It is not out to provide everything that Oracle does,
but it is out to do the things it does better than everybody else.
I believe that PostgreSQL is quite interested in what corporate users want.
Think of in-core replication. But that does not imply that PostgreSQL can
and wants to provide everything that corporate users ask for.
> The topic of hints have popped up again and again. And
> will not go away. As a matter of fact, I once stated that I will write a
> hint system myself. The project that I am currently working on and is
> consuming the most of my spare time will be done by August the 1st. I
> will then renew my C programming, I used to be pretty good at that, and
> write it, even if it takes me a full year to do so.
I am sure that you'll find the community helpful if you have questions.
However, it will be pretty difficult to get that into core.
You'd have to be patient at least.
But as long as you can scratch your itch, that's a minor inconvenience.
> What allows me to make such a broad comparison is the fact that PostgreSQL
> has no tuning methodology. If, for instance, the two of us are given the
> same SQL statement to make faster, you on Postgres and me on Oracle, I
> will have a method to follow and will, most likely, end up with a
> solution, much, much before you. If all else fails, I will be able to
> just force the execution plan that I like. You will still be waiting for
> an answer from the mailing list.
As I mentioned, there *are* other things you can do.
- You can examine the output of EXPLAIN ANALYZE yourself and figure out
where and why things go wrong.
- You can influence statistics levels on a per-column basis.
- You can influence query planning with a number of parameters.
And they don't have to "stay around" after the query is done.
There is SET LOCAL which takes effect only in the current transaction.
- Often a simple change in an SQL statement will get the planner on the
right track.
> Finally, let me explain what am I doing here. What I've set out to do is
> to warn anybody who wants to replace Oracle by Postgres that it will not
> be as easy as it looks and that there are terminal deficiencies in the
> mentality of PostgreSQL developer which will never allow it to gain much
> popularity or become a real alternative to Oracle. Had I been properly
> warned, I wouldn't have ended with such an egg on my face. I will only
> react if someone mentions conversion from Oracle. People ought to be
> warned.
That's ok, share your experiences.
But if you want your opinion to be heard, refrain from insult and sweeping
generalizations with no more proof than your authority.
Yours,
Laurenz Albe
>> I can only see that PostgreSQL is much less popular than MySQL and I
>> can see that it isn't getting better.
>
> First, be patient. And second, popularity is not always the measure of
> quality (ever heard of Microsoft?).
Patient? Postgres came about in 1995. It's 2011 now. How much more
patience is needed? As for Microsoft, it is very good at what it does. MS-
Word is a class in itself, head and shoulders above the competition.
Microsoft still owns 95% or so of the desktop and office software market,
despite repeated assaults by both Linux and Apple and the former
opponents like OS/2, WordPerfect, Borland, Lotus and all other names that
are now but waning memories. Other than that, I have very good
experiences with Windows7. I wouldn't exactly take MS as an example for
the lack of quality. Look at it this way: users are voting with their
wallets. So far, MS is winning the contest by a wide margin, despite the
existence of the free alternatives. That says something, doesn't it?
> Well, you said that you spent months learning PostgreSQL. That's an
> investment that can be measured in money, right?
Yes, it is. In this case, it was wasted money. I've wasted a year of my
life.
>
> Commercial software has a different and enticing pricing model. You'll
> get the thing for a moderate price at first (or you get a crippled free
> version), then you realize that without (expensive) support you can't
> even get patches for the showstopping bugs, and the next time you'll
> have to negotiate license fees you'll have to pay more.
Depends on the vendor. There are many business models, ranging from Red
Hat which allows you to download for free and pay for the support if you
need it to Oracle.
> With open source software you may have a higher initial cost: you will
> have to spend more money on integrating it into your environment, you
> might have to invest more into education, you might have to develop
> add-ons that you need and that are not supplied. But once this is done,
> costs will drop rapidly, and in the long run you'll be cheaper.
Unless the software itself doesn't fulfill your needs. And that is what
I've been trying to convey to Postgres community all along: hints are not
optional. They are essential. That is why I have drawn the line in the
sand.
> Furthermore, you'll get added values with open source: the ability to
> influence decisions (the only enhancement request I ever made for Oracle
> years ago did not even receive an answer), the ability to help yourself
> if you run into a bug or need a feature.
You mean the ability to help yourself if you need something like hints?
What happens if the product is heavily influenced by bozos saying
something "we don't need to do that because we're not for profit"?
> Maybe there are other reasons besides terrible performance? Do you have
> to pay for TPC? Is it expensive?
TPC is mainly paid for by the equipment manufacturers. Having the meanest
database machine on the market allegedly sells boxes. Database vendors
are usually not too thrilled about benchmarks because of the fear that in
a few months the vendor will release a better benchmark by some other DB
and leave them with a huge marketing problem. If Postgres was so fast, I
assure you, someone would have selected it for a TPC benchmark. The fact
that it didn't happen puts things in perspective.
> PostgreSQL has some of these things.
> Even if Oracle has some performance related features that PostgreSQL
> does not, that does not prove that it will always perform better. I'm
> sure that Oracle will sometimes be faster than PostgreSQL, but I don't
> accept a general statement like that.
Look no further than the Postgresql performance mailing list. At least
once a month, there is a question why does Oracle do this or that faster
than Postgres. If you asked users with the experience in both worlds,
what do you think that the answer would be, which one is faster?
> There you go. PostgreSQL with its limited resources has to concentrate
> on the core system. It is not out to provide everything that Oracle
> does, but it is out to do the things it does better than everybody else.
> I believe that PostgreSQL is quite interested in what corporate users
> want. Think of in-core replication. But that does not imply that
> PostgreSQL can and wants to provide everything that corporate users ask
> for.
The case with hints shows a fatal flaw in the mentality of Postgres
community. Quite frankly, if Postgres didn't make it to the prime time in
16 years, I don't think it ever will. There are other options, from NoSQL
databases like MongoDB to Firebird. I am waiting for Firebird 3.0 to
assess how usable it will be. It's a major effort and from what I am
reading about it, it definitely looks interesting.
> I am sure that you'll find the community helpful if you have questions.
> However, it will be pretty difficult to get that into core. You'd have
> to be patient at least.
Initially, I've asked for hints to only be put on the TODO list. At that
time, that would have satisfied me. I was steamrolled. The "community"
doesn't even want to talk about hints. It's a carnival of vanities,
they've drawn the line in the sand and now there is no more discussion
about that. Fine with me.
> But as long as you can scratch your itch, that's a minor inconvenience.
I can, but not with Postgres. The absence of hints disqualifies Postgres
for any further projects. And I wished for Postgres to succeed. I was
hosting meetings and I was very enthusiastic about it. Until I touched
the topic of hints.
> As I mentioned, there *are* other things you can do. - You can examine
> the output of EXPLAIN ANALYZE yourself and figure out
> where and why things go wrong.
> - You can influence statistics levels on a per-column basis. - You can
> influence query planning with a number of parameters.
Laurenz, you skipped all my explanations that humans are smarter than
computer programs and that there must be the way for human to decide and
override the decisions made by the optimizer. For a database to be
viable, it must offer a tuning methodology, a finite set of steps that
user can make to improve performance. I find posting to the mailing an
unacceptable methodology. This was supposed to be a discussion on the
technical merits of hints, yet there is no answer to the main points I've
made. Let me repeat them once again:
- Humans are smarter than the optimizer and have better knowledge of
the data. Therefore, they should have the possibility to override the
decisions made by the optimizer.
- Hints are needed, as the last resort, but they are. When there is no
other way of getting the desired performance, hints must be available
as the method of the last resort.
- Database must offer a tuning methodology, things that user can do to
improve performance. You start with a SQL, you run explain analyze and
then there must be a series of steps to take. In PostgreSQL, there
isn't such a methodology.
- Such refusal to provide a method to override the optimizer decisions
reflects deep mistrust of the database developers toward their users.
The only reason for so stubbornly refusing hints is what I call the
carnival of vanities, the line drawn in the sand by a group of
developers, who have never been DBA's, and who will rather see their
product to sink into obscurity than to give in to reasonable demands. At
this point, I would like to hear one reason against hints. All I've heard
so far is to be patient and that it's "deep in core", without any
reasoning against having the option. For patience to make sense, someone
must acknowledge that hints are necessary and put them on the TODO list.
With pearls of wisdom like the one previously mentioned, patience makes
no sense. A change in attitude is needed and I find that unlikely to
happen.
>
> That's ok, share your experiences.
> But if you want your opinion to be heard, refrain from insult and
> sweeping generalizations with no more proof than your authority.
I don't think that I am getting my message across here. I don't care
about being heard by the Postgres community, that's way behind me. I've
said what I've had, those who wanted to listen did so. Now, my goal is to
try warning the Oracle users of the pitfalls and let them know that
Postgres is essentially inadequate for the task. In other words, my
message is not intended for the Postgres community at all, it is intended
for the Oracle users. I've abandoned my hopes for Postgres. With Josh
Berkus, Kevin Grittner,Joshua Drake, Tom Lane and Bruce Momjian, there
isn't much hope for the project. Sure, it will live on, maybe for even
another decade, but it will eventually sink into obscurity. Firebird
looks promising, NoSQL databases seem like a good solution for the DW
type use, and there are some other ideas, too.
> I think we need to separate the ROWID issue from having global indexes
> for partitioned tables. I can think of ways to make a global index work
> without the existence of ROWID.
Well, it would be presumptuous of me to claim otherwise. Having a global
ROWID does look like a simple solution to the problem, but there probably
are other solutions. Unfortunately, none of those exist in the world of
Postgres yet. Neither does this: http://tinyurl.com/38fgjwg
That's parallel query in MySQL, isn't it? This looks good. I think I
agree with you: for really huge datasets there should be some built in
parallelism in order to be able to truly use all IO bandwidth available
and not get limited by a single core.
Kind regards
robert
--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
> Look no further than the Postgresql performance mailing list. At least
> once a month, there is a question why does Oracle do this or that faster
> than Postgres. If you asked users with the experience in both worlds,
> what do you think that the answer would be, which one is faster?
That argument is flawed.
Only people who have a problem will write a question to the list.
> This was supposed to be a discussion on the
> technical merits of hints, yet there is no answer to the main points I've
> made. Let me repeat them once again:
> - Humans are smarter than the optimizer and have better knowledge of
> the data. Therefore, they should have the possibility to override the
> decisions made by the optimizer.
> - Hints are needed, as the last resort, but they are. When there is no
> other way of getting the desired performance, hints must be available
> as the method of the last resort.
> - Database must offer a tuning methodology, things that user can do to
> improve performance. You start with a SQL, you run explain analyze and
> then there must be a series of steps to take. In PostgreSQL, there
> isn't such a methodology.
> - Such refusal to provide a method to override the optimizer decisions
> reflects deep mistrust of the database developers toward their users.
All but the third of these points are not about technical merits.
#1 is philosophy. I'd agree that humans are smarter than machines in
principle, but I wouldn't enter a number crunching contest with one.
Nor do I personally want to enter a query optimizing contest with
PostgreSQL's optimizer.
#2 is an opinion. What should I answer?
#4 sounds like a conspiracy theory.
I agree with #3 as far that a database must offer tools for tuning.
I have named some that PostgreSQL has. Query tuning is always a bit of
an art, don't pretend that there's a step-by-step foolproof method in
Oracle. I may look like that to you because you know it well.
>> That's ok, share your experiences.
>> But if you want your opinion to be heard, refrain from insult and
>> sweeping generalizations with no more proof than your authority.
> I don't think that I am getting my message across here. I don't care
> about being heard by the Postgres community, that's way behind me.
No, I meant be heard by people who post here to get advice and opinions
on migrating from Oracle to PostgreSQL.
Yours,
Laurenz Albe
> All but the third of these points are not about technical merits.
>
> #1 is philosophy. I'd agree that humans are smarter than machines in
> principle, but I wouldn't enter a number crunching contest with one. Nor
> do I personally want to enter a query optimizing contest with
> PostgreSQL's optimizer.
Well, I have no compunction about doing the latter. However, no choice is
given. That's the problem.
>
> #2 is an opinion. What should I answer?
You should let me know whether you think that there are situations in
which hints are required and if not, why not? The designers of all other
major RDBMS systems seem to disagree here.
>
> #4 sounds like a conspiracy theory.
No, it doesn't. It's a rightful impression. Refusal to give users a
choice and requirement to post to the list is completely ludicrous. Now,
had I said that PostgreSQL is intentionally kept relatively useless, so
that business users have to either buy a commercial product or hire
consultants, that would be a conspiracy theory. However, I didn't say
that. All I said is that refusal to provide hints based on arguments that
the optimizer is smarter than the users is rude and reflects deep
mistrust toward the users of the software.
>
> I agree with #3 as far that a database must offer tools for tuning. I
> have named some that PostgreSQL has. Query tuning is always a bit of an
> art, don't pretend that there's a step-by-step foolproof method in
> Oracle. I may look like that to you because you know it well.
Actually, there is a foolproof method in Oracle. You see where the time
is spent and you address the problem. That is what the wait event
interface is all about. And it's by no means an art form.
> No, I meant be heard by people who post here to get advice and opinions
> on migrating from Oracle to PostgreSQL.
Oh, I'll take care of that myself. I'll make myself heard, don't worry.
I am not sure if there are situations where they are needed.
So far, I have not felt the need in PostgreSQL.
>> I agree with #3 as far that a database must offer tools for tuning. I
>> have named some that PostgreSQL has. Query tuning is always a bit of an
>> art, don't pretend that there's a step-by-step foolproof method in
>> Oracle. I may look like that to you because you know it well.
>
> Actually, there is a foolproof method in Oracle. You see where the time
> is spent and you address the problem. That is what the wait event
> interface is all about. And it's by no means an art form.
EXPLAIN ANALYZE will show me where the time is spent.
The "address the problem" is the nontrivial point here, in PostgreSQL
as well as in Oracle. PostgreSQL is lacking one problem addressing tool
that Oracle provides, but that dows not imply that there are none.
Yours,
Laurenz Albe
> am not sure if there are situations where they are needed.
> So far, I have not felt the need in PostgreSQL.
And that's essentially what the usual argument against providing hints
boils down to. Plus, of course, a fair dose of irritation over my being
stubborn and refusing to go away.
> EXPLAIN ANALYZE will show me where the time is spent.
Not in sufficient detail. There is nothing like oracle wait event
interface, but that was not the point here. We're not discussing a wait
event interface here.
> The "address the
> problem" is the nontrivial point here, in PostgreSQL as well as in
> Oracle. PostgreSQL is lacking one problem addressing tool that Oracle
> provides, but that dows not imply that there are none.
The problem is that Postgres doesn't allow me to override its optimizer
plan, even if I deem it necessary. In other words, Postgres tells me, as
some of its creators have written, that its optimizer is smarter than
humans. And here make a loop toward philosophy again. This is not a minor
detail, as I have explained before. Optimizer hints are a crucial problem.
And I will not go away and stop pointing at it.
[apologies for not having the technical depth to understand these
issues better -- I just use PostgreSQL as a "means to an end"...
I could use any similar "means" but this seems to work well enough
for me]
I write a lot of code (in C, etc.) and *deliberately* let the compiler
do as much work as possible, for me -- concentrating on picking good
*algorithms* instead of micromanaging individual lines of code. If
performance is a *real* issue (like processing video frames in
real-time) and I *can't* change the hardware (because this isn't
a handful of servers, but, rather, thousands of "retail devices"),
then I will drop into assembly language to code *very tiny* portions
of the algorithm where the gains are the most substantial. I.e.,
my philosophy is to let the machine do the thinking/work, wherever
possible.
Can someone please, *gently*, explain *quantitatively* the differences
that these issues make in performance?
(see below)
On 6/24/2011 3:29 AM, Laurenz Albe wrote:
> About hints:
> ============
>
> Let me say that I personally do not feel very strongly
> either way - I have used hints in Oracle and get by without them in PostgreSQL.
Are hints on a par with the 'register' keyword in C? I.e.,
something you *hope* gives the compiler an insight into your
algorithm -- but, that the compiler is free to *ignore*?
Are there consequences to using a "hint" (e.g., using 'register'
places constraints on what you can do)?
Is the "hint" concept essentially a workaround to compensate for
capabilities *missing* in the optimizer? I.e., as C compilers
get smarter, they deliberately *choose* to ignore the 'register'
keyword because they actually have *more* insight than the
programmer!
Are there ways (without using a "hint") to coerce the database
to perform in the way your hint would *hope* to have it behave?
(e.g., can you rewrite the SQL to cause the behavior you seek?)
[these all seem to be parallels to the register/compiler
example I posed]
What sort of *quantitative* difference can this make (example?)
that can't, otherwise, be attained?
> About vacuum:
> =============
>
> Of course vacuum is a pain, but every database system has to pay a similar
> price somewhere. Either you reduce concurrency (DB2), or (if you use some kind
> of MVCC) you can move dead rows somewhere else (Oracle) or leave them where
> they are and clean up later.
Is this just the difference between automatic garbage collection
and "manual" garbage collection (again, relying on parallels to
other "languages")?
Does the parallel maintain throughout operation or does it fall
apart at the extremes? E.g., if garbage collection must be
manually initiated (in a language), then failing to do so
can eventually cause you to run out of heap, etc. (which
causes the application to "fail"). Is the same true when
it comes to vacuuming -- or, does performance just suffer?
(i.e., will you ever "fill the disk" with dead rows if you
fail to initiate a VACUUM operation?)
> ROWID:
> ======
>
> You make the point that because of VACUUM there is nothing compared to
> Oracle's ROWID. But do you really need that? Shouldn't a row be referenced
> by its primary key?
The last question seems to be my understanding of how things should
work. (?) (actually, one of the harder ideas for me to get accustomed
to in this field :< )
How does a ROWID differ from using an OID? (which I presume is
not a good thing)
> Fixed size of WAL segments:
> ===========================
>
> Maybe I am dense, but what is the problem there?
<grin> I don't even know what they *are*! ;-)
> Are hints on a par with the 'register' keyword in C?
Yes.
> I.e., something
> you *hope* gives the compiler an insight into your algorithm -- but,
> that the compiler is free to *ignore*?
C compiler is not free to ignore "register" keyword and neither is the
database free to ignore hints. Compiler has to place your variable in a
register, if there is one available. Database also must obey hints, if
possible. Also, with C there is one more keyword, namely "asm", which may
be used to coerce the code to behave precisely as you want.
This analogy is not good for several reasons, the main one being the
difference between the directly executable vs. interpreted paradigms.
If C was the only compiler to disallow "hints", it wouldn't be so
popular. Now, if PostgreSQL was the only major database to disallow
hints...?
>> Are hints on a par with the 'register' keyword in C?
>
> Yes.
>
>> I.e., something
>> you *hope* gives the compiler an insight into your algorithm -- but,
>> that the compiler is free to *ignore*?
>
> C compiler is not free to ignore "register" keyword and neither is the
> database free to ignore hints. Compiler has to place your variable in a
> register, if there is one available.
The relevant text is:
A declaration of an identifier for an object with storage-class
specifier register suggests that access to the object be as fast as
possible. The extent to which such suggestions are effective is
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
implementation-defined.
^^^^^^^^^^^^^^^^^^^^^^
i.e. the compiler is free to ignore it.
> Database also must obey hints, if possible. Also, with C there is one
> more keyword, namely "asm", which may be used to coerce the code to
> behave precisely as you want. This analogy is not good for several
> reasons, the main one being the difference between the directly
> executable vs. interpreted paradigms. If C was the only compiler to
> disallow "hints", it wouldn't be so popular. Now, if PostgreSQL was
> the only major database to disallow hints...?
> The relevant text is:
>
> A declaration of an identifier for an object with storage-class
> specifier register suggests that access to the object be as fast as
> possible. The extent to which such suggestions are effective is
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> implementation-defined.
> ^^^^^^^^^^^^^^^^^^^^^^
This is what I found:
http://tigcc.ticalc.org/doc/keywords.html#register
**************************************************************************
register
Tells the compiler to store the variable being declared in a CPU register.
In standard C dialects, keyword auto uses the following syntax:
register data-definition;
The register type modifier tells the compiler to store the variable being
declared in a CPU register (if possible), to optimize access. For example,
register int i;
Note that TIGCC will automatically store often used variables in CPU
registers when the optimization is turned on, but the keyword register
will force storing in registers even if the optimization is turned off.
However, the request for storing data in registers may be denied, if the
compiler concludes that there is not enough free registers for use at
this place.
************************************************************************
While there are implementation differences, I would hardly say that the
compiler is "free to ignore the register keyword". It might end up
ignoring it, but it's definitely not the default behavior. However, this
group is about PostgreSQL and this thread was devoted to some particular
aspects missing in PostgreSQL, not to C language. Analogy with the C
language is actually a bad analogy to consider when discussing databases,
primarily because C language isn't a database. I don't think we should
waste any more time discussing C language.
On 6/30/2011 8:41 AM, Mladen Gogala wrote:
> On Thu, 30 Jun 2011 07:05:54 -0700, Don Y wrote:
>
>> Are hints on a par with the 'register' keyword in C?
>
> Yes.
>
>> I.e., something
>> you *hope* gives the compiler an insight into your algorithm -- but,
>> that the compiler is free to *ignore*?
>
> C compiler is not free to ignore "register" keyword and neither is the
> database free to ignore hints. Compiler has to place your variable in a
> register, if there is one available. Database also must obey hints, if
No, the compiler can (and usually *does*) ignore the 'register'
keyword. It (was originally) intended as a "hint" (in the generic
sense of the word) to the compiler for a possible optimization
"in the opinion of the programmer".
This is unlike, for example, the contract that "volatile" imposes
on the compiler (which can NOT be ignored).
[I obviously can't commeent on what the DBMS "must" do regarding
*its* "hints"]
I deliberately chose this example as it seems (to me) intuitively
related to what a hint *might* be in database-parlance.
> possible. Also, with C there is one more keyword, namely "asm", which may
> be used to coerce the code to behave precisely as you want.
Are you inferring that database hints impose the same sort of
constraints on the DBMS?
> This analogy is not good for several reasons, the main one being the
> difference between the directly executable vs. interpreted paradigms.
> If C was the only compiler to disallow "hints", it wouldn't be so
> popular. Now, if PostgreSQL was the only major database to disallow
> hints...?
This depends on the role "hints" (in their respective domains)
are expected to play. E.g., I can *rarely* give a hint to a decent
C compiler that is *better* than what the compiler can deduce on
its own. That's not a bad reflection on me but, rather, praise
for how advanced C optimizers have become.
You might want to scroll down a bit instead of just taking the
first hit google gives you -- which references *an* implementation
of *a* C compiler (and can't authoritatively speak for The Standard
nor any of the other "conforming compilers" out there)
> While there are implementation differences, I would hardly say that the
> compiler is "free to ignore the register keyword". It might end up
> ignoring it, but it's definitely not the default behavior. However, this
> group is about PostgreSQL and this thread was devoted to some particular
> aspects missing in PostgreSQL, not to C language. Analogy with the C
> language is actually a bad analogy to consider when discussing databases,
> primarily because C language isn't a database. I don't think we should
> waste any more time discussing C language.
Gee, someone expresses an interest in understanding why you have
such a hard-on regarding "hints" and you dismiss it as "irrelevant".
(sigh) Sure makes me want to try to understand your point of
view better -- NOT!
> Gee, someone expresses an interest in understanding why you have such a
> hard-on regarding "hints" and you dismiss it as "irrelevant".
What I consider irrelevant is the analogy with the C language.
Furthermore, this debate is, at least in my humble opinion, relevant to
DBA's, project managers and people who will write SQL against a Postgres
database. I haven't envisioned this debate as a talk show with the
viewers taking sides, Oprah already covers that market.
As far as me having hard-on for hints, I needed them in one of my
projects. Hints are a welcome option, offered by virtually all other RDBMS
systems, yet there seems to be a problem with the designers of Postgres.
I have yet to see a logical argument against providing that option and
analogies with the C programming language don't seem to provide it,
either. So, those analogies are irrelevant, at least in my point of view.
You don't seem to be a database professional either, I am quite confused
by the zeal with which you're trying to participate in this debate. Once
again, this debate is about a specific feature of Postgres, or the lack
of it.
FYI, hints are a method to force the optimizer to do what the programmer
wants, rather than what the RDBMS thinks is right. Here are two
implementations of query hints:
http://dev.mysql.com/doc/refman/5.5/en/index-hints.html
http://msdn.microsoft.com/en-us/library/ms181714.aspx
There are, of course, other implementations, in DB2 and Oracle. As a
matter of fact, the only other relatively well known database that didn't
have hints was Firebird. Fortunately, its designers have seen the light
and the hints are planned for the version 3.0:
http://www.slideshare.net/mindthebird/initial-review-of-firebird-3
The argument is whether the users should have power to override the
optimizer decisions. I haven't seen a single coherent argument why they
shouldn't have that option and I definitely did have a need for hints.
> That's not a bad reflection on me but, rather, praise for how advanced C
> optimizers have become.
Sadly, RDBMS optimizers haven't yet become so advanced.
No worries, but don't take it as rudeness if I give you links for
some of these question instead of answering everything in detail.
> Can someone please, *gently*, explain *quantitatively* the differences
> that these issues make in performance?
Quantifying performance is only possible in a narrowly defined setting
(in my opinion).
> Are hints on a par with the 'register' keyword in C? I.e.,
> something you *hope* gives the compiler an insight into your
> algorithm -- but, that the compiler is free to *ignore*?
> Are there consequences to using a "hint" (e.g., using 'register'
> places constraints on what you can do)?
Roughly yes. Except that (at least in Oracle) a hint will only
be ignored if it is impossible to use (you cannot force an index
that cannot be used for this query).
> Is the "hint" concept essentially a workaround to compensate for
> capabilities *missing* in the optimizer? I.e., as C compilers
> get smarter, they deliberately *choose* to ignore the 'register'
> keyword because they actually have *more* insight than the
> programmer!
Yes, it is a workaround for optimizer deficiencies.
The problem here is that optimizing SQL statements is quite difficult
because SQL is a declarative language, i.e. you tell the system what
you want to get and not how to get it. The optimizer will have to
figure out an execution path.
> Are there ways (without using a "hint") to coerce the database
> to perform in the way your hint would *hope* to have it behave?
> (e.g., can you rewrite the SQL to cause the behavior you seek?)
Sometimes, yes. Often a simplified statement will be easier
to optimize. If you know the optimizer well, you can also trick
it into not considering certain optimizations.
The optimizer uses "statistics" which are data about the number
and distribution of data in table columns. A lot of times the
optimizer will perform better if you collect more data.
You can also set database parameters that influence the optimizer.
> What sort of *quantitative* difference can this make (example?)
> that can't, otherwise, be attained?
The effect of choosing a better plan can improve things *a lot*.
If you fetch 10 million rows, only to discard them later because
you join with an empty result set, that's *much* worse than not
fetching anything at all.
>> About vacuum:
>> =============
>>
>> Of course vacuum is a pain, but every database system has to pay a similar
>> price somewhere. Either you reduce concurrency (DB2), or (if you use some kind
>> of MVCC) you can move dead rows somewhere else (Oracle) or leave them where
>> they are and clean up later.
>
> Is this just the difference between automatic garbage collection
> and "manual" garbage collection (again, relying on parallels to
> other "languages")?
No. DB2 does not create garbage, because there is only one version
of each row at any given time. To get that, a writer has to wait
until a reader is done (i.e., the reader will take a "read lock"
on the row he or she is reading).
Oracle and PostgreSQL both create garbage, because they both keep
old versions of rows around to avoid readers blocking writers.
Oracle collects the garbage automatically *while* it executes
SQL statements.
PostgreSQL collects garbage at some later time, either manually
by an administrator running the VACUUM command, or automatically
if the autovacuum daemon is turned on.
> Does the parallel maintain throughout operation or does it fall
> apart at the extremes? E.g., if garbage collection must be
> manually initiated (in a language), then failing to do so
> can eventually cause you to run out of heap, etc. (which
> causes the application to "fail"). Is the same true when
> it comes to vacuuming -- or, does performance just suffer?
> (i.e., will you ever "fill the disk" with dead rows if you
> fail to initiate a VACUUM operation?)
All of this is only relevant if you have autovacuum turned off
(autovacuum is the recommended way these days).
Without autovacuum and if VACUUM is never run, two things will
happen:
1) Your tables and indexes will grow on every insert and update,
and deletes and updates will create garbage that cannot be
reused. That can fill the disk and drastically increase the
duration of table and index scans.
2) You will at some point run out of transaction IDs, and the
system will shutdown to prevent data loss.
See
http://www.postgresql.org/docs/current/static/routine-vacuuming.html
for details.
>> ROWID:
>> ======
>>
>> You make the point that because of VACUUM there is nothing compared to
>> Oracle's ROWID. But do you really need that? Shouldn't a row be referenced
>> by its primary key?
>
> The last question seems to be my understanding of how things should
> work. (?) (actually, one of the harder ideas for me to get accustomed
> to in this field :< )
Right. It was a rethorical question.
To digress, I always find a library a good analogy for a database.
You have a lot of books there, each has a title and an author.
A primary key is a) an index (like an index in a library) and
b) a rule that says that there may only be one book for every
possible index entry.
Without a primary key, one author could have more than one book with
the same title, and you won't be able to tell them apart (even though
the books themselves may be different).
> How does a ROWID differ from using an OID? (which I presume is
> not a good thing)
The ROWID is the physical location of the row, basically the offset
in the file. You can sort of just "lseek(3)" there and get it. This
is the fastest way to access a row.
An OID is a number from a system counter that *can* also be used
as a primary key. To some extent, it's just a more old-fashioned
way of what can be done better with a sequence these days.
If used as a primary key, it is no faster than other primary keys.
See the final paragraphs of
http://www.postgresql.org/docs/current/static/ddl-system-columns.html
for details.
>> Fixed size of WAL segments:
>> ===========================
>>
>> Maybe I am dense, but what is the problem there?
>
> <grin> I don't even know what they *are*! ;-)
WAL stands for "write ahead log" and is a transaction log.
It is a way to provide durability for committed transactions in case
of a system failure.
See
http://www.postgresql.org/docs/9.0/static/continuous-archiving.html
for details.
Yours,
Laurenz Albe
>> The relevant text is:
>>
>> A declaration of an identifier for an object with storage-class
>> specifier register suggests that access to the object be as fast as
>> possible. The extent to which such suggestions are effective is
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> implementation-defined.
>> ^^^^^^^^^^^^^^^^^^^^^^
>
>
> This is what I found:
> http://tigcc.ticalc.org/doc/keywords.html#register
[...]
That's the manual for a particular compiler. The text I quoted is from
the language specification.
> That's the manual for a particular compiler. The text I quoted is from
> the language specification.
With all due respect, I don't think that the analogy with the C compiler
is particularly relevant for this discussion.
> No. DB2 does not create garbage, because there is only one version of
> each row at any given time. To get that, a writer has to wait until a
> reader is done (i.e., the reader will take a "read lock" on the row he
> or she is reading).
Actually, DB2 also has MVCC, since the version 9.7 which was released, if
I remember correctly, in 2009. One should set the isolation level to
"currently committed" and voila, there is MVCC.
Thanks for the correction. I have not worked with DB2 since 2001.
Yours,
Laurenz Albe
> Thanks for the correction. I have not worked with DB2 since 2001.
Unfortunately, the company that work for has turned its last AIX box off
in January :). I will not be able to work with it again, for a
foreseeable future.
On 7/1/2011 1:13 AM, Laurenz Albe wrote:
> Don Y wrote:
>> [apologies for not having the technical depth to understand these
>> issues better -- I just use PostgreSQL as a "means to an end"...
>> I could use any similar "means" but this seems to work well enough
>> for me]
>
> No worries, but don't take it as rudeness if I give you links for
> some of these question instead of answering everything in detail.
Understood.
>> Can someone please, *gently*, explain *quantitatively* the differences
>> that these issues make in performance?
>
> Quantifying performance is only possible in a narrowly defined setting
> (in my opinion).
Of course. I was trying to get an idea as to what was "at
stake".
E.g., a peephole optimizer (in a compiled language) makes very
small improvements -- changing, for example:
store foo
load foo
into the equivalent:
store foo
(since foo is already loaded, no need to REload it). The
savings are typically insignificant unless in a very *tight*
loop where the cost of that *single* saved instruction represents
a large portion of the cost of a single iteration.
OTOH, rewriting:
if ((count_all_people_in_the_world() && (today != weekday)) ...
as:
if ((today != weekday) && (count_all_people_in_the_world()) ...
provides *big* savings as count_all_people_in_the_world() isn't
done *most* days (because today is more often a weekday than a
NONweekday!).
>> Are hints on a par with the 'register' keyword in C? I.e.,
>> something you *hope* gives the compiler an insight into your
>> algorithm -- but, that the compiler is free to *ignore*?
>> Are there consequences to using a "hint" (e.g., using 'register'
>> places constraints on what you can do)?
>
> Roughly yes. Except that (at least in Oracle) a hint will only
> be ignored if it is impossible to use (you cannot force an index
> that cannot be used for this query).
So, if the hint is "a bad idea", the DBMS won't ignore it, even
if it "knows better"? I.e., the DBA can *worsen* performance
by hinting incorrectly?
>> Is the "hint" concept essentially a workaround to compensate for
>> capabilities *missing* in the optimizer? I.e., as C compilers
>> get smarter, they deliberately *choose* to ignore the 'register'
>> keyword because they actually have *more* insight than the
>> programmer!
>
> Yes, it is a workaround for optimizer deficiencies.
> The problem here is that optimizing SQL statements is quite difficult
> because SQL is a declarative language, i.e. you tell the system what
> you want to get and not how to get it. The optimizer will have to
> figure out an execution path.
Understood. The "hint" plays the same intended role as "register"
in C -- "I, the human, know more about this than you, the computer,
so take my advice and do *this*...". But, in C, the compiler
can effectively say, "Silly Rabbit, no you *don't*!"
>> Are there ways (without using a "hint") to coerce the database
>> to perform in the way your hint would *hope* to have it behave?
>> (e.g., can you rewrite the SQL to cause the behavior you seek?)
>
> Sometimes, yes. Often a simplified statement will be easier
> to optimize. If you know the optimizer well, you can also trick
> it into not considering certain optimizations.
> The optimizer uses "statistics" which are data about the number
> and distribution of data in table columns. A lot of times the
> optimizer will perform better if you collect more data.
> You can also set database parameters that influence the optimizer.
Are those parameters effectively "hints of a different color"?
I.e., hints regarding the makeup of the *data* themselves
instead of a particular *query*?
>> What sort of *quantitative* difference can this make (example?)
>> that can't, otherwise, be attained?
>
> The effect of choosing a better plan can improve things *a lot*.
Oh. Are these situations likely to creep in often and *not*
be obvious to the DBA ahead of time? I.e., why would he
know to "hint" at them yet the optimizer not be able to
foresee the same issue(s)?
> If you fetch 10 million rows, only to discard them later because
> you join with an empty result set, that's *much* worse than not
> fetching anything at all.
Again, why would the DBA know this is likely to be the case
but the optimizer (planner) *doesn't*?
>>> About vacuum:
>>> =============
>>>
>>> Of course vacuum is a pain, but every database system has to pay a similar
>>> price somewhere. Either you reduce concurrency (DB2), or (if you use some kind
>>> of MVCC) you can move dead rows somewhere else (Oracle) or leave them where
>>> they are and clean up later.
>>
>> Is this just the difference between automatic garbage collection
>> and "manual" garbage collection (again, relying on parallels to
>> other "languages")?
>
> No. DB2 does not create garbage, because there is only one version
> of each row at any given time. To get that, a writer has to wait
> until a reader is done (i.e., the reader will take a "read lock"
> on the row he or she is reading).
Understood. There is, therefore, no need to vacuum.
> Oracle and PostgreSQL both create garbage, because they both keep
> old versions of rows around to avoid readers blocking writers.
> Oracle collects the garbage automatically *while* it executes
> SQL statements.
>
> PostgreSQL collects garbage at some later time, either manually
> by an administrator running the VACUUM command, or automatically
> if the autovacuum daemon is turned on.
That was what I *thought* (intuitively) the role of VACUUM was.
So, *I*, knowing what I have done and am likely to do *next*,
can decide (if auto is off) whether it behooves me to vacuum
*now* or "wait until after this next -- which will be my
LAST -- query, at which time the DB will be idle"
>> Does the parallel maintain throughout operation or does it fall
>> apart at the extremes? E.g., if garbage collection must be
>> manually initiated (in a language), then failing to do so
>> can eventually cause you to run out of heap, etc. (which
>> causes the application to "fail"). Is the same true when
>> it comes to vacuuming -- or, does performance just suffer?
>> (i.e., will you ever "fill the disk" with dead rows if you
>> fail to initiate a VACUUM operation?)
>
> All of this is only relevant if you have autovacuum turned off
> (autovacuum is the recommended way these days).
Understood.
> Without autovacuum and if VACUUM is never run, two things will
> happen:
> 1) Your tables and indexes will grow on every insert and update,
> and deletes and updates will create garbage that cannot be
> reused. That can fill the disk and drastically increase the
> duration of table and index scans.
> 2) You will at some point run out of transaction IDs, and the
> system will shutdown to prevent data loss.
So, it's just like turning off the GC and waiting for the
free store to be exhausted. Sooner or later, there's no
more "memory" available and a memory allocation request
fails -- causing the application to fail at that point
(though the "recovery" routine could simply manually
initiate GC at that point and rerun the request).
> See
> http://www.postgresql.org/docs/current/static/routine-vacuuming.html
> for details.
>
>>> ROWID:
>>> ======
>>>
>>> You make the point that because of VACUUM there is nothing compared to
>>> Oracle's ROWID. But do you really need that? Shouldn't a row be referenced
>>> by its primary key?
>>
>> The last question seems to be my understanding of how things should
>> work. (?) (actually, one of the harder ideas for me to get accustomed
>> to in this field :< )
>
> Right. It was a rethorical question.
>
> To digress, I always find a library a good analogy for a database.
> You have a lot of books there, each has a title and an author.
>
> A primary key is a) an index (like an index in a library) and
> b) a rule that says that there may only be one book for every
> possible index entry.
>
> Without a primary key, one author could have more than one book with
> the same title, and you won't be able to tell them apart (even though
> the books themselves may be different).
Understood. I tend to create an "ID" field in each table to
act as a primary key (because there often is no other way
to differentiate entries/rows -- and, because this maps
nicely to how I think of "pointers" in my code :> )
>> How does a ROWID differ from using an OID? (which I presume is
>> not a good thing)
>
> The ROWID is the physical location of the row, basically the offset
> in the file. You can sort of just "lseek(3)" there and get it. This
> is the fastest way to access a row.
>
> An OID is a number from a system counter that *can* also be used
> as a primary key. To some extent, it's just a more old-fashioned
> way of what can be done better with a sequence these days.
> If used as a primary key, it is no faster than other primary keys.
Ah! Excellent! That makes the performance aspect very clear.
The OID can be changed, arbitrarily, while changing the ROWID
requires "physically" moving the record.
> See the final paragraphs of
> http://www.postgresql.org/docs/current/static/ddl-system-columns.html
> for details.
>
>>> Fixed size of WAL segments:
>>> ===========================
>>>
>>> Maybe I am dense, but what is the problem there?
>>
>> <grin> I don't even know what they *are*! ;-)
>
> WAL stands for "write ahead log" and is a transaction log.
> It is a way to provide durability for committed transactions in case
> of a system failure.
OK. So, it's just another "missing feature" -- like buying a car
without air conditioning (not necessary if you live in a cold
climate)
> See
> http://www.postgresql.org/docs/9.0/static/continuous-archiving.html
> for details.
Thanks for taking the time (and pointers) to shed some light
on this. Eventually, the WAL might be significant for me but
I doubt hints would ever be as, at a minimum, they would require
me to know a lot more about how the planner worked and how to
convince it to "do something else".
--don
On 6/30/2011 3:42 PM, Mladen Gogala wrote:
> On Thu, 30 Jun 2011 13:21:17 -0700, Don Y wrote:
>
>> Gee, someone expresses an interest in understanding why you have such a
>> hard-on regarding "hints" and you dismiss it as "irrelevant".
>
> What I consider irrelevant is the analogy with the C language.
> Furthermore, this debate is, at least in my humble opinion, relevant to
> DBA's, project managers and people who will write SQL against a Postgres
> database. I haven't envisioned this debate as a talk show with the
> viewers taking sides, Oprah already covers that market.
Again, you *presume* to understand my reason for participating.
I'm curious as to whether you are using a Ouija Board or a
Crystal Ball to get your information...?
Regardless, what makes "The Participants'" opinions more valuable
than the Viewers'? *Unlike* Oprah, the Viewers aren't prevented
from *being* Participants -- now, or at some future date!
> As far as me having hard-on for hints, I needed them in one of my
> projects. Hints are a welcome option, offered by virtually all other RDBMS
> systems, yet there seems to be a problem with the designers of Postgres.
> I have yet to see a logical argument against providing that option and
> analogies with the C programming language don't seem to provide it,
> either. So, those analogies are irrelevant, at least in my point of view.
> You don't seem to be a database professional either, I am quite confused
> by the zeal with which you're trying to participate in this debate. Once
> again, this debate is about a specific feature of Postgres, or the lack
> of it.
How do you define a "database professional"? Someone who is paid to
perform the duties of a DBA? Someone who went to school to be a
DBA? Someone who *calls* himself a DBA?
I am *acting* as a DBA on two projects currently. I suspect I
will represent a greater "user base" than most folks here. And,
probably, be responsible for more "managed data" than the typical
"viewer/participant".
I suspect I am using the DBMS in ways that most viewers/participants
here have never even considered ("Huh? You can't do *that*...").
Do I want to make a career out of this? Nope. There are many
more interesting application domains to explore... this is just a
tool to facilitate work in a few of them!
Will the code I write and the tables I build stand up to "expert
scrutiny"? Probably no more than *your* C code would stand up
to my scrutiny! Yet, it will run as intended within the constraints
of the projects.
More to the point, I suspect I will represent more of a (financial)
contribution to the PostgreSQL project than, at the very least, *you*.
So, it behooves me to understand what I might like to put on my wish
list to *underwrite* (as you appear to have deep pockets -- but short
arms!).
Show me the courtesy of letting me make my own informed decisions
as any other participant/viewer here -- instead of arbitrarily
and unilaterally deciding on the criteria for "admission to the
debate". I'm sorry if I have to bother you by *asking* questions
instead of just being "infused" with the right answer/opinion.
Unfortunately, I have neither a Ouija Board *nor* a Crystal Ball
to rely upon!
> How do you define a "database professional"? Someone who is paid to
> perform the duties of a DBA? Someone who went to school to be a DBA?
> Someone who *calls* himself a DBA?
>
> I am *acting* as a DBA on two projects currently. I suspect I will
> represent a greater "user base" than most folks here. And, probably, be
> responsible for more "managed data" than the typical
> "viewer/participant".
>
> I suspect I am using the DBMS in ways that most viewers/participants
> here have never even considered ("Huh? You can't do *that*..."). Do I
> want to make a career out of this? Nope. There are many more
> interesting application domains to explore... this is just a tool to
> facilitate work in a few of them
And good luck to you.
>> Roughly yes. Except that (at least in Oracle) a hint will only
>> be ignored if it is impossible to use (you cannot force an index
>> that cannot be used for this query).
> So, if the hint is "a bad idea", the DBMS won't ignore it, even
> if it "knows better"? I.e., the DBA can *worsen* performance
> by hinting incorrectly?
Yes.
>>> Are there ways (without using a "hint") to coerce the database
>>> to perform in the way your hint would *hope* to have it behave?
>>> (e.g., can you rewrite the SQL to cause the behavior you seek?)
>> Sometimes, yes. Often a simplified statement will be easier
>> to optimize. If you know the optimizer well, you can also trick
>> it into not considering certain optimizations.
>> The optimizer uses "statistics" which are data about the number
>> and distribution of data in table columns. A lot of times the
>> optimizer will perform better if you collect more data.
>> You can also set database parameters that influence the optimizer.
> Are those parameters effectively "hints of a different color"?
> I.e., hints regarding the makeup of the *data* themselves
> instead of a particular *query*?
No, they are a coarser version of hints.
They tell the database things like "do not use a nested loop join",
while a hint would say "use a merge join for this specific join
operation".
Or "it's really a good idea to use an index even if you think it
is not optimal" compared to "use this index".
>> The effect of choosing a better plan can improve things *a lot*.
> Oh. Are these situations likely to creep in often and *not*
> be obvious to the DBA ahead of time? I.e., why would he
> know to "hint" at them yet the optimizer not be able to
> foresee the same issue(s)?
These situations creep up and are not obvious.
Whether they creep up often or not depends on the quality of the
optimizer and the complexity of your queries and data.
Usually hints and other tuning facilities are used after a problem
has been discovered.
They are hard to discover in advance, because the problems usually
only manifest with realistict amounts of data.
>> If you fetch 10 million rows, only to discard them later because
>> you join with an empty result set, that's *much* worse than not
>> fetching anything at all.
> Again, why would the DBA know this is likely to be the case
> but the optimizer (planner) *doesn't*?
First, the planner is not perfect.
Moreover, it is not always possible to know these things in advance.
In the described simple case, the planner might go wrong because
he does not know that the second result set will be empty before he
actually tries it.
The DBA (or more likely the programmer), who has knowledge about
the data that is not evident from the database layout (like "no male
can ever have an abortion") can sometimes know more than the planner.
> That was what I *thought* (intuitively) the role of VACUUM was.
> So, *I*, knowing what I have done and am likely to do *next*,
> can decide (if auto is off) whether it behooves me to vacuum
> *now* or "wait until after this next -- which will be my
> LAST -- query, at which time the DB will be idle"
Not quite, because routine vacuuming is not done by the end user
who executes statements, but by an administrator. The end user will
not have the necessary permissions to vacuum all tables in the database.
With non-automatic vacuum, the DBA can for example schedule a daily
VACUUM of all tables at 11 p.m. because he or she knows that nothing
much is going on at that time.
>> WAL stands for "write ahead log" and is a transaction log.
>> It is a way to provide durability for committed transactions in case
>> of a system failure.
> OK. So, it's just another "missing feature" -- like buying a car
> without air conditioning (not necessary if you live in a cold
> climate)
[...]
> Thanks for taking the time (and pointers) to shed some light
> on this. Eventually, the WAL might be significant for me but
> I doubt hints would ever be as, at a minimum, they would require
> me to know a lot more about how the planner worked and how to
> convince it to "do something else".
WAL is unimportant for you only then if data loss is not a
problem for you, and yesterday's backup is good enough.
Yours,
Laurenz Albe
[snips throughout]
On 7/4/2011 7:39 AM, Laurenz Albe wrote:
>>> Roughly yes. Except that (at least in Oracle) a hint will only
>>> be ignored if it is impossible to use (you cannot force an index
>>> that cannot be used for this query).
>
>> So, if the hint is "a bad idea", the DBMS won't ignore it, even
>> if it "knows better"? I.e., the DBA can *worsen* performance
>> by hinting incorrectly?
>
> Yes.
Ah! *But*, it can only "worsen (or improve) performance" -- not
affect the "correctness" of the result?
>>> The effect of choosing a better plan can improve things *a lot*.
>
>> Oh. Are these situations likely to creep in often and *not*
>> be obvious to the DBA ahead of time? I.e., why would he
>> know to "hint" at them yet the optimizer not be able to
>> foresee the same issue(s)?
>
> These situations creep up and are not obvious.
> Whether they creep up often or not depends on the quality of the
> optimizer and the complexity of your queries and data.
So, you stumble over them when a particular query (etc.)
unexpectedly proves to be more expensive than you had
"expected"?
> Usually hints and other tuning facilities are used after a problem
> has been discovered.
>
> They are hard to discover in advance, because the problems usually
> only manifest with realistict amounts of data.
And, presumably, hints only benefit *specific* dataset-query
combinations. I.e., change the query and the hint's value may
change?
>>> If you fetch 10 million rows, only to discard them later because
>>> you join with an empty result set, that's *much* worse than not
>>> fetching anything at all.
>
>> Again, why would the DBA know this is likely to be the case
>> but the optimizer (planner) *doesn't*?
>
> First, the planner is not perfect.
Understood. Neither is the programmer/DBA! :>
> Moreover, it is not always possible to know these things in advance.
> In the described simple case, the planner might go wrong because
> he does not know that the second result set will be empty before he
> actually tries it.
> The DBA (or more likely the programmer), who has knowledge about
> the data that is not evident from the database layout (like "no male
> can ever have an abortion") can sometimes know more than the planner.
OK. That would be very application-specific (and dataset-specific)
knowledge, though. (?)
>> That was what I *thought* (intuitively) the role of VACUUM was.
>> So, *I*, knowing what I have done and am likely to do *next*,
>> can decide (if auto is off) whether it behooves me to vacuum
>> *now* or "wait until after this next -- which will be my
>> LAST -- query, at which time the DB will be idle"
>
> Not quite, because routine vacuuming is not done by the end user
> who executes statements, but by an administrator. The end user will
> not have the necessary permissions to vacuum all tables in the database.
Yes, understood. I'm thinking about *my* applications where there
is an agent interposed between the end user and the DBMS acting
as if the administrator. So, it can run some commands on behalf of
the user and *then* decide to VACUUM (having inherent knowledge
of the consequences of those commands).
> With non-automatic vacuum, the DBA can for example schedule a daily
> VACUUM of all tables at 11 p.m. because he or she knows that nothing
> much is going on at that time.
That assumes you have sufficient free store to let a day's worth
of "garbage" accumulate.
>> Thanks for taking the time (and pointers) to shed some light
>> on this. Eventually, the WAL might be significant for me but
>> I doubt hints would ever be as, at a minimum, they would require
>> me to know a lot more about how the planner worked and how to
>> convince it to "do something else".
>
> WAL is unimportant for you only then if data loss is not a
> problem for you, and yesterday's backup is good enough.
In my case, the opposite is true -- the application needs to
avoid "stumbling", if at all possible. Going back to
yesterday's dataset would be very painful.
Thanks for the insights!
--don
As Laurenz has explained, they're not.
But as I understand it, the PostgreSQL developers who are opposed to
adding 'traditional' hints think that the kind of hint you're
describing here is worth considering.
-M-
Because that information could be exploited by "smarter"
incarnations of the planner?
I.e., their philosophy being to let the planner do the thinking
and give it whatever information might be *helpful* to facilitate
its coming to a "better plan" and then just invest (development)
effort in *making* that "smarter planner"?
Presumably, the planner wouldn't be *obligated* to do as those
"non-hint hints" would otherwise suggest?
--don
> Ah! *But*, it can only "worsen (or improve) performance" -- not
> affect the "correctness" of the result?
Correct.
[about bad execution plans]
> So, you stumble over them when a particular query (etc.)
> unexpectedly proves to be more expensive than you had
> "expected"?
Usually, that's how it is. Unless you Do It Right and test with
a realistic set of data beforehand.
> And, presumably, hints only benefit *specific* dataset-query
> combinations. I.e., change the query and the hint's value may
> change?
Yes. A query hint is part of the query and refers to certain
parts of it.
>> The DBA (or more likely the programmer), who has knowledge about
>> the data that is not evident from the database layout (like "no male
>> can ever have an abortion") can sometimes know more than the planner.
> OK. That would be very application-specific (and dataset-specific)
> knowledge, though. (?)
Right. That's what I am talking about.
>> Not quite, because routine vacuuming is not done by the end user
>> who executes statements, but by an administrator. The end user will
>> not have the necessary permissions to vacuum all tables in the database.
> Yes, understood. I'm thinking about *my* applications where there
> is an agent interposed between the end user and the DBMS acting
> as if the administrator. So, it can run some commands on behalf of
> the user and *then* decide to VACUUM (having inherent knowledge
> of the consequences of those commands).
That sounds like a really bad idea.
If the database user that performs SQL on behalf of the end user
has superuser privileges, that constitutes an unnecessary risk.
Security holes or bugs in your software can cause much more damage
that way.
>> With non-automatic vacuum, the DBA can for example schedule a daily
>> VACUUM of all tables at 11 p.m. because he or she knows that nothing
>> much is going on at that time.
> That assumes you have sufficient free store to let a day's worth
> of "garbage" accumulate.
Of course.
>> WAL is unimportant for you only then if data loss is not a
>> problem for you, and yesterday's backup is good enough.
>
> In my case, the opposite is true -- the application needs to
> avoid "stumbling", if at all possible. Going back to
> yesterday's dataset would be very painful.
Then you need to learn about WAL and point-in-time recovery.
Yours,
Laurenz Albe
On 7/4/2011 11:58 PM, Laurenz Albe wrote:
>> And, presumably, hints only benefit *specific* dataset-query
>> combinations. I.e., change the query and the hint's value may
>> change?
>
> Yes. A query hint is part of the query and refers to certain
> parts of it.
So, unless you have accurate, prior knowledge of the dataset's
actual complexion, you stand a chance of shooting yourself in the
foot *if* you hint (based on assumptions you make that may not
turn out to be true)
>>> Not quite, because routine vacuuming is not done by the end user
>>> who executes statements, but by an administrator. The end user will
>>> not have the necessary permissions to vacuum all tables in the database.
>
>> Yes, understood. I'm thinking about *my* applications where there
>> is an agent interposed between the end user and the DBMS acting
>> as if the administrator. So, it can run some commands on behalf of
>> the user and *then* decide to VACUUM (having inherent knowledge
>> of the consequences of those commands).
>
> That sounds like a really bad idea.
>
> If the database user that performs SQL on behalf of the end user
> has superuser privileges, that constitutes an unnecessary risk.
> Security holes or bugs in your software can cause much more damage
> that way.
No choice. There is no "organic" DBA involved. So, even if I
let autovacuum handle that aspect, there will always be other
aspects that have to be "coded" (and "Just Work")
>>> With non-automatic vacuum, the DBA can for example schedule a daily
>>> VACUUM of all tables at 11 p.m. because he or she knows that nothing
>>> much is going on at that time.
>
>> That assumes you have sufficient free store to let a day's worth
>> of "garbage" accumulate.
>
> Of course.
OK. I can't guarantee that so it is something I will have to
be aware of.
>>> WAL is unimportant for you only then if data loss is not a
>>> problem for you, and yesterday's backup is good enough.
>>
>> In my case, the opposite is true -- the application needs to
>> avoid "stumbling", if at all possible. Going back to
>> yesterday's dataset would be very painful.
>
> Then you need to learn about WAL and point-in-time recovery.
<grin> I'll add it to the List of Stuff I Must Learn.
Thanks!
--don
>> That sounds like a really bad idea.
>>
>> If the database user that performs SQL on behalf of the end user
>> has superuser privileges, that constitutes an unnecessary risk.
>> Security holes or bugs in your software can cause much more damage
>> that way.
> No choice. There is no "organic" DBA involved. So, even if I
> let autovacuum handle that aspect, there will always be other
> aspects that have to be "coded" (and "Just Work")
You can automatize administrative tasks, that's a good thing,
but that does not necessitate that the end user gets served by a
superuser account. Those things should be done separately.
You won't get away without some sort of human DBA.
You must at least ascertain that backups complete successfully
and space doesn't run out. And you need somebody who is able to
restore a backup. So at least when things go wrong, you need a DBA.
Yours,
Laurenz Albe
The intend is to deploy in *devices* -- not "servers". For a
(crude) parallel, imagine the "contacts", etc. in your cell phone
were being managed by PostgreSQL. I.e., a set of relations
defining people, their phone numbers, log of recent calls, etc.
[recall, I am using a cell phone solely as an analogy as the
usage patterns are similar -- largely "fixed"]
The sorts of relations are all predefined. A (cell phone) "user"
could opt to add a name to his "contact list". Or, delete one.
Or "edit" an existing contact. Or, purge the log of the least
recent calls. etc.
He might even want to add "another" contact list (named "business
contacts"), etc.
But, he's not creating any arbitrary relations. And, the queries
he'll run are predefined -- with "fill in the blank" parameters, etc.
I.e., if PostgreSQL can't keep the relations intact, then it has
fundamental bugs (in which case, why would *anyone* be using it?).
If PostgreSQL isn't "ready for prime time", in that regard, I
have taken pains to define all my relations in such a way that I
could fall back to cruder tools (e.g., dbm-style where I do most
of the "work") and, even, a sparse VM implementation (linking
relations with actual pointers).
I have a trial, "server based" application running "unattended"
that has performed well so far (despite abuses to the server
itself). And, one of the two projects in the works currently
takes that a step further (more traffic, users, etc.). So,
I've got "an out" for the other project if I discover PG can't
"keep a clean house" without someone babysitting it... :<
<shrug> We'll see. AFAIK, it hasn't been tried on this large
a scale previously (?) As I said, the things I'm expecting from
the DBMS are probably different than what most users/DBA's would
expect.
--don
This is pretty vague, but it sounds like nothing you need a
superuser account for.
Keep administrative stuff (VACUUM, backups) separate from the rest.
> But, he's not creating any arbitrary relations. And, the queries
> he'll run are predefined -- with "fill in the blank" parameters, etc.
Yup, that's exactly where the user can use SQL injection to break
into your database.
> I.e., if PostgreSQL can't keep the relations intact, then it has
> fundamental bugs (in which case, why would *anyone* be using it?).
You misunderstood me. PostgreSQL will keep its stuff consistent.
If somebody breaks into your database with a superuser account,
he or she can very consistently read and change everything in the
database, access the file system, and theoretically do anything with
your machine that the OS user has permissions to do.
> <shrug> We'll see. AFAIK, it hasn't been tried on this large
> a scale previously (?) As I said, the things I'm expecting from
> the DBMS are probably different than what most users/DBA's would
> expect.
I don't want to play Kassandra here, but most of the people who
want to use a software for something else than the intended use
become quite unhappy in the end.
Yours,
Laurenz Albe
On 7/7/2011 12:05 AM, Laurenz Albe wrote:
> Don Y wrote:
>> The sorts of relations are all predefined. A (cell phone) "user"
>> could opt to add a name to his "contact list". Or, delete one.
>> Or "edit" an existing contact. Or, purge the log of the least
>> recent calls. etc.
>>
>> He might even want to add "another" contact list (named "business
>> contacts"), etc.
>
> This is pretty vague, but it sounds like nothing you need a
> superuser account for.
I'm not claiming to need a "privileged" account. Rather, there is
an agent (a piece of software) that sits between the "user" and
the DBMS that interacts with the DB on behalf of the user and
reports back to the user.
So, it can run a query -- and then manually vacuum, etc.
Resources are scarce/fixed so you can't just arbitrarily do
something and let the pieces fall where they may. Instead,
you have to interact with the DB in ways that are more
"economical" (in space/time/power/etc.).
> Keep administrative stuff (VACUUM, backups) separate from the rest.
>
>> But, he's not creating any arbitrary relations. And, the queries
>> he'll run are predefined -- with "fill in the blank" parameters, etc.
>
> Yup, that's exactly where the user can use SQL injection to break
> into your database.
That's why there's an agent in the middle to "sanitize" all
interactions with the DBMS. The interface to the DBMS is
never "raw".
>> I.e., if PostgreSQL can't keep the relations intact, then it has
>> fundamental bugs (in which case, why would *anyone* be using it?).
>
> You misunderstood me. PostgreSQL will keep its stuff consistent.
Then there won't be any problems. :>
> If somebody breaks into your database with a superuser account,
> he or she can very consistently read and change everything in the
> database, access the file system, and theoretically do anything with
> your machine that the OS user has permissions to do.
Understood. But, the same vulnerabilities exist in any PostgreSQL
instance. Less so, here, because the device isn't sitting on a
network, "exposed".
>> <shrug> We'll see. AFAIK, it hasn't been tried on this large
>> a scale previously (?) As I said, the things I'm expecting from
>> the DBMS are probably different than what most users/DBA's would
>> expect.
>
> I don't want to play Kassandra here, but most of the people who
> want to use a software for something else than the intended use
> become quite unhappy in the end.
I don't see that what we are doing is in any way different from
what a PostgreSQL user could expect from a DB. The differences
are that our "users" are applications -- not "organic beings".
And, that there is typically only a single "organic" user
associated with each PG instance.