[GENERAL] CLUSTER command

Jean-Luc Lachance

unread,

Dec 12, 2002, 4:34:42 PM12/12/02

to

Hi all,

I just read about the cluster command and was a little (very)
disapointed.
Clustered tables do not remain clustered after inserts.
Clustered tables are usefull when the table is very large and there are
few different keys.

Because the table file is already extended (2G limit) using different
files extension (.N)
how complicated (modifying the code) would it be to have the table files
split according to the cluster key?

This would:

Greatly improve performance when the cluster key in included in search
criteria.
Allow for a much larger table before a file has to be split (.N).
Simplify the management of symblinks (that's something else we need to
look at).
The index file for that field would no longer be required.

Of course, there should be only one cluster key per table.
The length the "key" should be short and the number of unique key should
be low as well.

SO... ?

JLL

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majo...@postgresql.org

Stephan Szabo

unread,

Dec 12, 2002, 5:15:03 PM12/12/02

to

On Thu, 12 Dec 2002, Jean-Luc Lachance wrote:

> Hi all,
>
> I just read about the cluster command and was a little (very)
> disapointed.
> Clustered tables do not remain clustered after inserts.
> Clustered tables are usefull when the table is very large and there are
> few different keys.
>
>
> Because the table file is already extended (2G limit) using different
> files extension (.N)
> how complicated (modifying the code) would it be to have the table files
> split according to the cluster key?

I'd vote against changing the existing CLUSTER since the existing CLUSTER
while not great does handle many different key values fairly well as well
and this solution wouldn't. Many different key values are still
useful to cluster if you're doing searches over ranges since it lowers the
number of heap file reads necessary. If done this should probably be
separate from the existing cluster or at least both versions should be
possible.

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majo...@postgresql.org)

Jean-Luc Lachance

unread,

Dec 12, 2002, 5:24:43 PM12/12/02

to

The current cluster command is equivalant to:

create b as select * from a order by i;

So you would not be loosing anything.

Stephan Szabo wrote:
>
> On Thu, 12 Dec 2002, Jean-Luc Lachance wrote:
>
> > Hi all,
> >
> > I just read about the cluster command and was a little (very)
> > disapointed.
> > Clustered tables do not remain clustered after inserts.
> > Clustered tables are usefull when the table is very large and there are
> > few different keys.
> >
> >
> > Because the table file is already extended (2G limit) using different
> > files extension (.N)
> > how complicated (modifying the code) would it be to have the table files
> > split according to the cluster key?
>
> I'd vote against changing the existing CLUSTER since the existing CLUSTER
> while not great does handle many different key values fairly well as well
> and this solution wouldn't. Many different key values are still
> useful to cluster if you're doing searches over ranges since it lowers the
> number of heap file reads necessary. If done this should probably be
> separate from the existing cluster or at least both versions should be
> possible.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Stephan Szabo

unread,

Dec 12, 2002, 5:30:33 PM12/12/02

to

On Thu, 12 Dec 2002, Jean-Luc Lachance wrote:

> The current cluster command is equivalant to:
>
> create b as select * from a order by i;
>
> So you would not be loosing anything.

Except for the fact that the CLUSTER is intended (although
I don't know if it does yet) to retain things like constraints
and other indexes which the above doesn't.

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Lincoln Yeoh

unread,

Dec 12, 2002, 5:36:23 PM12/12/02

to

Splitting table files by indexed value may not help if the operating system
doesn't manage to keep the tables unfragmented on disk. I suppose the O/S
should know how to do it though.

Cheerio,
Link.

At 04:31 PM 12/12/02 -0500, Jean-Luc Lachance wrote:

>Hi all,
>
>I just read about the cluster command and was a little (very)
>disapointed.
>Clustered tables do not remain clustered after inserts.
>Clustered tables are usefull when the table is very large and there are
>few different keys.
>
>
>Because the table file is already extended (2G limit) using different
>files extension (.N)
>how complicated (modifying the code) would it be to have the table files
>split according to the cluster key?

---------------------------(end of broadcast)---------------------------

Jean-Luc Lachance

unread,

Dec 12, 2002, 5:42:33 PM12/12/02

to

OK fine,

Let's create a new command:

PARTITION <table> ON <attribute>

I did not want to start a fight. You can keep the CLUSTER command as it
is.

I still think clustering/partitioning would be a great idea.
This is what I want to talk about. Look at the original post for the
reasons.

JLL

johnnnnnn wrote:

>
> On Thu, Dec 12, 2002 at 02:03:56PM -0800, Stephan Szabo wrote:
> > I'd vote against changing the existing CLUSTER since the existing
> > CLUSTER while not great does handle many different key values fairly
> > well as well and this solution wouldn't.
>

> I would agree. What's being proposed sounds much more like table
> partitioning than clustering.
>
> That's not to say that the existing CLUSTER couldn't be improved, at
> the very least to the point where it allows inserts to respect the
> clustered structure. That's a post for another thread, though.
>
> -johnnnnnnnnnnn
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majo...@postgresql.org so that your
> message can get through to the mailing list cleanly

Stephan Szabo

unread,

Dec 12, 2002, 7:07:44 PM12/12/02

to

On Thu, 12 Dec 2002, johnnnnnn wrote:

> On Thu, Dec 12, 2002 at 05:39:44PM -0500, Jean-Luc Lachance wrote:
> > Let's create a new command:
> >
> > PARTITION <table> ON <attribute>

> <snip>

> > Because the table file is already extended (2G limit) using
> > different files extension (.N)
> > how complicated (modifying the code) would it be to have the table
> > files split according to the cluster key?
>

> I think the code changes would be complicated. Just at a 30-second
> consideration, this would need to touch:
> - all sql (selects, inserts, updates, deletes)
> - vacuuming
> - indexing
> - statistics gathering
> - existing clustering

I think his idea was to treat it similarly to the way that the
system treats tables >2G with .N files. The only thing is that
I believe the code that deals with that wouldn't be particularly
easy to change to do it though, but I've only taken a cursory look at
what I think is the place that does that(storage/smgr/md.c). Some sort of
good partitioning system would be nice though.

> create table u1 (...);
> create table u2 (...);
> create table u3 (...);
>
> create view uv as (select "A" as partition_key, ... from u1
> union all
> select "B" as partition_key, ... from u2
> union all
> select "C" as partition_key, ... from u3);
>
> That keeps the tables in different files on-disk while still allowing
> you to query against all of them. You need to index them separately
> and logic is necessary when changing data.

Unfortunately, I think that the optimizer isn't going to do what you'd
hope here and scan only the appropriate table if you were to say
partition_key='A' and foo='bar'. I'd love to be shown that I'm wrong, but
the best I could see hoping for would be that if partition_key was part of
u1-u3 and there was an index on partition_key,foo that it could use that
and do minimal work on the other tables.

In addition, doing something like the above is a nightmare if you don't
know beforehand what the partitions should be (for example if you know
there aren't alot of distinct values, but you don't know what they are) or
for that matter even with 10-15 partitions, writing the rules and such
would probably be really error prone.

Stephan Szabo

unread,

Dec 12, 2002, 9:14:33 PM12/12/02

to

On Thu, 12 Dec 2002, Alvaro Herrera wrote:

> On Thu, Dec 12, 2002 at 04:03:47PM -0800, Stephan Szabo wrote:
> > On Thu, 12 Dec 2002, johnnnnnn wrote:
> >
> > > I think the code changes would be complicated. Just at a 30-second
> > > consideration, this would need to touch:
> > > - all sql (selects, inserts, updates, deletes)
> > > - vacuuming
> > > - indexing
> > > - statistics gathering
> > > - existing clustering
> >
> > I think his idea was to treat it similarly to the way that the
> > system treats tables >2G with .N files. The only thing is that
> > I believe the code that deals with that wouldn't be particularly
> > easy to change to do it though, but I've only taken a cursory look at
> > what I think is the place that does that(storage/smgr/md.c). Some sort of
> > good partitioning system would be nice though.
>

> I don't think this is doable without a huge amount of work. The storage
> manager doesn't know anything about what is in a page, let alone a
> tuple. And it shouldn't, IMHO. Upper levels don't know how are pages
> organized in disk; they don't know about .1 segments and so on, and they
> shouldn't.

Which is part of why I said it wouldn't be easy to change to do that,
there's no good way to communicate that information. Like I said, I
didn't look deeply, but I had to look though, because you can never tell
with bits of old university code to do mostly what you want that haven't
been exercised in years floating around.

> I think this kind of partition doesn't buy too much. I would really
> like to have some kind of auto-clustering, but it should be implemented
> in some upper level; e.g., by leaving some empty space in pages for
> future tuples, and arranging the whole heap again when it runs out of
> free space somewhere. Note that this is very far from the storage
> manager.

Auto clustering would be nice.

I think Jean-Luc's suggested partitioning mechanism has certain usage
patterns that it's a win for and most others that it's not. Since the
usage pattern I can think of (very large table with a small number of
breakdowns where your conditions are primarily on those breakdowns) aren't
even remotely in the domain of things I've worked with, I can't say
whether it'd end up really being a win to avoid the index reads for the
table.