loading data, creating indexes, clustering, vacuum...

Angva

unread,

Dec 7, 2006, 11:38:06 AM12/7/06

to

Hi everyone,

Looking for a small bit of advice...

I have a script that updates several tables with large amounts of data.
Before running the updates, it drops all indexes for optimal
performance. When the updates have finished, I run the following
procedure:

recreate the indexes
cluster the tables
vacuum full analyze on the tables

I was hoping an expert could comment on the optimal way to order these
three commands. For instance I have a hunch that creating the indexes
first (as I do now) could slow down the clustering - perhaps the row
locations in the indexes all have to be updated as the cluster command
shifts their locations? And perhaps vacuuming should be done before
clustering so that dead tuples aren't "in the way"?

Of course I could just test every combination until I get it right, but
I'd like to have a good understanding as well.

Any insight would be much appreciated.

Thank you,
Mark

Alan Hodgson

unread,

Dec 7, 2006, 5:52:11 PM12/7/06

to

On Thursday 07 December 2006 08:38, "Angva" <ang...@gmail.com> wrote:
> three commands. For instance I have a hunch that creating the indexes
> first (as I do now) could slow down the clustering - perhaps the row
> locations in the indexes all have to be updated as the cluster command
> shifts their locations? And perhaps vacuuming should be done before
> clustering so that dead tuples aren't "in the way"?

clustering also removes the dead tuples.

I would just:

- create one index, the one to be clustered
- cluster the table
- create the remaining indexes

--
Eat right. Exercise regularly. Die anyway.

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Martijn van Oosterhout

unread,

Dec 8, 2006, 5:51:15 AM12/8/06

to

On Thu, Dec 07, 2006 at 02:52:11PM -0800, Alan Hodgson wrote:
> On Thursday 07 December 2006 08:38, "Angva" <ang...@gmail.com> wrote:
> > three commands. For instance I have a hunch that creating the indexes
> > first (as I do now) could slow down the clustering - perhaps the row
> > locations in the indexes all have to be updated as the cluster command
> > shifts their locations? And perhaps vacuuming should be done before
> > clustering so that dead tuples aren't "in the way"?
>
> clustering also removes the dead tuples.
>
> I would just:
>
> - create one index, the one to be clustered
> - cluster the table
> - create the remaining indexes

And then run ANALYSE. No need to vacuum because the cluster did that
already.

Have a nice day,
--
Martijn van Oosterhout <kle...@svana.org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

signature.asc

Angva

unread,

Dec 8, 2006, 9:47:37 AM12/8/06

to

Thank you very much, Alan and Martijn, for the advice!

Mark

> --fXStkuK2IQBfcDe+
> Content-Type: application/pgp-signature
> Content-Disposition: inline;
> filename="signature.asc"
> Content-Description: Digital signature
> X-Google-AttachSize: 190

Angva

unread,

Dec 8, 2006, 12:33:31 PM12/8/06

to

> clustering also removes the dead tuples.

I have a followup question. What if the set of dead tuples is too big
and I need to VACUUM FULL, as opposed to VACUUM. (The update size
varies greatly from day to day.) Will the clustering effectively do a
VACUUM FULL, or just a VACUUM?

Thanks again for your help,
Mark

Glen Parker

unread,

Dec 8, 2006, 3:49:30 PM12/8/06

to

Angva wrote:
> Looking for a small bit of advice...
>
> I have a script that updates several tables with large amounts of data.
> Before running the updates, it drops all indexes for optimal
> performance. When the updates have finished, I run the following
> procedure:
>
> recreate the indexes
> cluster the tables
> vacuum full analyze on the tables

Hi all,

I'd like to see a general way to take indexes off line without actually
losing their definitions. For example, something like "ALTER TABLE [EN
| DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc. This could
also be used internally when a backend encounters an error
reading/writing an index. Rather than refusing to execute queries, it
could just ignore indexes it knows are disabled or bad in some way and
re-plan as needed.

This would have two benefits. First, the above scenerio would be much
simpler. Rather than dropping and re-creating new indexes, you could
just disable and then re-enable them without having any knowledge of
their structure. Secondly, it would allow us to put indexes in an
alternate table space on a non-redundant volume and, in the case of a
drive failure, be able to limp along, and get the system back to normal
simply by replacing the disk and issuing a REINDEX command.

I realize there are a couple gotchas with this. For example, what to do
with unique indexes? Perhaps a backend would still need to refuse to do
update/inserts on a table with degraded unique indexes, unless the index
was disabled explicitly? And then, refuse to rebuild/re-enable the
index as normal if non-unique values found?

Thx for considering :-)

-Glen Parker

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Glen Parker

unread,

Dec 8, 2006, 3:49:29 PM12/8/06

to

Hi all,

Since PITR works well, my use of pg_dump has shifted. Rather than using
it as a backup tool, I now use it as a snapshotting tool. At the end of
each month we do an ASCII dump to keep around, so if we ever need to,
we can see the data as it was any number of months or years ago. Not a
backup at all, just a raw data archive.

These archives do not need to hold all our data, for example, system
logs would be useless later. There also is no reason to include
indexes. Ignoring extranious tables and indexes is a great way to keep
the archive small and keep the time to restore as low as possible.

So, it would be great if pg_dump could accept some sort of argument to
make it simply not dump certain types of objects. Indexes, views,
functions, etc.

Thx for considering :-)

-Glen Parker

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Martijn van Oosterhout

unread,

Dec 8, 2006, 5:04:51 PM12/8/06

to

On Fri, Dec 08, 2006 at 12:49:30PM -0800, Glen Parker wrote:
> I'd like to see a general way to take indexes off line without actually
> losing their definitions. For example, something like "ALTER TABLE [EN
> | DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc. This could
> also be used internally when a backend encounters an error
> reading/writing an index. Rather than refusing to execute queries, it
> could just ignore indexes it knows are disabled or bad in some way and
> re-plan as needed.

One issue would be that even disabled indexes would need to be updated
when there are new rows. If you don't update the index when it's
disabled, then re-enabling will essentially need to rebuild the index.

signature.asc

Glen Parker

unread,

Dec 8, 2006, 5:12:18 PM12/8/06

to

Martijn van Oosterhout wrote:
> On Fri, Dec 08, 2006 at 12:49:30PM -0800, Glen Parker wrote:
>> I'd like to see a general way to take indexes off line without actually
>> losing their definitions. For example, something like "ALTER TABLE [EN
>> | DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc. This could
>> also be used internally when a backend encounters an error
>> reading/writing an index. Rather than refusing to execute queries, it
>> could just ignore indexes it knows are disabled or bad in some way and
>> re-plan as needed.
>
> One issue would be that even disabled indexes would need to be updated
> when there are new rows. If you don't update the index when it's
> disabled, then re-enabling will essentially need to rebuild the index.

That's what I had in mind. You could just as easily blow away the index
file(s). It's just that I don't want it to toss the index *definition*.

To continued to update such an index would be to completely negate the
benefit of disabling it!

-Glen

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majo...@postgresql.org so that your
message can get through to the mailing list cleanly

Tom Lane

unread,

Dec 8, 2006, 6:59:54 PM12/8/06

to

Martijn van Oosterhout <kle...@svana.org> writes:
> One issue would be that even disabled indexes would need to be updated
> when there are new rows. If you don't update the index when it's
> disabled, then re-enabling will essentially need to rebuild the index.

I assume that's what he wants. However, it's not immediately clear that
there's a sufficient use case for this to justify the extra apparatus
compared to just DROP INDEX (and recreate it later).

regards, tom lane

Bill Moran

unread,

Dec 8, 2006, 9:48:48 PM12/8/06

to

"Angva" <ang...@gmail.com> wrote:
>
> > clustering also removes the dead tuples.
>
> I have a followup question. What if the set of dead tuples is too big
> and I need to VACUUM FULL, as opposed to VACUUM. (The update size
> varies greatly from day to day.) Will the clustering effectively do a
> VACUUM FULL, or just a VACUUM?

CLUSTER is the equivalent of VACUUM FULL, with the addition that it
orders the data in the table in the order of the index you specify.

VACUUM FULL doesn't follow any particular order for the data.

-Bill

Shoaib Mir

unread,

Dec 9, 2006, 3:25:21 AM12/9/06

to

If set have the max_fsm_pages set correctly, that makes VACUUM faster and removes the need for VACUUM FULL or REINDEX. Should be slightly more than the total number of data pages which will be touched by updates and deletes between vacuums.

Thanks,
--------
Shoaib Mir
EnterpriseDB (www.enterprisedb.com)

Glen Parker

unread,

Dec 11, 2006, 7:51:06 PM12/11/06

to

Gurus,

I hope I can make this clear somehow... Anyway... This all involves PG
8.1.4 on a 64-bit FC5 box.

Select version() says "PostgreSQL 8.1.4 on x86_64-redhat-linux-gnu,
compiled by GCC x86_64-redhat-linux-gcc (GCC) 4.1.0 20060304 (Red Hat
4.1.0-3)".

I guess the best question I can see is, under what circumstances is the
directory name in pg_tablespace actually used?

I have a scenario where I want to restore from a PITR backup, into an
alternate location on the same machine it came from, while the original
database is still up and running. I have one alternate table space.

It goes like this. First I expand the base archive into an alternate
location, then expand the table space archive(s) into alternate
location(s). Then I recreate the links under pg_tblspc. I then fiddle
a little bit with config files and run postgres on the new alternate
location. Everthing goes fine, the database rolls forward, and then
postgres quits (because I give it an SQL file for stdin). Great.

But now I have a problem. What if I move objects from the main
tablespace to the alternate one, such as indexes, between the time of
the backup and the restore? During the restore/recovery, the
pg_tablespace table is out of date. If the tablespace directory listed
there was used in copying files, I'd have a big fat mess involving a
badly broken production database.

Hopefully that all makes sense...

-Glen

---------------------------(end of broadcast)---------------------------

Tom Lane

unread,

Dec 11, 2006, 8:37:04 PM12/11/06

to

Glen Parker <glen...@nwlink.com> writes:
> I guess the best question I can see is, under what circumstances is the
> directory name in pg_tablespace actually used?

It isn't used (except by pg_dumpall) ... what counts is where the
symlink in $PGDATA/pg_tblspc points.

> I have a scenario where I want to restore from a PITR backup, into an
> alternate location on the same machine it came from, while the original
> database is still up and running. I have one alternate table space.

You can probably make this work if you don't issue any CREATE TABLESPACE
commands while PITR logging is active, but you'll want to test your
procedures pretty carefully.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Glen Parker

unread,

Dec 12, 2006, 6:49:06 PM12/12/06

to

> You can probably make this work if you don't issue any CREATE TABLESPACE
> commands while PITR logging is active, but you'll want to test your
> procedures pretty carefully.

That's what I thought, and after your message, I went ahead with it and
had no problems. Thx, Tom.

Bruce Momjian

unread,

Dec 13, 2006, 6:06:07 PM12/13/06

to

Glen Parker wrote:
> Martijn van Oosterhout wrote:
> > On Fri, Dec 08, 2006 at 12:49:30PM -0800, Glen Parker wrote:
> >> I'd like to see a general way to take indexes off line without actually
> >> losing their definitions. For example, something like "ALTER TABLE [EN
> >> | DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc. This could
> >> also be used internally when a backend encounters an error
> >> reading/writing an index. Rather than refusing to execute queries, it
> >> could just ignore indexes it knows are disabled or bad in some way and
> >> re-plan as needed.
> >
> > One issue would be that even disabled indexes would need to be updated
> > when there are new rows. If you don't update the index when it's
> > disabled, then re-enabling will essentially need to rebuild the index.
>
>
> That's what I had in mind. You could just as easily blow away the index
> file(s). It's just that I don't want it to toss the index *definition*.

Well, you can just pg_dump the index definition to a file. What more
would someone want?

--
Bruce Momjian br...@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster