scan into many tables

Greg Hennessy

unread,

Jul 13, 2012, 12:11:03 PM7/13/12

to

I have a rather large table (3 TB) that I need to split up.
I can do a bunch of
select yada into tmp_00 from centroids where value=0;
select yada into tmp_01 from centroids where value=1;
select yada into tmp_02 from centroids where value=2;

but since a scan on the database takes about 40 hours,
and I have values between 0 and 48, I'm looking at multiple
weeks just to copy things into the smaller tables.

Is there a way to do this while going through the
database once?

Hans Castorp

unread,

Jul 13, 2012, 12:37:47 PM7/13/12

to

Don't know if this is really going to be faster:

with base_data as (
select yada, value
from centroids
where value in (0,1,2)
),
first_insert as (
insert into tmp_00 (yada)
select yada
from base_data
where value = 0
returning yada
),
second_insert as (
insert into tmp_01 (yada)
select yada
from base_data
where value = 1
returning yada
),
third_insert as (
insert into tmp_02 (yada)
select yada
from base_data
where value = 2
returning yada
)
select count(*)
from third_insert;

If the result for the "base_data" is huge, this might not perform well.

> but since a scan on the database takes about 40 hours

That sounds like something is seriously broken in your environment. How many rows are we talking?

Another option might be a partitioned table where you simply insert everything into and the trigger takes care of distributing the rows into the individual tables.

Matthew Woodcraft

unread,

Jul 13, 2012, 12:51:09 PM7/13/12

to

So long as you don't have an ORDER BY in the queries, if you run several
of them in parallel (ie, from different database connections) then
PostgreSQL is supposed to be smart enough to keep the scans in step so
that it only has to read the disk once.

(The synchronize_seqscans configuration parameter controls this, but it
defaults to 'on' anyway.)

But I can't tell you from experience how well to expect this to work, or
how many queries you can reasonably expect to issue at once. The people
on the pgsql-performance mailing list could probably tell you
(especially if you tell them more about your hardware).

Also, 40 hours seems a long time to me to scan a 3TB table; they might
be able to help you improve that too.

-M-

Greg Hennessy

unread,

Jul 13, 2012, 2:54:18 PM7/13/12

to

On 2012-07-13, Hans Castorp <REWYRL...@spammotel.com> wrote:
> Don't know if this is really going to be faster:
>
> with base_data as (
> select yada, value
> from centroids
> where value in (0,1,2)
> ),

Thanks. I'm not familiar with the WITH statement, time for some
reading.

> If the result for the "base_data" is huge, this might not perform well.
>
>> but since a scan on the database takes about 40 hours
>
> That sounds like something is seriously broken in your
>environment. How many rows are we talking?

Not sure, I've never counted them. I can tell you in about 40
hours. :)

> Another option might be a partitioned table where you simply insert
>everything into and the trigger takes care of distributing the rows
>into the individual tables.

That's my goal, but I wanted to try to subdivide the honking huge
table into 48 smaller tables, then rebuild the new partitioned table .

Greg Hennessy

unread,

Jul 13, 2012, 3:00:43 PM7/13/12

to

On 2012-07-13, Matthew Woodcraft <matt...@chiark.greenend.org.uk> wrote:
> So long as you don't have an ORDER BY in the queries, if you run several
> of them in parallel (ie, from different database connections) then
> PostgreSQL is supposed to be smart enough to keep the scans in step so
> that it only has to read the disk once.

That is nice to know.

> (The synchronize_seqscans configuration parameter controls this, but it
> defaults to 'on' anyway.)

I don't see a parameter with this name in my postgresql.conf file.
I'm running 8.1.23, on Redhat 5.

Greg Hennessy

unread,

Jul 13, 2012, 4:48:24 PM7/13/12

to

On 2012-07-13, Hans Castorp <REWYRL...@spammotel.com> wrote:

> That sounds like something is seriously broken in your
> environment. How many rows are we talking?

28 billion rows, if I understand things correctly.

PS1=# explaPS1=# explain select count(*) from centroids;
QUERY PLAN
---------------------------------------------------------------------------------
Aggregate (cost=688839322.40..688839322.41 rows=1 width=0)
-> Seq Scan on centroids (cost=0.00..616676434.72 rows=28865155072 width=0)
(2 rows)

Time: 3468.272 ms
PS1=#

Matthew Woodcraft

unread,

Jul 13, 2012, 5:15:26 PM7/13/12

to

Gosh. Then you won't get the behaviour I described, because it was
introduced in 8.3.

-M-

Jasen Betts

unread,

Jul 13, 2012, 10:55:17 PM7/13/12

to

set up table partitioning (or something like it)
and have your insert trigger function do the
demultiplexing,

on the other hand that's a lot of data to handle in a single
transaction. dumping it and spltting it externally may work
better.

--
⚂⚃ 100% natural

--- Posted via news://freenews.netfront.net/ - Complaints to ne...@netfront.net ---

Graham Murray

unread,

Jul 22, 2012, 6:46:40 AM7/22/12

to

Greg Hennessy <greg.h...@cox.net> writes:

> I don't see a parameter with this name in my postgresql.conf file.
> I'm running 8.1.23, on Redhat 5.

You should seriously consider upgrading as postgresql 8.1 went EOL in
November 2010.

Matthew Woodcraft

unread,

Jul 22, 2012, 8:09:19 AM7/22/12

to

In article <87eho48...@einstein.gmurray.org.uk>,

8.1 is what shipped with RHEL 5, so I think Red Hat will be providing
full support for some time yet.

(Of course there are other good reasons to upgrade, but I imagine if it
was easy to do he'd have done it already.)

-M-

Robert Klemme

unread,

Jul 22, 2012, 12:55:29 PM7/22/12

to

On 22.07.2012 14:09, Matthew Woodcraft wrote:
> In article <87eho48...@einstein.gmurray.org.uk>,
> Graham Murray <news...@gmurray.org.uk> wrote:
>> Greg Hennessy <greg.h...@cox.net> writes:
>>
>>> I don't see a parameter with this name in my postgresql.conf file.
>>> I'm running 8.1.23, on Redhat 5.
>>
>> You should seriously consider upgrading as postgresql 8.1 went EOL in
>> November 2010.
>
> 8.1 is what shipped with RHEL 5, so I think Red Hat will be providing
> full support for some time yet.

Does that mean they will implement security fixes and general bug fixes
in their version of PostgreSQL?

Cheers

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Don Y

unread,

Jul 22, 2012, 1:30:21 PM7/22/12

to

Hi Mathew,

"Easy" doesn't mean that it doesn't "take time". As with most
projects, *that* is usually the limiting factor (only so many
hours in a day!)

I've found it is reasonably painless to move (forward) along the
pg upgrade path. It's a bit of a chore dumping and reloading
an entire database (depending on size and secondary storage you have
available to you) but (so far), so far, the process seems to go without
a hitch.

OTOH, I always build from sources so if you're stuck waiting for
a prepackaged binary, your mileage *will* vary :-/

Hans Castorp

unread,

Jul 22, 2012, 2:24:32 PM7/22/12

to

Don Y wrote on 22.07.2012 19:30:
> It's a bit of a chore dumping and reloading
> an entire database (depending on size and secondary storage you have
> available to you) but (so far), so far, the process seems to go without
> a hitch.

Since 8.4 and pg_upgrade, upgrades have become much less painful.

Don Y

unread,

Jul 22, 2012, 3:54:07 PM7/22/12

to

Hi Hans,

I'm not sure I "trust" pg_upgrade, entirely, when it comes to
the "universe of potential databases/sets" it might encounter.

E.g., I have several custom base types that I use. So, as part of
each upgrade, I dig through the sources to see what, if anything,
in the "type interface" may have changed to be sure my existing
types will continue to work when imported to the new implementation.

I find the dump + reload option is the most reassuring one (to me)
as I *know* that my code will rexamine the dumped data as it is
being reloaded. I don't have to worry that some subtle change
will bite me *after* it's in place (without a way of returning to
the known, working configuration).

And, of course, pg_upgrade doesn't help you *build* the new
binaries (which Matthew might require)

Matthew Woodcraft

unread,

Jul 22, 2012, 4:07:02 PM7/22/12

to

Robert Klemme <short...@googlemail.com> wrote:
>On 22.07.2012 14:09, Matthew Woodcraft wrote:
>> 8.1 is what shipped with RHEL 5, so I think Red Hat will be providing
>> full support for some time yet.
>
> Does that mean they will implement security fixes and general bug fixes
> in their version of PostgreSQL?

I expect it's something along the lines of security fixes and dataloss
bug fixes, and I think it's unlikely there'll be many of the latter
turning up now.

I'm not speaking from experience; I use Debian and maybe there's some
fine print I haven't seen.

But certainly Red Hat employ sufficient expertese, and RHEL5 isn't even
out of the highest-level-support part of its 'lifecycle' yet.

-M-

Greg Hennessy

unread,

Jul 22, 2012, 6:57:06 PM7/22/12

to

I'll upgrade as soon as redhat 5 upgrades. It is a work requirement
that I track redhat 5. Well, I could track redhat 6, but I don't
really want to upgrade my entire cluster at this point.