nohup time pg_dump -f database.dmp -Z9 database
Table "database.bigtable"
Column | Type | Modifiers
--------------+-------------------+--------------------
headerid | integer | not null
member | numeric(10,0) | not null
postcode | character varying |
bsp | character varying |
details | bytea | not null
membertypeid | integer | not null default 0
Indexes:
"bigtable_pkey" PRIMARY KEY, btree (headerid, member)
"bigtable_member" btree (member)
Foreign-key constraints:
"bigtable_headerid_fkey" FOREIGN KEY (headerid) REFERENCES header(headerid)
Did you look at "vmstat 1" output to see whether the system was under
any large I/O load?
Dumping large bytea data is known to be slow for a couple of reasons:
1. The traditional text output format for bytea is a bit poorly chosen.
It's not especially cheap to generate and it interacts very badly with
COPY processing, since it tends to contain lots of backslashes which
then have to be escaped by COPY.
2. Pulling the data from the out-of-line "toast" table can be expensive
if it ends up seeking all over the disk to do it. This will show up as
a lot of seeking and I/O wait, rather than CPU expense.
Since you mention having recently recopied the table into a new table,
I would guess that the toast table is reasonably well-ordered and so
effect #2 shouldn't be a big issue. But it's a good idea to check.
PG 9.0 is changing the default bytea output format to hex, in part
to solve problem #1. That doesn't help you in an 8.3 installation
of course. If you're desperate you could consider excluding this
table from your pg_dumps and backing it up separately via COPY BINARY.
The PITA factor of that might be more than you can stand though.
Offhand I can't think of any other way to ameliorate the problem
in 8.3.
regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-pe...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Can you connect a few times with gdb and do "bt" to get a backtrace?
That might shed some light on where it's spending all of its time.
...Robert
Evening all,
Maiden post to this list. I've a performance problem for which I'm uncharacteristically in need of good advice.
I have a read-mostly database using 51GB on an ext3 filesystem on a server running Ubuntu 9.04 and PG 8.3. Forty hours ago I started a plain-format dump, compressed with -Z9, and it is still running, having produced 32GB of an expected 40 - 45GB of compressed output.
The first interesting thing was that piping (uncompressed) pg_dump into
gzip, instead of using pg_dump's internal compressor, does bring a lot
of extra parallelism into play. (Thank you, Matthew Wakeling.) I
observed gzip using 100% CPU, as expected, and also two, count them, two
postgres processes collecting data, each consuming a further 80% CPU.
It seemed to me that Postgres was starting and stopping these to match
the capacity of the consumer (i.e. pg_dump and gzip.) Very nice.
Unfortunately one of these processes dropped eventually, and, according
to top, the only non-idle process running was gzip (100%.) Obviously
there were postgress and pg_dump processes, too, but they were throttled
by gzip's rate of output and effectively idle (less than 1% CPU). That
is also interesting. The final output from gzip was being produced at
the rate of about 0.5MB/second, which seems almost unbelievably slow.
I next tried Tom Lane's suggestion, COPY WITH BINARY, which produced the
complete 34GB file in 30 minutes (a good result.) I then compressed
that with gzip, which took an hour and reduced the file to 32GB (hardly
worth the effort) for a total run time of 90 minutes. In that instance,
gzip produced output at the rate of 10MB/second, so I tried pg_dump -Z0
to see how quickly that would dump the file. I had the idea that I'd go
on to see how quickly gzip would compress it, but unfortunately it
filled my disk before finishing (87GB at that point), so there's
something worth knowing: pg_dump's output for binary data is very much
less compact than COPY WITH BINARY; all those backslashes, as Tom
pointed out. For the aforementioned reason, I didn't get to see how
gzip would perform. For the record, pg_dump with no compression
produced output at the rate of 26MB/second; a rather meaningless number
given the 200%+ expansion of final output.
I am now confident the performance problem is from gzip, not Postgres
and wonder if I should read up on gzip to find why it would work so
slowly on a pure text stream, albeit a representation of PDF which
intrinsically is fairly compressed. Given the spectacular job that
postgres did in adjusting it's rate of output to match the consumer
process, I did wonder if there might have been a tragic interaction
between postgres and gzip; perhaps postgres limits its rate of output to
match gzip; and gzip tries to compress what's available, that being only
a few bytes; and perhaps that might be so inefficient that it hogs the
CPU; but it don't think that likely. I had a peek at gzip's source
(surprisingly readable) and on first blush it does seem that unfortunate
input could result in only a few bytes being written each time through
the loop, meaning only a few more bytes could be read in.
Just to complete the report, I created a child table to hold the PDF's,
which are static, and took a dump of just that table, and adjusted my
backup command to exclude it. Total size of compressed back sans PDFs
circa 7MB taking around 30 seconds.
> Unfortunately one of these processes dropped eventually, and, according
> to top, the only non-idle process running was gzip (100%.) Obviously
> there were postgress and pg_dump processes, too, but they were throttled
> by gzip's rate of output and effectively idle (less than 1% CPU). That
> is also interesting. The final output from gzip was being produced at
> the rate of about 0.5MB/second, which seems almost unbelievably slow.
CPU isn't the only measure of interest here.
If pg_dump and the postgres backend it's using are doing simple work
such as reading linear data from disk, they won't show much CPU activity
even though they might be running full-tilt. They'll be limited by disk
I/O or other non-CPU resources.
> and wonder if I should read up on gzip to find why it would work so
> slowly on a pure text stream, albeit a representation of PDF which
> intrinsically is fairly compressed.
In fact, PDF uses deflate compression, the same algorithm used for gzip.
Gzip-compressing PDF is almost completely pointless - all you're doing
is compressing some of the document structure, not the actual content
streams. With PDF 1.5 and above using object and xref streams, you might
not even be doing that, instead only compressing the header and trailer
dictionary, which are probably in the order of a few hundred bytes.
Compressing PDF documents is generally a waste of time.
--
Craig Ringer
> In fact, PDF uses deflate compression, the same algorithm used for gzip.
> Gzip-compressing PDF is almost completely pointless -
Yeah. I would bet that the reason for the slow throughput is that gzip
is fruitlessly searching for compressible sequences. It won't find many.
regards, tom lane
Indeed, I didn't expect much reduction in size, but I also didn't expect
a four-order of magnitude increase in run-time (i.e. output at
10MB/second going down to 500KB/second), particularly as my estimate was
based on gzipping a previously gzipped file. I think it's probably
pathological data, as it were. Might even be of interest to gzip's
maintainers.