Working pg_dump!

159 views
Skip to first unread message

Micah Yoder

unread,
Jun 9, 2010, 9:13:07 AM6/9/10
to Holland Backup
Ok, I have a working pg_dump plugin and here's something to prove
it! :)

micah@localhost:~/env/var/spool/holland/pgdump$ ll 20100609_075021/
data/
total 44848
-rw-rw----. 1 micah micah 45861237 2010-06-09 07:50 ftsearch.dump.bz2
-rw-rw----. 1 micah micah 349 2010-06-09 07:50 global.sql.bz2
-rw-rw----. 1 micah micah 1821 2010-06-09 07:50 micah.dump.bz2
-rw-rw----. 1 micah micah 539 2010-06-09 07:50 postgres.dump.bz2

micah@localhost:~/env/var/spool/holland/pgdump$ ll 20100609_074643/
data/
total 44644
-rw-rw----. 1 micah micah 45650372 2010-06-09 07:47 ftsearch.dump.gz
-rw-rw----. 1 micah micah 305 2010-06-09 07:46 global.sql.gz
-rw-rw----. 1 micah micah 1859 2010-06-09 07:47 micah.dump.gz
-rw-rw----. 1 micah micah 513 2010-06-09 07:46 postgres.dump.gz

It's committed to my yodermk/holland repo on Github. I'm sure it will
need a good bit of improvement -- like, it totally ignores most
options!

Some notes:

* Uses pg_database_size to estimate the size. The backup size for
both gzip and bzip2 is only about 30% of the estimate.

* Reads auth options only out of a .pgpass file, that is specified in
the config. Is that good enough or do we have to also support putting
all the parameters in the config? (Actually, just now looking back at
it, that's how it connects for DB information. I don't think it used
that info for the actual dump. Something to fix.)

* I re-worked a couple things from the base provided by Andy.
Hopefully not too big a problem.
-- It now grabs a DB connection on init and keeps it through the
lifetime of the backup
-- When getting the DB list, it does:
SELECT datname FROM pg_database WHERE datistemplate='f'
instead of selecting everything and excluding DBs beginning with
'template'.
-- The main PgDump object now keeps a list of databases instead of
calling pg_databases() every time one is needed.

Any comments, let me know. I'll plan to fix some other stuff up as
time permits, hopefully tonight.

Andrew Garner

unread,
Jun 9, 2010, 11:55:35 AM6/9/10
to hollan...@googlegroups.com

I've briefly looked over this and I'll do a more thorough vetting
later today. It looks pretty good so far. The reworked items you
noted are great improvements. Good work.

~Andy

Andrew Garner

unread,
Jun 9, 2010, 5:26:13 PM6/9/10
to hollan...@googlegroups.com

I've looked over the plugin more in-depth and there are a few items to
comment on:

1) Python compatibility issue

There is a compatibility issue in holland.backup.pgdump.base in the
get_connection(config) method. This is using the 'with' construct
that is not supported until python2.5+. We're still supporting back
to python2.3 (CentOS/RHEL4) for holland1.0 core - external plugins can
violate this if they want but it's probably not all that useful to do
so in this case. This could be replaced with the more verbose:

try:
f = open(...)
except IOError, exc:
raise PgError()

try:
passline = f.readline()
finally:
f.close()


However, you shouldn't have to even do that because you shouldn't need
to open the passfile directly.

2) PGPASS handling

There are multiple problems with just reading the first line of a
pgpass file and passing in the values literally to connect():
* The first four fields of a standard pgpassfile may be wildcards and
not real user names
* There may be multiple lines and the first line may not necessarily
match the database being connected to
* fields may be escaped (if they contain a : or \) and need demunging
* If a password (or other field..) contains a : a naive split(:) won't work

This behavior has been the same for quite some time:

http://www.postgresql.org/docs/7.4/interactive/libpq-pgpass.html

If you're using the standard pgpass mechanism I would just setup the
environment correctly and you can implicitly use the pgpass file with
no parsing in python-land. I would pull the user,host etc from a
section in the config and either use a predefined pgpass or maybe
generate a pgpass file in the backup directory. To make use of this
you can do the following:

a) setup the environment for psycopg2 (or any other libpq-based connector)
import os
os.environ['PGPASSFILE'] = path/to/my/pgpassfile (e.g
os.path.join(target_directory, 'holland.pgpass')
# optionally generate a pgpass from the holland .conf
open(os.environ['PGPASSFILE'], 'w').write('*:*:*:*:' + password)
os.chmod(pgpassfile, 0600) # shouldn't be readable by group or world

# password will be read from pgpass per the documented rules, although
you could specify it explicitly
connection = psycopg2.connect(user=user,host=...,dbname=...,port=...)

b) Setup the environment when calling the postgres dump commands
subprocess.check_call(['pg_dump', '-U', username, '-h', host, ...,
'-Fc', dbname],
env={ 'PGPASSFILE' : pgpassfile },
# this will setup the environment in the subprocess before execv*
...)

This allows for consistent handling of pgpass between the connector
and the pg_dump* commands.


3) Persistent connections
Long-term, I would look at not maintaining a persistent connection as
well. All the metadata info (database names, sizes, etc) can be
calculated up front and statically referenced thereafter. This is
done in the mysqldump module as well (largely because mysql metadata
is ridiculously expensive to calculate and needs caching).

At any rate, if you do maintain a persistent connection I recommend
stashing it in the backup plugin instance state, rather than as a
global module variable.

~Andy

Micah Yoder

unread,
Jun 11, 2010, 8:18:57 AM6/11/10
to Holland Backup
Andy, thanks for the input! I'll get to fixing that as soon as I can
get away with it. Maybe before I leave this morning since we got the
queue down good.

I agree that the "control connection" used by Python can be released
before the pg_dump starts. I just needed it in more than one function
and didn't want to make the connection twice.

Micah Yoder

unread,
Jun 16, 2010, 7:01:55 AM6/16/10
to Holland Backup
Ok, this change is now committed to my repo.

It enforces entering either a pgpassfile or password in the
configuration. If no pgpassfile, it attempts to create it.
Originally I tried to create it in the target directory, but that was
failing -- apparently it is not yet created at the point of the plugin
constructor. In any case, with dry_run the directory isn't created
anyway. So that's a problem. I am for the time being creating /
tmp/.pgpass, which is probably not ideal but it works for now. Any
better suggestions? The plugin class deletes /tmp/.pgpass with its
destructor if it was created.

os.environ seems to be passing the PGPASSFILE variable just fine to
the pg_dump command so I didn't see the need to set it in the
environment with the call.

I guess next is the work of integrating all the options that make
sense. For starters, why is there a role and username? I'm not sure
what the role option would be used for since in Postgres users are
just roles with login permissions.

Then there's the bit about excluding schemas and tables. Do you see
that as necessary? I tend to believe that DBs should be backed up as
a whole. Of course if you have a log table that no foreign key
references, maybe it's a candidate for exclusion. Implementing that
bit might be somewhat complex -- I take it I would need to implement a
PgSchema class of some kind, like you have for MySQL?

Thanks,
Micah

Tim Soderstrom

unread,
Jun 16, 2010, 9:09:13 AM6/16/10
to hollan...@googlegroups.com
Yo Micah!

Andy is out and he's our PostgreSQL expert :) But the changes sound good. As far as the scheam/table exclusions, while I agree with you, there will likely be cases where someone has a clear need for inclusion/exclusions so I would say that is worth implementing. The mysqldump uses GLOB based filtering which gets pushed down to a mysqldump config file. Not sure how you would handle that in PostgreSQL.

Andy will probably have more when he gets back but I thought I'd provide my $0.02 for now :)

Tim

> --
> WIKI: http://wiki.hollandbackup.org
> CODE: http://github.com/holland-backup
> UNSUBSCRIBE: holland-deve...@googlegroups.com
>

Micah Yoder

unread,
Jun 16, 2010, 9:59:02 AM6/16/10
to Holland Backup
Thanks. There's more to do before I get there, and I'll study the
mysql implementation some more.

Andrew Garner

unread,
Jun 16, 2010, 1:58:22 PM6/16/10
to hollan...@googlegroups.com
On Wed, Jun 16, 2010 at 6:01 AM, Micah Yoder <yod...@gmail.com> wrote:
> Ok, this change is now committed to my repo.
>
> It enforces entering either a pgpassfile or password in the
> configuration.  If no pgpassfile, it attempts to create it.
> Originally I tried to create it in the target directory, but that was
> failing -- apparently it is not yet created at the point of the plugin
> constructor.  In any case, with dry_run the directory isn't created
> anyway.  So that's a problem.  I am for the time being creating /
> tmp/.pgpass, which is probably not ideal but it works for now.  Any
> better suggestions?  The plugin class deletes /tmp/.pgpass with its
> destructor if it was created.

I typically use a tempfile.NamedTemporaryFile() in these cases. This
gets unlinked as soon as the file is closed (which will happen
implicitly once the object goes out of scope) and will honor TMPDIR
settings so you don't have to hardcode /tmp.

See:

http://docs.python.org/library/tempfile.html


> I guess next is the work of integrating all the options that make
> sense.  For starters, why is there a role and username?  I'm not sure
> what the role option would be used for since in Postgres users are
> just roles with login permissions.

Some of those options can definitely be clipped. I don't quite recall
what I had in mind with role. Probably something like supporting
8.4's pg_dump --role option. Some environments require connecting to
the database with a low privilege user and then using SET ROLE to 'su'
with sufficient privileges (for logging/auditing purposes). This
isn't supported by pg_dump < 8.4 and not very critical to support out
of the gate.

> Then there's the bit about excluding schemas and tables.  Do you see
> that as necessary?  I tend to believe that DBs should be backed up as
> a whole.  Of course if you have a log table that no foreign key
> references, maybe it's a candidate for exclusion.  Implementing that
> bit might be somewhat complex -- I take it I would need to implement a
> PgSchema class of some kind, like you have for MySQL?

Excluding schemas is useful, for example, in a slony environment. At
least with slony 2.x you can exclude all the bookeeping junk by just
excluding the slony schema. Excluding tables is a little less common,
but I often see people wanting to excluding some large log table that
can either be reconstructed or they can easily deal with the loss and
so back it up less frequently.

I don't think --exclude-[table|schema] options existed until 8.2 and
before that '--table' options could only be specified once, I believe.
pg8.1 is still pretty common as it is the default on CentOS/RHEL5
:-\ So this can be a little tricky to implement if you want to support
multiple Postgres versions.

As far as implementation, a simple one might just pass down the
exclude/include lists to pg_dump - constructing multiple
--exclude-schema=X arguments for each specified pattern. To adjust
the backup size estimate is more complex as you likely need to fake
pg_dump's matching mechanism which might have some impedance mismatch
with python's fnmatch. I don't think having a "perfect" backup size
estimate is necessary out of the gate and just using the simple
pg_database_size estimate is probably good enough.

Andrew Garner

unread,
Jun 23, 2010, 10:30:23 AM6/23/10
to hollan...@googlegroups.com
On Wed, Jun 16, 2010 at 12:58 PM, Andrew Garner
<andrew....@gmail.com> wrote:
> On Wed, Jun 16, 2010 at 6:01 AM, Micah Yoder <yod...@gmail.com> wrote:
>> I guess next is the work of integrating all the options that make
>> sense.  For starters, why is there a role and username?  I'm not sure
>> what the role option would be used for since in Postgres users are
>> just roles with login permissions.

As a short followup and to clarify, you really aren't obligated to
implement any of the features in the configspec. A basic pg_dump that
generate a custom format dump is already quite useful with no other
options. With the custom format 90% of those options are better
handled by pg_restore anyway. This include/exclude filters don't add
much, since pg_dump supports globs (unlike mysql) and this could be
covered by 'additional-options' or something similar. ISTM, the only
benefit of having explicit settings would be to potentially adjust the
backup size estimate which is a very minor concern at this point.

With just a little cleanup I think this could be merged into the
mainline. I know a few people that would be happy to use it.

~Andy

Micah Yoder

unread,
Jun 30, 2010, 9:19:52 AM6/30/10
to Holland Backup
Great! The fix for the temporary file name is now committed to my
branch. (Was hoping to get to it last week but again the queue was
too heavy ... :/ )

I agree that the role thing is probably worth supporting. I wasn't
aware of that option.

Any other specific thing you think needs to be cleaned up before a
merge? I'll still plan on getting to schema exclusions and the role
feature as time permits.

On Jun 23, 9:30 am, Andrew Garner <andrew.b.gar...@gmail.com> wrote:
> On Wed, Jun 16, 2010 at 12:58 PM, Andrew Garner
>

Micah Yoder

unread,
Jul 2, 2010, 9:27:29 AM7/2/10
to Holland Backup
--role option now supported!

Tim Soderstrom

unread,
Jul 2, 2010, 10:02:54 AM7/2/10
to hollan...@googlegroups.com
Woohoo! Nice work Micah!

On Jul 2, 2010, at 8:27 AM, Micah Yoder wrote:

> --role option now supported!

Andrew Garner

unread,
Jul 12, 2010, 2:31:58 PM7/12/10
to hollan...@googlegroups.com
Hey Micah,

Sorry, I've been really busy the past couple weeks moving and work
related issues. I've finally swung around to reviewing your pgdump
changes.

1) in PgDump.__init__ in interface.py (line 104)

except IOError as e:

This exception syntax is only supported on python2.6+ (mostly in
preparation for py3k) so it is not supported on older versions.
You'll want to change this to

except IOError, e:

2) pgauth defaults should probably all be None. In the example config
we might set a few example settings, but I wouldn't put these in the
configspec

[pgauth]
username = string(default="postgres")
role = string(default=None)
password = string(default=None)
hostname = string(default="127.0.0.1")
port = integer(default=5432)
pgpass = string(default=None)

I would changes these to all be None - except for 'port' which makes
sense as 5432 and maybe user = 'postgres'. I suspect hostname will
often be a unix socket rather than 127.0.0.1.

3) Hostname can't be None :)

In mysqldump we basically loop through the auth parameters
(user,host,port,socket,etc.) and just skip any that aren't set. This
will use the default behavior, which is often sufficient.

So rather than specifying all the parameters, just specify the ones
defined. So probably one of the simplest (aside from '') would be:

>>> conn = dbapi.connect(user='postgres')
>>> curs = conn.cursor()
>>> curs.execute('select datname from pg_database')
>>> dbs = [db for db, in curs]
>>> dbs
['template1', 'template0', 'agarner', 'postgres', 'foo']

And if host is specified we would have connect(user='...', host='...')
and so on.

4) password or pgauth is required.

I would allow the possibility of running with trust or ident
privileges - I think this can be degraded to a warning and be less
opinionated.

5) Need to clip the unused options to avoid confusion

I don't think any of the pg_dump options are supported currently. It
might be nice to support at least pgdump --compression and/or
extra-options (for direct passthrough to pg_dump) today. I think
[compression] can completely go, unless you support non-custom formats
(where you might pipe to gzip or another external compression program)
- I'm not sure that pg_dumpall --globals-only will ever generate
enough output to justify compression (although it doesn't hurt I
suppose :). The default compression for a real database somewhat
defeats the purpose of --format=custom, as the backup will need to be
uncompressed to access each of the (already compressed by default)
objects.

extra-options (to pass parameters directly to pg_dump) would be nice
to make the plugin flexible.

6) Be careful to catch database errors

There are a couple cases where errors can leak through. For instance
get_db_size (called from estimate_backup_size) might fail for various
reasons (postgres crashed, database dropped, etc.) and this doesn't
get caught until it hits the general error handler in
holland.core.backup which will report the error rather loudly. You
probably want to handle that gracefully and report a nicer error
message about what was going on when the error occurred.

7) pgauth should probably make hostname and other parameters a
wildcard ("*") if they are not specified.

8) generating a pgauth should probably quote the fields, in case they
contain an embedded ':' (either intentionally or otherwise).

From the docs: "If an entry needs to contain : or \, escape this
character with \."

I'm still playing around with it, so I'll probably have more feedback tomorrow.

~Andy

Micah Yoder

unread,
Jul 21, 2010, 8:04:30 AM7/21/10
to Holland Backup
Thanks for the input. I just got back from vacation, and will try to
address these as soon as possible.

Andrew Garner

unread,
Jul 22, 2010, 11:46:16 AM7/22/10
to hollan...@googlegroups.com
On Wed, Jul 21, 2010 at 7:04 AM, Micah Yoder <yod...@gmail.com> wrote:
> Thanks for the input.  I just got back from vacation, and will try to
> address these as soon as possible.

Cool. I'm looking forward to it. I should be much more responsive
with feedback now. If you would prefer I can open individual github
issues if that makes it easier to track.

~Andy

Micah Yoder

unread,
Jul 24, 2010, 8:38:46 AM7/24/10
to Holland Backup
Ok, additional thoughts ...

On Jul 12, 1:31 pm, Andrew Garner <andrew.b.gar...@gmail.com> wrote:

> I would changes these to all be None - except for 'port' which makes
> sense as 5432 and maybe user = 'postgres'. I suspect hostname will
> often be a unix socket rather than 127.0.0.1.

I had been thinking the default user makes sense as postgres, but if
users typically set PGUSER and other environment variables in their
root environment, I guess it can be None. For that matter, it may
indeed be best to default Port to None since defaulting it to 5432
would cause an override of PGPORT.

The psycopg2 connect() function takes all those environment variables
into account right? That way they would be the same between that and
the pg_dump calls.

> 5) Need to clip the unused options to avoid confusion

Will do that as soon as I figure out exactly what I'll implement.

> I don't think any of the pg_dump options are supported currently. It
> might be nice to support at least pgdump --compression and/or
> extra-options (for direct passthrough to pg_dump) today. I think
> [compression] can completely go, unless you support non-custom formats
> (where you might pipe to gzip or another external compression program)
> - I'm not sure that pg_dumpall --globals-only will ever generate
> enough output to justify compression (although it doesn't hurt I
> suppose :). The default compression for a real database somewhat
> defeats the purpose of --format=custom, as the backup will need to be
> uncompressed to access each of the (already compressed by default)
> objects.

Might as well leave it in for similarity with MySQL, but defaulting to
off would make some sense.


> extra-options (to pass parameters directly to pg_dump) would be nice
> to make the plugin flexible.

good idea.

> 6) Be careful to catch database errors

Right. I can get a bit lazy about exceptions when doing the first
code run. Bad practice I suppose! :/

About exceptions though, I wonder if it would be better instead of
having plugin specific exceptions like MySQLError and PgError, to have
cross-plugin error-specific exceptions defined in a Holland lib. For
example HollandDBConnectError would be thrown by either the MySQL or
PG plugin. That would let the plugins behave in a similar way with a
similar error, and what happens (specific error message or other
exception behavior) could be defined centrally.

> 7) pgauth should probably make hostname and other parameters a
> wildcard ("*") if they are not specified.

You mean the generated pgpass I assume. Yeah I guess that will be
necessary.

Andrew Garner

unread,
Jul 24, 2010, 11:14:55 AM7/24/10
to hollan...@googlegroups.com
On Sat, Jul 24, 2010 at 7:38 AM, Micah Yoder <yod...@gmail.com> wrote:
> Ok, additional thoughts ...
>
> On Jul 12, 1:31 pm, Andrew Garner <andrew.b.gar...@gmail.com> wrote:
>
>> I would changes these to all be None - except for 'port' which makes
>> sense as 5432 and maybe user = 'postgres'.  I suspect hostname will
>> often be a unix socket rather than 127.0.0.1.
>
> I had been thinking the default user makes sense as postgres, but if
> users typically set PGUSER and other environment variables in their
> root environment, I guess it can be None.  For that matter, it may
> indeed be best to default Port to None since defaulting it to 5432
> would cause an override of PGPORT.
>
> The psycopg2 connect() function takes all those environment variables
> into account right?  That way they would be the same between that and
> the pg_dump calls.

Yeah. This is all handled by Postgres' underlying libpq. Since libpq
defaults to the current system user and you either have a pgpassfile
configured, environment or using an appropriate auth type in pg_hba
you could be so lazy as to do the minimal:
psycopg2.connect('')

Default user of 'postgres' is probably fine. I think for the most
parts the other defaults will either be fine or need to be manually
customized by a user anyway.

In python, you can build up a dictionary of { 'param' : 'value ' }
pairs and pass this to the connect method directly:
params = dict()
for param in config['pgauth']:
if param not in acceptable_params: # skip pgpass, role etc.
continue
value = config['pgauth'][param]
if value is not None:
params[param] = value
psycopg2.connect(**params)

Or alternatively you can build up the traditional postgres DSN string.

>> 5) Need to clip the unused options to avoid confusion
>
> Will do that as soon as I figure out exactly what I'll implement.
>
>> I don't think any of the pg_dump options are supported currently.  It
>> might be nice to support at least pgdump --compression and/or
>> extra-options (for direct passthrough to pg_dump) today.  I think
>> [compression] can completely go, unless you support non-custom formats
>> (where you might pipe to gzip or another external compression program)
>> - I'm not sure that pg_dumpall --globals-only will ever generate
>> enough output to justify compression (although it doesn't hurt I
>> suppose :).  The default compression for a real database somewhat
>> defeats the purpose of --format=custom, as the backup will need to be
>> uncompressed to access each of the (already compressed by default)
>> objects.
>
> Might as well leave it in for similarity with MySQL, but defaulting to
> off would make some sense.

The original idea was to use compression for those people who (for
whatever reason) didn't want to do dumps with --format=custom (that
is, plain sql dumps). Most of the interesting restore behavior
(including parallel restore) will require custom, so that's a good
default.

For custom, it problem makes sense to convert the compression level to
a pg_dump -Z # option, I suppose and warn that the compression method
is not configurable with the custom format since it always uses zlib
today.

>> 6) Be careful to catch database errors
>
> Right.  I can get a bit lazy about exceptions when doing the first
> code run. Bad practice I suppose! :/
>
> About exceptions though, I wonder if it would be better instead of
> having plugin specific exceptions like MySQLError and PgError, to have
> cross-plugin error-specific exceptions defined in a Holland lib.  For
> example HollandDBConnectError would be thrown by either the MySQL or
> PG plugin.  That would let the plugins behave in a similar way with a
> similar error, and what happens (specific error message or other
> exception behavior) could be defined centrally.

We could do this, but I'm not sure what we would want special for this
case vs. any other error a plugin might run into. Right now we have a
basic 'BackupError' core catches and otherwise has a catchall for
other excpetions and loudly logs if a plugin raises something
otherwise unexpected. If we have useful actions or reporting that
makes sense for more specialized errors, I'm all for making a richer
holland exception hierarchy.

>> 7) pgauth should probably make hostname and other parameters a
>> wildcard ("*") if they are not specified.
>
> You mean the generated pgpass I assume.  Yeah I guess that will be
> necessary.

By pgauth I mean the [pgauth] section we have today. I do roughly the
same with mysql, looping through those auth parameters and skipping
those that are `None` (not defined) when generating a my.cnf. Here,
we're generating a pgpassfile, so rather than skipping a parameter, it
should probably be converted to '*' for all but password.

Andrew Garner

unread,
Jul 28, 2010, 1:51:19 AM7/28/10
to hollan...@googlegroups.com
I thought it might be useful to see some of the changes I had
recommended to move things along a little faster. Not all of these
necessarily have to be in the final plugin, of course, but I've
implemented most of the changes I mentioned previously here:

http://github.com/abg/holland/tree/pgdump

The last 17 commits or so cleaned up the pgdump plugin with what I had
in mind. Not all changes are necessarily a good idea :) but I wanted
to see how they'll work out.

As a summary:

* dropped pgpass config parameter entirely. a pgpassfile is still
generated, but only if a password is set. We warn loudly if this
overrides a pgpassfile in the environment.

* no longer set a global PGPASSFILE in the environment, but pass an
environment directly to the individual commands. This means other
potential postgres plugins won't be futzing with the same environment
variable.

* all pgauth parameters are optional and uses libpq default behavior.
this means username = os.getuid() effectively, and a unix socket
connection (probably /tmp or /var/run/postgresql/ on debian) rather
than tcp over 127.0.0.1 as before.

* I noticed there was still the old stderr=open('pgdump.err') when
running pg commands, which writes a pgdump.err file in os.getcwd()
which wasn't intended. I output to a temporary file and relog that to
the plugin logger as necessary.

* pg_dump --format is configurable and --format=custom uses the
pg_dump builtin --compress option rather than holland's streaming
compression facilities

* database names are encoded with holland.lib.safefilename (to make
things like 'createdb foo/bar'; holland backup pgdump work)

* reorganized 'role' into the main pgdump section

* drop all other unimplemented configspec parameters.

* fixed holland.spec to make pgdump build correctly in rpms (in my
personal tree only, not our holland-backup 'trunk' and currently not
in the debian packages)

Some sample output from a test run via holland v1.0.4 (latest merge
from upstream):

# holland backup pgdump
Holland 1.0.4 started with pid 3527
--- Starting backup run ---
Acquired lock /etc/holland/backupsets/pgdump.conf :
'/etc/holland/backupsets/pgdump.conf'
Got databases: ['agarner', 'postgres', 'foo', 'foo/bar']
Creating backup path /var/spool/holland/pgdump/20100728_014757
DB agarner size 4468332
DB postgres size 4468332
DB foo size 4468332
DB foo/bar size 4468332
Estimated Backup Size: 17.05MB
Starting backup[pgdump/20100728_014757] via plugin pgdump
pg_dumpall -g --username postgres
pg_dump --username postgres --compress 1 --format custom agarner
pg_dump --username postgres --compress 1 --format custom postgres
pg_dump --username postgres --compress 1 --format custom foo
Encoded database foo/bar as filename foo(2f)bar
pg_dump --username postgres --compress 1 --format custom foo/bar
Final on-disk backup size 4.91KB
0.03% of estimated size 17.05MB
Backup completed in 0.24 seconds
Purged pgdump/20100728_013333
1 backups purged
Released lock /etc/holland/backupsets/pgdump.conf
--- Ending backup run ---

# ls -1 /var/spool/holland/pgdump/20100728_014757/data/
agarner.dump
foo(2f)bar.dump
foo.dump
global.sql.gz
pgpass
postgres.dump

~Andy

Andrew Garner

unread,
Jul 28, 2010, 1:25:16 PM7/28/10
to hollan...@googlegroups.com
I also added a few more minor things to my pgdump tree this morning:

* a debian package for pgdump
* handle unicode database names
* generate a manifest file (in case database names are path encoded to
something unrecognizable)
* minor formatting changes to make logging slightly prettier

~Andy

Micah Yoder

unread,
Jul 29, 2010, 9:37:01 AM7/29/10
to Holland Backup
Wow that's great!

Anything left to do on it now? :)

Andrew Garner

unread,
Jul 29, 2010, 10:20:30 AM7/29/10
to hollan...@googlegroups.com
On Thu, Jul 29, 2010 at 8:37 AM, Micah Yoder <yod...@gmail.com> wrote:
> Wow that's great!
>
> Anything left to do on it now? :)

I'm pretty sure there are a few places where errors aren't being
trapped properly. The changes could definitely do with code
review/testing. I've only done pretty basic testing against 8.3/8.4.
Some discussion on whether the (lack of) defaults are reasonable.

For flexibility implementing some "additional-options" config
parameter would be nice. Otherwise, I think the basic functionality
to cover most of the pgdump use cases is in place and just needs
vetted.

~Andy

Micah Yoder

unread,
Jul 30, 2010, 8:57:01 AM7/30/10
to Holland Backup
Andy,

Ok you broke something I had working! :o

--compress is not supported in pg_dumpall. So that crashed it. I
fixed it by adding this to backup_globals. Hopefully it is not too
nasty:

# --compress not supported in pg_dumpall!
if '--compress' in connection_params:
idx = connection_params.index('--compress')
connection_params[idx:idx+2] = []

It's in my master branch, which is now based on your pgdump.

> I'm pretty sure there are a few places where errors aren't being
> trapped properly.   The changes could definitely do with code
> review/testing.  I've only done pretty basic testing against 8.3/8.4.
> Some discussion on whether the (lack of) defaults are reasonable.

Sure I'll plan on banging it around over the next couple weeks and
look into more exception handling.

> For flexibility implementing some "additional-options" config
> parameter would be nice.  Otherwise, I think the basic functionality
> to cover most of the pgdump use cases is in place and just needs
> vetted.

should be doable as well.

Need to get some VMs working to test with older Postgres versions and
distros.

Andrew Garner

unread,
Jul 30, 2010, 11:05:49 AM7/30/10
to hollan...@googlegroups.com
On Fri, Jul 30, 2010 at 7:57 AM, Micah Yoder <yod...@gmail.com> wrote:
> Andy,
>
> Ok you broke something I had working! :o
>
> --compress is not supported in pg_dumpall.  So that crashed it.  I
> fixed it by adding this to backup_globals.  Hopefully it is not too
> nasty:
>
>    # --compress not supported in pg_dumpall!
>    if '--compress' in connection_params:
>        idx = connection_params.index('--compress')
>        connection_params[idx:idx+2] = []
>
> It's in my master branch, which is now based on your pgdump.

Good catch. This snuck in yesterday and I didn't get back around to
testing it.

A better approach here is probably to maintain a separate args list
for pg_dump. This would also fold nicely into the
'additional-options' feature. I originally was doing this, but
unnecessarily in the "for each database" loop so I had moved it into
the pgauth2args method to clean that up.

I pushed a small fix for that into my branch, which may be a little prettier.

>> For flexibility implementing some "additional-options" config
>> parameter would be nice.  Otherwise, I think the basic functionality
>> to cover most of the pgdump use cases is in place and just needs
>> vetted.
>
> should be doable as well.

Since it was trivial (~5 line patch) based on the --compress change I
just made, I pushed an initial implementation to my tree for this if
you want to evaluate that approach.

This works alright in most cases, but this impl uses shlex.split which
does not handle unicode so I think some additional fencing is needed
at least. I think this would only be an issue with non-ascii
table/schema exclusions (which are very rare, I think), and probably
can be solved by just encoding to utf8 first.

> Need to get some VMs working to test with older Postgres versions and
> distros.

If you know what else you want to test on, we should be able to
provide temporary testing servers. You can email me off-list.

~Andy

Andrew Garner

unread,
Jul 30, 2010, 2:23:38 PM7/30/10
to hollan...@googlegroups.com
On Fri, Jul 30, 2010 at 7:57 AM, Micah Yoder <yod...@gmail.com> wrote:
> Need to get some VMs working to test with older Postgres versions and
> distros.

As a small data point in the other direction, I ran the pgdump plugin
against Postgres-9.0beta3 on RHEL 5.4 without any problems. Of
course, you have to have the compat-libs to make rhel's psycopg2
package's libpq dependency happy.

On the other end of the spectrum, pgdump doesn't work on rhel4 with
the default postgres-7.4.

There error here is:

Traceback (most recent call last):
File "/usr/lib/python2.3/site-packages/holland/core/command/command.py",
line 201, in dispatch
return self.run(self.optparser.prog, opts, *args)
File "/usr/lib/python2.3/site-packages/holland/commands/backup.py",
line 95, in run
runner.backup(name, config, opts.dry_run)
File "/usr/lib/python2.3/site-packages/holland/core/backup/base.py",
line 102, in backup
estimated_size = self.check_available_space(plugin)
File "/usr/lib/python2.3/site-packages/holland/core/backup/base.py",
line 152, in check_available_space
estimated_bytes_required = plugin.estimate_backup_size()
File "/usr/lib/python2.3/site-packages/holland/backup/pgdump/interface.py",
line 79, in estimate_backup_size
totalestimate += get_db_size(db, self.connection)
File "/usr/lib/python2.3/site-packages/holland/backup/pgdump/base.py",
line 46, in get_db_size
cursor.execute("SELECT pg_database_size('%s')" % dbname)
ProgrammingError: function pg_database_size("unknown") does not exist
HINT: No function matches the given name and argument types. You may
need to add explicit type casts.

Obviously this is just due to the lack of the pg_database_size
function (which affects any version <8.1). This is easy to detect by
looking for exception.pgcode = 42883. I put in a workaround to
fallback to just setting size = 0 in this case and otherwise the
plugin worked for my test data. I think to support this otherwise we
would have to reconnect to each database and query pg_class directly
which isn't very pretty. This is probably not worth it as 7.4,8.0
were recently deprecated anyway but if it's not painful providing some
workaround might be useful. Although I doubt anyone conservative
enough to not have upgraded Postgres will want to run holland. :)

~Andy

Micah Yoder

unread,
Jul 31, 2010, 3:20:22 AM7/31/10
to Holland Backup
>   Although I doubt anyone conservative
> enough to not have upgraded Postgres will want to run holland. :)

I agree, and am not generally a fan of supporting old versions of
things if it hinders progress for newer, cooler stuff.

What about RHEL 4 with a newer Postgres? I'd be inclined to not worry
too much about that either. (Especially because I failed installing
CentOS 4 in VirtualBox ...)

> A better approach here is probably to maintain a separate args list
for pg_dump.

Perhaps so. Or start with a base argument list, which would include
nearly all arguments, then have a variable to hold the final list for
pg_dump and pg_dumpall.

Andrew Garner

unread,
Jul 31, 2010, 1:58:38 PM7/31/10
to hollan...@googlegroups.com
On Sat, Jul 31, 2010 at 2:20 AM, Micah Yoder <yod...@gmail.com> wrote:
>>   Although I doubt anyone conservative
>> enough to not have upgraded Postgres will want to run holland. :)
>
> I agree, and am not generally a fan of supporting old versions of
> things if it hinders progress for newer, cooler stuff.

Still, the error here is uncaught and needs to be handled at any rate.
It's not that much extra work to support these old versions. I
pushed a fix for this in my pgdump branch to fallback to querying
pg_class so this should work on old versions now.

> What about RHEL 4 with a newer Postgres?  I'd be inclined to not worry
> too much about that either.  (Especially because I failed installing
> CentOS 4 in VirtualBox ...)

python-psycopg2 works fine, although EPEL is required (which is a
required dependency to provide python-setuptools on this platform
anyway). pgrpms.org provides rhel4 packages all the way up to the
latest 8.4.4 so there are other solutions.

>> A better approach here is probably to maintain a separate args list
> for pg_dump.
>
> Perhaps so.  Or start with a base argument list, which would include
> nearly all arguments, then have a variable to hold the final list for
> pg_dump and pg_dumpall.

That's what pgdump is doing per the fix I pushed yesterday. There are
two lists, connection_params and extra_options - the latter gets
passed only to pgdump and I slip --compress in there as needed.

~Andy

BJ Dierkes

unread,
Aug 2, 2010, 5:58:00 PM8/2/10
to hollan...@googlegroups.com

On Jul 31, 2010, at 2:20 AM, Micah Yoder wrote:

>> Although I doubt anyone conservative
>> enough to not have upgraded Postgres will want to run holland. :)
>
> I agree, and am not generally a fan of supporting old versions of
> things if it hinders progress for newer, cooler stuff.

Here here... I am on the side of providing support on RHEL4 for the minimal set of features... i.e. mysqldump. I wouldn't be apposed to adding a PG version check in the plugin and failing with 'unsupported version of postgres' type error. Anything beyond the basics we shouldn't be too concerned with older os's.

As far as packages go, we can optionally not build in any plugins that won't work on rhel4.

---
derks

Andrew Garner

unread,
Aug 2, 2010, 6:36:17 PM8/2/10
to hollan...@googlegroups.com
On Mon, Aug 2, 2010 at 4:58 PM, BJ Dierkes <wdie...@5dollarwhitebox.org> wrote:
>
> On Jul 31, 2010, at 2:20 AM, Micah Yoder wrote:
>
>>>   Although I doubt anyone conservative
>>> enough to not have upgraded Postgres will want to run holland. :)
>>
>> I agree, and am not generally a fan of supporting old versions of
>> things if it hinders progress for newer, cooler stuff.
>
> Here here...  I am on the side of providing support on RHEL4 for the minimal set of features... i.e. mysqldump.  I wouldn't be apposed to adding a PG version check in the plugin and failing with 'unsupported version of postgres' type error.  Anything beyond the basics we shouldn't be too concerned with older os's.

Checking the version is essentially the same amount of work as just
working around the problem (and orders of magnitude less effort than
even the simple case in mysqldump) - I already committed a fix. The
main point here though is to handle the problem gracefully - if we run
into further issues we can just punt early and declare these versions
unsupported. I don't expect this will be the case, however.

~Andy

Reply all
Reply to author
Forward
0 new messages