I've briefly looked over this and I'll do a more thorough vetting
later today. It looks pretty good so far. The reworked items you
noted are great improvements. Good work.
~Andy
I've looked over the plugin more in-depth and there are a few items to
comment on:
1) Python compatibility issue
There is a compatibility issue in holland.backup.pgdump.base in the
get_connection(config) method. This is using the 'with' construct
that is not supported until python2.5+. We're still supporting back
to python2.3 (CentOS/RHEL4) for holland1.0 core - external plugins can
violate this if they want but it's probably not all that useful to do
so in this case. This could be replaced with the more verbose:
try:
f = open(...)
except IOError, exc:
raise PgError()
try:
passline = f.readline()
finally:
f.close()
However, you shouldn't have to even do that because you shouldn't need
to open the passfile directly.
2) PGPASS handling
There are multiple problems with just reading the first line of a
pgpass file and passing in the values literally to connect():
* The first four fields of a standard pgpassfile may be wildcards and
not real user names
* There may be multiple lines and the first line may not necessarily
match the database being connected to
* fields may be escaped (if they contain a : or \) and need demunging
* If a password (or other field..) contains a : a naive split(:) won't work
This behavior has been the same for quite some time:
http://www.postgresql.org/docs/7.4/interactive/libpq-pgpass.html
If you're using the standard pgpass mechanism I would just setup the
environment correctly and you can implicitly use the pgpass file with
no parsing in python-land. I would pull the user,host etc from a
section in the config and either use a predefined pgpass or maybe
generate a pgpass file in the backup directory. To make use of this
you can do the following:
a) setup the environment for psycopg2 (or any other libpq-based connector)
import os
os.environ['PGPASSFILE'] = path/to/my/pgpassfile (e.g
os.path.join(target_directory, 'holland.pgpass')
# optionally generate a pgpass from the holland .conf
open(os.environ['PGPASSFILE'], 'w').write('*:*:*:*:' + password)
os.chmod(pgpassfile, 0600) # shouldn't be readable by group or world
# password will be read from pgpass per the documented rules, although
you could specify it explicitly
connection = psycopg2.connect(user=user,host=...,dbname=...,port=...)
b) Setup the environment when calling the postgres dump commands
subprocess.check_call(['pg_dump', '-U', username, '-h', host, ...,
'-Fc', dbname],
env={ 'PGPASSFILE' : pgpassfile },
# this will setup the environment in the subprocess before execv*
...)
This allows for consistent handling of pgpass between the connector
and the pg_dump* commands.
3) Persistent connections
Long-term, I would look at not maintaining a persistent connection as
well. All the metadata info (database names, sizes, etc) can be
calculated up front and statically referenced thereafter. This is
done in the mysqldump module as well (largely because mysql metadata
is ridiculously expensive to calculate and needs caching).
At any rate, if you do maintain a persistent connection I recommend
stashing it in the backup plugin instance state, rather than as a
global module variable.
~Andy
Andy is out and he's our PostgreSQL expert :) But the changes sound good. As far as the scheam/table exclusions, while I agree with you, there will likely be cases where someone has a clear need for inclusion/exclusions so I would say that is worth implementing. The mysqldump uses GLOB based filtering which gets pushed down to a mysqldump config file. Not sure how you would handle that in PostgreSQL.
Andy will probably have more when he gets back but I thought I'd provide my $0.02 for now :)
Tim
> --
> WIKI: http://wiki.hollandbackup.org
> CODE: http://github.com/holland-backup
> UNSUBSCRIBE: holland-deve...@googlegroups.com
>
I typically use a tempfile.NamedTemporaryFile() in these cases. This
gets unlinked as soon as the file is closed (which will happen
implicitly once the object goes out of scope) and will honor TMPDIR
settings so you don't have to hardcode /tmp.
See:
http://docs.python.org/library/tempfile.html
> I guess next is the work of integrating all the options that make
> sense. For starters, why is there a role and username? I'm not sure
> what the role option would be used for since in Postgres users are
> just roles with login permissions.
Some of those options can definitely be clipped. I don't quite recall
what I had in mind with role. Probably something like supporting
8.4's pg_dump --role option. Some environments require connecting to
the database with a low privilege user and then using SET ROLE to 'su'
with sufficient privileges (for logging/auditing purposes). This
isn't supported by pg_dump < 8.4 and not very critical to support out
of the gate.
> Then there's the bit about excluding schemas and tables. Do you see
> that as necessary? I tend to believe that DBs should be backed up as
> a whole. Of course if you have a log table that no foreign key
> references, maybe it's a candidate for exclusion. Implementing that
> bit might be somewhat complex -- I take it I would need to implement a
> PgSchema class of some kind, like you have for MySQL?
Excluding schemas is useful, for example, in a slony environment. At
least with slony 2.x you can exclude all the bookeeping junk by just
excluding the slony schema. Excluding tables is a little less common,
but I often see people wanting to excluding some large log table that
can either be reconstructed or they can easily deal with the loss and
so back it up less frequently.
I don't think --exclude-[table|schema] options existed until 8.2 and
before that '--table' options could only be specified once, I believe.
pg8.1 is still pretty common as it is the default on CentOS/RHEL5
:-\ So this can be a little tricky to implement if you want to support
multiple Postgres versions.
As far as implementation, a simple one might just pass down the
exclude/include lists to pg_dump - constructing multiple
--exclude-schema=X arguments for each specified pattern. To adjust
the backup size estimate is more complex as you likely need to fake
pg_dump's matching mechanism which might have some impedance mismatch
with python's fnmatch. I don't think having a "perfect" backup size
estimate is necessary out of the gate and just using the simple
pg_database_size estimate is probably good enough.
As a short followup and to clarify, you really aren't obligated to
implement any of the features in the configspec. A basic pg_dump that
generate a custom format dump is already quite useful with no other
options. With the custom format 90% of those options are better
handled by pg_restore anyway. This include/exclude filters don't add
much, since pg_dump supports globs (unlike mysql) and this could be
covered by 'additional-options' or something similar. ISTM, the only
benefit of having explicit settings would be to potentially adjust the
backup size estimate which is a very minor concern at this point.
With just a little cleanup I think this could be merged into the
mainline. I know a few people that would be happy to use it.
~Andy
On Jul 2, 2010, at 8:27 AM, Micah Yoder wrote:
> --role option now supported!
Sorry, I've been really busy the past couple weeks moving and work
related issues. I've finally swung around to reviewing your pgdump
changes.
1) in PgDump.__init__ in interface.py (line 104)
except IOError as e:
This exception syntax is only supported on python2.6+ (mostly in
preparation for py3k) so it is not supported on older versions.
You'll want to change this to
except IOError, e:
2) pgauth defaults should probably all be None. In the example config
we might set a few example settings, but I wouldn't put these in the
configspec
[pgauth]
username = string(default="postgres")
role = string(default=None)
password = string(default=None)
hostname = string(default="127.0.0.1")
port = integer(default=5432)
pgpass = string(default=None)
I would changes these to all be None - except for 'port' which makes
sense as 5432 and maybe user = 'postgres'. I suspect hostname will
often be a unix socket rather than 127.0.0.1.
3) Hostname can't be None :)
In mysqldump we basically loop through the auth parameters
(user,host,port,socket,etc.) and just skip any that aren't set. This
will use the default behavior, which is often sufficient.
So rather than specifying all the parameters, just specify the ones
defined. So probably one of the simplest (aside from '') would be:
>>> conn = dbapi.connect(user='postgres')
>>> curs = conn.cursor()
>>> curs.execute('select datname from pg_database')
>>> dbs = [db for db, in curs]
>>> dbs
['template1', 'template0', 'agarner', 'postgres', 'foo']
And if host is specified we would have connect(user='...', host='...')
and so on.
4) password or pgauth is required.
I would allow the possibility of running with trust or ident
privileges - I think this can be degraded to a warning and be less
opinionated.
5) Need to clip the unused options to avoid confusion
I don't think any of the pg_dump options are supported currently. It
might be nice to support at least pgdump --compression and/or
extra-options (for direct passthrough to pg_dump) today. I think
[compression] can completely go, unless you support non-custom formats
(where you might pipe to gzip or another external compression program)
- I'm not sure that pg_dumpall --globals-only will ever generate
enough output to justify compression (although it doesn't hurt I
suppose :). The default compression for a real database somewhat
defeats the purpose of --format=custom, as the backup will need to be
uncompressed to access each of the (already compressed by default)
objects.
extra-options (to pass parameters directly to pg_dump) would be nice
to make the plugin flexible.
6) Be careful to catch database errors
There are a couple cases where errors can leak through. For instance
get_db_size (called from estimate_backup_size) might fail for various
reasons (postgres crashed, database dropped, etc.) and this doesn't
get caught until it hits the general error handler in
holland.core.backup which will report the error rather loudly. You
probably want to handle that gracefully and report a nicer error
message about what was going on when the error occurred.
7) pgauth should probably make hostname and other parameters a
wildcard ("*") if they are not specified.
8) generating a pgauth should probably quote the fields, in case they
contain an embedded ':' (either intentionally or otherwise).
From the docs: "If an entry needs to contain : or \, escape this
character with \."
I'm still playing around with it, so I'll probably have more feedback tomorrow.
~Andy
Cool. I'm looking forward to it. I should be much more responsive
with feedback now. If you would prefer I can open individual github
issues if that makes it easier to track.
~Andy
Yeah. This is all handled by Postgres' underlying libpq. Since libpq
defaults to the current system user and you either have a pgpassfile
configured, environment or using an appropriate auth type in pg_hba
you could be so lazy as to do the minimal:
psycopg2.connect('')
Default user of 'postgres' is probably fine. I think for the most
parts the other defaults will either be fine or need to be manually
customized by a user anyway.
In python, you can build up a dictionary of { 'param' : 'value ' }
pairs and pass this to the connect method directly:
params = dict()
for param in config['pgauth']:
if param not in acceptable_params: # skip pgpass, role etc.
continue
value = config['pgauth'][param]
if value is not None:
params[param] = value
psycopg2.connect(**params)
Or alternatively you can build up the traditional postgres DSN string.
>> 5) Need to clip the unused options to avoid confusion
>
> Will do that as soon as I figure out exactly what I'll implement.
>
>> I don't think any of the pg_dump options are supported currently. It
>> might be nice to support at least pgdump --compression and/or
>> extra-options (for direct passthrough to pg_dump) today. I think
>> [compression] can completely go, unless you support non-custom formats
>> (where you might pipe to gzip or another external compression program)
>> - I'm not sure that pg_dumpall --globals-only will ever generate
>> enough output to justify compression (although it doesn't hurt I
>> suppose :). The default compression for a real database somewhat
>> defeats the purpose of --format=custom, as the backup will need to be
>> uncompressed to access each of the (already compressed by default)
>> objects.
>
> Might as well leave it in for similarity with MySQL, but defaulting to
> off would make some sense.
The original idea was to use compression for those people who (for
whatever reason) didn't want to do dumps with --format=custom (that
is, plain sql dumps). Most of the interesting restore behavior
(including parallel restore) will require custom, so that's a good
default.
For custom, it problem makes sense to convert the compression level to
a pg_dump -Z # option, I suppose and warn that the compression method
is not configurable with the custom format since it always uses zlib
today.
>> 6) Be careful to catch database errors
>
> Right. I can get a bit lazy about exceptions when doing the first
> code run. Bad practice I suppose! :/
>
> About exceptions though, I wonder if it would be better instead of
> having plugin specific exceptions like MySQLError and PgError, to have
> cross-plugin error-specific exceptions defined in a Holland lib. For
> example HollandDBConnectError would be thrown by either the MySQL or
> PG plugin. That would let the plugins behave in a similar way with a
> similar error, and what happens (specific error message or other
> exception behavior) could be defined centrally.
We could do this, but I'm not sure what we would want special for this
case vs. any other error a plugin might run into. Right now we have a
basic 'BackupError' core catches and otherwise has a catchall for
other excpetions and loudly logs if a plugin raises something
otherwise unexpected. If we have useful actions or reporting that
makes sense for more specialized errors, I'm all for making a richer
holland exception hierarchy.
>> 7) pgauth should probably make hostname and other parameters a
>> wildcard ("*") if they are not specified.
>
> You mean the generated pgpass I assume. Yeah I guess that will be
> necessary.
By pgauth I mean the [pgauth] section we have today. I do roughly the
same with mysql, looping through those auth parameters and skipping
those that are `None` (not defined) when generating a my.cnf. Here,
we're generating a pgpassfile, so rather than skipping a parameter, it
should probably be converted to '*' for all but password.
http://github.com/abg/holland/tree/pgdump
The last 17 commits or so cleaned up the pgdump plugin with what I had
in mind. Not all changes are necessarily a good idea :) but I wanted
to see how they'll work out.
As a summary:
* dropped pgpass config parameter entirely. a pgpassfile is still
generated, but only if a password is set. We warn loudly if this
overrides a pgpassfile in the environment.
* no longer set a global PGPASSFILE in the environment, but pass an
environment directly to the individual commands. This means other
potential postgres plugins won't be futzing with the same environment
variable.
* all pgauth parameters are optional and uses libpq default behavior.
this means username = os.getuid() effectively, and a unix socket
connection (probably /tmp or /var/run/postgresql/ on debian) rather
than tcp over 127.0.0.1 as before.
* I noticed there was still the old stderr=open('pgdump.err') when
running pg commands, which writes a pgdump.err file in os.getcwd()
which wasn't intended. I output to a temporary file and relog that to
the plugin logger as necessary.
* pg_dump --format is configurable and --format=custom uses the
pg_dump builtin --compress option rather than holland's streaming
compression facilities
* database names are encoded with holland.lib.safefilename (to make
things like 'createdb foo/bar'; holland backup pgdump work)
* reorganized 'role' into the main pgdump section
* drop all other unimplemented configspec parameters.
* fixed holland.spec to make pgdump build correctly in rpms (in my
personal tree only, not our holland-backup 'trunk' and currently not
in the debian packages)
Some sample output from a test run via holland v1.0.4 (latest merge
from upstream):
# holland backup pgdump
Holland 1.0.4 started with pid 3527
--- Starting backup run ---
Acquired lock /etc/holland/backupsets/pgdump.conf :
'/etc/holland/backupsets/pgdump.conf'
Got databases: ['agarner', 'postgres', 'foo', 'foo/bar']
Creating backup path /var/spool/holland/pgdump/20100728_014757
DB agarner size 4468332
DB postgres size 4468332
DB foo size 4468332
DB foo/bar size 4468332
Estimated Backup Size: 17.05MB
Starting backup[pgdump/20100728_014757] via plugin pgdump
pg_dumpall -g --username postgres
pg_dump --username postgres --compress 1 --format custom agarner
pg_dump --username postgres --compress 1 --format custom postgres
pg_dump --username postgres --compress 1 --format custom foo
Encoded database foo/bar as filename foo(2f)bar
pg_dump --username postgres --compress 1 --format custom foo/bar
Final on-disk backup size 4.91KB
0.03% of estimated size 17.05MB
Backup completed in 0.24 seconds
Purged pgdump/20100728_013333
1 backups purged
Released lock /etc/holland/backupsets/pgdump.conf
--- Ending backup run ---
# ls -1 /var/spool/holland/pgdump/20100728_014757/data/
agarner.dump
foo(2f)bar.dump
foo.dump
global.sql.gz
pgpass
postgres.dump
~Andy
* a debian package for pgdump
* handle unicode database names
* generate a manifest file (in case database names are path encoded to
something unrecognizable)
* minor formatting changes to make logging slightly prettier
~Andy
I'm pretty sure there are a few places where errors aren't being
trapped properly. The changes could definitely do with code
review/testing. I've only done pretty basic testing against 8.3/8.4.
Some discussion on whether the (lack of) defaults are reasonable.
For flexibility implementing some "additional-options" config
parameter would be nice. Otherwise, I think the basic functionality
to cover most of the pgdump use cases is in place and just needs
vetted.
~Andy
Good catch. This snuck in yesterday and I didn't get back around to
testing it.
A better approach here is probably to maintain a separate args list
for pg_dump. This would also fold nicely into the
'additional-options' feature. I originally was doing this, but
unnecessarily in the "for each database" loop so I had moved it into
the pgauth2args method to clean that up.
I pushed a small fix for that into my branch, which may be a little prettier.
>> For flexibility implementing some "additional-options" config
>> parameter would be nice. Otherwise, I think the basic functionality
>> to cover most of the pgdump use cases is in place and just needs
>> vetted.
>
> should be doable as well.
Since it was trivial (~5 line patch) based on the --compress change I
just made, I pushed an initial implementation to my tree for this if
you want to evaluate that approach.
This works alright in most cases, but this impl uses shlex.split which
does not handle unicode so I think some additional fencing is needed
at least. I think this would only be an issue with non-ascii
table/schema exclusions (which are very rare, I think), and probably
can be solved by just encoding to utf8 first.
> Need to get some VMs working to test with older Postgres versions and
> distros.
If you know what else you want to test on, we should be able to
provide temporary testing servers. You can email me off-list.
~Andy
As a small data point in the other direction, I ran the pgdump plugin
against Postgres-9.0beta3 on RHEL 5.4 without any problems. Of
course, you have to have the compat-libs to make rhel's psycopg2
package's libpq dependency happy.
On the other end of the spectrum, pgdump doesn't work on rhel4 with
the default postgres-7.4.
There error here is:
Traceback (most recent call last):
File "/usr/lib/python2.3/site-packages/holland/core/command/command.py",
line 201, in dispatch
return self.run(self.optparser.prog, opts, *args)
File "/usr/lib/python2.3/site-packages/holland/commands/backup.py",
line 95, in run
runner.backup(name, config, opts.dry_run)
File "/usr/lib/python2.3/site-packages/holland/core/backup/base.py",
line 102, in backup
estimated_size = self.check_available_space(plugin)
File "/usr/lib/python2.3/site-packages/holland/core/backup/base.py",
line 152, in check_available_space
estimated_bytes_required = plugin.estimate_backup_size()
File "/usr/lib/python2.3/site-packages/holland/backup/pgdump/interface.py",
line 79, in estimate_backup_size
totalestimate += get_db_size(db, self.connection)
File "/usr/lib/python2.3/site-packages/holland/backup/pgdump/base.py",
line 46, in get_db_size
cursor.execute("SELECT pg_database_size('%s')" % dbname)
ProgrammingError: function pg_database_size("unknown") does not exist
HINT: No function matches the given name and argument types. You may
need to add explicit type casts.
Obviously this is just due to the lack of the pg_database_size
function (which affects any version <8.1). This is easy to detect by
looking for exception.pgcode = 42883. I put in a workaround to
fallback to just setting size = 0 in this case and otherwise the
plugin worked for my test data. I think to support this otherwise we
would have to reconnect to each database and query pg_class directly
which isn't very pretty. This is probably not worth it as 7.4,8.0
were recently deprecated anyway but if it's not painful providing some
workaround might be useful. Although I doubt anyone conservative
enough to not have upgraded Postgres will want to run holland. :)
~Andy
Still, the error here is uncaught and needs to be handled at any rate.
It's not that much extra work to support these old versions. I
pushed a fix for this in my pgdump branch to fallback to querying
pg_class so this should work on old versions now.
> What about RHEL 4 with a newer Postgres? I'd be inclined to not worry
> too much about that either. (Especially because I failed installing
> CentOS 4 in VirtualBox ...)
python-psycopg2 works fine, although EPEL is required (which is a
required dependency to provide python-setuptools on this platform
anyway). pgrpms.org provides rhel4 packages all the way up to the
latest 8.4.4 so there are other solutions.
>> A better approach here is probably to maintain a separate args list
> for pg_dump.
>
> Perhaps so. Or start with a base argument list, which would include
> nearly all arguments, then have a variable to hold the final list for
> pg_dump and pg_dumpall.
That's what pgdump is doing per the fix I pushed yesterday. There are
two lists, connection_params and extra_options - the latter gets
passed only to pgdump and I slip --compress in there as needed.
~Andy
>> Although I doubt anyone conservative
>> enough to not have upgraded Postgres will want to run holland. :)
>
> I agree, and am not generally a fan of supporting old versions of
> things if it hinders progress for newer, cooler stuff.
Here here... I am on the side of providing support on RHEL4 for the minimal set of features... i.e. mysqldump. I wouldn't be apposed to adding a PG version check in the plugin and failing with 'unsupported version of postgres' type error. Anything beyond the basics we shouldn't be too concerned with older os's.
As far as packages go, we can optionally not build in any plugins that won't work on rhel4.
---
derks
Checking the version is essentially the same amount of work as just
working around the problem (and orders of magnitude less effort than
even the simple case in mysqldump) - I already committed a fix. The
main point here though is to handle the problem gracefully - if we run
into further issues we can just punt early and declare these versions
unsupported. I don't expect this will be the case, however.
~Andy