Multi-DB Support

227 views
Skip to first unread message

David Gouldin

unread,
Nov 11, 2011, 12:33:42 AM11/11/11
to South Users
Hi, all.

I'm intent on implementing true mult-db support and submitting a patch
to ticket #370:

http://south.aeracode.org/ticket/370

I'd like to start a discussion thread to come to a consensus on how
the feature should work. I'm -1 on migration sets as many migrations
will be applied to more than 1 database, and simply making a copy of
the same migration to store in different sets violates DRY. Instead,
I'd like to see a property on Migration, let's call it "applies_to",
which is a tuple specifying which database aliases the migration
should be applied to. For backwards compatibility, a missing
applies_to will be construed as (DEFAULT_DB_ALIAS,). So, a migration
class definition would look like:

class Migration(SchemaMigration):
applies_to = ('default', 'foo')

Since currently, forwards and backwards in Migration rely on the
global db object, their signature would need to change to include a db
argument whose value is the database object for the current database
the migration is being applied to. So, a forwards function definition
would look like:

def forwards(self, orm, db):
db.add_column(...)

South would iterate through a migration's applies_to aliases,
performing the migration on each database. I can't think of a way to
mak this backwards compatible, but I'm open to suggestions.

An issue to consider is dependency checking and the addition of
migration history records to different databases. I haven't come up
with a solution I'm satisfied with here. I feel that all databases
should include their own migration history table, but I'm uncertain if
each should contain records for all migrations or only those which
apply to that particular database. If the latter, there could
potentially be gaps in migration sequences, and south would need to
load all migration classes to determine which migrations should have
been applied to compare against entries in that database's migration
history table. So the former is certainly easier to deal with but is
a less accurate representation of which migrations were actually
applied to the database in question.

Another issue is transaction management. Currently, a migration is an
atomic unit, and a transaction can safely be committed once the
migration succeeds on a single database. However, if a migration is
being applied to multiple databases and succeeds on one but fails on
another, what should be done? Should we live the databases in
different states, or should we wait to commit all transactions until
we know the migration has succeeded on all databases it applies to?

I'll be on freendoe #django-south tomorrow (Friday 11/11/11) beginning
around 9am PST tomorrow morning. I'd like to get something worked up
by the end of the weekend if possible.

Jeremy Dunck

unread,
Nov 11, 2011, 1:26:15 PM11/11/11
to south...@googlegroups.com
On Thu, Nov 10, 2011 at 9:33 PM, David Gouldin <dgou...@gmail.com> wrote:

> Since currently, forwards and backwards in Migration rely on the
> global db object, their signature would need to change to include a db
> argument whose value is the database object for the current database
> the migration is being applied to.  So, a forwards function definition
> would look like:
>
> def forwards(self, orm, db):
>    db.add_column(...)

This would be backwards-compatible if south detected old migrations
(which don't take the db parameter) and call them "correctly" with the
default DB.

> An issue to consider is dependency checking and the addition of
> migration history records to different databases.  I haven't come up
> with a solution I'm satisfied with here.  I feel that all databases
> should include their own migration history table, but I'm uncertain if
> each should contain records for all migrations or only those which
> apply to that particular database.

I think for the purposes of depends_on, it'd be better to have a
record in each migration history table -- in my mind, these migrations
were "applied", even if that application was a no-op.

To keep this distinction clear, perhaps applies_to is a bad naming -
perhaps something more like "mutates" or "alters".

> Another issue is transaction management.  Currently, a migration is an
> atomic unit, and a transaction can safely be committed once the
> migration succeeds on a single database.  However, if a migration is
> being applied to multiple databases and succeeds on one but fails on
> another, what should be done?  Should we live the databases in
> different states, or should we wait to commit all transactions until
> we know the migration has succeeded on all databases it applies to?

Just to be clear, this issue only applies to DBs that allow
transactional DDL at all, and of the DBs south supports, only pg and,
debatably, SQL Server, support it. If a DB doesn't support
transactional DDL, all bets are off for a safe multi-db migration (or
any migration, really).

I think holding open the transactions until all are completed would
only be good if a migration were applied to all aliases in parallel -
otherwise, throughput on all shards (for example) would be hobbled for
an O(N=# shards) period. I get the issue of code being unaligned
with some (but not all) DBs.

There's also that chance that even after all migration work has been
done, one or more of the DBs fail the transaction upon attempting
commit - commit on A succeeds, B goes down before commit can succeed.
This is admittedly less of a problem than the chance that a given
migration has a bug or hits deadlock or ...

Would it be useful to think through different reasons people use multi-db?

1) write sharding
2) read slave
3) multi-master
4) unit-of-control (data from different sources/departments/cultures)
5) unit-of-access (some data is random-access, some is localized)

Are there other scenarios? Are any of these badly served by
applies_to + held-open transactions, possibly parallelized?

For what it's worth, our (Votizen) immediate need is # 4, and we're on
MySQL, so this idea of transactional DDL is aside from our immediate
need. :-/

Andrew Godwin

unread,
Nov 12, 2011, 11:55:46 AM11/12/11
to south...@googlegroups.com
On 11/11/11 18:26, Jeremy Dunck wrote:
> On Thu, Nov 10, 2011 at 9:33 PM, David Gouldin <dgou...@gmail.com> wrote:
>
>> Since currently, forwards and backwards in Migration rely on the
>> global db object, their signature would need to change to include a db
>> argument whose value is the database object for the current database
>> the migration is being applied to. So, a forwards function definition
>> would look like:
>>
>> def forwards(self, orm, db):
>> db.add_column(...)
>
> This would be backwards-compatible if south detected old migrations
> (which don't take the db parameter) and call them "correctly" with the
> default DB.

This is why the import at the top of migrations is "from south.v2 import
..." - new migrations can be v3, and thus the change can be detected
very easily.

>
>> An issue to consider is dependency checking and the addition of
>> migration history records to different databases. I haven't come up
>> with a solution I'm satisfied with here. I feel that all databases
>> should include their own migration history table, but I'm uncertain if
>> each should contain records for all migrations or only those which
>> apply to that particular database.
>
> I think for the purposes of depends_on, it'd be better to have a
> record in each migration history table -- in my mind, these migrations
> were "applied", even if that application was a no-op.
>
> To keep this distinction clear, perhaps applies_to is a bad naming -
> perhaps something more like "mutates" or "alters".

This is a very, very tricky thing. Really, it should be the case that if
you swap some database X for a fresh one, and re-run South, it applies
the correct migrations to that database.

To that end, I think the best direction is to record all migrations in
all databases (and also write the running code that way) - migrations
that are not destined for the currently-running database are simply
no-opped and put in the record table anyway.

>
>> Another issue is transaction management. Currently, a migration is an
>> atomic unit, and a transaction can safely be committed once the
>> migration succeeds on a single database. However, if a migration is
>> being applied to multiple databases and succeeds on one but fails on
>> another, what should be done? Should we live the databases in
>> different states, or should we wait to commit all transactions until
>> we know the migration has succeeded on all databases it applies to?
>
> Just to be clear, this issue only applies to DBs that allow
> transactional DDL at all, and of the DBs south supports, only pg and,
> debatably, SQL Server, support it. If a DB doesn't support
> transactional DDL, all bets are off for a safe multi-db migration (or
> any migration, really).
>
> I think holding open the transactions until all are completed would
> only be good if a migration were applied to all aliases in parallel -
> otherwise, throughput on all shards (for example) would be hobbled for
> an O(N=# shards) period. I get the issue of code being unaligned
> with some (but not all) DBs.
>
> There's also that chance that even after all migration work has been
> done, one or more of the DBs fail the transaction upon attempting
> commit - commit on A succeeds, B goes down before commit can succeed.
> This is admittedly less of a problem than the chance that a given
> migration has a bug or hits deadlock or ...

I say the migrations should be applied in series to each database; i.e.

for migration in migrations:
for database in databases:
migration.run(database)

If a migration fails, roll it back on that database it failed on, and
abort there. Then, when South re-runs, database A has it applied (and a
record to that affect), B doesn't, and so it only gets run on B.

There might be some work needed to the "migrate" command to allow
migrating forwards/backwards on single databases, should they get out of
sync.

> Would it be useful to think through different reasons people use multi-db?
>
> 1) write sharding
> 2) read slave
> 3) multi-master
> 4) unit-of-control (data from different sources/departments/cultures)
> 5) unit-of-access (some data is random-access, some is localized)
>
> Are there other scenarios? Are any of these badly served by
> applies_to + held-open transactions, possibly parallelized?
>
> For what it's worth, our (Votizen) immediate need is # 4, and we're on
> MySQL, so this idea of transactional DDL is aside from our immediate
> need. :-/

All of those I've heard of being used, but 2) certainly doesn't matter
with South - the read slaves should propagate the schema changes.

I'm still in favour of not making this too automatic, as we'll never be
everything to everyone. However, I think the idea as a whole is a decent
improvement on the current (and sorry) state of affairs.

Andrew

David Gouldin

unread,
Nov 12, 2011, 1:20:31 PM11/12/11
to south...@googlegroups.com
There's already a --database switch. That should work for migrating against a single database, though after this change it will only run migrations which apply to the database specified instead of all migrations.

All of this sounds reasonable. I think I have enough direction to get started.

> --
> You received this message because you are subscribed to the Google Groups "South Users" group.
> To post to this group, send email to south...@googlegroups.com.
> To unsubscribe from this group, send email to south-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/south-users?hl=en.
>

Shai Berger

unread,
Nov 14, 2011, 4:29:19 AM11/14/11
to south...@googlegroups.com
On Saturday 12 November 2011 18:55:46 Andrew Godwin wrote:
> On 11/11/11 18:26, Jeremy Dunck wrote:
> > On Thu, Nov 10, 2011 at 9:33 PM, David Gouldin <dgou...@gmail.com> wrote:
>
> >> Another issue is transaction management. Currently, a migration is an
> >> atomic unit, and a transaction can safely be committed once the
> >> migration succeeds on a single database. However, if a migration is
> >> being applied to multiple databases and succeeds on one but fails on
> >> another, what should be done? Should we live the databases in
> >> different states, or should we wait to commit all transactions until
> >> we know the migration has succeeded on all databases it applies to?
> >
> > [...]

> >
> > I think holding open the transactions until all are completed would
> > only be good if a migration were applied to all aliases in parallel -
> > otherwise, throughput on all shards (for example) would be hobbled for
> > an O(N=# shards) period. I get the issue of code being unaligned
> > with some (but not all) DBs.
> >
> > There's also that chance that even after all migration work has been
> > done, one or more of the DBs fail the transaction upon attempting
> > commit - commit on A succeeds, B goes down before commit can succeed.
> > This is admittedly less of a problem than the chance that a given
> > migration has a bug or hits deadlock or ...
>

Two points:

1) A migration -- in particular, a data migration -- may well fail only on
commit, because Foreign Key constraints are deferred; which means they only
get checked on commit.

2) People who want to have unified transactions across databases use 2-phase
commit. It doesn't quite ensure the transactions' state consistency, but it
makes things a lot better. When multiple databases were introduced into
Django, the idea of dealing with 2-phase commit was rejected; developers
preferred the attitude of documentation and prayers. While you can guess I
disapprove, I see no sense in making South holier than Django-core in that
regard.

> I say the migrations should be applied in series to each database; i.e.
>
> for migration in migrations:
> for database in databases:
> migration.run(database)
>
> If a migration fails, roll it back on that database it failed on, and
> abort there. Then, when South re-runs, database A has it applied (and a
> record to that affect), B doesn't, and so it only gets run on B.
>

I agree.

> > Would it be useful to think through different reasons people use
> > multi-db?
> >
> > 1) write sharding
> > 2) read slave
> > 3) multi-master
> > 4) unit-of-control (data from different sources/departments/cultures)
> > 5) unit-of-access (some data is random-access, some is localized)
> >
> > Are there other scenarios?

See https://code.djangoproject.com/wiki/MultipleDatabaseSupport and (linked
from it) http://groups.google.com/group/django-
developers/browse_thread/thread/9f0353fe0682b73?pli=1

I think that whole thread should be required reading for anyone wishing to
implement multi-db support. It includes discussions and arguments and
reasonings for decisions taken and whatnot.

Also, I'm not sure if my own favorite use case (separate transaction control:
writing a log to the database, which gets committed whether or not the "main"
transaction is committed) is covered by 4) above.

Have fun,
Shai.

Rosen Diankov

unread,
Dec 9, 2011, 1:23:36 AM12/9/11
to South Users
By playing around with the model._meta.managed attribute before
schemamigration is called, it is possible to select which models get
migrated in which databases

Rosen Diankov

unread,
Dec 9, 2011, 12:01:46 AM12/9/11
to South Users
hi all,

Unfortunately this doesn't answer how to manage different models in
different databases. For example, model.A only goes in database 'A'
and model.B only in database 'B'. looking through the south.changes
source code, the following statement pops out:

# Don't do anything for unmanaged, abstract or proxy models
if model._meta.abstract or getattr(model._meta, "proxy", False) or not
getattr(model._meta, "managed", True):
continue


This says that if the model.A._meta.managed variable is False, then
the model will be ignored.
Before south migration is called, by:

1. playing around with the "_managed" attributes,
2. changing the south.db.db database, and
3. managing migrations separately for each database,

it might be possible to get all multi-db working as it. should.what do
you think?

rosen,

David Gouldin

unread,
Dec 9, 2011, 12:02:52 PM12/9/11
to south...@googlegroups.com
Honestly, it sounds more like a hack than a solution to me. I do agree, though, that models of the same app which exist in different sets of databases should be split from each other when auto-generating schema migrations.

Jeremy Dunck

unread,
Dec 10, 2011, 9:41:51 PM12/10/11
to south...@googlegroups.com
On Fri, Dec 9, 2011 at 9:02 AM, David Gouldin <dgou...@gmail.com> wrote:
> Honestly, it sounds more like a hack than a solution to me. I do agree, though, that models of the same app which exist in different sets of databases should be split from each other when auto-generating schema migrations.

Just to be clear, it happens fairly often that allow_syncdb will
return True for multiple DBs for any given model. This is useful in
sharding, and is the reason allow_relation exists - any 2 model
instances might be sharded to the same replica (along some
app-convenient partition). It's not correct to assume that any model
would go to one (and only one) DB.

Reply all
Reply to author
Forward
0 new messages