Unintuitive behaviour of management commands with multiple databases

117 views
Skip to first unread message

Vinay Sajip

unread,
Feb 2, 2016, 10:11:31 AM2/2/16
to Django users
I've set up a simple project using with two databases, foo and bar in settings.py:

DATABASES = {
   
'foo': {
       
'ENGINE': 'django.db.backends.sqlite3',
       
'NAME': os.path.join(BASE_DIR, 'foo.sqlite'),
   
},
   
'bar': {
       
'ENGINE': 'django.db.backends.sqlite3',
       
'NAME': os.path.join(BASE_DIR, 'bar.sqlite'),
   
}
}
DATABASES
['default'] = DATABASES['foo']


I have a simple model:

class Thing(models.Model):
    name
= models.CharField(max_length=20, unique=True)

I have a simple management command:

class Command(BaseCommand):
    help
= 'Add all the things.'


   
def add_arguments(self, parser):
        parser
.add_argument('--database', nargs='?', metavar='DATABASE',
                           
default='default',
                            help
='Database alias to use')
        parser
.add_argument('things', nargs='+', metavar='THING',
                            help
='Things to add')


   
def handle(self, *args, **options):
       
alias = options['database']
       
self.stdout.write('using alias %s' % alias)
       
with transaction.atomic(using=alias):
           
for name in options['things']:
               
try:
                   
Thing.objects.create(name=name)
                   
self.stdout.write('Added %s.\n' % name)
               
except IntegrityError as e:
                   
self.stderr.write('Failed to add thing %r: %s' % (name, e))
                   
break

After running migrations to set up the two databases, using python manage.py migrate --data foo and python manage.py migrate --data bar, I then run python manage.py add_things fizz buzz which results in two records being added to foo.sqlite, as expected. 

If I then run python manage.py add_things fizz buzz --data bar, it seems reasonable to expect it to add the records to bar.sqlite. However, this is not what happens: it tries to add them to foo.sqlite, even though I've specified using=alias with the alias set to bar. So I get a constraint violation:

using alias bar
Failed to add thing 'fizz': UNIQUE constraint failed: hello_thing.name

What have I overlooked? In a real case the atomic block might be manipulating lots of models in nested code, and I can't see that it's practical to call using() for every model. Somewhere, it looks like Django code is caching a connection based on what settings.DATABASES['default'] was when settings was imported, even though it is being overridden in the command line. If not actually a bug, this behaviour doesn't seem particularly intuitive, so any advice would be gratefully received.

Regards,

Vinay Sajip

Carl Meyer

unread,
Feb 2, 2016, 10:27:10 AM2/2/16
to django...@googlegroups.com
Hi Vinay,

On 02/02/2016 08:11 AM, 'Vinay Sajip' via Django users wrote:
> I've set up a simple project using with two databases, foo and bar in
> settings.py:
>
> |
> DATABASES ={
> 'foo':{
> 'ENGINE':'django.db.backends.sqlite3',
> 'NAME':os.path.join(BASE_DIR,'foo.sqlite'),
> },
> 'bar':{
> 'ENGINE':'django.db.backends.sqlite3',
> 'NAME':os.path.join(BASE_DIR,'bar.sqlite'),
> }
> }
> DATABASES['default']=DATABASES['foo']
>
> |
>
> I have a simple model:
>
> |
> classThing(models.Model):
> name =models.CharField(max_length=20,unique=True)
> |
>
> I have a simple management command:
>
> |
> classCommand(BaseCommand):
> help ='Add all the things.'
>
>
> defadd_arguments(self,parser):
> parser.add_argument('--database',nargs='?',metavar='DATABASE',
> default='default',
> help='Database alias to use')
> parser.add_argument('things',nargs='+',metavar='THING',
> help='Things to add')
>
>
> defhandle(self,*args,**options):
> alias=options['database']
> self.stdout.write('using alias %s'%alias)
> withtransaction.atomic(using=alias):
> forname inoptions['things']:
> try:
> Thing.objects.create(name=name)
> self.stdout.write('Added %s.\n'%name)
> exceptIntegrityErrorase:
> self.stderr.write('Failed to add thing %r: %s'%(name,e))
> break
> |
>
> After running migrations to set up the two databases, using python
> manage.py migrate --data foo and python manage.py migrate --data bar, I
> then run python manage.py add_things fizz buzz which results in two
> records being added to foo.sqlite, as expected.
>
> If I then run python manage.py add_things fizz buzz --data bar, it seems
> reasonable to expect it to add the records to bar.sqlite. However, this
> is not what happens: it tries to add them to foo.sqlite, even though
> I've specified using=alias with the alias set to bar. So I get a
> constraint violation:
>
> |
> usingaliasbar
> Failedto add thing 'fizz':UNIQUE constraint failed:hello_thing.name
> |
>
> What have I overlooked? In a real case the atomic block might be
> manipulating lots of models in nested code, and I can't see that it's
> practical to call using() for every model. Somewhere, it looks like
> Django code is caching a connection based on what
> settings.DATABASES['default'] was when settings was imported, even
> though it is being overridden in the command line. If not actually a
> bug, this behaviour doesn't seem particularly intuitive, so any advice
> would be gratefully received.

Nothing in your code ever "overrides" settings.DATABASES['default'].
Defining an atomic block on a certain database has no effect on the
default database within that block, it just opens a transaction on the
requested database. "Which database is default" and "which database(s)
has/have transactions open on them" are completely orthogonal questions,
there is no implicit link between them. Nothing prevents you from having
simultaneous transactions open on multiple databases, and interspersing
queries to different databases while those transactions are open.

If calling .using() explicitly to use a non-default database is
burdensome (and the situation can't be handled automatically with a
custom database router class), you could look at a third-party solution
like django-dynamic-db-router [1], which provides an `in_database`
context manager that does what you were expecting `transactions.atomic`
to do for you.

Carl

[1]
https://github.com/ambitioninc/django-dynamic-db-router/blob/master/docs/quickstart.rst

signature.asc

Vinay Sajip

unread,
Feb 2, 2016, 11:52:06 AM2/2/16
to Django users

On Tuesday, February 2, 2016 at 3:27:10 PM UTC, Carl Meyer wrote:
Nothing in your code ever "overrides" settings.DATABASES['default'].

Dear Carl,

Thanks for the quick response. I had thought of that, and tried adding the statement settings.DATABASES['default'] = settings.DATABASES[alias] just before the with block, and it had no effect - the result was the same - so I took it out. How else are you supposed to override the default database?

Regards,

Vinay Sajip

Carl Meyer

unread,
Feb 2, 2016, 12:08:40 PM2/2/16
to django...@googlegroups.com
Hi Vinay,

On 02/02/2016 09:52 AM, 'Vinay Sajip' via Django users wrote:
>
> On Tuesday, February 2, 2016 at 3:27:10 PM UTC, Carl Meyer wrote:
>
> Nothing in your code ever "overrides" settings.DATABASES['default'].
>
>
> Dear Carl,
>
> Thanks for the quick response. I /had/ thought of that, and tried adding
> the statement settings.DATABASES['default'] = settings.DATABASES[alias]
> just before the with block, and it had no effect - the result was the
> same - so I took it out. How else are you supposed to override the
> default database?

You can't (well, you might be able to by poking around in the internals
of the django.db.connections object, but I'd strongly advise against
that). The proper (and thread-safe) way to achieve the equivalent is to
do what the third-party project I linked (django-dynamic-db-router)
does: write a db router that references some thread-local state to
determine which database to route queries to, and write a context
manager or whatever to modify that thread-local state. Or just use
django-dynamic-db-router, since it already does what you want :-)

Carl

signature.asc

Vinay Sajip

unread,
Feb 2, 2016, 12:56:03 PM2/2/16
to Django users
Dear Carl,


On Tuesday, February 2, 2016 at 5:08:40 PM UTC, Carl Meyer wrote: 
You can't (well, you might be able to by poking around in the internals
of the django.db.connections object, but I'd strongly advise against
that). The proper (and thread-safe) way to achieve the equivalent is to

Well, for any given run of the application (on which the example I posted is based), only one database is ever used. The idea of multiple databases was to allow us to select which one is used for any particular Python process run, and we expected to be able to run management commands which would (based on a settings configuration) determine both which database to use and e.g. how to populate parts of it ready for deployment to particular customers. Oddly, setting django.db.utils.DEFAULT_DB_ALIAS = alias does the trick. Yes, I know that's a no-no, but it seems to me that for this kind of case where you don't need run-time routing, something like django-dynamic-db-router shouldn't really be needed, useful though it might be for the run-time routing case. Everything else works off the settings module, and the fact that Django caches the default database to use such that a management command can't change it seems like a design flaw. I suppose I was expecting transaction.atomic(using=XXX) to not just open a transaction on that database, but also make it the default for the scope of the block (except if explicitly overridden by using() for particular models in code called from that block). The current behaviour seems to violate the principle of least surprise (this my first encounter with multiple databases). Is there really no case for an in_database() context manager in Django itself? Though I would still expect that behaviour from transaction.atomic(), of course, but backwards compatibility is a constraint I'm aware of :-)

Regards,

Vinay Sajip

Carl Meyer

unread,
Feb 2, 2016, 1:23:34 PM2/2/16
to django...@googlegroups.com
Hi Vinay,

On 02/02/2016 10:56 AM, 'Vinay Sajip' via Django users wrote:
> On Tuesday, February 2, 2016 at 5:08:40 PM UTC, Carl Meyer wrote:
>
> You can't (well, you might be able to by poking around in the internals
> of the django.db.connections object, but I'd strongly advise against
> that). The proper (and thread-safe) way to achieve the equivalent is to
>
> Well, for any given run of the application (on which the example I
> posted is based), only one database is ever used. The idea of multiple
> databases was to allow us to select which one is used for any particular
> Python process run, and we expected to be able to run management
> commands which would (based on a settings configuration) determine both
> which database to use and e.g. how to populate parts of it ready for
> deployment to particular customers.

For this scenario, I would use a different settings file (which can all
"inherit" the same common settings via the `from common_settings import
*` trick) for each "customer", with a different default database defined
in each. In fact, is there any reason for one process to have access to
another process' database at all, in your scenario? I don't think you
even have a use-case for Django's multi-db support here at all, just a
use case for multiple settings file variants. Then you can use the
--settings option with management commands to choose which settings to
use (or the DJANGO_SETTINGS_MODULE env var where that's convenient); you
don't need to define your own custom option.

Django's multi-db support is designed for what you're calling "run-time
routing," where a single process needs to access multiple different
databases at different times. That's not your situation; I'm not sure
you even need or want multi-db.

> Oddly, setting
> django.db.utils.DEFAULT_DB_ALIAS = alias does the trick. Yes, I know
> that's a no-no, but it seems to me that for this kind of case where you
> don't need run-time routing, something like django-dynamic-db-router
> shouldn't really be needed, useful though it might be for the run-time
> routing case. Everything else works off the settings module, and the
> fact that Django caches the default database to use such that a
> management command can't change it seems like a design flaw.

The "design flaw" you are observing here is not specific to databases,
but to Django settings. Django settings in general are not designed to
be modified at runtime; they define a static configuration for a given
process. Some settings can in practice safely be modified on-the-fly
(but still probably shouldn't be), some cannot, because their value is
cached for efficiency or other reasons: in general modifying settings at
runtime is not a supported technique. And it's not a _necessary_
technique either, in your case, when you could just instead use the
right settings for each process from the beginning.

> I suppose I
> was expecting transaction.atomic(using=XXX) to not just open a
> transaction on that database, but also make it the default for the scope
> of the block (except if explicitly overridden by using() for particular
> models in code called from that block). The current behaviour seems to
> violate the principle of least surprise (this my first encounter with
> multiple databases).

I think Django's design in this case is superior to the one you propose,
because it keeps two concepts orthogonal that don't need to be linked
(transactions and multi-db-routing) and avoids implicit global or
thread-local state changes in operations that don't obviously imply such
changes, making the overall system more flexible and predictable.

I suppose we'd need a more scientific survey to establish which behavior
is more surprising to more people :-)

> Is there really no case for an in_database()
> context manager in Django itself?

Perhaps. In my experience the more typical uses for multi-db are
amenable to other types of routing (e.g. models X, Y are always routed
to database A, model Z is always routed to database B, or more complex
schemes), and dynamically-scoped routing (such as that provided by
`in_database`) isn't as commonly needed.

Django provides the database router abstraction, which is a very
flexible system for implementing whatever kind of multi-db routing you
need. The burden of proof is heavy on any particular routing scheme to
demonstrate that it is so frequently needed that it should be bundled
with Django itself, rather than being a separate reusable package for
those who need it.

Carl

signature.asc

Vinay Sajip

unread,
Feb 2, 2016, 2:14:46 PM2/2/16
to Django users
Dear Carl,


On Tuesday, February 2, 2016 at 6:23:34 PM UTC, Carl Meyer wrote:
For this scenario, I would use a different settings file (which can all
"inherit" the same common settings via the `from common_settings import
*` trick) for each "customer", with a different default database defined
in each. In fact, is there any reason for one process to have access to
another process' database at all, in your scenario? I don't think you
even have a use-case for Django's multi-db support here at all, just a
use case for multiple settings file variants. Then you can use the
--settings option with management commands to choose which settings to
use (or the DJANGO_SETTINGS_MODULE env var where that's convenient); you
don't need to define your own custom option.

You're right, of course, this is a viable option.

 Django's multi-db support is designed for what you're calling "run-time 
routing," where a single process needs to access multiple different
databases at different times. That's not your situation; I'm not sure
you even need or want multi-db.

No, we were looking at it as a possible convenient way of configuring things, but then again, perhaps not ;-)
 
The "design flaw" you are observing here is not specific to databases,
but to Django settings. Django settings in general are not designed to
be modified at runtime; they define a static configuration for a given
process. Some settings can in practice safely be modified on-the-fly
(but still probably shouldn't be), some cannot, because their value is
cached for efficiency or other reasons: in general modifying settings at
runtime is not a supported technique. And it's not a _necessary_
technique either, in your case, when you could just instead use the
right settings for each process from the beginning.

Yes, but a management command run is a process run, and having read documentation and examples saying that you can pass --data to commands to operate on a particular database, it doesn't seem to be working as expected/advertised. Obviously there are many ways to skin this particular cat, including using multiple setting files, which we of course do for things like dev vs. test vs. production etc. The idea was that we could use a special settings file just for populating customer databases, with a common set of settings ... of course we can use the usual technique of "inheriting" settings from other modules, and so on.

> I suppose I
> was expecting transaction.atomic(using=XXX) to not just open a
> transaction on that database, but also make it the default for the scope
> of the block (except if explicitly overridden by using() for particular
> models in code called from that block). The current behaviour seems to
> violate the principle of least surprise (this my first encounter with
> multiple databases).

I think Django's design in this case is superior to the one you propose,
because it keeps two concepts orthogonal that don't need to be linked
(transactions and multi-db-routing) and avoids implicit global or
thread-local state changes in operations that don't obviously imply such
changes, making the overall system more flexible and predictable.

I suppose we'd need a more scientific survey to establish which behavior
is more surprising to more people :-)

Note that my post is about management commands, specifically. In a multi-DB situation, there could well be times when you want specific management commands to run against different databases each time they're invoked. What you are saying is that this can't be done using a --data argument to the command, where there are typically no issues of thread-locality or the need to dynamically route to DBs based on HTTP requests or models. Not a very common use case, to be sure, but not especially esoteric, either. It seems there is no way to do this in the most natural-seeming way (--data)  with Django out-of-the-box - if I understand you correctly, you would have to set up multiple settings files, one for each database, where the only difference would the the value of DATABASES['default'] in that settings file from all the other settings files created for this purpose, and then invoke that using DJANGO_SETTINGS_MODULE= the particular settings file for database you wanted to operate on for that run. Possible to do, of course, but it feels like a workaround rather than a natural approach.

> Is there really no case for an in_database()
> context manager in Django itself?

Perhaps. In my experience the more typical uses for multi-db are
amenable to other types of routing (e.g. models X, Y are always routed
to database A, model Z is always routed to database B, or more complex
schemes), and dynamically-scoped routing (such as that provided by
`in_database`) isn't as commonly needed.

Agree, this is not a very common use case, 

Django provides the database router abstraction, which is a very
flexible system for implementing whatever kind of multi-db routing you
need. The burden of proof is heavy on any particular routing scheme to
demonstrate that it is so frequently needed that it should be bundled
with Django itself, rather than being a separate reusable package for
those who need it.

I'm not arguing for any particular different routing scheme to be included - only for management commands to be able to be written to respect --data arguments passed to them, and which can easily treat the passed value as the default database to use just for that command invocation, when that value is different to whatever settings.DATABASES['default'] is. A quick look at the builtin management commands shows there's a lot of usage of connections[options.get('database')] going on, and yet externally written management commands aren't encouraged to use the same approach, and multiple settings files are the suggested alternative?

Regards,

Vinay Sajip

Carl Meyer

unread,
Feb 2, 2016, 2:51:20 PM2/2/16
to django...@googlegroups.com
Hi Vinay,

On 02/02/2016 12:14 PM, 'Vinay Sajip' via Django users wrote:
> I'm not arguing for any particular different routing scheme to be
> included - only for management commands to be able to be written to
> respect --data arguments passed to them, and which can easily treat the
> passed value as the default database to use just for that command
> invocation, when that value is different to whatever
> settings.DATABASES['default'] is. A quick look at the builtin management
> commands shows there's a lot of usage
> of connections[options.get('database')] going on, and yet externally
> written management commands aren't encouraged to use the same approach,
> and multiple settings files are the suggested alternative?

There's nothing wrong with connections[options.get('database')] if you
want to get a specific database connection object; that's public API and
you can use it. In the case of the built-in commands, that's useful
because they need to do low-level things with the connection; I don't
think it helps you for regular ORM use. It doesn't allow changing the
default connection for ORM usage. Where the built-in management commands
do use the higher-level ORM, they use the public `using` APIs.

In other words, the built-in management commands aren't doing anything
different from what I originally recommended you do (just use `using`).

If you want to write a management command that has a --database option
like the built-in ones and makes heavy use of the ORM, you can a)
explicitly provide `using` where needed , b) use
django-dynamic-db-router, or c) hold your nose and monkeypatch
DEFAULT_DB_ALIAS, since it's just a one-shot management command and you
don't need to be as concerned about cleaning up after yourself.

Personally I would go for (a) or (b), depending on just how much I was
using the ORM in the management command. I'm not convinced that a fourth
(or third supported public) option is necessary here. If it's just a few
queries, you use `using`, if it's more than that and you want a broader
policy applied, you use a database router.

Carl

signature.asc

Vinay Sajip

unread,
Feb 2, 2016, 3:01:04 PM2/2/16
to Django users
Dear Carl,


On Tuesday, February 2, 2016 at 7:51:20 PM UTC, Carl Meyer wrote:

There's nothing wrong with connections[options.get('database')] if you
want to get a specific database connection object; that's public API and
you can use it. In the case of the built-in commands, that's useful

OK, perhaps I misunderstood what you said in an earlier comment - I see you were talking against using the internals of the connection rather the connection itself.

because they need to do low-level things with the connection; I don't
think it helps you for regular ORM use. It doesn't allow changing the
default connection for ORM usage. Where the built-in management commands
do use the higher-level ORM, they use the public `using` APIs.
 
That's fine for simple use, but it's not good have lots of using() calls for models when it's not a "real" multi-DB scenario,
 
If you want to write a management command that has a --database option
like the built-in ones and makes heavy use of the ORM, you can a)
explicitly provide `using` where needed , b) use
django-dynamic-db-router, or c) hold your nose and monkeypatch
DEFAULT_DB_ALIAS, since it's just a one-shot management command and you
don't need to be as concerned about cleaning up after yourself.

Personally I would go for (a) or (b), depending on just how much I was
using the ORM in the management command. I'm not convinced that a fourth
(or third supported public) option is necessary here. If it's just a few
queries, you use `using`, if it's more than that and you want a broader
policy applied, you use a database router.
 
OK, thanks for the suggestion - you've been very helpful, as always. I'll mull things over and see what's the best for the specifics of the specific project.

Regards,

Vinay Sajip
Reply all
Reply to author
Forward
0 new messages