Decoupling the ORM

383 views
Skip to first unread message

Samuel Bishop

unread,
Dec 15, 2015, 10:43:55 AM12/15/15
to Django developers (Contributions to Django itself)
Having worked through the code of several Django nosql/alternative database backend libraries, forks, etc... 

I've noticed that that one of the biggest challenges they run into, is 'conforming' to many of the things Django expects these lowest layers to do.

I opened this ticket https://code.djangoproject.com/ticket/25265 to begin getting feedback on an initial idea for how to 'fix' the problem.
Since then I've had further time to ponder the problem and while it still seems to me that the best mechanism is to draw a line between the 'upper' and 'lower' layers of Django, 
I'm no longer 100% sure the correct place to enable this is the queryset via an additional method, because I've realized that this is not just an opportunity to get NoSQL databases into Django, but also an opportunity to finally provide support for alternative Python ORMs, such as SQLAlchemy. 

I've been digging around the code for this so I dont mind writing up the code for this, but there is the big question of 'where to decouple' things. Initial feedback in the thread https://code.djangoproject.com/ticket/25265#comment:4 has raised the suggestion that moving one layer further up may be the right place to go. It would be very helpful for me to get extra input from Django developers familiar with the QuerySet and Query, before I start writing, so I would love to hear feedback on the idea.

Anssi Kääriäinen

unread,
Dec 16, 2015, 2:33:08 AM12/16/15
to Django developers (Contributions to Django itself)

Assume the goal is perfect admin integration with a MongoDB backend. The approach can be either:
1) Use Django's standard models, create a QuerySet compatible MongoDBQuerySet.
2) Use completely different models, which respond to the APIs needed by Admin. This includes implementing a QuerySet compatible MongoDBQuerySet.

There is a lot more work to 2), but the benefit is that you get to use models actually meant to be used with a non-relational backend. For example, Django's User, Permission and Group models are implemented in a way that makes sense for a relational backend. If you use relational schema on non-relational database you are going to face big problems if you try to run the site with any non-trivial amount of data. For this reason I believe 2) to be the right approach.

But, to get there, a QuerySet compatible MongoDBQuerySet is needed anyways. Here the choices are those mentioned in https://code.djangoproject.com/ticket/25265#comment:4. That is, you can go with Django's QuerySet and Query, and just implement a MongoDBCompiler. Or, you can use QuerySet with MongoDBQuery class. Or, finally, you can implement MongoDBQuerySet directly from scratch.

If you implement Compiler or Query, you are targeting internal APIs which we *will* change in the future, maybe even in dramatic ways. If you target QuerySet, you are targeting a public API that doesn't change often. And, even if it changes, you will get a nice deprecation period for the changes.

It might seem a lot of work to start from QuerySet, but for a non-relational backend there isn't actually *that* much work involved. Most of Django's Query and Compiler classes deal with joins, SQL's NULL peculiarities or SQL's way of doing aggregations. All of these are non-issues for non-relational backends.

So, I think you should start with implementing a custom QuerySet for your wanted backend. You can also try to make it work with all Django models, but that approach is very likely to fail. For starters, Django's models use an autoincrementing integer field for primary key, whereas most (if not all) nonrelational databases use something different. Another interesting case is ManyToManyFields, which assumes a relational data model.

It is very tempting to go with an approach where you just implement a custom Compiler class for your nonrelational backend. This would, in theory, allow users to run any existing Django application on non-relational database by just using a non-relational backend. The problem with this approach is that it doesn't work well enough in practice, and the maintenance overhead in the long run is huge.

 - Anssi

Ola Sitarska

unread,
Dec 16, 2015, 3:01:44 AM12/16/15
to django-d...@googlegroups.com
I'm definitely not going to be more helpful than Anssi, but feel free to take a look on Djangae, a Django database backend for non-relational datastore on Google App Engine. We've also got a chat and a mailing list if you wanna have a chat with us. 

Here are also some implementation details.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/4e40d965-37dc-428b-b9e8-508664db6b91%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Samuel Bishop

unread,
Dec 17, 2015, 1:00:09 AM12/17/15
to Django developers (Contributions to Django itself)
After many years idly wanting the feature, and having looked at the existing "state of the art" when it came to Django and alternative Databases, it was Djangae that convinced me that this is doable and got me "over the ditch" to where I now stand, Djangae on one side of me as proof, a working example that Django can exist on this 'brave new soil' so to speak, looking forward at several concrete options for how I can push even further ahead. So you've been a lot more helpful than you think ;-) 

Samuel Bishop

unread,
Dec 17, 2015, 4:55:08 AM12/17/15
to Django developers (Contributions to Django itself)
By the time I opened the issue ticket I had become convinced that the DB Compiler was effectively an impossible route. I completely agree with your sentiments about implementing Compiler. I'd go as far as to suggest that few small documentation changes may be warranted in order to suitably explain to future developers that they should not take this route if their database is not "relational enough".

Using different models has some advantages in that it can take full advantage of the underlying database's capabilities. But it does sacrifice compatibility with a significant amount of existing Django packages, so putting aside the additional complexity level, its not the target I'm aiming for. 

I'm definitely coming to the conclusion that Queryset is the correct place to start work. 
I think there are a number of issues this will expose/create, such as issues related to UUID usage especially as Primary Keys, and from just a few minutes re-reading the queryset class I also think there may be a need to clarify  

So far, I've found the following problems related to UUIDs that might get in the way of 'finishing' this work.

Existing issues: 

I've identified one new "issue".
There is an implicit assumption that primary keys are useful for ordering by the current QuerySet API methods `.first()` and `.last()`.
I'll raise an issue for this item after I give an opportunity for further discussion here since I'd like to have more of an idea regarding typical usage of these two queryset methods. I'm currently unsure how often these are used on unordered QuerySet objects. If the current behaviour of implicitly falling back to ordering by the primary key is in heavy use, I will need to take that into consideration. In the shorter term I currently have a few possible workarounds in mind to replicate the existing behaviour but the performance implications of these different methods become significantly more important if the implicit order by primary key behaviour is heavily used. Longer term, this behaviour might be good to deprecate by documenting that without an integer primary key, this behaviour cannot be relied upon, and removing any workarounds that emulate integer ordering type behaviour.

Ticket 6663 was closed quite some time ago, however in order to get the most from any attempt to support non relational databases, via QuerySet or otherwise, it will need to be revisited and either reopened or a new issue created to address the point I'm about to make that I feel is encompassed by 6663. I hope I can avoid any confusion and be clear what I feel is covered by this.

The current UUIDField that was recently added to Django is not always suitable for use as a database primary key because:
- The UUIDField generates the UUID with Python code and this is less than optimal in some circumstances. Many databases can or do generate document or row UUID 'primary keys' automatically. It should be possible to let Django defer the creation of the UUID and rely on the database for the creation of UUID primary keys just like it currently does for automatically incrementing integer primary keys. 
- Existing Django applications/libraries were not written with UUID primary keys. Supporting existing Django applications and models is one of my goals, so requiring explicit use of something like `id = UUIDField(primary_key=True)` on a model in order to make it compatible, represents an issue to me. 

Ticket 6663 was about the ability to use a UUID as the primary key. While on the surface this appears solved, we can do `id = UUIDField(primary_key=True)` and we have a UUID as the primary key, what hasn't been addressed is the ability to say "I want to use UUIDs for primary keys", I feel this was the intent behind Ticket 6663 and it should be reopened with an explicit focus on fixing the following two things:
- The default AutoField that Django provides any model that doesn't explicitly create its own id field, should not "force" the use of an automatically incrementing integer based primary key.
- A mechanism for configuring what kind of primary keys should be used. The two most likely configurations are all integer primary keys and all UUID primary keys, so my initial thoughts are that this mechanism should reside at the public QuerySet API layer, probably as a boolean value set during QuerySet class `__init__`.

In addition to this, in order for this to be most effective, there needs to be a way to specify that you want to use an alternative QuerySet class. There are several places one could override this for their own application and models very easily, however no convenient way to modify the 'default QuerySet' class provided by `Manager`. While my first thought is modify `BaseManager.from_queryset()` here https://github.com/django/django/blob/stable/1.9.x/django/db/models/manager.py#L143 so that the definition of `Manager` no longer has to explicitly pass QuerySet, like it does here https://github.com/django/django/blob/stable/1.9.x/django/db/models/manager.py#L238 the potential impact of such changes is definitely something I'm unfamiliar with, so I would greatly appreciate any feedback on how appropriate this approach would be.

- Sam 

Curtis Maloney

unread,
Dec 17, 2015, 7:32:12 PM12/17/15
to django-d...@googlegroups.com
> I've identified one new "issue".
> There is an implicit assumption that primary keys are useful for
> ordering by the current QuerySet API methods `.first()` and `.last()`.

I believe the case here is that first and last are meaningless without
an ordering, so in lieu of any programmer supplied ordering, a
"fallback" consistent ordering of 'pk' is used.

Remember the DBMS is under no obligation to return a query in consistent
order without an ORDER BY clause.

So, without an ordering, two successive calls to first() [or last()]
may, in fact, return different records even without modifications to the
table.

It's not expected people should _rely_ on the ordering of PK, and,
indeed, it's frequently recommended against inferring any meaning from
PK values [sqlite, IIRC, assigns them pseudo-ramdomly]

That said, the assumption that a PK is sortable and will provide a
deterministic ordering shouldn't be much to ask, surely?

--
Curtis

Marc Tamlyn

unread,
Dec 18, 2015, 12:42:40 PM12/18/15
to django-d...@googlegroups.com

I agree that the current uuidfield is a simplistic solution to be used as autofield, and having a way to specify your autofield type globally would be good, equally something like a biginteger field. The complexity involved with doing this (custom RETURNING statements, database level defaults, integration with migrations) is an order of magnitude more complex than the solution committed. I knew that was the intention of the original issue but it was also why it hadn't been solved for the "simple" case.

I did spend some time looking at the more complete solution and honestly could not work out how to approach it. I found if hit many more parts of the internals of the ORM than I expected.

--
You received this message because you are subscribed to the Google Groups "Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.

Samuel Bishop

unread,
Dec 19, 2015, 10:39:13 AM12/19/15
to Django developers (Contributions to Django itself)
You've raise a very good point, one that has been on my mind the entire time I've been exploring this. 
Much of Django has been designed 'on the back of' the ORM, reliant on its limitations and without need to go beyond the scope of what it can provide, its very probable that over time this has introduced some failures to fully adhere to the best possible separation of concerns.

Autofield does seem to suffer from this a little with its assumption of Integer only keys. 
Fundamentally, when it comes to letting the database do the work for us, Django doesn't need a lot of code to make a field... I've seen alternate (non primary key) autofields as short as 3 lines. If we took the simplest approach possible, a 'project wide setting'. Such a setting could 'toggle' between integer and string/UUID behavior everywhere we need to introduce changes. However I feel that may be too simplistic an approach if I want to try and get broader support for NoSQL/Alternate DB support via the approach I'm so determined to try and build. 

My biggest worry is really, just how many places check the DB parameters. A cursory reading of the base Field class, has literally dozens of references to both the default connection wrapper object `connection`, and to the database connectionhandler object `connections`. 55 references at last check in `django.db.models.fields` alone. It worries me that I may still need to create essentially a fake database backend full of dozens of stub entries to satisfy the introspection being performed by various classes despite using a queryset that never even makes a database call. If anyone with more knowledge about this wants to comment, it would be most useful. Currently, I'm operating under the assumption, educated at least in part by studying the DB connection classes, testing and debugging various alternative database backends, and debugging my way through ORM calls,that the queryset interface is the uppermost level of ORM interfacing code. In principle nothing above the queryset should rely on any particular database, or any database at all. If there are parts of Django "above" the queryset that rely on knowing things about the database, how defective should we consider this? Should I be temporarily working around such issues, commenting things out, logging bugs etc, or implementing workarounds in my own queryset... such as the aforementioned 'fake database backend'. 

Josh Smeaton

unread,
Dec 19, 2015, 6:31:48 PM12/19/15
to Django developers (Contributions to Django itself)
As far as I'm concerned, anything we can do to simplify and decouple abstraction layers is a good thing and would be welcomed, with the usual caveats of backwards compatibility in public and pseudo-public APIs.

models.query is a thin layer over models.sql.query. That's a good thing. Ideally, models.query should be able to use any sql.query underneath, but there's currently no good way to swap out implementations. I think simplifying models.query so that it is *only* a thin wrapper over the backend query class would be a great first start. 

I think there's a tonne of room for improvement with sql.query also. It does way more than just store the current state of the query. I feel a lot of the functionality of sql.query should be pushed down to the Compiler class.

As far as fields themselves go, they're distinct from the queryset/query/compiler layer, so I wouldn't say they are "above" or "below" queryset itself. Perhaps fields would benefit from a similar abstraction where there is a very thin layer up top, and a "backend" implementation that does the heavy lifting. Again though, figuring out a nice way to swap out implementations would be a challenge, and preserving backwards compatibility of fields is going to be even harder than that of query and compiler, because fields are most definitely public/pseudo-public.

But going back to my original point, we'd welcome lots of smaller incremental improvements to the abstraction layers. Fields are going to be hard to improve though. They offer so many different methods that aren't well documented but are definitely relied on from all backends.

Cheers

Samuel Bishop

unread,
Dec 20, 2015, 3:46:44 AM12/20/15
to Django developers (Contributions to Django itself)
I was more referring to 'above' in a broad way, as the fields are constructed from the data returned by the ORM when building the model. which should also be above the ORM. 

Pulling open the Django code in PyCharm and taking a look at things, with an eye towards "lots of smaller incremental improvements" has me wondering... how small is small?

As part of restructuring the layers, it would make sense to create a higher level module, stub the 'new' models and fields into the old locations for backwards compatibility, and separate the ORM related and non ORM related parts of the model and field classes, placing the non ORM part in the new module, and likely keeping the rest in the current place since `django.db.models` since 'django database models' feels like the right place for database specific parts of model code. Splitting it like that would allow the placement of our new 'chose your back end' logic to exist between the non ORM parts and the ORM parts by the use of dynamic subclassing based on configuration options much as we do inside the ORM.

But while that could happen with little new code needing to be written, since its primarily about restructuring the existing code. It doesn't however feel like a small change, and no matter how I cut it up, at some point there would likely need to be a fairly big "cut over" PR. Not really in the spirit of lots of small increments.  

Josh Smeaton

unread,
Dec 20, 2015, 4:43:11 AM12/20/15
to Django developers (Contributions to Django itself)
When I wrote small I was thinking along the lines of clearing up particular methods to ensure they're not breaking semi-formal contracts. Layering on a new abstraction to fields is quite a bit bigger, but is still small relative to the goal of "decoupling the ORM". If you can show a clear benefit for that direction and it doesn't break backwards compatibility then I think the idea would be welcomed. 
Reply all
Reply to author
Forward
0 new messages