two-phase commit / distributed transaction

532 views
Skip to first unread message

Mateusz Mikołajczyk

unread,
Dec 1, 2016, 6:28:26 AM12/1/16
to Django developers (Contributions to Django itself)
Hello, fellow devs.

I have been googling intensively in order to see whether somebody already raised such issue, but so far I have been unsuccesful. Therefore, trembling on my legs, I decided to write to the devlist as suggested in the docs.

I am trying to extend the atomic decorator / context statement in order to do 'prepare transaction \'foo\'' rather than usual 'commit' on succesful transaction. It is, however, not the usual scenario where django would talk to multiple databases. What I have in mind is a bunch of microservices, one of them which would be django application. Therefore django app would be talking to the external transaction manager which would then take care of executing the appropriate transactions inside of each microservice.
I suppose that after a bunch of hacking I could implement this with some monkey patching of the original atomic() code but it clearly is not the way to go.

I then started to think how this could be done database-agnostic way. I know that PostgreSQL supports this with 'prepare transaction' statement, but I suppose that other databases have different syntax for this kind of behaviour and some don't support this feature at all. Therefore I thought that in e.g. SQLite3 (or other database which doesn't support this natively), this behavior could be 'emulated'. Therefore I thought of the following pseudocode:

```
with transaction.atomic(commit_hook=lambda connection: connection.prepare_transaction('foo'))

OR

with transaction.atomic(prepare_transaction='foo')
```

When you would do CRUD operations they would be instead serialized inside a special table, and then, after issuing another command, say ..

from django.db import transaction
transaction.commit_prepared('foo')

they would be applied using regular 'atomic' call. (or.. I don't know, Django could raise an IntegrityError if the database doesn't support distributed transactions and the code would try to execute them)

Do you think that this is realistic or is it a wrong approach to the subject?

kind regards,

Aymeric Augustin

unread,
Dec 1, 2016, 6:52:58 AM12/1/16
to django-d...@googlegroups.com
Hello,

Currently you cannot do this:

from django.db import connection
connection.xid()  # or any other TPC method

Adding implementations of TPC methods in BaseDatabaseWrapper() that simply forward to the underlying connection object is the first step for a database agnostic implementation in Django.


For the high level API, here’s one possibility:

# entering the block calls .xid(format_id , global_transaction_id , branch_qualifier) and .tpc_begin(xid)
with transaction.atomic2(format_id , global_transaction_id , branch_qualifier) as prepare: 

    # run statements

    prepare()

    # check if others are ready to commit
    # raise an exception to abort

# exiting the block calls .tpc_commit() or .tpc_rollback() depending on whether there’s an exception

I’m proposing a separate context manager because I’m worried about increasing again the complexity of transaction.atomic. There will be a significant amount of duplication between the two implementations, though.


The proposal above doesn’t account for recovery: .tpc_recover(), .tpc_commit(xid), .tpc_rollback(xid). I’m not sure what recovery of a two phase transaction is and I can’t say if it needs to be supported in the API.


Also I didn’t talk about savepoints. I assume they can be supported like in regular transactions.


I hope this helps,

-- 
Aymeric.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/8d358e52-591d-4b9b-8c11-882e6a2ac80d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shai Berger

unread,
Dec 1, 2016, 7:30:11 AM12/1/16
to django-d...@googlegroups.com
On Thursday 01 December 2016 13:52:41 Aymeric Augustin wrote:
>
> I’m proposing a separate context manager because I’m worried about
> increasing again the complexity of transaction.atomic. There will be a
> significant amount of duplication between the two implementations, though.
>
I believe that making transaction.atomic more complex for this will be
inevitable, because the two will need to interact: If I'm in a TPC
transaction, and open an atomic block, it needs to be handled as part of the
TPC transaction. There are too many atomic blocks in the Django ecosystem,
including Django itself, to make the feature useful any other way.

Aymeric Augustin

unread,
Dec 1, 2016, 8:04:53 AM12/1/16
to django-d...@googlegroups.com
You may be right.

The person who will write and test that code has all my sympathy :-)

--
Aymeric.

Florian Apolloner

unread,
Dec 1, 2016, 8:54:27 AM12/1/16
to Django developers (Contributions to Django itself)


On Thursday, December 1, 2016 at 2:04:53 PM UTC+1, Aymeric Augustin wrote:
The person who will write and test that code has all my sympathy :-)

I'll second that, I have no idea how Aymeric managed to keep his sanity while rewriting the transaction code :D

Mateusz Mikołajczyk

unread,
Dec 2, 2016, 6:05:11 AM12/2/16
to Django developers (Contributions to Django itself)
What would you say about checking which CRUD operations were executed within atomic() call (in order to serialize them and save into a special model for databases which don't support this functionality) ? Is it realistic? 

What I mean by that is that when you do:

from django.db import transaction

with transaction.atomic():
    MyModel.objects.create(field=123)

then the generated SQL is something like

BEGIN;
INSERT INTO mymodel values (123);
COMMIT;

However, if the database doesn't support the TPC functionality, the SQL would have to be slightly different, say:

BEGIN;
INSERT INTO prepared_transactions (txn_id, model, operation, params) values ('foo', 'MyModel', 'create', '{field:123}');
COMMIT;

But on the other hand, if the database does support that, it could be 'normal', i.e.:

BEGIN;
INSERT INTO mymodel values ( ... )
BEGIN TRANSACTION 'foo';
(no COMMIT)

If it is not possible to trace the CRUD operations, would it be easier to introduce a slightly different syntax, say ...

from django.db import prepare_distributed:

with prepare_distributed('foo') as prepare:
    prepare.add_operation(MyModel.objects.create, {'field': 123})

After all, it's not like the developer doesn't know whether he's doing a distributed transaction or not.. 
 
As for making the atomic() more complex, I don't think that it would be significantly hard. The distributed transaction isn't really *that* different - it's just calling PREPARE TRANSACTION 'foo' (without calling COMMIT). I thought that the Atomic class could simply have some kind of inner method hooks. The default class could then implement those:

class Atomic(ContextDecorator):
    def _commit_wrapper(self, connection):
        return connection.commit()

but the Two phase could do it differently:

class TwoPhaseAtomic(Atomic):
    def _commit_wrapper(self, connection):
        return connection.prepare_distributed(self.distributed_transaction_id);

Of course, the prepare_distributed call would create models in the special table if the database wouldn't support the functionality and call regular commit() at the end, or call appropriate command otherwise - so this seems like the easiest thing to do. The problem that I haven't figured out yet would be to trace the instances being saved / created / etc ..

Patryk Zawadzki

unread,
Dec 2, 2016, 8:32:51 AM12/2/16
to Django developers (Contributions to Django itself)
W dniu piątek, 2 grudnia 2016 12:05:11 UTC+1 użytkownik Mateusz Mikołajczyk napisał:
What would you say about checking which CRUD operations were executed within atomic() call (in order to serialize them and save into a special model for databases which don't support this functionality) ? Is it realistic? 

It would likely break the promise that distributed two-step transactions give you: that once all statements are prepared the transaction is unlikely to fail during commit. In this case the commit would mean "start over and try to repeat my steps" at which point any of the recorded statements is likely to fail constraint checks. (Even more so if your code used get_or_create().)

Also how would relations work? You begin a transaction, create a Foo instance and the returned PK is 5. You assign it to child models. At this point the transaction is saved and rolled back. During replay the insert returns PK = 7, at this point there's no way to detect that some of the stored fives should now be treated as sevens while some should remain fives.

Mateusz Mikołajczyk

unread,
Dec 2, 2016, 2:43:41 PM12/2/16
to Django developers (Contributions to Django itself)
Well, I suppose that it would either lead to very obfuscated implementation code, or very weird syntax (client code). As for your first argument ( promise that the transaction is unlikely to fail ):

from django.db import distributed:

with distributed('foo') as foo:
    MyModel.get_or_create(field=123)

then, before calling the emulated behavior, the db would have to:

* do all the operations (like it would normally do with regular commit - thus checking every constrainst and so on)
* then do a rollback (so that it doesn't store the actual values in the db)
* then serialize them in separate journal (the additional model I mentioned - an analogy to an actual separate journal of PostgreSQL)

Utterly ugly / hacky solution if you ask me, but please keep in mind that this would be only emulation of the actual algorithm for the databases which don't support this standart

As for the relations, I have thought a lot about it and the only pseudocode I could think of was utterly ugly as well:

with distributed('foo') as foo:
    foo.add(MyModel.objects.get_or_create, {'field': 123}, namespace='mymodel')
    foo.add(MyOtherModel.objects.create, {'my_model_id': ('from-namespace', 'mymodel')})

Theoretically, both of these syntaxes could co-exist. If you wouldn't have any relations, you could use the cleaner syntax.

So I'd say it would technically be possible but would lead to very, very, very ugly code (at least in the second scenario with relations). And I realize that this is not an option in the Django world.

I understand that because of all the above it is unlikely to create a nice interface which would work in database-agnostic way, therefore Django would have to throw IntegrityError if somebody would be trying to do distributed transaction on non-supported database? But if that's the case then this code doesn't really belong in the django core, does it? Which means that I'm probably left with the monkey-patching thing :( Or .. ? I have to prepare this functionality either way - because I need it ;)

Thank you for all the answers !

Aymeric Augustin

unread,
Dec 2, 2016, 4:02:44 PM12/2/16
to django-d...@googlegroups.com
Hello,

To be honest I’m pessimistic about the feasibility of emulating transactional behavior — pretty much the most complicated and low level thing databases do — in the application. I don’t think that would be considered suitable for Django.

Usually Django handles such cases with a database feature flag and make the methods no-ops on databases that don’t support the corresponding features. For instance that’s how Django ignores transactions on MySQL + MyISAM.

Best regards,

-- 
Aymeric.

-- 
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.

Mateusz Mikołajczyk

unread,
Dec 2, 2016, 8:52:26 PM12/2/16
to Django developers (Contributions to Django itself)
If anybody is interested, I created a proof of concept code for PostgreSQL which extends existing Atomic context:


the actual implementation can be found inside tpc/atomic_tpc.py and there are two commands available:

./manage.py prepare
./manage.py commit

if atomic() would have been extendible (i.e. if one could replace connection.commit() with something else ) then this whole lot of monkeypatching wouldn't be necesary.

What do you guys think? Also, has nobody ever written stuff like this? Or am searching the wrong way?

cheers,
toudi

-- 
Aymeric.

To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages