gRPC & transaction support

2,720 views
Skip to first unread message

Jiri Jetmar

unread,
Aug 5, 2015, 4:45:53 AM8/5/15
to grpc.io
Hi guys,

we are (re-) designing a new RPC-based approach for our backoffice services and we are considering the usage of gRPC. Currently we are using a REST method to call our services, but we realize with time to design a nice REST API is a really hard job and when we look to our internal APIs it looks more RPC then REST. Therefore the shift to pure RPC is valid alternative. I;m not talking here about public APIs - they will continue to be REST-based..

Now, when there are a number of microservices that are/can be distributed one has to compensate issues during commands (write interactions, aka HTTP POST, PUT, DELETE). Currently we are using the TCC (try-confirm-cancel) pattern.

I'm curious how you guys at Google are solving it ? How you are solving the issue with distributed transaction on top of the RPC services ? Are you doing to solve it on a more technical level (e.g. a kind of transactional monitor), or are you considering it more on a functional/application level where the calling client has to compensate failed commands to a service ?

Are the any plans to propose something for gRPC.io ?

Thank you.

Cheers,
Jiri

Jorge Canizales

unread,
Sep 9, 2015, 1:47:51 PM9/9/15
to grpc.io
For Google's JSON/REST APIs we use ETag headers (optimistic concurrency) to do these things. That's something easy to implement on top of gRPC, using the request and response metadata to send the equivalent headers.

glert...@gmail.com

unread,
Nov 5, 2018, 4:16:10 AM11/5/18
to grpc.io
Dead issue but I would like to resurrect it because this wasn't answered at all.

Simple use case which can easily illustrate the problem: Two different services OrderService (with CreateOrder method) and AuditService (with Audit method). You want to create the order and, in case everything succeeded, log an audit entry. If you log an entry beforehand you could end with an audit log which never happened because the create order task failed. If you (try to) log an entry afterwards, the audit task could fail and end not logging something that happened which fails its sole purpose of having an audit log at all.

What do you guys at Google do?
* Compensate?
* Nothing more than live with it?
* In this concrete case having a custom audit log per service and the CDC (Change Data Capture) and replicate to the central service?

@Jiri what did you end up doing?

Thanks,

Robert Engels

unread,
Nov 5, 2018, 8:48:50 AM11/5/18
to glert...@gmail.com, grpc.io
You need a database and logger service that supports XA transactions. 

Sometimes it is easier to just log in the database under the same transaction. 
--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/c727145c-b8a8-44f3-b857-416b4491362b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

glert...@gmail.com

unread,
Nov 5, 2018, 12:13:42 PM11/5/18
to grpc.io
I find really hard to believe that Google uses XA transactions for its own services. Take this AuditLog service as an example which presumably is used/consumed as a middleware on their services (due to service_name/method_name properties). Without taking into account that probably the people in charge of an AuditLog API and Container API are different teams which difficult the consensus on a one and only *RPC based XA/2PC based system.

Anyone from Google can shed some light on this matter?

Robert Engels

unread,
Nov 5, 2018, 12:20:51 PM11/5/18
to glert...@gmail.com, grpc.io
Regardless of what google does or doesn’t do, you can’t solve the problem without XA transactions - either built in or rolling your own (lots of work and not very performant). The easiest way is to log everything in a persistent log that will occur, do the operations and verify that the expected operations were logged successfully - but that may even be possible as some operations cannot be rolled back - like log statements in a non transactional logging system. In this case a new log statement is created that logically supersedes the previous (same ID, etc) 

It’s a pretty standard CS problem, but the work required can be relaxed if you don’t need full ACID across all resources. 

Christian Rivasseau

unread,
Nov 5, 2018, 12:21:36 PM11/5/18
to glert...@gmail.com, grp...@googlegroups.com
The user group for an RPC framework is not the right place to ask how you should
implement your transactions. This is 100% orthogonal issue, and one that you would have
with REST services too.



For more options, visit https://groups.google.com/d/optout.


--
Christian Rivasseau
Co-founder and CTO @ Lefty
+33 6 67 35 26 74

Sankate Sharma

unread,
Nov 5, 2018, 12:21:39 PM11/5/18
to glert...@gmail.com, grpc.io
The problem you are describing is pretty common and not specific to gRPC, it will be there for any micro service based architecture.
Typically it will be the orchestration layer that will be responsible for making sure that both operations succeed, and implement any retry logic. Or have the order service does it’s job and then logs it, in case of failure, implement retry mechanism.
Other way you could solve this problem is by implementing configuration for automatic retries in the service mesh (linkerd etc.)


For more options, visit https://groups.google.com/d/optout.
--

Sankate Sharma
Principal Engineer
350 Convention Way, Suite 200
Redwood City, CA 94063

Payments API for Platform Businesses


robert engels

unread,
Nov 5, 2018, 1:47:14 PM11/5/18
to glert...@gmail.com, grpc.io

Carl Mastrangelo

unread,
Nov 5, 2018, 2:27:36 PM11/5/18
to grpc.io
<Speaking on my own behalf, rather than Google's>

I think OP hit the nail on the head with REST being a bad fit for transactions.  A previous team I worked on had pretty much the same problem.  There are two solutions to having transaction like semantics.

1.  Make all RPCs compareAndSwap-like, effectively making them atomic operations.  When querying an object from the service, every object needs a unique version (something like a timestamp).  When making updates, the system compares the modified time on the request object with the one it has stored and makes sure there haven't been any changes.   This works for objects that are updated infrequently, and which don't involve other dependent objects.  

2.  Make streaming RPCs a transaction.  One good thing about streaming RPCs is that you make the messages you send and receive effective a consistent snapshot.  When you half-close the streaming RPC, it attempts to commit the transaction as a whole, or else gives a failure to try again.  This makes multi object updates much easier.  The down side is that the API is uglier, because effectively you have a single "Transaction RPC" and all your actual calls are just submessages.   It works, but things like stats, auth, interception, etc. get more complicated.   


Personally, I would structure my data to prefer option one, even though it is less powerful.  I *really* don't like thinking about implementing my own deadlock detection or other lock ordering for a RPC service.   If you know locking is not a problem, I think both are valid solutions. 

robert engels

unread,
Nov 5, 2018, 2:40:21 PM11/5/18
to Carl Mastrangelo, grpc.io
Not sure if it is the OP, but given the order service example cited, a fairly simple solution that isn’t transactional, would be:

use service that hands-out globally unique txIDs.

1) use audit service to log “handling order for txID”
2) use order service to commit order, attaching TxID to the Order
3) use audit service to log “handled order for txID”.

Then it is easy to determine what happened.

If #3 exists, you know the order is committed.
else if #1 exists, then either:
    A) the order was committed, but the audit logged failed
        - so ensure order exists with TxID and re-log order audit
    B) the order failed, so log another order event “handled order TxID, failed”

The above actions would be performed in cases where the audit log was “incomplete”.

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.

glert...@gmail.com

unread,
Nov 5, 2018, 5:07:09 PM11/5/18
to grpc.io
Just to finish up the discussion, at least from my side, I will answer one by one:

@Christian Rivasseau I agree that this is not the correct forum but as I wasn't the one that opened the issue I found it really handy to re-ask in a forum where lots of expert people could expose their own experiences. So sorry about that!

@Sankate Sharma I don't think that an ad-hoc application level or service mesh (linkerd, istio or even a pure envoy one) based retry logic is a solution because you're just delaying or making the problem less probable, but not solving it.

@Robert Engels Thanks for your ideas and links.

IMHO and after reading a lot about this I would do the next in the following order:
* Avoid distributed transactions and try to fit multiple entities in a transaction into an existing bounded contexts
* Saga patterns + eventual consistency with compensation/reconciliation in case of failures
* CDC for data replication in order to feed other systems..

I found this article really interesting and a good summary on the current patterns: https://ebaytech.berlin/data-consistency-in-microservices-architecture-bf99ba31636f.

Regards,
Reply all
Reply to author
Forward
0 new messages