Concurrency issues (yes, again)

28 views
Skip to first unread message

Sebastian Brudziński

unread,
Jun 1, 2018, 12:30:12 PM6/1/18
to openlm...@googlegroups.com

Hello everyone,

we have recently started working on OLMIS-4728 which tackles an issue that Malawi has started experiencing after updating to OpenLMIS 3.3. This is once again related to concurrent updates.

In simple words - requisition update can currently overwrite most of the requisition properties (including status) if it's run concurrently with other requests that change requisition state. This is especially noticeable for updates that take longer time (several seconds). Our current mechanism to prevent concurrent updates is based on the timestamp property that is manually set and validated during updates (the received timestamp must match the database one). This works, but only for a case that doesn't involve any concurrency. For concurrent requests, this validation can pass for multiple requests and all the updates will be successfully carried, since the timestamps will match until one of the updates commits to the database.

The main problem therefore is that we are validating this at the application logic level. The proposed solution is moving this to the database level and introducing optimistic locking in one of the following ways:

1. Use optimistic concurrency control using @Version annotation.
A requisition would have a new property called "version". This field is automatically updated by Hibernate as it's modified. Both incremental numbers and timestamps are supported, but Hibernate recommends dedicated version number property (https://docs.jboss.org/hibernate/orm/4.0/devguide/en-US/html/ch05.html#d0e2251). This essentially adds the version property to the "where" clause of the database update query and in case no rows are affected, a org.hibernate.StaleStateException is returned, indicating the version has changed in the meantime. This ensures that no two updates with the same version will ever be commited into the database. We would still keep our timestamp validation in order to fail faster if the timestamp mismatch is detected. The version-based validation would primarily protect against concurrent updates. This also means it wouldn't need to be exposed to the REST API. This approach is also well documented here: https://dzone.com/articles/version-based-optimistic

2. Implement custom post-update trigger
We would implement a custom trigger that is fired after each update on the requisition and validate the timestamps there. This means that we would remove updating the timestamp from the application logic level and move it to the post-update trigger as well (only if the validation passed). This would allow us to re-use the timestamp field that we already have. We could also keep the check on the application level to fail faster if the mismatch is already detected. Custom trigger gives us more flexibility, but may also require more work.

"Why do you have concurrent requests in Malawi?"
I can see this question coming and I've been looking/asking for the reason as well. My best guess as of now is a mix of poor Internet connection + users refreshing and re-saving/re-submitting requisitions when it's taking too long. Whatever the reason is, our API should be able to handle concurrent requests (even due to the fact that multiple users can work on the same requisition).

I've also done a basic proof of concept for proposed solution #1 - I've added the version field to the requisition, added artificial 5 seconds wait in the update endpoint and was able to get the error as expected. Of course more testing and work is required if we decide to go with this solution.

I'm looking for any feedback you may have concerning the problem itself or the proposed solutions. Can you see any issues with either of the approach? Do you have any preferences? Do you see other solutions?

Thanks,
Sebastian.

--

Sebastian Brudziński
Senior Software Developer / Team Leader
sbrud...@soldevelo.com



SolDevelo
Sp. z o.o. [LLC] / www.soldevelo.com
Al. Zwycięstwa 96/98, 81-451, Gdynia, Poland
Phone: +48 58 782 45 40 / Fax: +48 58 782 45 41

Josh Zamor

unread,
Jun 1, 2018, 7:02:18 PM6/1/18
to Sebastian Brudziński, openlm...@googlegroups.com
Thanks for this writeup Sebastian,

I must be forgetting my history here as I had thought we had already implemented optimistic locking in the database.  I didn’t realize we attempted it in Java - which might work if someone has a very outdated (likely an old cache on the device) Requisition, but not truly for concurrent requests.

Two things come to mind:

  • I find it unlikely that we really have different users trying to update the same Requisition at the same time.  Nor that we need pessimistic locking.
  • That said we should have solved the “double click” problem which I suppose we thought we did last time, however it looks like we left a gap by not doing what all payment APIs do.  The basic idea is to have a form submission/api call send along a random number in the HTTP header which uniquely identifies the request.  This request could be the save or submit a Requisition or whatever else it is that the user is attempting to “do once”, but for which they may accidentally do twice (or which our HTTP library might retry if it doesn’t hear back from the network).  On the server end we look at that number passed to us and once we see it, we mark it down as being “used”, and we will never accept another request which uses that number.  It’s important that this number be remembered outside of the action’s regular database transaction - you don’t want to entertain two requests simply because the first request hasn’t committed yet.  An in-memory cache, could be Redis though it’s more than needed strictly speaking, that writes this number down immediately is I think the most effective approach.  A quick google search brought up what Stripe does:  https://stripe.com/docs/api?lang=curl#idempotent_requests
    • I think we should use UUID v4, as we do already for resource IDs.
      • We may need to use something with more context than just a random UUID (v4) if we think that the processing of the request is so long that the user could reload the Requisition view, get a new number, and send a new request all while the original request is being processed and the transaction has yet to commit.
    • We should also expire the keys/unique numbers, and 24h is probably good for us as well.

I don’t think we need anything more than optimistic locking and creating idempotent requests.  The trigger would be useful if we wanted to protect the table against non-hibernate SQL access, however our architecture specifically forbids any other client from using the Service’s database directly and so I think we can skip the trigger.

Best,
Josh


--
You received this message because you are subscribed to the Google Groups "OpenLMIS Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openlmis-dev...@googlegroups.com.
To post to this group, send email to openlm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openlmis-dev/8469f3a4-b54e-1047-dafd-5097ea2d1094%40soldevelo.com.
For more options, visit https://groups.google.com/d/optout.

Sebastian Brudziński

unread,
Jun 5, 2018, 12:10:10 PM6/5/18
to openlm...@googlegroups.com, Josh Zamor

Thanks for the input, Josh.

I'd agree that we don't need to think or consider pessimistic locking for now.

On idempotent requests, that would definitely be helpful as well. I think we need something more than a random UUID generated at the page load as the idempotency-key. I'm not sure yet what would work better and possibly include some context though... I'll follow up on that. Two immediate thoughts at this point: How would we want to handle the case with two requests and the same idempotency-key? Would we immediately return an error code (eg. 409 Conflict) for the second and following requests? Also, would the idempotency-key be required in the requests or just supported? It looks like Stripe API simply supports idempotency but doesn't enforce it, which sounds like a good solution for us as well. We could use it on UI, but still give other clients option to call the API without generating random numbers.

It sounds like we would want to implement both optimistic locking and idempotent requests as a part of OLMIS-4728, am I understanding this correctly? As for the optimistic locking solution - are we happy with the incremental version field on the requisition object?

Best regards,
Sebastian.

Josh Zamor

unread,
Jun 6, 2018, 2:50:53 AM6/6/18
to Sebastian Brudziński, openlm...@googlegroups.com
Agreed that both idempotent requests and optimistic locking would be needed for OLMIS-4728.  And for optimistic locking I have no objection to the incremental version field managed via Hibernate.  It sounds like the right approach.

As to v4 UUID’s for the idempotency-key one thought that comes to mind is that it could be enough to have this key cached/stored in the UI’s storage along with the Requisition object when making the request.  In this way no more context encoded in the key would be needed.  This works in a very similar way for payment API’s - it’s important as a client of the API to remember the idempotency-key you’re trying to use until you know that the key is no longer useful.  The key is no longer useful to the client once it hears back from the API.  That’s essentially the lifecycle of this ID.  The client needs to associate it with an action it’s trying to attempt, and refreshing the page shouldn’t cause the UI to forget that action.

That leads to the question of requiring the key and error codes.  I think it’s fine to keep them as optional (but recommended).  For the error code the 409 sounds right so long as the server has used the key before.  And that brings up a useful detail.  If the server responds to a request with a 400, 403, etc I don’t think it’s useful to mark the key as used.  If the client doesn’t hear the response, and retries, sending a 409 back masks the original 400 or 403.  However if the response is a 200, mark the key as used and send back a 409 with the Location header set to the resource that would have been returned in the original 200’s body (if any).  For PUT of course this isn’t really needed, however for POST I think this makes sense.

Best,
Josh

Sebastian Brudziński

unread,
Jun 8, 2018, 9:59:16 AM6/8/18
to openlm...@googlegroups.com, Josh Zamor

The cached idempotency-key associated with the requisition object would be sufficient, but we would need to consider exceptional cases. As you mention, we would need to hear back from the API to know that the key has been used and is no longer valid. If we refresh the page or close the website completely, we will never hear back from the API and we will keep the cached key. The API request, however, will eventually get processed, causing the key to no longer be valid (provided the response is 200 OK). I think our UI client would need a way to somehow recognize that the requisition has changed and that a new key needs to be generated.

Agreed that any validation errors don't need to mark the idempotency-key as used. 

Best regards,
Sebastian.

Sebastian Brudziński

unread,
Jun 8, 2018, 1:10:22 PM6/8/18
to openlm...@googlegroups.com, Josh Zamor

Also, a follow-up thought on idempotent requests is whether we should mark the keys as used at the very beginning of the request and then immediately free them, in case the response wasn't 2xx. If we only mark keys as used when the request has finished executing, we aren't really solving anything, since the endpoints would still be accepting new requests with the same key as they are executing (and therefore have concurrent requests). An alternative would be to check whether the key exists at the end of the endpoint execution, rather than at the beginning, and rollback the whole transaction with an exception if the key exists. Thoughts?

Best regards,
Sebastian.

Josh Zamor

unread,
Jun 11, 2018, 1:17:59 AM6/11/18
to Sebastian Brudziński, openlm...@googlegroups.com
The scenario you described should be covered by optimistic locking using the versioning field.  One thing I think I didn’t clarify is that the idempotency-key is more for use for creating new resources - typically a POST/PUT /api/resourceName and less for a PUT /api/resourceName/id.

Check this document out on strong/weak etags and if-match:  https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests

Double checking the Requisition I noticed that it looks like we don’t yet use etags.  A good alignment here would be to introduce Hibernate’s optimistic locking using a version field as we discussed, and then using that version as a weak etag for the Requisition.  This has the benefit that we can then use that version field in the if-match header, and it’s simple for a client with a network connection to send a lightweight HEAD request with that etag to see if they have a stale state.

For Requisition then I think we should:
  • Return an etag for the Requisition which is the Version field we talked about for optimistic locking.  This is a “weak” etag, so you’ll need a “W/“ in front.
  • Use if-match in PUT requests with the previous etag (aka version).
  • Use an idempotency-key for create/POST calls such as initiate, approve, submit, etc…  (the status changes)
  • We shouldn’t use the field modifiedDate for stale state or concurrency checking, the etag with if-match is better suited.  However this doesn’t mean we have to remove it from the Response - I won’t presume that no other clients are using this field for other purposes at this point.

As to this point about error responses and idempotency-key, I think I muddied the water with too much of an edge case.  When creating a resource, if the previous request failed and now we’re getting a 409 with no Location header set that should be enough to tell the client that the cause is the idempotency-key.  A body with an error message for the idempotency-key would clarify it.  Either way the client knows to change the key and retry the request to find out why the original request failed.  Not ideal but it’s an edge case.  If during a create a 409 is returned with a Location header, then we know the previous request succeeded and where to go to load the resource we did create.

I also agree as you pointed out, it’s important to check if the key is used as soon as possible - or more importantly not after a transaction to the DB has been committed.  While rolling the transaction back does sound plausible, I think I’d prefer to get a 409 back to the client as soon as possible.  This lets the client know what’s happened quickly.  i.e. if the API can respond with a 409 ASAP, then as a client I know quickly that that key can’t be used again.  If the client has to wait some N seconds till after a bunch of logic checks and I/O occur on the server, that’s N seconds wasted with no feedback occurring over a likely poor network.

The last thing I’d note is that idempotency-key requests should likely be checked after all the authorization steps.  It’d be weird to get a 409 for something when in reality I should have gotten a 403 first.

Best,
Josh

Sebastian Brudziński

unread,
Jun 14, 2018, 11:16:26 AM6/14/18
to openlm...@googlegroups.com, Josh Zamor

Thanks Josh, this all makes sense.

One doubt I still have is how we should handle a case when we receive a second request (same Idempotency-Key) when the first one is still being executed. If we were to immediately return 409 (without Location set, since the request hasn't finished yet), this may give client a wrong impression that something went wrong and it's safe to retry the same with another Idempotency-Key.

It seems that we need a way to distinguish between a 409 response caused by another request in progress and a 409 from a completely processed request. Would you agree? This could potentially be custom response header or different error message.

Best regards,
Sebastian.

Josh Zamor

unread,
Jun 18, 2018, 4:34:25 PM6/18/18
to OpenLMIS Dev
Good point Sebastian,

In that case I still think we return a 409 as soon as possible to the second request and instead of assuming that no location header means the previous request failed, we'll put in the body of the response a more specific error message that the key has been used.  A client having received a 409 knowing that it's an idempotency-key issue of course should never simply change the key and retry.  Rather the user should likely have to restart the process.  For our POST endpoints which tend to be initiating or changing the status of a Requisition, I think that "restarting of the process" would be to reload the initiate screen or the view requisition screen - essentially recheck if that requisition can now be initiated or has changed status.  The time it takes to do that will likely (but not guaranteed of course) result in the client finding the truth.  And again, the point to an idempotency key isn't to make a POST request idempotent forever, but rather long enough that a client won't accidentally retry the same request.

That seems like the most straightforward approach, however there are at least two others:
  • Block the second request from returning a 409 quickly until the first request completes - this seems difficult and also quite undesirable for network usage and even UX and so I won't go into it further, but I'll mention it for completeness...
  • Save the response to any request which has an idempotency key attached to the key in storage.  This means that when we return a 409, we can return what the previous request returned.  For responses which completed but didn't make it back to the client, that would result in more efficient network usage.  And for the specific edge case we're discussing, a 409 without the previous response would let us know quickly that the first request hasn't completed yet, and it's safe to keep requesting with the same key and getting a 409 until the previous request's response is attached to the key.  This is actually quite desirable I think, but it'd certainly increase the complexity of the implementation - something I'm not sure it's worth atm.  If we did go this route, I'd encourage Redis more strongly here (which btw did we choose Redis already or does our key storage tool support TTL?)
Thoughts?

Best,
Josh
Best,
Josh

To unsubscribe from this group and stop receiving emails from it, send an email to openlmis-dev+unsubscribe@googlegroups.com.

Sebastian Brudziński

unread,
Jun 19, 2018, 8:32:28 AM6/19/18
to openlm...@googlegroups.com

Thanks Josh,

I think we can start with the version that doesn't involve saving the response for now and easily expand our idempotent request handling with that feature at a later point if desired. We haven't picked our key storage tool yet, but I think it would be cool to finally introduce Redis. Perhaps we can make more use of it at a later point in time as well (caching?).

Best regards,
Sebastian.

To unsubscribe from this group and stop receiving emails from it, send an email to openlmis-dev...@googlegroups.com.

To post to this group, send email to openlm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Sebastian Brudziński
Senior Software Developer / Team Leader
sbrud...@soldevelo.com

Josh Zamor

unread,
Jun 25, 2018, 4:51:58 AM6/25/18
to OpenLMIS Dev
Thanks Sebastian, that sounds reasonable.

And yes I'd like us to get back toward caching for select Resources, likely with Redis.  We've stayed away from it in the past, however some of our Resources are finally reaching a decent level of maturity that I think we should try again.  Key to this though I think is our caching capability on the front-end, in particular I'd like to know what we think our limits are.  Can we store 10k items in the UI's database, search them, and update those that are out of date efficiently yet?  100k?

We don't need to address that here but it's the direction I think we need to move.

Best,
Josh

--

Sebastian Brudziński
Senior Software Developer / Team Leader
sbrud...@soldevelo.com

Reply all
Reply to author
Forward
0 new messages