Queues, Transactions, Timeouts...

96 views
Skip to first unread message

jules....@gmail.com

unread,
Jun 29, 2016, 5:45:47 AM6/29/16
to Hazelcast
Guys,

I'm working on a POC for extending our use of Hazelcast from just a
distributed Map to distributed Queues as well.

I am hoping that we can use Queues as an integral part of our HA
strategy.

Unfortunately, I appear to have either misunderstood how transaction
timeouts are meant to work or I have found a bug or...

I attach a test - wasn't sure whether I should go straight to the issue
tracker ... It is failing on hazelcast-3.6.3 / java-1.8.0_77-b03 / macosx yosemite.

The scenario :

- someone puts a message on a queue
- client1
 - begins a transaction
 - takes the message
 - hangs for a time longer than the transaction timeout
 - tries to commit its transaction (unsuccessfully)
- client2
 - starts a transaction
 - takes the same message
 - commits the transaction

It looks as if the transaction manager is NOT rolling the transaction
back after the timeout and NOT putting the message back onto the
queue. client2 is blocking forever as it waits for the message to
reappear on the queue.

Have I missed something ?

Are my assumptions wrong ?

Any help would be very much appreciated.


Jules

TransactionTimeoutTest.java

jules....@gmail.com

unread,
Jun 29, 2016, 6:55:52 AM6/29/16
to Hazelcast, jules....@gmail.com
BTW :-)

I reread my test and noticed that I was using TransactionType.TWO_PHASE

The doc for this says:

"...by automatically copying the backlog to another member..."

and I am only running one server node in this example.

So, I threw in another server - but this did not seem to change the outcome :-(


Jules

Ali Gurbuz

unread,
Jun 30, 2016, 6:22:23 AM6/30/16
to Hazelcast, jules....@gmail.com
Thank you for the report and the reproducer. I've created an issue here https://github.com/hazelcast/hazelcast/issues/8483

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/5d6168ae-2be7-4c27-8e8c-6f0df9705b59%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Ali Gurbuz
Distinguished Engineer

Mahir İz Cad. No:35, Altunizade, İstanbul
a...@hazelcast.com 
+90 507 857 7815
skype: isbiroglu
@aligurbuz

jules....@gmail.com

unread,
Jun 30, 2016, 10:48:30 AM6/30/16
to Hazelcast, jules....@gmail.com
Ali,

Thanks for picking this up.

A couple of further questions / points.

Your comment on the issue says "...one should call rollback explicitly upon commit failure in order to release resources".

I don't know how Hazelcast implements this stuff, but I would expect the TransactionManager to set a timeout when I begin a transaction and then unilaterally cancel and rollback that transaction if it is still extant when the timer goes off.

If client2 has to wait for client1 to explicitly rollback the transaction after it fails to commit, it may have to wait a very long time as client1 has obviously got a problem which may block it indefinitely. If, by chance, client1 does come back to life and try an explicit rollback then should this not fail as well ? By now the transaction should have been picked up by the TransactionManager and rolled back. Unless rolling back is an idempotent operation, explicit rollback code will start failing when the issue is properly fixed...

You have committed on this issue. I'm not sure whether this was intended to be a full or partial fix. I've git-pulled and rerun my test with a second server node added and both with and without explicit roll-back by client1 after the failed commit - I still seem to be experiencing the issue as described :-(

I'm very grateful for your help with this - it's a pretty fundamental issue for me as it means that I can't promote Hazelcast Queues as an HA solution to my client as I could lose a message in this way - I have tried shutting down client1 immediately after explicit rollback and putting a second message on the queue. Client2 receives the second message - the first one just disappears although client1's transaction was never successfully committed. I'm also worried now that this may affect other Hazelcast collections.

regards,


Jules

Ali Gurbuz

unread,
Jul 1, 2016, 3:26:59 AM7/1/16
to Hazelcast, Jules Gosnell
Hi Jules,

My comment about `calling rollback explicitly` was about Transactional Collections (IQueue, IList, ISet). Your expectation about automatically releasing the resources upon timeout is valid but we do support it only for lock based data structures (IMap, Multimap) for now. We are planning to add this capability for our next release 3.8

Above solution is a partial fix, aiming to solve the bug `not released resources even if you call rollback explicitly`.

PS: I've copied your test into my PR and it is passing



For more options, visit https://groups.google.com/d/optout.

jules....@gmail.com

unread,
Jul 6, 2016, 8:25:03 AM7/6/16
to Hazelcast, jules....@gmail.com
Ali,

Thanks for getting back to me.

I've refreshed my tree and run my test again - as far as I can see, the message is still getting lost.

What version of Java and OS are you using ?
Have you made any changes to the test ? If so, could I have a copy ?

many thanks,

Jules

jules....@gmail.com

unread,
Jul 6, 2016, 8:25:03 AM7/6/16
to Hazelcast, jules....@gmail.com
Ali,

I'm sure that I sent a reply to this several days ago, but it does not seem to have made it to the list :-(

It's good news that my test is working for you. However, I tried running it against a freshened master after you posted about this and was still losing messages.

Can you confirm

- your test platform - i.e. jvm, os, versions etc.
- whether you have made any changes to the test (if so, maybe you could attach the new version)

I am keen to nail this issue and put Hazelcast queues into my project.

many thanks,


Jules

Jules Gosnell

unread,
Jul 22, 2016, 4:43:30 AM7/22/16
to Hazelcast, jules....@gmail.com
Ali,

I thought that I would check the list to see if you had come back to me on this - my last post was over 2 weeks ago.

To recap:

- I found a bug - a substantial hole in your TransactionManager
- I reduced it to and submitted a testcase
- you implied that you had fixed it
- I could not replicate your fix against master
- I told you
- silence....

This will be my last posting on this subject.


Jules

Ali Gurbuz

unread,
Jul 22, 2016, 8:28:49 AM7/22/16
to Hazelcast, Jules Gosnell
Hi Jules,

Sorry for late response, we were in a tight schedule for the 3.7 release.
You're still encountering the message lost issue because the fix was not merged yet.
Here is the PR for the fix:  https://github.com/hazelcast/hazelcast/pull/8572
As you can see, I've raised the PR when you've mentioned the issue but it get merged only 3 days ago.
Can you please re-run your test against the latest master to see if it is still valid.

Sorry again for the inconvenience.


--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.

For more options, visit https://groups.google.com/d/optout.

Mehmet Dogan

unread,
Jul 22, 2016, 8:53:54 AM7/22/16
to haze...@googlegroups.com

Ali Gurbuz

unread,
Jul 22, 2016, 9:04:34 AM7/22/16
to Hazelcast

Thank you for the correction


Reply all
Reply to author
Forward
0 new messages