Serious problem: Rollback of data on HRD

139 views
Skip to first unread message

Raymond C.

unread,
Aug 14, 2011, 6:21:42 AM8/14/11
to google-a...@googlegroups.com
I have recently ran into a problem after migrating to HRD:

My application is a social online game which I have recently migrated from M/S to HR Datastore around 3 weeks ago.  Since 2 weeks ago I have started receiving reports from players which their game progress are rolled back suddenly while playing, which progress made in the recent few days are missing.  I have verified the problem through data on other entities (in different entity group) that the reports are actually legit and at least several days of progress are actually rolled back (with updates to the entities in the last few days are all missing).

Player's data in the game are retrieved through id ( Player.get_by_id(player_id) ) and because the gap is so large (days) I believe it is not a problem on my code (nowhere in my code cache player's data).

It has never happened before for nearly 1 year so I am guessing if it is related to HRD.  I remember there was a thread here before which reported data being rolled back on HRD but I can not find it anymore.

As you know with AppEngine datastore's distributed nature, it is so hard to monitor this kind of problem to ensure the problem exist.  I would like to ask if anyone has ran into this problem as well or suspect that you have had this problem before with your HRD application?

Ikai Lan (Google)

unread,
Aug 16, 2011, 3:37:33 PM8/16/11
to google-a...@googlegroups.com
This doesn't sound right to me. If you can get the keys and app IDs of some objects that are being "rolled back" we can take a look to see if there's something inherently wrong with those entities. There's a known issue with objects that are too large - for instance, objects with lots of list properties.

--
Ikai Lan 
Developer Programs Engineer, Google App Engine



--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/hC-TziRgSkUJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Raymond C.

unread,
Aug 16, 2011, 9:41:39 PM8/16/11
to google-a...@googlegroups.com
Since they are player's data and they will be keep updating the entity, is it still useful if I give you the app id + key of the entities once they are already updated after a rolled back?

The entity is actually relatively large (historically problem on the design), may I know what could happen for large entities? and is it only on HRD?

Greg

unread,
Aug 16, 2011, 10:11:38 PM8/16/11
to Google App Engine
Please check your logs for a warning "Transaction collision.
Retying...".

Something very similar is happening on my app, where DB put()s
silently fail (equivalent to the entity being rolled back) very
occasionally. This has only started happening after moving to HR.

In my app, I get this warning very consistently (every time) at
exactly the time the entity is supposed to be stored. I would be very
interested to hear if you find this warning too. If so, I think it
points to a bug in the transaction collision handler in put(). Please
let me know!

See my earlier post here:
http://groups.google.com/group/google-appengine/browse_thread/thread/3a1e01d2685a8f16/997c40bdb49dd132

Cheers
Greg.

Raymond C.

unread,
Aug 16, 2011, 10:29:11 PM8/16/11
to google-a...@googlegroups.com
I do notice there are a lot of "Transaction collision. Retying..." warning in my app, but I thought those are "ok to ignore" warning.  I just noticed that it wasnt there while at M/S though. 

I have no idea if those warning makes my db.put() fail or not though.  Since you are mentioning it, I realized that there are quite a number of support emails from players recently saying that their items are not in their inventory after acquiring them, which is after migrating to HRD.

It looks like the problem is more serious than I thought. 

Robert Kluin

unread,
Aug 17, 2011, 1:14:02 AM8/17/11
to google-a...@googlegroups.com
The HR datastore can handle a lower number of writes / second to the
same entity group. That's probably why you're seeing those warnings
more often.

I wonder if some of those requests are eventually failing completely?
As I recall, the transaction will retry three times by default then
fail.

Robert

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/google-appengine/-/3GrJLvl4jdgJ.

Greg

unread,
Aug 17, 2011, 11:35:25 PM8/17/11
to Google App Engine
On Aug 17, 5:14 pm, Robert Kluin <robert.kl...@gmail.com> wrote:
> I wonder if some of those requests are eventually failing completely?

But then we should be getting an exception, And we don't.

Robert Kluin

unread,
Aug 18, 2011, 12:34:47 AM8/18/11
to google-a...@googlegroups.com
True.

You guys should file production issue tickets. I would very much like
to know what the cause and resolution winds up being; it would be
great if you could post the ticket numbers back here (or email me the
issue #) -- I'd like to follow it.


Robert

> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.

Tom Phillips

unread,
Aug 19, 2011, 3:42:55 PM8/19/11
to Google App Engine
I'm seeing the same since moving to HR last week. It happens rarely,
and the only clue is a ConcurrentModification in the logs (java in my
case).

Pure speculation, but it looks to me like some sort of background
transaction retry might overwrite the entity with stale data, rather
than a rollback.

Scenario for me is like:
pre) Entity bob has property height=70
1) thread 1, transaction X, height=75->commit() [appears to succeed]
2) Meanwhile (within a second or so) thread 2, transaction Y height=80-
>commit() [ConcurrentModificationException] -> I pause 500ms and retry
-> commit() [appears to succeed this time]
3) For a while (up to a minute or so, but possibly much longer) all
get-by-key on bob show height==80 (ok)
4) Another while later all get-by-key on bob suddenly show height==75,
as per transaction X (not good!)

My speculation is that the ConcurrentModification could sometimes
indicate there was disruption of BOTH transaction X and Y, even though
reported for Y. Perhaps X had gotten past commit() call but hadn't
yet reached milestone A of http://code.google.com/appengine/articles/transaction_isolation.html,
and was also (temporarily) aborted due to the contention.

Then some sort of background retry on X sometimes (rarely) re-inserts
it into the transaction queue BEHIND my explicit retry on Y, and
eventually overwrites with the whole entity state from X in 1)

And it appears that sometimes the background retry of X may not even
happen till a good while later.

Any chance something like this is happening?

/Tom

On Aug 16, 10:11 pm, Greg <g.fawc...@gmail.com> wrote:
> Please check your logs for a warning "Transaction collision.
> Retying...".
>
> Something very similar is happening on my app, where DB put()s
> silently fail (equivalent to the entity being rolled back) very
> occasionally. This has only started happening after moving to HR.
>
> In my app, I get this warning very consistently (every time) at
> exactly the time the entity is supposed to be stored. I would be very
> interested to hear if you find this warning too. If so, I think it
> points to a bug in the transaction collision handler in put(). Please
> let me know!
>
> See my earlier post here:http://groups.google.com/group/google-appengine/browse_thread/thread/...

Alfred Fuller

unread,
Aug 19, 2011, 9:34:47 PM8/19/11
to google-a...@googlegroups.com
Are your transactions idempotent? It is possible that the transaction is being run (and succeeding) twice in this case. What other request is colliding with first? You are not using any non-ancestor queries or setting read_policy=EVENTUAL on any reads correct?

Tom Phillips

unread,
Aug 20, 2011, 1:38:13 PM8/20/11
to Google App Engine
X and Y used only get-by-key, and read policy was unset in
jdoconfig.xml, so using the default strong. These transactions were
idempotent.

BUT..I scoured the code again and sure enough, there is third entry
point I missed in my HRD prep that reads the entity state via..you
guessed it..a query accross entity groups.

So there is a third transaction Z that can collide with Y, and it's
the one writing stale state it sees from the query (state after X is
what it overwrote with).

I thought I had covered all of my HR migration tweaks - but missed
this one entry point at least. Easily fixed now.

Thanks, (and thanks for zigzag merge join BTW, it rocks)
Tom

On Aug 19, 9:34 pm, Alfred Fuller <arfuller+appeng...@google.com>
wrote:
> Are your transactions idempotent? It is possible that the transaction is
> being run (and succeeding) twice in this case. What other request is
> colliding with first? You are not using any non-ancestor queries or setting
> read_policy=EVENTUAL on any reads correct?
>

Alfred Fuller

unread,
Aug 29, 2011, 9:59:34 PM8/29/11
to google-a...@googlegroups.com
Glad you were able to find the problem! These types of problems can prove to be very hard to find (as they are difficult to reproduce consistently).
Reply all
Reply to author
Forward
0 new messages