Database corruption: collision between multiple background and manual triggered jobs

128 views
Skip to first unread message

David Ostrovsky

unread,
Jun 13, 2014, 7:54:18 AM6/13/14
to repo-d...@googlegroups.com
After investigation of sporadic database operation failures, that leaving changes
in undefined state with projects that have cherry-pick submit strategy i think i have
an idea what going on (on master):

The suspects for non synchronized database write operations are:

1/ Merges triggered interactively by user (UI, REST endpoint or SSH command)
2/ Merges scheduled by the background merge queue reload (ChangeMergeQueue.java)
3/ MergeabilityChecker jobs (Mergeable.java)

All three places above can currently run at the same time for the same change and trigger
in the end one or multiple insert/update statements on exact the same records, with the result:

* unique index constrain viloations
* multiple identical rows for patch_sets table, that is lacking unique index (fixed in [3])
* OrmConcurrencyException, later more on it
* Broken rendering  (404-Not found) on new change screen

Race condition between /1 and /2 was (hopefully) fixed here: [1].

Race condition between /1, /2 and /3 can still happen. I tried to minimize the database operation in
MegeabilityChecker in this change [2] to only be triggered when status was changed,
but that's not enough. Consider a site with hundreds opened conflicting changes:

a) change "foo" was merged
b) mergeability checker reevaluates the whole branch and suppose that dozens changes need
to be updated. Suppose that change "bar" would be changed from non-mergeable to mergeable
(say change "foo" fixed conflicts). So in the end the following update statement is going to be executed:

  update changes set mergeable = 'Y', row_version = 43 where change = "bar" and row_version = 42

c) parallel to event b) let say simultaneosly another user decides to merge change "bar":
per ssh command, it could be approved and submitted in one shot:

  ssh host gerrit review --code-review=2 --submit "bar"

In this case the merge is triggered interactively (case /1 above) and CherryPick submit strategy is called,
and this call the CherryPick.writeCherryPickCommit() method and particularly new patch set is inserted and
change record is updated to reflect new patch set creation:

  n.change().setCurrentPatchSet(patchSetInfoFactory.get(newCommit, ps.getId()));
  args.db.changes().update(Collections.singletonList(n.change()));

And this would end with :

  update changes set current_patch_set_id = 2, row_version = 43 where change = "bar" and row_version = 42

Now we have in b) and c) two non-synchronized actions that can be executed simultaneously and clearly
only one would succeed and another would end up with OrmConcurrencyException. Why?

Because where statement is stale, after first statement was successful executed,
there is no change any more with 'where change = "bar" and row_version = 42', and we would hit
zero rows update case in JdbcAccess.execute():

  final int numberOfRowsUpdated = schema.getDialect().executeBatch(ps);
  if (numberOfRowsUpdated != cnt) {
    throw new OrmConcurrencyException();
  }

In any event we have a change in undefined state. The better question is how to fix that?


David Ostrovsky

unread,
Jun 14, 2014, 8:46:10 AM6/14/14
to repo-d...@googlegroups.com, Bruce Zu, David Pursehouse (Sony Mobile)
Gerrit is using optimistic database locking scenario, that's what Gerrit's ORM

  update foo set row_version = row_version+1,[...] where row_vesion = 42;

is doing for every update. In such scenario it is a normal case to fail, when stale row
is going to be updated. Because ORM cannot recover in such case, it is indespensiable
for this approach to have proper implemented transaction management, so that when that
happens, as in

 beginTransaction();
  try {
    insertPatchSetAncestors();
    insertPatchSets();
    updateChange(); // ==> Stale row is going to be updated
    commit();
  } catch (OrmException e) {
    rollback();
  }

(because of MergeabilityChecker, AddReviewer, RemoveReviewer, plugin(s) or whoever
update the same change in the same time)

that CherryPick submit strategy (and other places for that matter) is able to recover from
this failure and pick that change in next cycle (300 sec. later) for merging.

So the fix for all database curruption problems [1], [2], [3], [4] is the combination of: [5] and [6].


Reply all
Reply to author
Forward
0 new messages