Database corruption: collision between multiple background and manual triggered jobs

134 views

Skip to first unread message

David Ostrovsky

unread,

Jun 13, 2014, 7:54:18 AM6/13/14

to repo-d...@googlegroups.com

After investigation of sporadic database operation failures, that leaving changes

in undefined state with projects that have cherry-pick submit strategy i think i have

an idea what going on (on master):

The suspects for non synchronized database write operations are:

1/ Merges triggered interactively by user (UI, REST endpoint or SSH command)

2/ Merges scheduled by the background merge queue reload (ChangeMergeQueue.java)

3/ MergeabilityChecker jobs (Mergeable.java)

All three places above can currently run at the same time for the same change and trigger

in the end one or multiple insert/update statements on exact the same records, with the result:

* unique index constrain viloations

* multiple identical rows for patch_sets table, that is lacking unique index (fixed in [3])

* OrmConcurrencyException, later more on it

* Broken rendering (404-Not found) on new change screen

Race condition between /1 and /2 was (hopefully) fixed here: [1].

Race condition between /1, /2 and /3 can still happen. I tried to minimize the database operation in

MegeabilityChecker in this change [2] to only be triggered when status was changed,

but that's not enough. Consider a site with hundreds opened conflicting changes:

a) change "foo" was merged

b) mergeability checker reevaluates the whole branch and suppose that dozens changes need

to be updated. Suppose that change "bar" would be changed from non-mergeable to mergeable

(say change "foo" fixed conflicts). So in the end the following update statement is going to be executed:

update changes set mergeable = 'Y', row_version = 43 where change = "bar" and row_version = 42

c) parallel to event b) let say simultaneosly another user decides to merge change "bar":

per ssh command, it could be approved and submitted in one shot:

ssh host gerrit review --code-review=2 --submit "bar"

In this case the merge is triggered interactively (case /1 above) and CherryPick submit strategy is called,

and this call the CherryPick.writeCherryPickCommit() method and particularly new patch set is inserted and

change record is updated to reflect new patch set creation:

n.change().setCurrentPatchSet(patchSetInfoFactory.get(newCommit, ps.getId()));

args.db.changes().update(Collections.singletonList(n.change()));

And this would end with :

update changes set current_patch_set_id = 2, row_version = 43 where change = "bar" and row_version = 42

Now we have in b) and c) two non-synchronized actions that can be executed simultaneously and clearly

only one would succeed and another would end up with OrmConcurrencyException. Why?

Because where statement is stale, after first statement was successful executed,

there is no change any more with 'where change = "bar" and row_version = 42', and we would hit

zero rows update case in JdbcAccess.execute():

final int numberOfRowsUpdated = schema.getDialect().executeBatch(ps);

if (numberOfRowsUpdated != cnt) {

throw new OrmConcurrencyException();

}

In any event we have a change in undefined state. The better question is how to fix that?

[1] https://gerrit-review.googlesource.com/57825

[2] https://gerrit-review.googlesource.com/57826

[3] https://gerrit-review.googlesource.com/57784

David Ostrovsky

unread,

Jun 14, 2014, 8:46:10 AM6/14/14

to repo-d...@googlegroups.com, Bruce Zu, David Pursehouse (Sony Mobile)

Gerrit is using optimistic database locking scenario, that's what Gerrit's ORM

update foo set row_version = row_version+1,[...] where row_vesion = 42;

is doing for every update. In such scenario it is a normal case to fail, when stale row

is going to be updated. Because ORM cannot recover in such case, it is indespensiable

for this approach to have proper implemented transaction management, so that when that

happens, as in

beginTransaction();

try {

insertPatchSetAncestors();

insertPatchSets();

updateChange(); // ==> Stale row is going to be updated

commit();

} catch (OrmException e) {

rollback();

}

(because of MergeabilityChecker, AddReviewer, RemoveReviewer, plugin(s) or whoever

update the same change in the same time)

that CherryPick submit strategy (and other places for that matter) is able to recover from

this failure and pick that change in next cycle (300 sec. later) for merging.

So the fix for all database curruption problems [1], [2], [3], [4] is the combination of: [5] and [6].

[1] https://code.google.com/p/gerrit/issues/detail?id=2034

[2] https://code.google.com/p/gerrit/issues/detail?id=2246

[3] https://code.google.com/p/gerrit/issues/detail?id=2383

[4] https://code.google.com/p/gerrit/issues/detail?id=2702

[5] https://gerrit-review.googlesource.com/57860

[6] https://gerrit-review.googlesource.com/57851

Reply all

Reply to author

Forward

0 new messages