Duplicate records in Backup Data?

135 views
Skip to first unread message

Mike

unread,
Apr 29, 2013, 10:10:34 PM4/29/13
to google-a...@googlegroups.com
Hi there

I've noticed there may be duplicate records in the Backup data that AppEngine produces.

I can verify this because I'm loading the Backups into BigQuery. When I search one of my tables, I can see the duplicates:

SELECT __key__.id as X_id, COUNT(__key__.id) as X_count, created FROM [TableId] GROUP BY X_id, created HAVING X_count > 1 ORDER BY created DESC;

This shows there are 5,807 duplicates in a table of ~2 million entries (~0.2%)

I can give Google employees access to our BigQuery and Google Storage accounts if that helps track down the issue.

Cheers
Mike

Jason Collins

unread,
May 1, 2013, 12:59:53 AM5/1/13
to google-a...@googlegroups.com
We have seen the same phenomenon. 

It's likely due to some kind of race condition in the backup tool itself, but is not a problem there because when restoring, one of the dups will just overwrite the other. But it does become a problem once ingested into BigQuery.

j

Jason Collins

unread,
May 1, 2013, 6:39:06 PM5/1/13
to google-a...@googlegroups.com
On reflection, I suspect it has more to do with Map-Reduce task retries than some race condition.

j

Mike

unread,
May 1, 2013, 7:53:23 PM5/1/13
to google-a...@googlegroups.com
I would think it would be possible for the BigQuery team to discard duplicates when running the import? That's probably going to be the easiest solution....

Arie Ozarov

unread,
May 2, 2013, 4:30:34 PM5/2/13
to google-a...@googlegroups.com


On Wednesday, May 1, 2013 3:39:06 PM UTC-7, Jason Collins wrote:
On reflection, I suspect it has more to do with Map-Reduce task retries than some race condition.
Correct. Not an issue for backup/restore but is a known issue for BigQuery imports.
We plan to eliminate duplicates in the MR level. 

Mike

unread,
May 6, 2013, 4:28:27 AM5/6/13
to google-a...@googlegroups.com
Great - thanks Arie. Any idea when this will be ready? An approximation only would be appreciated. i.e. 1 month, 6 months, 1 year?

Oliver Urs Lenz

unread,
Sep 9, 2015, 10:32:08 AM9/9/15
to Google App Engine, mick...@gmail.com
I can confirm that more than two years later, this is still an issue.. :-(

Nick (Cloud Platform Support)

unread,
Sep 21, 2015, 3:09:47 PM9/21/15
to Google App Engine, mick...@gmail.com
Hey Oliver, 

If you're experiencing an issue, I recommend posting to the BigQuery public issue tracker, since an old thread like this probably won't have much activity, and the public issue tracker is a more responsive way to report an issue. 

Best wishes,

Nick
Reply all
Reply to author
Forward
0 new messages