Duplicate records in redshift

ta...@boomerangcommerce.com

unread,

Mar 10, 2016, 4:17:15 AM3/10/16

to Snowplow

Hi All,

I ran the EMR job once yesterday which had 20 records.

My redshift table showed 20 records, which is correct.

Now i ran the EMR job once again today,.

This time, the job moved records which were already moved yesterday, thereby creating a lot of duplicates in the redshift.

ie, the records which were pushed yesterday were also pushed today.

Is the archived bucket being used when data is being moved from S3 to redshift ? Is there a way to skip that ?

Thanks

Tarun

Jimit Modi

unread,

Mar 10, 2016, 4:43:25 AM3/10/16

to snowpl...@googlegroups.com

Hey Tarun,

There are few things to check.

So you ran storage loader on the processed data and 20 records where added to Redshift.

What was the outcome of storage loader. Did it completed successfully ?. I mean were you able to

see something like below.

x snowplow-etl-emr-runner/shredded/good/run=2016-03-10-05-27-19/com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0/part-00001-00001

x snowplow-etl-emr-runner/shredded/good/run=2016-03-10-05-27-19/com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0/part-00004-00001

x snowplow-etl-emr-runner/shredded/good/run=2016-03-10-05-27-19/com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0/part-00003-00001

x snowplow-etl-emr-runner/shredded/good/run=2016-03-10-05-27-19/com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0/part-00002-00001

moving files from s3://snowplow-etl-emr-runner/shredded/good/ to s3://snowplow-etl-emr-runner/shredded/archive/

Completed successfully

So what storage loader does once it has successfully inserted records in Redshift. It moves the files from good bucket to archive bucket.

If the later part was interrupted then this case may have the case.

--

Jim(y || it)

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tarun Thontadarya

unread,

Mar 10, 2016, 4:47:11 AM3/10/16

to snowpl...@googlegroups.com

Hi Jimit,

Thanks for the response.

This is the log snapshot :

canvas-snowplow-logs/shredded/good/run=2016-03-10-07-58-12/org.w3/PerformanceTiming/jsonschema/1-0-0/part-00003-00001

x canvas-snowplow-logs/shredded/good/run=2016-03-10-07-58-12/com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0/part-00003-00000

x canvas-snowplow-logs/shredded/good/run=2016-03-10-07-58-12/com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0/part-00004-00000

x canvas-snowplow-logs/shredded/good/run=2016-03-10-07-58-12/org.w3/PerformanceTiming/jsonschema/1-0-0/part-00000-00001

x canvas-snowplow-logs/shredded/good/run=2016-03-10-07-58-12/org.w3/PerformanceTiming/jsonschema/1-0-0/part-00002-00001

x canvas-snowplow-logs/shredded/good/run=2016-03-10-07-58-12/org.w3/PerformanceTiming/jsonschema/1-0-0/part-00004-00001

x canvas-snowplow-logs/shredded/good/run=2016-03-10-07-58-12/org.w3/PerformanceTiming/jsonschema/1-0-0/part-00001-00001

moving files from s3://canvas-snowplow-logs/shredded/good/ to s3://canvas-snowplow-logs/shredded/archive/

Completed successfully

So the job is completed successfully.

Do you see any other reason, why the duplicates are being created ?

Jimit Modi

unread,

Mar 10, 2016, 5:02:24 AM3/10/16

to snowpl...@googlegroups.com

That is a successful run.

By the duplicates, on which columns are you stating it as duplicates.

Can you share a sample

--

Jim(y || it)

Tarun Thontadarya

unread,

Mar 10, 2016, 5:09:49 AM3/10/16

to snowpl...@googlegroups.com

Snapshot of the inner join query :

2016-03-09 13:22:48.937 | localhost | syed | 2016-03-09 | /price_intelligence | 4.144

2016-03-10 07:58:12.282 | localhost | syed | 2016-03-09 | /price_intelligence | 4.144

If you see here,

The only difference in the 2 records is etl_tstamp.

One ran today and one ran yesterday.

There are several other duplicates like this in the table.

Ihor Tomilenko

unread,

Mar 10, 2016, 12:28:40 PM3/10/16

to Snowplow

Hi Tarun,

I can see 2 possibilities here:

Duplicate records are caused by generation of the same ID for different events
SQL query badly formed

Please, take a look at the below post and see if you can find the answer in your case:

http://snowplowanalytics.com/blog/2015/08/19/dealing-with-duplicate-event-ids/

Also these SQLs to remove duplicate records:

https://github.com/snowplow/snowplow/tree/master/5-data-modeling/sql-runner/redshift/sql/deduplicate

Regards,

Ihor

Tarun Thontadarya

unread,

Mar 11, 2016, 1:59:48 AM3/11/16

to snowpl...@googlegroups.com

Hi Ihor,

Thanks for your reply.

I tried running the following query :

select event,etl_tstamp,dvce_created_tstamp from atomic.events where event_id='a9e52aff-c279-4d30-820e-c06ae6b342ac';

The result was :

event | etl_tstamp | dvce_created_tstamp

-----------+-------------------------+-------------------------

page_view | 2016-03-09 13:22:48.937 | 2016-03-09 12:55:34.474

page_view | 2016-03-10 07:58:12.282 | 2016-03-09 12:55:34.474

which goes to say that the same record got copied to redshift twice. once when the ETL ran on 2016-03-09 and once on 2016-03-10

Do you feel this is something to do with the storage-loader ?

Also, you are right about the duplicate ID's, there are duplicates in Event ID's.

Tarun Thontadarya

unread,

Mar 11, 2016, 2:12:55 AM3/11/16

to snowpl...@googlegroups.com

Hi Ihor,

By using distinct I am able to get what i want. But the table is bloating up with every ETL job.

Ie, Records which are already processed in the previous ETL run, is being processed again.

Ihor Tomilenko

unread,

Mar 11, 2016, 10:52:49 AM3/11/16

to Snowplow

Hi Tarun,

I'm just curious to see what the other timestamps are for the duplicated page_view events. Could you add the following to your query, please?

collector_tstamp
dvce_sent_tstamp
refr_dvce_tstamp
derived_tstamp

And more broadly, what are the (not null) values for the other parameters?

Regards,

Ihor

Tarun Thontadarya

unread,

Mar 13, 2016, 2:22:11 PM3/13/16

to snowpl...@googlegroups.com

Hi Ihor,

Every time stamp field is the same amongst the duplicate records, other than etl_tstamp. The link (http://snowplowanalytics.com/blog/2015/08/19/dealing-with-duplicate-event-ids/) which you suggested fits the problem which I am facing.

I implemented the below, as you suggested :

https://github.com/snowplow/snowplow/tree/master/5-data-modeling/sql-runner/redshift/sql/deduplicate

And things seem fine for now.

Ihor Tomilenko

unread,

Mar 14, 2016, 7:45:21 PM3/14/16

to Snowplow

Hi Tarun,

Glad the given instructions helped.

Another suggestion. You seem to be running the older version of Snowplow. The "natural duplicates" (where all the fields are the same) are being tackled automatically during shredding process starting from the release r76. Please, read the following blog release article http://snowplowanalytics.com/blog/2016/01/26/snowplow-r76-changeable-hawk-eagle-released/

Just bear in mind if you run the EmrEtlRunner with the option --skip shred then you will have to remove that option with the later releases to get the de-duplication process applied during EMR process.

Regards,

Ihor

Tarun Thontadarya

unread,

Mar 15, 2016, 3:12:00 AM3/15/16

to snowpl...@googlegroups.com

Hi Ihor,

I checked the release article. I also checked my config.yml, i see that my hadoop job versions are updated, and is the same version mentioned in the document.

hadoop_enrich: 1.5.1

hadoop_shred: 0.7.0

hadoop_elasticsearch: 0.1.0

I guess my usecase comes under "If natural duplicates exist across ETL runs, these will not be de-duplicated currently. This is something we hope to explore soon" section of your document.

Will wait for this fix from snowplow, and then get rid of the sql way of removing duplicates.

Thanks and Regards,

Tarun

Sean Halliburton

unread,

Mar 19, 2016, 11:41:36 AM3/19/16

to Snowplow

Wild guess here, but in my first Snowplow install and POC, I made the mistake of setting my 'archive' folder too close to my 'raw/in' folder, so SP was somehow rewriting every event back to the inbox for reprocessing. Not only did I end up with duplicate records in Redshift, but the EMR job took longer and longer with each run...because the number of records in both S3 and RS were snowballing. If your archive location is anywhere near your inbox, try moving to a separate bucket altogether.