Event ID not unique and Joins

183 views
Skip to first unread message

Annah Brown

unread,
May 21, 2015, 10:01:09 AM5/21/15
to snowpl...@googlegroups.com
From everything I have read here I see that we cannot assume event_id will be unique. I am wondering then how I can ensure correct joins in Redshift on my custom tables where it is suggested that I use root_id=event_id?

Christophe Bogaert

unread,
May 22, 2015, 11:13:08 AM5/22/15
to snowpl...@googlegroups.com
Hi Annah,

In most cases, this won't cause many problems when doing analytics on the data. If the number of duplicated events is relatively small, we usually recommend to use (approximate) count distinct on the relevant columns. An alternative would be to deduplicate or exclude duplicated events before using the data.

I hope this helps,

Christophe

Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0) 203 589 6116
+44 (0) 7598 006 851


On 21 May 2015 at 15:01, Annah Brown <an...@annahbrown.com> wrote:
From everything I have read here I see that we cannot assume event_id will be unique. I am wondering then how I can ensure correct joins in Redshift on my custom tables where it is suggested that I use root_id=event_id?

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sambhav Sharma

unread,
May 26, 2015, 9:26:28 AM5/26/15
to snowpl...@googlegroups.com
Well, this is an issue then. So you're suggesting we manually remove events with duplicate event ids? There is no way these can be unique? If so, it's a really tedious task and something really not called for.

Grzegorz Ewald

unread,
May 27, 2015, 1:40:45 AM5/27/15
to snowpl...@googlegroups.com
There is no point to make deduplication manually - why not to find IDs that have count grater than one by SQL query and than decide which record to remove?

The uniqueness of IDs is an issue itself but I don't believe, there is a solution while using typo 4 UUIDs client side generated (as observed, the uniqueness of Snowplow IDS is more than acceptable). The only solution I can find is to generate type 1 UUIDs in collector (Kinesis or Clojure only at this point), but this requires s bit of development on both sides: collector and tracker...

Alex Dean

unread,
May 27, 2015, 8:13:31 AM5/27/15
to snowpl...@googlegroups.com
Thanks Grzegorz,

There are some interesting pros and cons to type 1 versus type 4 UUIDs, but as you say, the uniqueness of the UUIDs themselves is already acceptable - this isn't where duplicates come from.

There are two types of duplicate to distinguish between:
  1. Synthetic or exogenous duplicates - these are duplicates introduced by some external (to Snowplow) system. These include browser pre-cachers, anti-virus software, adult content screeners, web scrapers. These duplicates can be fired before or after the "real event". They can come from the device itself or from a different IP address
  2. Natural or endogenous duplicates - these are duplicates introduced within the Snowplow pipeline, wherever our processing capabilities are at least once rather than exactly once. Examples where processing is at least once: in the batch flow, the CloudFront Collector can duplicate events; in the Kinesis real-time flow, any application can introduce duplicates because of the KCL checkpointing approach

An important point: the reason we have moved to generating event IDs in the trackers is so that both types of duplicate are detectable.

Dealing with natural/endogenous duplicates is not hugely difficult - a simple lookup of previously-seen event IDs will suffice. Dealing with synthetic/exogenous duplicates is much more complex - the best solution currently is, as Christophe and Grzegorz say, to use appropriate queries or de-dupe using SQL.

Note that the ElasticSearch sink for the Kinesis flow has a "last event wins" approach to duplicates: each event is upserted into the ES collection using the event_id, so later dupes will overwrite earlier.

Hope this helps,

Alex


--
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom

Grzegorz Ewald

unread,
May 27, 2015, 4:21:19 PM5/27/15
to snowpl...@googlegroups.com
Hi,
There is one bottom note, that is worth to be added here: Amazon Kinesis Client Library is build wit assumption that every process has to be processed at leas once. This was the main idea behind check pointing mechanism. The mechanism guarantees, that no data would be missed, but do not ensure single record processing. We should threat this rather like a feature than a bug and not related with Snowplow but Kinesis.

Alex Dean

unread,
May 27, 2015, 5:17:08 PM5/27/15
to snowpl...@googlegroups.com
Thanks Grzegorz - you are of course right: the KCL guarantees at least once processing through its checkpointing design.

Best,

Alex

Gabriel Awesome

unread,
May 28, 2015, 1:53:24 PM5/28/15
to snowpl...@googlegroups.com
I'm looking here at line 91, right now:

What are the cons to type1? Also, perhaps if a compromise is needed, then this can be made configurable which type is used.

I definitely don't want to play with duplicates in redshift, at all.

Gabriel

Alex Dean

unread,
May 28, 2015, 2:16:19 PM5/28/15
to snowpl...@googlegroups.com

Changing the UUID type won't get rid of either type of duplicate.

A

Gabriel Awesome

unread,
May 28, 2015, 3:13:24 PM5/28/15
to snowpl...@googlegroups.com
Sorry, I'm new to this, but I read that v1 was time-based and unique, so it should not create the "Synthetic or exogenous duplicates" 

Also, I found this similar issue where the dev responded:

Gabriel

Alex Dean

unread,
May 28, 2015, 4:37:50 PM5/28/15
to snowpl...@googlegroups.com
No worries - I've put a hopefully clearer explanation of the three duplicate event_id scenarios on this ticket: https://github.com/snowplow/snowplow/issues/24

A
Reply all
Reply to author
Forward
0 new messages