connect events table with 3rd party data

tobias...@justwatch.com

unread,

Apr 5, 2016, 9:08:55 AM4/5/16

to Snowplow

Hi,

we have a running snowplow + redshift instance. We now have a big pile of impressions that we don't want to blow into our standard events table, but the data has the same user_ids, so we want to be able, to join them based on that. Does anyone has an idea for this, for now we are thinking about:

- writing a nano ETL an import them in a table in redshift that we can join (best solution right now, but it's a little bit dirty and a second thing to maintain)
- setting up another snowplow instance, with a seperate events table (seems to much the effort)

TLDR: we want all the snowplow features, without using the big events table (I know, that the events table is the core of the whole snowplow process)

Thanks in advance

Cheers
Tobi

Alex Dean

unread,

Apr 6, 2016, 3:58:37 AM4/6/16

to Snowplow

Hi Tobias,

For now I would go with option 2. We are working on functionality which will allow you to:

Perform data modeling (such as aggregations) on events in EMR before the load into Redshift. You'll be able to write these jobs in Spark using the new Snowplow Scala Analytics SDK
Filter out certain event types (such as your impression data) so they are not loaded into Redshift

In short - when these are available you should be able to drop the second pipeline, port your impression aggregation code from SQL to Spark and reduce your Redshift cluster spec, but in the meantime a second pipeline is the easiest approach (and it's what we do for e.g. analyzing global sp.js usage).

Cheers,

Alex

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

tobias...@justwatch.com

unread,

Apr 6, 2016, 4:27:05 AM4/6/16

to Snowplow

Hi Alex,

thanks for the always fast and helpful comments :D Do you plan on releasing it this quarter, this year or later? Just to give me a feeling of how long we would have to run this approach

Alex Dean

unread,

Apr 6, 2016, 1:29:55 PM4/6/16

to Snowplow

Hi Tobias,

The Spark data modeling piece is in progress - should be 2 releases out or so. The filtering out of certain event types has not been scheduled yet.