Snowplow enrichment and Thrift/Avro

315 views
Skip to first unread message

Gareth Rogers

unread,
Jun 4, 2015, 6:25:05 AM6/4/15
to snowpl...@googlegroups.com
Hi

We're starting to look at using thrift or avro with our Cascalog enrichment stages which we run after the Snowplow enrichment step and have started by writing a thrift schema for the Snowplow enrichment step output, starting small with a simple 104 column output ;)

One of the first problems we've encountered is that some of the integer or boolean fields can be unset and, perhaps unnaturally given thrift doesn't support, this we'd represent them as a nullable primitive type. There are other possibilities, default values and moving onto experimenting with Avro.

To be honest my real question is more of a general one, has anyone been looking at using thrift or avro (we're planning to add parquet as the data storage format) for their enrichment stages? It would also be useful to know how people are handling the possibility of unset primitive fields in technologies like thrift?

At the moment in our Cascalog enrichment we convert empty strings to nil even for the string fields as they're more natural to work with that way.

Thanks
Gareth

Alex Dean

unread,
Jun 4, 2015, 8:35:45 AM6/4/15
to snowpl...@googlegroups.com
Hey Gareth,

This is awesome - we're also planning to move the Snowplow enriched event format from a TSV/JSON hybrid to Apache Avro (plus Parquet). Although we have had good experiences with Thrift, we have chosen Avro because the JSON<>JSON schema<>Avro interop story is much better (and we have plans to improve it further with our own tooling).

The relevant milestone is: https://github.com/snowplow/snowplow/milestones/Avro%20support%20%231

This will take some time on our side, not least because we have to decouple the enriched event format from the file format which is loaded into Redshift.

I'm not sure about the Thrift unset primitives issue. From a data modeling perspective with Avro, I would think none of the 104 fields would be optional - after all, all 104 columns are present in each TSV row. Instead, most of them would be Avro unions of their given data type pus the value null. (Think of an Option type in Scala or a Maybe in Haskell; JSON Schema is very similar)

I hope this helps! Please keep us up-to-date with your R&D in this area, I think we can learn a lot from each other!

Cheers,

Alex



--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

Gareth Rogers

unread,
Jun 5, 2015, 10:29:24 AM6/5/15
to snowpl...@googlegroups.com
Thanks Alex, that's good to know. We'll keep you up-to-date on how we get on and keep an eye on your milestone.

Optional is not the answer as you say because all fields are always present so really they're a union of present and a specific type or empty. It looks like thrift now has a union type which is something we'll look at.

We're exploring both technologies so it'll be interesting to see what's a best fit.

Thanks

P.S. I just rediscovered this tab which I wrote yesterday evening and apparently didn't post!

Christoph Bünte

unread,
Sep 10, 2015, 5:54:13 AM9/10/15
to Snowplow
Hi Alex,

we just discovered the planned switch from TSV to Avro in the enrichment phase. While this might be a good design decision towards better performance and easier handling of custom contexts it makes it also harder to debug malformed events and so on. Will the Avro format be optional or is it intended to leave the TSV format behind completely?

Christoph

Alex Dean

unread,
Sep 10, 2015, 9:53:40 AM9/10/15
to Snowplow
Hi Christoph!

Very good question. As you know, our current enriched event format has evolved organically into a TSV+embedded JSON format. It's got us a long way but it doesn't make sense as a long-term format, especially when there are great alternatives out there like Avro and JSON Schema.

We have multiple Avro milestones which document our plan, but basically:
  1. First we box our TSV in a self-describing Avro. This will make it easier for downstream components to support multiple versions of the TSV - at the moment, they don't: you can't run a new release of Hadoop Shred against old enriched events on disk
  2. Then we move most of the remaining fields in our TSV out into JSON Schemas. For example, all the browser fields will be moved into a browser_context; the 5 structured event fields move into com.google.analytics/event, etc
  3. Once we are down to the "stragglers" in our TSV, we will then make all of those fields first-class in the Avro and remove the TSV forever
  4. After this we will hopefully do a new iteration of the Avro which reflects more of our event grammar thinking

The most important point: through all of this migration, each enriched event version will be carefully documented in Iglu Central, and our aim is that all downstream components maintain backwards compatibility with earlier versions over time.

Another way of thinking of it: historically we have only guaranteed that raw events will still be processable by future Snowplow releases. Once we have an Avro format (even quite a basic one which just boxes the TSV), we want to start guaranteeing that enriched Snowplow events are processable by future Snowplow releases.

Hope that answers your question!

Cheers,

Alex

Christoph Bünte

unread,
Sep 11, 2015, 5:31:11 AM9/11/15
to Snowplow
Hi Alex,

thanks for explaining the plans in detail. That basically means after the switch we have to reprocess all of our raw events again to end up with enriched in avro format. Or will there be some kind of migration tool, that can convert events?

Christoph

Alex Dean

unread,
Sep 11, 2015, 5:49:40 AM9/11/15
to Snowplow
Hey Christophe,

Glad it was helpful!


after the switch we have to reprocess all of our raw events again to end up with enriched in avro format

To be clear: if you want all of your Snowplow enriched events stored in S3 in the latest available format, you have always had to reprocess all of your raw events against the latest version of the Snowplow enrichment process.

What is changing is that in the future, when we start writing out enriched events in a self-describing Avro format, we want the components downstream of the enriched event to start supporting older versions of the enriched event. The idea is that e.g. you could apply a Spark sessionization algorithm across a Snowplow event archive which contains multiple (all self-describing) versions of the Snowplow enriched event format. So the enriched event archive in S3 becomes a first-class citizen, rather than what is today (basically an archive of Redshift loads in different non-self-describing formats).


will there be some kind of migration tool, that can convert events

There have been so many versions of the enriched event this migration tool would end up like a Borges map. Better just to use the enrichment process on your raw archive.

A
Reply all
Reply to author
Forward
0 new messages