Remember that Spark Streaming uses the KCL under the hood, so there is no ops simplification from using a Spark Streaming job over Scala Stream Enrich... Generally the closer you stay to the reference Snowplow architecture, the more you will get "for free" as we continue to roll out more and more functionality over the coming months and years.
> Also, not sure if it actually would be better to make `atomic.events` be also populated from JSON instead of TSVs ?
Yes, our plan is to replace the EnrichedEvent format with an Avro schema over time. Our Scala Analytics SDK is the first step in providing an abstraction layer/facade over the enriched event to make these upgrades easier: https://github.com/snowplow/snowplow-scala-analytics-sdk
Cheers,
Alex
--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Combine these two and what we found was that by extracting our core validation and enrichment into a core Scala library, we were then able to use this library in multiple runtimes (Scalding, KCL with others coming in the future). So yes I think using Snowplow components in library form has merit in support of porting enrichment to a new runtime like Spark or Flink.
> We tried first by using the default KCL implementation and then adding new apps that consumed from the enriched stream, but we found it cumbersome. The main reason is that you need to define all the Kinesis Streams ahead of time and configure them for all our environments (we use integration, staging and prod). It's also difficult to see the whole picture of what the system does when the code it's distributed across several projects.
This is the whole microservice versus monolith debate :-) It's possible to do either with Snowplow - you can compose pipelines using our specific apps (Stream Enrich, Kinesis S3, Elasticsearch Sink etc), with streams as the glue in-between, or you could take the underlying libraries and compile them into a monolith like a Spark Streaming job which does enrichment and, say, event data modeling. The microservice approach places more challenges on your devops team, the latter is more developer-time-intensive. We are firmly in favour of the microservice approach at Snowplow - we prefer smaller composable units of work ("async microservices) which can be assembled together in flexible ways. With things like Docker, Kubernetes, Mesos, Kafka Streams and AWS Lambda there is a lot of great tooling consolidating around the async microservices approach - the next 12 months should be very interesting!
> I go to my Spark UI and I can nicely see my 2 long running streaming Jobs running there and their event rate, etc
You are right - it's nice to have a UI to monitor your real-time pipeline holistically - but even if you can mandate using Spark for all jobs (even when some would fit Lambda or KCL better), you are still only seeing half the picture, because the behaviour of your Kinesis streams (shard merges/splits etc) isn't represented in that UI. There's no good solution at the moment - we have a new monitoring and autoscaling fabric (including UI) in heavy R&D at the moment for the Snowplow Real-Time Managed Service; I hope we can share more information about this publically later this summer.
Cheers,
Alex