UDAF in KSQL 5.0
KSQL is the open source streaming SQL engine that enables real-time data processing against Apache Kafka. KSQL makes it easy to read, write and process streaming data in real time, at scale, using SQL-like semantics.KSQL already has plenty of available functions like SUBSTRING, STRINGTOTIMESTAMP or COUNT. Even so, many users need additional functions to process their data streams..
KSQL now has an official API for building your own functions. As of the release of Confluent Platform 5.0, KSQL supports creating user-defined scalar functions (UDF) and user-defined aggregate functions (UDAF)
https://www.confluent.io/blog/build-udf-udaf-ksql-5-0
http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297
Air Flow DAG Tests and Unit Tests
Testing is an integral part of any software system to build confidence and increase the reliability of the system. Recently, I joined Grab and here at Grab, we are using Airflow to create and manage pipelines. But, we were facing issues with Airflow. I had a conversation with my engineering manager and discussed on how we could make Airflow reliable and testable.
http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297
Spark SQL Performance in Video Play Sessions
Play Sessions are the bread and butter of the Data Pipelines engineering team at JW Player. They are an attempt to identify a single ‘unit of work’ of a video viewer by computing transformations and aggregations in Spark SQL to compact the data down into a much more manageable size. These query operations are performed across roughly 100 columns, which turns out to be a hefty query and the impetus for why we needed to tune & optimize our Spark SQL job.
http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297
Data at Rezdy
The use of data in a startup, becomes increasingly important as its number of users grows. In the early days of any B2B startup, you have to meet with every single customer yourself to get them on-board, so understanding what will get them using the platform becomes a case of simply asking them. However, as a business scales, this approach to making decisions no longer works.
https://medium.com/rezdy-engineering/an-introduction-to-data-at-rezdy-53b12d9935f5
http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297
Postgres Databases with Rails
The first step of architecting is always to figure out what we’re optimizing for. We’ve enjoyed great success with Postgres and saw no reason to depart. It has given us high performance and has an incredible array of datatypes and query functions that have been extremely useful.
http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297
Kafka Blindness
The most common response was the need for better tools to monitor and manage Kafka in production. Specifically, users wanted better visibility in understanding what is going on in the cluster across the four key entities with Kafka: producers, topics, brokers, and consumers. In fact, because we heard this same response over and over from the users we interviewed, we gave it a name: The Kafka Blindness.
https://hortonworks.com/blog/kafka-blindness/
http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297
Scheduling Notebooks at Netflix
At Netflix we’ve put substantial effort into adopting notebooks as an integrated development platform. The idea started as a discussion of what development and collaboration interfaces might look like in the future. It evolved into a strategic bet on notebooks, both as an interactive UI and as the unifying foundation of our workflow scheduler. We’ve made significant strides towards this over the past year, and we’re currently in the process of migrating all 10,000 of the scheduled jobs running on the Netflix Data Platform to use notebook-based execution. When we’re done, more than 150,000 Genie jobs will be running through notebooks on our platform every single day
https://medium.com/netflix-techblog/scheduling-notebooks-348e6c14cfd6
http://datadotz.com/datadotz-bigdata-weekly-62/#more-1302
Kafka KSQL
KSQL is a SQL engine for Kafka. It allows you to write SQL queries to analyze a stream of data in real time. Since a stream is an unbounded data set (for more details about this terminology, see a query with KSQL will keep generating results until you stop it.KSQL is built on top of Kafka Streams. When you submit a query, this query will be parsed and a Kafka Streams topology will be built and executed. This means that KSQL offers similar concepts as to what Kafka Streams offers, but all with a SQL language: streams (KStreams), tables (KTables), joins, windowing functions, etc.
http://aseigneurin.github.io/2018/08/22/kafka-tutorial-10-ksql.html
http://datadotz.com/datadotz-bigdata-weekly-62/#more-1302
Jupyter Notebooks and Apache Drill
Apache This blog post will walk through the installation and basic usage of the jupyter_drill module for Python that allows you, from a Jupyter Notebook, to connect and work with data from Apache Drill using IPython magic functions. If you are looking for the design goals of the project, please see my other blog post Mining the Data Universe: Sending a Drill to Jupyter about how this module came to be and the design considerations I used while building this module.
https://mapr.com/blog/drilling-jupyter/
http://datadotz.com/datadotz-bigdata-weekly-62/#more-1302
Citi Bike Real time utilization using Kafka Streams
This is when Kafka Streams comes in. Kafka Streams is a set of application API (currently in Java & Scala) that seamlessly integrates stateless (stream) and stateful (table) processing. The underlying premise of the design is very interesting. In short it is based on the fact that a table can be reconstructed from a stream of change data capture (CDC) or transaction log records. If we have a stream of change logs, a table is just a local store that reflects that latest state of each change record.
http://datadotz.com/datadotz-bigdata-weekly-62/#more-1302
AWS EC2 instance store vs EBS for MySQL
If you are using large EBS GP2 volumes for MySQL (i.e. 10TB+) on AWS EC2, you can increase performance and save a significant amount of money by moving to local SSD (NVMe) instance storage. Interested? Then read on for a more detailed examination of how to achieve cost-benefits and increase performance from this implementation.
http://datadotz.com/datadotz-bigdata-weekly-62/#more-1302
Performance Comparison of HDP LLAP, Presto, SparkSQL, Hive on Tez, and Hive on MR3
There are a plethora of benchmark results available on the internet, but we still need new benchmark results. Since all SQL-on-Hadoop systems constantly evolve, the landscape gradually changes and previous benchmark results may already be obsolete. Moreover the hardware employed in a benchmark may favor certain systems only, and a system may not be configured at all to achieve the best performance. On the other hand, the TPC-DS benchmark continues to remain as the de facto standard for measuring the performance of SQL-on-Hadoop systems.
https://mr3.postech.ac.kr/blog/2018/08/15/comparison-llap-presto-spark-mr3/