Recent Hot Picks in the News/Blogs

10 views
Skip to first unread message

DataDotz DataDotz

unread,
Aug 22, 2018, 6:12:10 AM8/22/18
to chenn...@googlegroups.com

UDAF in KSQL 5.0

KSQL is the open source streaming SQL engine that enables real-time data processing against Apache Kafka. KSQL makes it easy to read, write and process streaming data in real time, at scale, using SQL-like semantics.KSQL already has plenty of available functions like SUBSTRING, STRINGTOTIMESTAMP or COUNT. Even so, many users need additional functions to process their data streams..

KSQL now has an official API for building your own functions. As of the release of Confluent Platform 5.0, KSQL supports creating user-defined scalar functions (UDF) and user-defined aggregate functions (UDAF)

https://www.confluent.io/blog/build-udf-udaf-ksql-5-0

http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297

Air Flow DAG Tests and Unit Tests

Testing is an integral part of any software system to build confidence and increase the reliability of the system. Recently, I joined Grab and here at Grab, we are using Airflow to create and manage pipelines. But, we were facing issues with Airflow. I had a conversation with my engineering manager and discussed on how we could make Airflow reliable and testable.

https://blog.usejournal.com/testing-in-airflow-part-1-dag-validation-tests-dag-definition-tests-and-unit-tests-2aa94970570c

http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297

Spark SQL Performance in Video Play Sessions

Play Sessions are the bread and butter of the Data Pipelines engineering team at JW Player. They are an attempt to identify a single ‘unit of work’ of a video viewer by computing transformations and aggregations in Spark SQL to compact the data down into a much more manageable size. These query operations are performed across roughly 100 columns, which turns out to be a hefty query and the impetus for why we needed to tune & optimize our Spark SQL job.

https://medium.com/jw-player-engineering/optimizing-spark-sql-performance-in-video-play-sessions-d49bfcca59b7

http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297

Data at Rezdy

The use of data in a startup, becomes increasingly important as its number of users grows. In the early days of any B2B startup, you have to meet with every single customer yourself to get them on-board, so understanding what will get them using the platform becomes a case of simply asking them. However, as a business scales, this approach to making decisions no longer works.

https://medium.com/rezdy-engineering/an-introduction-to-data-at-rezdy-53b12d9935f5

http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297

 Postgres Databases with Rails

The first step of architecting is always to figure out what we’re optimizing for. We’ve enjoyed great success with Postgres and saw no reason to depart. It has given us high performance and has an incredible array of datatypes and query functions that have been extremely useful.

https://tech.instacart.com/scaling-at-instacart-distributing-data-across-multiple-postgres-databases-with-rails-13b1e4eba202

http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297

Kafka Blindness

The most common response was the need for better tools to monitor and manage Kafka in production. Specifically, users wanted better visibility in understanding what is going on in the cluster across the four key entities with Kafka: producers, topics, brokers, and consumers.  In fact, because we heard this same response over and over from the users we interviewed, we gave it a name: The Kafka Blindness.

https://hortonworks.com/blog/kafka-blindness/

http://datadotz.com/datadotz-bigdata-weekly-61/#more-1297    

--


DataDotz DataDotz

unread,
Aug 29, 2018, 5:22:17 AM8/29/18
to chenn...@googlegroups.com

Scheduling Notebooks at Netflix

At Netflix we’ve put substantial effort into adopting notebooks as an integrated development platform. The idea started as a discussion of what development and collaboration interfaces might look like in the future. It evolved into a strategic bet on notebooks, both as an interactive UI and as the unifying foundation of our workflow scheduler. We’ve made significant strides towards this over the past year, and we’re currently in the process of migrating all 10,000 of the scheduled jobs running on the Netflix Data Platform to use notebook-based execution. When we’re done, more than 150,000 Genie jobs will be running through notebooks on our platform every single day

https://medium.com/netflix-techblog/scheduling-notebooks-348e6c14cfd6

http://datadotz.com/datadotz-bigdata-weekly-62/#more-1302

Kafka KSQL

KSQL is a SQL engine for Kafka. It allows you to write SQL queries to analyze a stream of data in real time. Since a stream is an unbounded data set (for more details about this terminology, see a query with KSQL will keep generating results until you stop it.KSQL is built on top of Kafka Streams. When you submit a query, this query will be parsed and a Kafka Streams topology will be built and executed. This means that KSQL offers similar concepts as to what Kafka Streams offers, but all with a SQL language: streams (KStreams), tables (KTables), joins, windowing functions, etc.

http://aseigneurin.github.io/2018/08/22/kafka-tutorial-10-ksql.html

http://datadotz.com/datadotz-bigdata-weekly-62/#more-1302

Jupyter Notebooks and Apache Drill

Apache This blog post will walk through the installation and basic usage of the jupyter_drill module for Python that allows you, from a Jupyter Notebook, to connect and work with data from Apache Drill using IPython magic functions. If you are looking for the design goals of the project, please see my other blog post Mining the Data Universe: Sending a Drill to Jupyter about how this module came to be and the design considerations I used while building this module.

https://mapr.com/blog/drilling-jupyter/

http://datadotz.com/datadotz-bigdata-weekly-62/#more-1302

Citi Bike Real time utilization using Kafka Streams

This is when Kafka Streams comes in. Kafka Streams is a set of application API (currently in Java & Scala) that seamlessly integrates stateless (stream) and stateful (table) processing. The underlying premise of the design is very interesting. In short it is based on the fact that a table can be reconstructed from a stream of change data capture (CDC) or transaction log records. If we have a stream of change logs, a table is just a local store that reflects that latest state of each change record.

https://towardsdatascience.com/tracking-nyc-citi-bike-real-time-utilization-using-kafka-streams-1c0ea9e24e79

http://datadotz.com/datadotz-bigdata-weekly-62/#more-1302

AWS EC2 instance store vs EBS for MySQL

If you are using large EBS GP2 volumes for MySQL (i.e. 10TB+) on AWS EC2, you can increase performance and save a significant amount of money by moving to local SSD (NVMe) instance storage. Interested? Then read on for a more detailed examination of how to achieve cost-benefits and increase performance from this implementation.

https://www.percona.com/blog/2018/08/20/using-aws-ec2-instance-store-vs-ebs-for-mysql-how-to-increase-performance-and-decrease-cost/

http://datadotz.com/datadotz-bigdata-weekly-62/#more-1302

Performance Comparison of HDP LLAP, Presto, SparkSQL, Hive on Tez, and Hive on MR3 

There are a plethora of benchmark results available on the internet, but we still need new benchmark results. Since all SQL-on-Hadoop systems constantly evolve, the landscape gradually changes and previous benchmark results may already be obsolete. Moreover the hardware employed in a benchmark may favor certain systems only, and a system may not be configured at all to achieve the best performance. On the other hand, the TPC-DS benchmark continues to remain as the de facto standard for measuring the performance of SQL-on-Hadoop systems.

https://mr3.postech.ac.kr/blog/2018/08/15/comparison-llap-presto-spark-mr3/

http://datadotz.com/datadotz-bigdata-weekly-62/#more-1302

Reply all
Reply to author
Forward
0 new messages