Tracing Distributed JVM
Computing frameworks like Apache Spark have been widely adopted to build large-scale data applications. For Uber, data is at the heart of strategic decision-making and product development. To help us better leverage this data, we manage massive deployments of Spark across our global engineering offices.
https://eng.uber.com/jvm-profiler/
http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269
Uber’s Hadoop Distributed File System
Uber Engineering adopted Hadoop as the storage (HDFS) and compute (YARN) infrastructure for our organization’s big data analysis. This analysis powers our services and enables the delivery of more seamless and reliable user experiences.
https://eng.uber.com/scaling-hdfs/
http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269
Automate Replications with Cloudera Manager API
Cloudera Enterprise BackUP and Disaster Recovery (BDR) enables you to replicate data across data centers for disaster recovery scenarios. As a lower cost solution to geographical redundancy or as a means to perform an on-premises to cloud migration, BDR can also replicate HDFS and Hive data to and from Amazon S3 or a Microsoft Azure Data Lake Store.
http://blog.cloudera.com/blog/2018/06/how-to-automate-replications-with-cloudera-manager-api/
http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269
Exploding Big Data
Big data comes in a variety of shapes. The Extract-Transform-Load (ETL) workflows are more or less stripe-shaped (left panel in the figure above) and produce an output of a similar size to the input. Reporting workflows are funnel-shaped (middle panel in the figure above) and progressively reduce the data size by filtering and aggregating.
https://engineering.linkedin.com/blog/2017/06/managing--exploding--big-data
http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269
Apache Kafka
The main goal of this post is to
demonstrate the concept of multi broker, partitioning and replication in
Apache Kafka. At the end of this post, steps are included to setup multiple
brokers along with partitioning and replication.If you are new to Apache kafka, you
can refer below posts to understand Kafka quickly.
http://nverma-tech-blog.blogspot.com/2015/12/apache-kafka-multibroker-partitioning.html
http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269
Dynamic Machine Translation in the LinkedIn Feed
The need for economic opportunity is global, and that is represented by the fact that more than half of LinkedIn’s active members live outside of the U.S. Engagement across language barriers and borders comes with a certain set of challenges one of which is providing a way for members to communicate in their native language. In fact, translation of member posts has been one of our most requested features, and now it's finally here.
https://engineering.linkedin.com/blog/2018/06/dynamic-machine-translation-in-the-linkedin-feed-
http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269
SparkSQL
Bastian Haase is an alum from the Insight Data Engineering program in Silicon Valley, now working as a Program Director at Insight Data Science for the Data Engineering and the DevOps Engineering programs. In this blog post, he shares his experiences on how to get started working on open source software.As one of the first steps of my Insight project, I wanted to compute the intersection of two columns of a Spark SQL DataFrame containing arrays of strings.
https://blog.insightdatascience.com/a-journey-through-spark-5d67b4af4b24
http://datadotz.com/datadotz-bigdata-weekly-59/#more-1275
Amazon Athena
At Goodreads, we’re currently in the process of decomposing our monolithic Rails application into microservices. For the vast majority of those services, we’ve decided to use Amazon DynamoDB as the primary data store. We like DynamoDB because it gives us consistent, single-digit-millisecond performance across a variety of our storage and throughput needs.
http://datadotz.com/datadotz-bigdata-weekly-59/#more-1275
As of 2018, Uber’s ridesharing business operates in more than 600 cities, while the Uber Eats food delivery business has expanded to more than 250 cities worldwide. The gross booking run rate for ridesharing hit $37 billion in 2018. With more than 75 million active riders and 3 million active drivers, Uber’s platform powers 15 million trips every day.
https://eng.uber.com/transforming-financial-forecasting-machine-learning/
http://datadotz.com/datadotz-bigdata-weekly-59/#more-1275
Cloudera Manager
One instance of Cloudera Manager (CM) can manage N clusters. In the current Role Based Access Control (RBAC) model, CM users hold privileges and permissions across everything in CM’s purview (including every cluster that CM manages). For example, Read-Only user John is a user who can perform all the actions of Read-Only users on all clusters managed by CM. The “Cluster Admin” user Chris is a cluster administrator of all the clusters managed by CM.
http://blog.cloudera.com/blog/2018/07/fine-grained-access-control-in-cloudera-manager/
http://datadotz.com/datadotz-bigdata-weekly-59/#more-127
Apache Beam and Apache NiFi
This story is about transforming XML data to RDF graph with the help of Apache Beam pipelines run on Google Cloud Platform (GCP) and managed with Apache NiFi.We had the following goal take data describing commercial companies (collected by Federal Tax Service of Russia) and turn it into a form that enables querying relations between a company and its parts (owners, management) as well as between different companies.
https://medium.com/datafabric/apache-beam-and-apache-nifi-cooperation-bba42d2de36c
http://datadotz.com/datadotz-bigdata-weekly-59/#more-1275
Gaming Events Data Pipeline with Databricks Delta
The world of mobile gaming is fast paced and requires the ability to scale quickly. With millions of users around the world generating millions of events per second by means of game play, you will need to calculate key metrics (score adjustments, in-game purchases, in-game actions, etc.) in real-time.
Data Pipeline Patterns in the Decoupled Processing Era
A Data pipeline is a sequence of transformations that converts raw data into actionable insights. In the past, the processing and storage engines were coupled together e.g., a traditional MPP warehouse combines both a processing and storage engine. With decoupled processing solutions (such as Spark, Redshift Spectrum, etc.) becoming mainstream in both open-source as well as the AWS Big Data Ecosystem, what are the popular data pipeline patterns? This post describes the data pipeline patterns we have defined in the context of decoupled processing engines.
http://datadotz.com/datadotz-bigdata-weekly-60/#more-1282
Apache Hadoop 3.1, YARN & HDP 3.0
GPUs are increasingly becoming a key tool for many big data applications. Deep-learning / machine learning, data analytics, Genome Sequencing etc all have applications that rely on GPUs for tractable performance. In many cases, GPUs can get up to 10x speedups. And in some reported cases (like this), GPUs can get up to 300x speedups. Many modern deep-learning applications directly build on top of GPU libraries like cuDNN (CUDA Deep Neural Network library). It’s not a stretch to say that many applications like deep-learning cannot live without GPU support.
https://hortonworks.com/blog/gpus-support-in-apache-hadoop-3-1-yarn-hdp-3/
http://datadotz.com/datadotz-bigdata-weekly-60/#more-1282
Using StreamSets and MapR
To use StreamSets with MapR, the mapr-client package needs to be installed on the StreamSets host. Alternatively (emphasized because this is important), you can run a separate CentOS Docker container, which has the mapr-client package installed, then you can share /opt/mapr as a Docker volume with the StreamSets container.
https://mapr.com/blog/using-streamsets-and-mapr-together-in-docker/
http://datadotz.com/datadotz-bigdata-weekly-60/#more-1282
TensorFlow on Spark 2.3
TensorFlow is Google’s open source software library dedicated to high performance numerical computation. Taking advantages of GPUs, TPUs and CPUs power and usable on servers, clusters and even mobiles, it is mainly used for Machine Learning and especially Deep Learning.It provides support for C++, Python and recently for Javascript with TensorFlow.js. The library is now in its 1.8th version and comes with an official high level wraper called Keras.
http://www.adaltas.com/en/2018/05/29/spark-tensorflow-2-3/
http://datadotz.com/datadotz-bigdata-weekly-60/#more-1282
Data engineering tech at Unruly
Data engineering began at Unruly with a product experiment as with many of the products we build. As an ad-tech organisation, we wanted to demonstrate the commercial value of predicting if a user will complete watching a digital video ad. This provides value by meeting objectives (KPIs) of advertisers, saving money, providing the end user more relevant ads and helping Unruly to optimise ad serving.
https://medium.com/unruly-engineering/data-engineering-tech-at-unruly-bb8b4afa2758
http://datadotz.com/datadotz-bigdata-weekly-60/#more-1282
Data Warehousing and Building Analytics
CHSDM knows which objects are the most popular at any given time, how long visitors spend in its galleries, and many other quantitative facts about visitor behavior it previously was unable to understand. The museum is beginning to be able to develop a deep understanding around the ways its visitors behave with the Pen and this paper will attempt to explain how it can continue to develop the tools necessary to dig even deeper.
Apache Pulsar Tiered Storage with Amazon S3
Apache Pulsar’s tiered storage feature enables Pulsar to offload older messages on a topic to a long-term storage system, freeing up space in Apache BookKeeper and taking advantage of scalable low-cost storage options such as cloud storage.Tiered storage is valuable for a topic for which you want to retain data for a very long time. For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time so that if you create a new version of your recommendation model you can rerun it against the full user history.
https://streaml.io/blog/configuring-apache-pulsar-tiered-storage-with-amazon-s3
http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350
Machine learning & Kafka KSQL stream processing
KSQL is an open source, streaming SQL engine that enables real-time data processing against Apache Kafka. KSQL supports creating User Defined Scalar Functions (UDFs) via custom jars that are uploaded to the ext/ directory of the KSQL installation. That is, my compiled anomaly score function can be exposed to the KSQL server and executed against the Kafka stream.
http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350
Acclerating Kafka in AWS
Using Kafka for building real-time data pipelines and now experiencing growing pains at scale? Then no doubt — like Branch — you are also running into issues directly impacting your business’ continued ability to provide first class service.While your own Kafka implementation may vary, we hope you can leverage our learnings and ultimate success in optimizing Kafka to effectively scale your data processing and support your growing business.
http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350
Database Migration at Scale
If the database contains quite a small data-set, then perhaps it’s possible to start and end the procedure within a declared maintenance window. By doing so, you guarantee consistency as the data doesn’t change by your servers. However, it’s not a valid solution if it violates the company’s SLA (Service-level agreement) in terms of the product’s availability. For example, At CallApp, we very rarely declare a downtime window.
https://medium.com/callapp/database-migration-at-scale-ae85c14c3621
http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350
Benchmarking Google BigQuery at Scale
It is the Core Data Services team that provides this data to our stakeholders across the business. Historically, we have served data using a petabyte-scale on-premise Data Warehouse build on top of the Hadoop ecosystem. This came with a number of challenges ranging from operational overheads, the scalability of applications and services running on the platform, all the way through to there being multiple SQL frameworks for users to access data based on their particular use case, rather than a convenient “Single Pane of Glass” solution.
https://medium.com/@TechKing/benchmarking-google-bigquery-at-scale-13e1e85f3bec
http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350
Apache Flink manages Kafka consumer offsets
The Kafka consumer in Apache Flink integrates with Flink’s checkpointing mechanism as a stateful operator whose state are the read offsets in all Kafka partitions. When a checkpoint is triggered, the offsets for each partition are stored in the checkpoint. Flink’s checkpoint mechanism ensures that the stored states of all operator tasks are consistent, i.e., they are based on the same input data. A checkpoint is completed when all operator tasks successfully stored their state. Hence, the system provides exactly-once state update guarantees when restarting from potential system failures.
https://data-artisans.com/blog/how-apache-flink-manages-kafka-consumer-offsets
http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350
History of High Availability
In the days of yore, databases ran on single machines. There was only one node and it handled all reads and all writes. There was no such thing as a “partial failure”; the database was either up or down.Total failure of a single database was a two-fold problem for the internet; first, computers were being accessed around the clock, so downtime was more likely to directly impact users; second, by placing computers under constant demand, they were more likely to fail. The obvious solution to this problem is to have more than one computer that can handle the request, and this is where the story of distributed databases truly begins.
https://www.cockroachlabs.com/blog/brief-history-high-availability/
http://datadotz.com/datadotz-bigdata-weekly-64/#more-1353
Production PySpark Jobs
Data processing, insights and analytics are at the heart of Addictive Mobility, a division of Pelmorex Corp. We take pride in our data expertise and proprietary technology to offer mobile advertising solutions to our clients. Our primary product offering is a Demand Side Platform (DSP). We offer the technology that allows the ability for ad spaces to be bid on in real time and to deliver qualified ads to customers.
https://medium.com/@lubna_22592/building-production-pyspark-jobs-5480d03fd71e
http://datadotz.com/datadotz-bigdata-weekly-64/#more-1353
Deep Learning Model On Kubernetes With Python, Keras, Flask, and Docker
Kubernetes, and its broader new buzzword, cloud-native, are taking the world by storm. Don’t worry — you are right to be skeptical. We’ve all seen the tech hype bubble turn into a veritable tsunami of jargon with AI, Big Data, and The Cloud. It’s yet to be seen if the same will happen with Kubernetes.But with your data science dilettante guiding you today, I have no understanding nor interest in the transformative reasons to use Kubernetes. My motivation is simple. I want to deploy, scale, and manage a REST API that serves up predictions. As you will see, Kubernetes makes this exceedingly easy.
http://datadotz.com/datadotz-bigdata-weekly-64/#more-1353
Kubernetes to productionize Spark ML faster
Machine learning (ML) models are nothing new to us. We’ve used them to power products such as our used car valuations and Price Indicators. The processes involved in training and serving these models are complex, involving code written by both data scientists and developers in a variety of languages including Python, Java and R. Serving predictions from these models in real time typically involves a custom application to extract the coefficients from the saved model, and apply them to an input request.
https://engineering.autotrader.co.uk/2018/10/03/productionizing-days-to-sell.html
http://datadotz.com/datadotz-bigdata-weekly-64/#more-1353
Lenses SQL for your Intrusion Detection System
During the last decade, applications dealing with high data throughput were limited to near real time operation. Take Network Intrusion Detection Systems (NIDS) for example; a crucial tool in network security —whatever your definition of security is. Until a few years ago, such systems required expensive proprietary hardware solutions, tied to the hardware vendor’s —often poor— software tools.
https://www.landoop.com/blog/2018/10/lenses-sql-for-your-intrustion-detection-system/
http://datadotz.com/datadotz-bigdata-weekly-64/#more-1353
Kubeflow on Amazon EKS
The Kubeflow project is designed to simplify the deployment of machine learning projects like TensorFlow on Kubernetes. There are also plans to add support for additional frameworks such as MXNet, Pytorch, Chainer, and more. These frameworks can leverage GPUs in the Kubernetes cluster for machine learning tasks.
https://aws.amazon.com/blogs/opensource/kubeflow-amazon-eks/
Physical storage of your database tables might matter
In our quest to simplify and enrich online grocery shopping for our users, we experimented with serving personalized item recommendations to each one of them. For this we operated in batch mode and pre computed relevant top 200 item recommendations for each user and dumped the results in a table of our OLTP PostgreSQL database to enable us to serve these recommendations in real time. Queries on this table were taking too long which resulted in bad user experience.
https://lambda.grofers.com/why-physical-storage-of-your-database-tables-might-matter-74b563d664d9
http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360
Spark core concepts visualized
Learning Spark is not an easy thing for a person with less background knowledge on distributed systems. Even though I have been using Spark for quite some time, I find it time-consuming to get a comprehensive grasp of all the core concepts in Spark. The official Spark documentation provides a very detailed explanation, yet it focuses more on the practical programming side. Also, tons of online tutorials can be overwhelming to a starter. Therefore in this article I would like to note down those Spark core concepts but in a more visualized way. Hope you will find it useful as well
https://medium.com/@pang.xin/spark-study-notes-core-concepts-visualized-5256c44e4090
http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360
Adobe Experience Platform Pipeline with Kafka
The aggregation, enrichment, and near-real-time analysis of high-frequency data present a whole new set of challenges. Particularly when the goal is to take data collected at any experience point, derive insights, and then deliver the next best experience at the speed customers demand, tying multiple messaging systems together is a non-starter.
https://medium.com/adobetech/creating-the-adobe-experience-platform-pipeline-with-kafka-4f1057a11ef
http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360
Amazon Redshift at Equinox Fitness Clubs
Clickstream analysis tools handle their data well, and some even have impressive BI interfaces. However, analyzing clickstream data in isolation comes with many limitations. For example, a customer is interested in a product or service on your website. They go to your physical store to purchase it. The clickstream analyst asks, “What happened after they viewed the product?” and the commerce analyst asks themselves, “What happened before they bought it?”
http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360
Data Artisans Platform on AWS EKS
In this article, we describe the setup of data Artisans Platform using Amazon Web Services Elastic Container Service for Kubernetes (AWS EKS). data Artisans Platform is built on Kubernetes as the underlying resource manager. Kubernetes is available in all major cloud providers, on-premise and also on single machines.
https://data-artisans.com/blog/how-to-get-started-with-data-artisans-platform-on-aws-eks
http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360
Event Processing Design Patterns With Pulsar Functions
Let’s review how we would go about implementing content-based routing using Apache Pulsar Functions. Content-based routing is an integration pattern that has been around for years and is commonly used in event hubs and messaging frameworks. The basic idea is that the content of each message is inspected and then is routed to various destinations based upon value(s) found or not found in the content.
https://streaml.io/blog/eda-event-processing-design-patterns-with-pulsar-functions
http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360
--
Replication Guide On HDFS and Amazon Web Services
Hortonworks’ Data Lifecycle Manager (DLM), an extensible service built on the Hortonworks DataPlane Platform (DPS) provides a complete solution to replicate HDFS, Hive data, metadata and security policies between on-premises and Amazon S3. This data movement enables Data science and ML workloads to execute models in Amazon SageMaker and bring back the successful data to on-premise. To facilitate this use case, here are the 3 steps for replication between HDFS to AWS cloud:
http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370
Azure Source
This is an update to the Azure Sphere Operating System, Azure Sphere Security Service, and Visual Studio development environment. This release includes substantial investments in our security infrastructure and our connectivity solutions, and it incorporates some of your feedback. Azure Sphere, which is in public preview, is a secured, high-level application platform with built-in communication and security features for internet-connected devices. It comprises an Azure Sphere microcontroller unit (MCU), tools and an SDK for developing applications, and the Azure Sphere Security Service, through which applications can securely connect to the cloud and web.
https://azure.microsoft.com/en-us/blog/azure-source-volume-58/
http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370
Kafka Connect Deep Dive Converters and Serialization
Kafka Connect is part of Apache Kafka, providing streaming integration between data stores and Kafka. For data engineers, it just requires JSON configuration files to use. There are connectors for common (and not-so-common) data stores out there already, ncluding JDBC, Elasticsearch, IBM MQ, S3 and BigQuery, to name but a few.For developers, Kafka Connect has a rich API in which additional connectors can be developed if required. In addition to this, it also has a REST API for configuration and management of connectors.
https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370
Druid enables analytics at Airbnb
Airbnb serves millions of guests and hosts in our community. Every second, their activities on Airbnb.com, such as searching, booking, and messaging, generate a huge amount of data we anonymize and use to improve the community’s experience on our platform.The Data Platform Team at Airbnb strives to leverage this data to improve our customers’ experiences and optimize Airbnb’s business. Our mission is to provide infrastructure to collect, organize, and process this deluge of data (all in privacy-safe ways), and empower various organizations across Airbnb to derive necessary analytics and make data-informed decisions from it.
https://medium.com/airbnb-engineering/druid-airbnb-data-platform-601c312f2a4c
http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370
Higher-Order Functions for Complex Data Types in Apache Spark 2.4
Apache Spark 2.4 introduces 29 new built-in functions for manipulating complex types (for example, array type), including higher-order functions.Before Spark 2.4, for manipulating the complex types directly, there were two typical solutions: 1) Exploding the nested structure into individual rows, and applying some functions, and then creating the structure again 2) Building a User Defined Function (UDF).In contrast, the new built-in functions can directly manipulate complex types, and the higher-order functions can manipulate complex values with an anonymous lambda function similar to UDFs but with much better performance.
http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370
Real-Time Analytics with Pulsar Functions
For many event-driven applications, how quickly data can be processed, understood and reacted to is paramount. In analytics and data processing for those scenarios, calculating a precise value may be time-consuming or impractically resource intensive. In those cases having an approximate answer within a given time frame is much more valuable than waiting to calculate an exact answer.
https://streaml.io/blog/eda-real-time-analytics-with-pulsar-functions
http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370
--