Recent Hot Picks in the News/Blogs

16 views
Skip to first unread message

DataDotz DataDotz

unread,
Jul 22, 2018, 5:53:28 AM7/22/18
to chenn...@googlegroups.com

Tracing Distributed JVM 

Computing frameworks like Apache Spark have been widely adopted to build large-scale data applications. For Uber, data is at the heart of strategic decision-making and product development. To help us better leverage this data, we manage massive deployments of Spark across our global engineering offices.

https://eng.uber.com/jvm-profiler/

http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269


Uber’s Hadoop Distributed File System

Uber Engineering adopted Hadoop as the storage (HDFS) and compute (YARN) infrastructure for our organization’s big data analysis. This analysis powers our services and enables the delivery of more seamless and reliable user experiences.

https://eng.uber.com/scaling-hdfs/

http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269

Automate Replications with Cloudera Manager API

Cloudera Enterprise BackUP and Disaster Recovery (BDR) enables you to replicate data across data centers for disaster recovery scenarios. As a lower cost solution to geographical redundancy or as a means to perform an on-premises to cloud migration, BDR can also replicate HDFS and Hive data to and from Amazon S3 or a Microsoft Azure Data Lake Store.

http://blog.cloudera.com/blog/2018/06/how-to-automate-replications-with-cloudera-manager-api/

 http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269

Exploding Big Data

Big data comes in a variety of shapes. The Extract-Transform-Load (ETL) workflows are more or less stripe-shaped (left panel in the figure above) and produce an output of a similar size to the input. Reporting workflows are funnel-shaped (middle panel in the figure above) and progressively reduce the data size by filtering and aggregating.

https://engineering.linkedin.com/blog/2017/06/managing--exploding--big-data

 http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269

Apache Kafka

The main goal of this post is to demonstrate the concept of multi brokerpartitioning and replication in Apache Kafka. At the end of this post, steps are included to setup multiple brokers along with partitioning and replication.If you are new to Apache kafka, you can refer below posts to understand Kafka quickly.

http://nverma-tech-blog.blogspot.com/2015/12/apache-kafka-multibroker-partitioning.html

http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269

Dynamic Machine Translation in the LinkedIn Feed

The need for economic opportunity is global, and that is represented by the fact that more than half of LinkedIn’s active members live outside of the U.S. Engagement across language barriers and borders comes with a certain set of challenges one of which is providing a way for members to communicate in their native language. In fact, translation of member posts has been one of our most requested features, and now it's finally here.

https://engineering.linkedin.com/blog/2018/06/dynamic-machine-translation-in-the-linkedin-feed-

 http://datadotz.com/datadotz-bigdata-weekly-58/#more-1269


--



--

DataDotz DataDotz

unread,
Jul 26, 2018, 9:08:43 AM7/26/18
to chenn...@googlegroups.com

SparkSQL

Bastian Haase is an alum from the Insight Data Engineering program in Silicon Valley, now working as a Program Director at Insight Data Science for the Data Engineering and the DevOps Engineering programs. In this blog post, he shares his experiences on how to get started working on open source software.As one of the first steps of my Insight project, I wanted to compute the intersection of two columns of a Spark SQL DataFrame containing arrays of strings.

https://blog.insightdatascience.com/a-journey-through-spark-5d67b4af4b24

http://datadotz.com/datadotz-bigdata-weekly-59/#more-1275

Amazon Athena

At Goodreads, we’re currently in the process of decomposing our monolithic Rails application into microservices. For the vast majority of those services, we’ve decided to use Amazon DynamoDB as the primary data store. We like DynamoDB because it gives us consistent, single-digit-millisecond performance across a variety of our storage and throughput needs.

https://aws.amazon.com/blogs/big-data/how-goodreads-offloads-amazon-dynamodb-tables-to-amazon-s3-and-queries-them-using-amazon-athena/

http://datadotz.com/datadotz-bigdata-weekly-59/#more-1275

Machine Learning at Uber


As of 2018, Uber’s ridesharing business operates in more than 600 cities, while the Uber Eats food delivery business has expanded to more than 250 cities worldwide. The gross booking run rate for ridesharing hit $37 billion in 2018. With more than 75 million active riders and 3 million active drivers, Uber’s platform powers 15 million trips every day.

https://eng.uber.com/transforming-financial-forecasting-machine-learning/

http://datadotz.com/datadotz-bigdata-weekly-59/#more-1275

Cloudera Manager

One instance of Cloudera Manager (CM) can manage N clusters. In the current Role Based Access Control (RBAC) model, CM users hold privileges and permissions across everything in CM’s purview (including every cluster that CM manages). For example, Read-Only user John is a user who can perform all the actions of Read-Only users on all clusters managed by CM. The “Cluster Admin” user Chris is a cluster administrator of all the clusters managed by CM.

http://blog.cloudera.com/blog/2018/07/fine-grained-access-control-in-cloudera-manager/

http://datadotz.com/datadotz-bigdata-weekly-59/#more-127

Apache Beam and Apache NiFi 

This story is about transforming XML data to RDF graph with the help of Apache Beam pipelines run on Google Cloud Platform (GCP) and managed with Apache NiFi.We had the following goal take data describing commercial companies (collected by Federal Tax Service of Russia) and turn it into a form that enables querying relations between a company and its parts (owners, management) as well as between different companies.

https://medium.com/datafabric/apache-beam-and-apache-nifi-cooperation-bba42d2de36c

http://datadotz.com/datadotz-bigdata-weekly-59/#more-1275

Gaming Events Data Pipeline with Databricks Delta

The world of mobile gaming is fast paced and requires the ability to scale quickly.  With millions of users around the world generating millions of events per second by means of game play, you will need to calculate key metrics (score adjustments, in-game purchases, in-game actions, etc.) in real-time.

https://databricks.com/blog/2018/07/02/build-a-mobile-gaming-events-data-pipeline-with-databricks-delta.html

http://datadotz.com/datadotz-bigdata-weekly-59/#more-1275

DataDotz DataDotz

unread,
Aug 2, 2018, 6:38:45 AM8/2/18
to chenn...@googlegroups.com

Data Pipeline Patterns in the Decoupled Processing Era

A Data pipeline is a sequence of transformations that converts raw data into actionable insights. In the past, the processing and storage engines were coupled together e.g., a traditional MPP warehouse combines both a processing and storage engine. With decoupled processing solutions (such as Spark, Redshift Spectrum, etc.) becoming mainstream in both open-source as well as the AWS Big Data Ecosystem, what are the popular data pipeline patterns? This post describes the data pipeline patterns we have defined in the context of decoupled processing engines.

https://quickbooks-engineering.intuit.com/re-think-your-data-pipelines-in-the-decoupled-era-5b032bc8b779

http://datadotz.com/datadotz-bigdata-weekly-60/#more-1282

 

Apache Hadoop 3.1, YARN & HDP 3.0

GPUs are increasingly becoming a key tool for many big data applications. Deep-learning / machine learning, data analytics, Genome Sequencing etc all have applications that rely on GPUs for tractable performance. In many cases, GPUs can get up to 10x speedups. And in some reported cases (like this), GPUs can get up to 300x speedups. Many modern deep-learning applications directly build on top of GPU libraries like cuDNN (CUDA Deep Neural Network library). It’s not a stretch to say that many applications like deep-learning cannot live without GPU support.

https://hortonworks.com/blog/gpus-support-in-apache-hadoop-3-1-yarn-hdp-3/

http://datadotz.com/datadotz-bigdata-weekly-60/#more-1282

 

Using StreamSets and MapR

To use StreamSets with MapR, the mapr-client package needs to be installed on the StreamSets host. Alternatively (emphasized because this is important), you can run a separate CentOS Docker container, which has the mapr-client package installed, then you can share /opt/mapr as a Docker volume with the StreamSets container.

https://mapr.com/blog/using-streamsets-and-mapr-together-in-docker/

http://datadotz.com/datadotz-bigdata-weekly-60/#more-1282

 

TensorFlow on Spark 2.3

TensorFlow is Google’s open source software library dedicated to high performance numerical computation. Taking advantages of GPUsTPUs and CPUs power and usable on servers, clusters and even mobiles, it is mainly used for Machine Learning and especially Deep Learning.It provides support for C++, Python and recently for Javascript with TensorFlow.js. The library is now in its 1.8th version and comes with an official high level wraper called Keras.

http://www.adaltas.com/en/2018/05/29/spark-tensorflow-2-3/

http://datadotz.com/datadotz-bigdata-weekly-60/#more-1282

 

Data engineering tech at Unruly

Data engineering began at Unruly with a product experiment as with many of the products we build. As an ad-tech organisation, we wanted to demonstrate the commercial value of predicting if a user will complete watching a digital video ad. This provides value by meeting objectives (KPIs) of advertisers, saving money, providing the end user more relevant ads and helping Unruly to optimise ad serving.

https://medium.com/unruly-engineering/data-engineering-tech-at-unruly-bb8b4afa2758

http://datadotz.com/datadotz-bigdata-weekly-60/#more-1282

 

Data Warehousing and Building Analytics 

CHSDM knows which objects are the most popular at any given time, how long visitors spend in its galleries, and many other quantitative facts about visitor behavior it previously was unable to understand. The museum is beginning to be able to develop a deep understanding around the ways its visitors behave with the Pen and this paper will attempt to explain how it can continue to develop the tools necessary to dig even deeper.

https://medium.com/micah-walter-studio/data-warehousing-and-building-analytics-at-cooper-hewitt-smithsonian-design-museum-159ec772905e

http://datadotz.com/datadotz-bigdata-weekly-60/#more-1282

DataDotz DataDotz

unread,
Oct 17, 2018, 5:48:48 AM10/17/18
to datadotz...@googlegroups.com, chenn...@googlegroups.com

Apache Pulsar Tiered Storage with Amazon S3

Apache Pulsar’s tiered storage feature enables Pulsar to offload older messages on a topic to a long-term storage system, freeing up space in Apache BookKeeper and taking advantage of scalable low-cost storage options such as cloud storage.Tiered storage is valuable for a topic for which you want to retain data for a very long time. For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time so that if you create a new version of your recommendation model you can rerun it against the full user history.

https://streaml.io/blog/configuring-apache-pulsar-tiered-storage-with-amazon-s3

http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350

Machine learning & Kafka KSQL stream processing 

KSQL is an open source, streaming SQL engine that enables real-time data processing against Apache Kafka. KSQL supports creating User Defined Scalar Functions (UDFs) via custom jars that are uploaded to the ext/ directory of the KSQL installation. That is, my compiled anomaly score function can be exposed to the KSQL server   and executed against the Kafka stream.

https://medium.com/@simon.aubury/machine-learning-kafka-ksql-stream-processing-bug-me-when-ive-left-the-heater-on-bd47540cd1e8

http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350

Acclerating Kafka in AWS

Using Kafka for building real-time data pipelines and now experiencing growing pains at scale? Then no doubt — like Branch — you are also running into issues directly impacting your business’ continued ability to provide first class service.While your own Kafka implementation may vary, we hope you can leverage our learnings and ultimate success in optimizing Kafka to effectively scale your data processing and support your growing business.

https://medium.com/branch-engineering/optimizing-kafka-hardware-selection-within-aws-amazon-web-services-69fb4b10198

http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350

Database Migration at Scale

If the database contains quite a small data-set, then perhaps it’s possible to start and end the procedure within a declared maintenance window. By doing so, you guarantee consistency as the data doesn’t change by your servers. However, it’s not a valid solution if it violates the company’s SLA (Service-level agreement) in terms of the product’s availability. For example, At CallApp, we very rarely declare a downtime window.

https://medium.com/callapp/database-migration-at-scale-ae85c14c3621

http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350

Benchmarking Google BigQuery at Scale

It is the Core Data Services team that provides this data to our stakeholders across the business. Historically, we have served data using a petabyte-scale on-premise Data Warehouse build on top of the Hadoop ecosystem. This came with a number of challenges ranging from operational overheads, the scalability of applications and services running on the platform, all the way through to there being multiple SQL frameworks for users to access data based on their particular use case, rather than a convenient “Single Pane of Glass” solution.

https://medium.com/@TechKing/benchmarking-google-bigquery-at-scale-13e1e85f3bec

http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350

Apache Flink manages Kafka consumer offsets

The Kafka consumer in Apache Flink integrates with Flink’s checkpointing mechanism as a stateful operator whose state are the read offsets in all Kafka partitions. When a checkpoint is triggered, the offsets for each partition are stored in the checkpoint. Flink’s checkpoint mechanism ensures that the stored states of all operator tasks are consistent, i.e., they are based on the same input data. A checkpoint is completed when all operator tasks successfully stored their state. Hence, the system provides exactly-once state update guarantees when restarting from potential system failures.

https://data-artisans.com/blog/how-apache-flink-manages-kafka-consumer-offsets

http://datadotz.com/datadotz-bigdata-weekly-63/#more-1350

 

 

DataDotz DataDotz

unread,
Oct 24, 2018, 6:06:52 AM10/24/18
to chenn...@googlegroups.com


History of High Availability

In the days of yore, databases ran on single machines. There was only one node and it handled all reads and all writes. There was no such thing as a “partial failure”; the database was either up or down.Total failure of a single database was a two-fold problem for the internet; first, computers were being accessed around the clock, so downtime was more likely to directly impact users; second, by placing computers under constant demand, they were more likely to fail. The obvious solution to this problem is to have more than one computer that can handle the request, and this is where the story of distributed databases truly begins.

https://www.cockroachlabs.com/blog/brief-history-high-availability/

http://datadotz.com/datadotz-bigdata-weekly-64/#more-1353

Production PySpark Jobs

Data processing, insights and analytics are at the heart of Addictive Mobility, a division of Pelmorex Corp. We take pride in our data expertise and proprietary technology to offer mobile advertising solutions to our clients. Our primary product offering is a Demand Side Platform (DSP). We offer the technology that allows the ability for ad spaces to be bid on in real time and to deliver qualified ads to customers.

https://medium.com/@lubna_22592/building-production-pyspark-jobs-5480d03fd71e

http://datadotz.com/datadotz-bigdata-weekly-64/#more-1353

Deep Learning Model On Kubernetes With Python, Keras, Flask, and Docker

Kubernetes, and its broader new buzzword, cloud-native, are taking the world by storm. Don’t worry — you are right to be skeptical. We’ve all seen the tech hype bubble turn into a veritable tsunami of jargon with AI, Big Data, and The Cloud. It’s yet to be seen if the same will happen with Kubernetes.But with your data science dilettante guiding you today, I have no understanding nor interest in the transformative reasons to use Kubernetes. My motivation is simple. I want to deploy, scale, and manage a REST API that serves up predictions. As you will see, Kubernetes makes this exceedingly easy.

https://medium.com/analytics-vidhya/deploy-your-first-deep-learning-model-on-kubernetes-with-python-keras-flask-and-docker-575dc07d9e76

http://datadotz.com/datadotz-bigdata-weekly-64/#more-1353

Kubernetes to productionize Spark ML faster

Machine learning (ML) models are nothing new to us. We’ve used them to power products such as our used car valuations and Price Indicators. The processes involved in training and serving these models are complex, involving code written by both data scientists and developers in a variety of languages including Python, Java and R. Serving predictions from these models in real time typically involves a custom application to extract the coefficients from the saved model, and apply them to an input request.

https://engineering.autotrader.co.uk/2018/10/03/productionizing-days-to-sell.html

http://datadotz.com/datadotz-bigdata-weekly-64/#more-1353

Lenses SQL for your Intrusion Detection System

During the last decade, applications dealing with high data throughput were limited to near real time operation. Take Network Intrusion Detection Systems (NIDS) for example; a crucial tool in network security —whatever your definition of security is. Until a few years ago, such systems required expensive proprietary hardware solutions, tied to the hardware vendor’s —often poor— software tools.

https://www.landoop.com/blog/2018/10/lenses-sql-for-your-intrustion-detection-system/

http://datadotz.com/datadotz-bigdata-weekly-64/#more-1353

Kubeflow on Amazon EKS

The Kubeflow project is designed to simplify the deployment of machine learning projects like TensorFlow on Kubernetes. There are also plans to add support for additional frameworks such as MXNet, Pytorch, Chainer, and more. These frameworks can leverage GPUs in the Kubernetes cluster for machine learning tasks.

https://aws.amazon.com/blogs/opensource/kubeflow-amazon-eks/

http://datadotz.com/datadotz-bigdata-weekly-64/#more-1353

DataDotz DataDotz

unread,
Nov 2, 2018, 1:02:15 AM11/2/18
to chenn...@googlegroups.com



Physical storage of your database tables might matter

In our quest to simplify and enrich online grocery shopping for our users, we experimented with serving personalized item recommendations to each one of them. For this we operated in batch mode and pre computed relevant top 200 item recommendations for each user and dumped the results in a table of our OLTP PostgreSQL database to enable us to serve these recommendations in real time. Queries on this table were taking too long which resulted in bad user experience.

https://lambda.grofers.com/why-physical-storage-of-your-database-tables-might-matter-74b563d664d9

http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360

Spark core concepts visualized

Learning Spark is not an easy thing for a person with less background knowledge on distributed systems. Even though I have been using Spark for quite some time, I find it time-consuming to get a comprehensive grasp of all the core concepts in Spark. The official Spark documentation provides a very detailed explanation, yet it focuses more on the practical programming side. Also, tons of online tutorials can be overwhelming to a starter. Therefore in this article I would like to note down those Spark core concepts but in a more visualized way. Hope you will find it useful as well

https://medium.com/@pang.xin/spark-study-notes-core-concepts-visualized-5256c44e4090

http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360

Adobe Experience Platform Pipeline with Kafka

The aggregation, enrichment, and near-real-time analysis of high-frequency data present a whole new set of challenges. Particularly when the goal is to take data collected at any experience point, derive insights, and then deliver the next best experience at the speed customers demand, tying multiple messaging systems together is a non-starter.

https://medium.com/adobetech/creating-the-adobe-experience-platform-pipeline-with-kafka-4f1057a11ef

http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360

Amazon Redshift at Equinox Fitness Clubs

Clickstream analysis tools handle their data well, and some even have impressive BI interfaces. However, analyzing clickstream data in isolation comes with many limitations. For example, a customer is interested in a product or service on your website. They go to your physical store to purchase it. The clickstream analyst asks, “What happened after they viewed the product?” and the commerce analyst asks themselves, “What happened before they bought it?”

https://aws.amazon.com/blogs/big-data/closing-the-customer-journey-loop-with-amazon-redshift-at-equinox-fitness-clubs/

http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360

Data Artisans Platform on AWS EKS

In this article, we describe the setup of data Artisans Platform using Amazon Web Services Elastic Container Service for Kubernetes (AWS EKS). data Artisans Platform is built on Kubernetes as the underlying resource manager. Kubernetes is available in all major cloud providers, on-premise and also on single machines.

https://data-artisans.com/blog/how-to-get-started-with-data-artisans-platform-on-aws-eks

http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360

Event Processing Design Patterns With Pulsar Functions

Let’s review how we would go about implementing content-based routing using Apache Pulsar Functions. Content-based routing is an integration pattern that has been around for years and is commonly used in event hubs and messaging frameworks. The basic idea is that the content of each message is inspected and then is routed to various destinations based upon value(s) found or not found in the content.

https://streaml.io/blog/eda-event-processing-design-patterns-with-pulsar-functions

http://datadotz.com/datadotz-bigdata-weekly-65/#more-1360

 

 -- 

DataDotz DataDotz

unread,
Nov 22, 2018, 2:32:22 AM11/22/18
to chenn...@googlegroups.com


Replication Guide On HDFS and Amazon Web Services

Hortonworks’ Data Lifecycle Manager (DLM), an extensible service built on the Hortonworks DataPlane Platform (DPS) provides a complete solution to replicate HDFS, Hive data, metadata and security policies between on-premises and Amazon S3. This data movement enables Data science and ML workloads to execute models in Amazon SageMaker and bring back the successful data to on-premise. To facilitate this use case, here are the 3 steps for replication between HDFS to AWS cloud:

https://hortonworks.com/blog/a-step-by-step-replication-guide-between-on-prem-hdfs-and-amazon-web-services/

http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370


Azure Source

This is an update to the Azure Sphere Operating System, Azure Sphere Security Service, and Visual Studio development environment. This release includes substantial investments in our security infrastructure and our connectivity solutions, and it incorporates some of your feedback. Azure Sphere, which is in public preview, is a secured, high-level application platform with built-in communication and security features for internet-connected devices. It comprises an Azure Sphere microcontroller unit (MCU), tools and an SDK for developing applications, and the Azure Sphere Security Service, through which applications can securely connect to the cloud and web.

https://azure.microsoft.com/en-us/blog/azure-source-volume-58/

http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370

Kafka Connect Deep Dive Converters and Serialization

Kafka Connect is part of Apache Kafka, providing streaming integration between data stores and Kafka. For data engineers, it just requires JSON configuration files to use. There are connectors for common (and not-so-common) data stores out there already, ncluding JDBCElasticsearchIBM MQS3 and BigQuery, to name but a few.For developers, Kafka Connect has a rich API in which additional connectors can be developed if required. In addition to this, it also has a REST API for configuration and management of connectors.

https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained

http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370

 

Druid enables analytics at Airbnb

Airbnb serves millions of guests and hosts in our community. Every second, their activities on Airbnb.com, such as searching, booking, and messaging, generate a huge amount of data we anonymize and use to improve the community’s experience on our platform.The Data Platform Team at Airbnb strives to leverage this data to improve our customers’ experiences and optimize Airbnb’s business. Our mission is to provide infrastructure to collect, organize, and process this deluge of data (all in privacy-safe ways), and empower various organizations across Airbnb to derive necessary analytics and make data-informed decisions from it.

https://medium.com/airbnb-engineering/druid-airbnb-data-platform-601c312f2a4c

http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370


Higher-Order Functions for Complex Data Types in Apache Spark 2.4

Apache Spark 2.4 introduces 29 new built-in functions for manipulating complex types (for example, array type), including higher-order functions.Before Spark 2.4, for manipulating the complex types directly, there were two typical solutions: 1) Exploding the nested structure into individual rows, and applying some functions, and then creating the structure again 2) Building a User Defined Function (UDF).In contrast, the new built-in functions can directly manipulate complex types, and the higher-order functions can manipulate complex values with an anonymous lambda function similar to UDFs but with much better performance.

https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html

http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370


Real-Time Analytics with Pulsar Functions

For many event-driven applications, how quickly data can be processed, understood and reacted to is paramount. In analytics and data processing for those scenarios, calculating a precise value may be time-consuming or impractically resource intensive. In those cases having an approximate answer within a given time frame is much more valuable than waiting to calculate an exact answer.

https://streaml.io/blog/eda-real-time-analytics-with-pulsar-functions

http://datadotz.com/datadotz-bigdata-weekly-66/#more-1370

 

 -- 

Reply all
Reply to author
Forward
0 new messages