Recent Hot Picks in the News/Blogs

3 views
Skip to first unread message

DataDotz DataDotz

unread,
Jan 23, 2019, 2:16:54 AM1/23/19
to chenn...@googlegroups.com

Getting Your Feet Wet with Stream Processing

When you create a stream processing application with Kafka’s Streams API, you create a Topology either using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). For testing though, connecting to a running Kafka cluster and making sure to clean up state between tests adds a lot of complexity and time.

https://www.confluent.io/blog/stream-processing-part-2-testing-your-streaming-application

http://datadotz.com/datadotz-bigdata-weekly-71/#more-1408

Hive vs Impala Schema Loading Case: Reading Parquet Files

Quite often in big data , comes a scenario where raw data is processed in Spark and then needs to be made available to the analytics team . For this purpose a standard solution is to write the processed data from the spark application in the form of parquet files in HDFS and then point a Hive/Impala table to this data upon which analytics team can then fire SQL like queries.

https://medium.com/@kartik.gupta_56068/hive-vs-impala-schema-loading-case-reading-parquet-files-acd0280c2cb3

http://datadotz.com/datadotz-bigdata-weekly-71/#more-1408

Building A Scalable Interactive Analytics Backend

According to a study by Gartner, diverse organizations perform 12% better than non-diverse ones, with more innovation and better financial returns. Eightfold.ai offers a Talent Diversity solution to our customers to track and analyze their diversity goals and check for any existing bias in the hiring process across different steps like recruiter screening, hiring manager screen, onsite etc.

https://medium.com/@eightfold/building-a-scalable-interactive-analytics-backend-aebeb79ee0c8

http://datadotz.com/datadotz-bigdata-weekly-71/#more-1408

Elasticsearch Distributed Consistency Principles Analysis (3) — Data

The previous two articles described the composition of the ES clusters, master election algorithm, master update meta process, and analyzed the consistency issues of the election and Meta update. This article analyzes the data flow in ES, including its write process, PacificA algorithm model, SequenceNumber, Checkpoint and compares the similarities and differences between ES implementation and the standard PacificA algorithm.

https://medium.com/@Alibaba_Cloud/elasticsearch-distributed-consistency-principles-analysis-3-data-a98cc436bc6b

http://datadotz.com/datadotz-bigdata-weekly-71/#more-1408

Amazon Managed Streaming For Kafka (MSK) With Apache Spark On Qubole

AWS recently announced Managed Streaming for Kafka (MSK) at AWS 2018. Apache Kafka is one of the most popular open source streaming message queues. Kafka provides a high-throughput, low-latency technology for handling data streaming in real time. MSK allows developers to spin up Kafka as a managed service and offload operational overhead to AWS.

https://www.qubole.com/blog/amazon-managed-streaming-for-kafka/

http://datadotz.com/datadotz-bigdata-weekly-71/#more-1408

Deploy production-grade Spark to Kubernetes in minutes

In December 2018 we released the public beta of Pipeline and introduced a Banzai Cloud terminology - spotguides. We have already gone deep into what Spotguides were and how they supercharged Kubernetes deployments of application frameworks (automated deployments, preconfigured GitHub repositories, CI/CD, job specific automated cluster sizing, Vault based secret management, etc.). This post is focused on one specific spotguide: Spark with HistoryServer.

https://banzaicloud.com/blog/spotguides-spark/

http://datadotz.com/datadotz-bigdata-weekly-71/#more-1408

 


--

DataDotz DataDotz

unread,
Jan 30, 2019, 5:13:54 AM1/30/19
to chenn...@googlegroups.com


Using Docker and Pyspark

Pyspark can be a bit difficult to get up and running on your machine. Docker is a quick and easy way to get a spark environment working on your local machine and is how I run Pyspark on my machine.I’ll start by giving an introduction to Docker. According to wikipedia “Docker is a computer program that performs operating-system-level virtualization, also known as ‘containerization’ ”. To greatly simplify, Docker creates a walled off linux operating system to run software on top of your machines OS called a container.

https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867

http://datadotz.com/datadotz-bigdata-weekly-73/#more-1421

Deploying Logstash pipelines to Kubernetes

Towards the end of 2018 I started to wrap up things I’d been learning and decided to put some structure into my learning for 2019.2018 had been an interesting year, I’d moved jobs 3 times and felt like my learning was all over the place. One day I was learning Scala and the next I was learning Hadoop. Looking back, I felt like I didn't gain much ground.

https://towardsdatascience.com/the-basics-of-deploying-logstash-pipelines-to-kubernetes-94a470ad34d9

http://datadotz.com/datadotz-bigdata-weekly-73/#more-1421

Spark Streaming or Kafka Streams or Alpakka Kafka?

Recently we needed to choose a stream processing framework for processing CDC events on Kafka. CDC events were produced by a legacy system and the resulting state would persist in a Neo4J graph database. We had to choose between, Spark Streaming, Kafka Streams and Alpakka Kafka. While we chose Alpakka Kafka over Spark streaming and kafka streams in this particular situation, the comparison we did would be useful to guide anyone making a choice of framework for stream processing.

https://medium.com/@unmeshvjoshi/choosing-a-stream-processing-framework-spark-streaming-or-kafka-streams-or-alpakka-kafka-fa0422229b25

http://datadotz.com/datadotz-bigdata-weekly-73/#more-1421

Joy and Pain of using Google BigTable

Last year, I wrote about Ravelin’s use and displeasure with DynamoDB. After some time battling that database we decided to put it aside and pick up a new battle, Google Bigtable. We have now had a year and a half of using Bigtable and have learned a lot along the way. We have been very impressed by BigTable, it can absorb almost any load we throw at it but it isn’t without its own eccentricities and issues. Today I want to go through some of those lessons.

https://syslog.ravelin.com/the-joy-and-pain-of-using-google-bigtable-4210604c75be

http://datadotz.com/datadotz-bigdata-weekly-73/#more-1421

Optimising Spark RDD Pipelines

Every day, in THRON, we collect and process millions of events regarding user-content interaction. The reason we do so is because we enrich user and content datasets, analise the timeseries, extract behaviour patterns and ultimately infer user interest and content characteristics from those; this is done to fuel lots of different cool benefits such as recommendations, Digital content ROI calculation, predictions and many more.

https://medium.com/thron-tech/optimising-spark-rdd-pipelines-679b41362a8a

http://datadotz.com/datadotz-bigdata-weekly-73/#more-1421

Serverless Data Lake on AWS

In this post, we talk about designing a cloud-native data warehouse as a replacement for our legacy data warehouse built on a relational database.At the beginning of the design process, the simplest solution appeared to be a straightforward lift-and-shift migration from one relational database to another.

https://aws.amazon.com/blogs/big-data/our-data-lake-story-how-woot-com-built-a-serverless-data-lake-on-aws/

http://datadotz.com/datadotz-bigdata-weekly-73/#more-1421

 

 

DataDotz DataDotz

unread,
Feb 7, 2019, 5:33:48 AM2/7/19
to chenn...@googlegroups.com


Scalability Improvement of Apache Impala 2.12.0 in CDH 5.15.0

Apache Impala is a massively-parallel SQL execution engine, allowing users to run complex queries on large data sets with interactive query response times. An Impala cluster is usually comprised of tens to hundreds of nodes, with an Impala daemon (Impalad) running on each node. Communication between the Impala daemons happens through remote procedure calls (RPCs) using the Apache Thrift library. When processing queries, Impala daemons frequently have to exchange large volumes of data with all other nodes in the cluster, for example during a partitioned hash join.

https://blog.cloudera.com/blog/2019/01/scalability-improvement-of-apache-impala-2-12-0-in-cdh-5-15-0/

http://datadotz.com/datadotz-bigdata-weekly-74/#more-1427

Uber’s GPU-Powered Open Source, Real-time Analytics Engine

At Uber, real-time analytics allow us to attain business insights and operational efficiency, enabling us to make data-driven decisions to improve experiences on the Uber platform. For example, our operations team relies on data to monitor the market health and spot potential issues on our platform; software powered by machine learning models leverages data to predict rider supply and driver demand; and data scientists use data to improve machine learning models for better forecasting.

https://eng.uber.com/aresdb/

http://datadotz.com/datadotz-bigdata-weekly-74/#more-1427

Spark surprises for the uninitiated

He started by adding a monotonically increasing ID column to the DataFrame. Spark has a built-in function for this, monotonically_increasing_id — you can find how to use it in the docs.His idea was pretty simple: once creating a new column with this increasing ID, he would select a subset of the initial DataFrame and then do an anti-join with the initial one to find the complement1.

https://blog.godatadriven.com/spark-beware

http://datadotz.com/datadotz-bigdata-weekly-74/#more-1427

40x faster hash joiner with vectorized execution

For the past four months, I’ve been working with the incredible SQL Execution team at Cockroach Labs as a backend engineering intern to develop the first prototype of a batched, column-at-a-time execution engine. During this time, I implemented a column-at-a-time hash join operator that outperformed CockroachDB’s existing row-at-a-time hash join by 40x.

https://www.cockroachlabs.com/blog/vectorized-hash-joiner/

http://datadotz.com/datadotz-bigdata-weekly-74/#more-1427

Finding Kafka’s throughput limit in Dropbox infrastructure

Apache Kafka is a popular solution for distributed streaming and queuing for large amounts of data. It is widely adopted in the technology industry, and Dropbox is no exception. Kafka plays an important role in the data fabric of many of our critical distributed systems: data analytics, machine learning, monitoring, search, and stream processing (Cape), to name a few.

https://blogs.dropbox.com/tech/2019/01/finding-kafkas-throughput-limit-in-dropbox-infrastructure/

http://datadotz.com/datadotz-bigdata-weekly-74/#more-1427

Google BigQuery's Python SDK: Creating Tables Programmatically

GCP is on the rise, and it's getting harder and harder to have conversations around data without addressing the 500-pound gorilla in the room: Google BigQuery. With most enterprises comfortably settled into their Apache-based Big Data stacks, BigQuery rattles the cages of convention for many. Luckily, Hackers And Slackers is no such enterprise. Thus, we aren't afraid to ask the Big question: how much easier would life be with BigQuery

https://hackersandslackers.com/getting-started-google-big-query-python/

http://datadotz.com/datadotz-bigdata-weekly-74/#more-1427

Reply all
Reply to author
Forward
0 new messages