Recent Hot Picks in the News/Blogs

4 views

Skip to first unread message

DataDotz DataDotz

unread,

Dec 5, 2018, 4:43:49 AM12/5/18

to chenn...@googlegroups.com

Data Management Strategies for Computer Vision

Computer vision (CV) developers often find the biggest barrier to success relates to data management, and yet so much of what you'll find about CV is about the algorithms, not the data. In this blog, I'll describe three separate data management strategies I've used with applications that process images. Through the anecdotes of my experiences, you'll learn about several functions that data platforms provide for CV.The main event here is a discussion about how video can be transported through MapR-ES (which is MapR's reimplementation of Apache Kafka) and how Docker can be used to elastically scale video processors for face detection.

https://mapr.com/blog/data-management-strategies-for-computer-vision/

http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377

Big Data Analytics and Machine Learning with PTC and Hortonworks

Today, PTC and Hortonworks announce a strategic partnership to “fast-forward” the realization of Industry 4.0 benefits including improved manufacturing quality and yield, enhanced asset and plant uptime, and optimized production flexibility and throughput. This collaboration is directed at a state-of-the art solution comprised of complementary offerings from Hortonworks.

https://hortonworks.com/blog/fast-forward-industry-4-0-enterprise-big-data-analytics-machine-learning-ptc-hortonworks/

http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377

Embed interactive dashboards in your application with Amazon QuickSight

Embedded Amazon QuickSight dashboards allow you to utilize Amazon QuickSight’s serverless architecture and easily scale your insights with your growing user base, while ensuring you only pay for usage with Amazon QuickSight’s unique pay-per-session pricing model.

https://aws.amazon.com/blogs/big-data/embed-interactive-dashboards-in-your-application-with-amazon-quicksight/

http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377

Scale your Amazon Redshift clusters

Amazon Redshift is the cloud data warehouse of choice for organizations of all sizes—from fast-growing technology companies such as Turo and Yelp to Fortune 500 companies such as 21st Century Fox and Johnson & Johnson. With quickly expanding use cases, data sizes, and analyst populations, these customers have a critical need for scalable data warehouses.Since we launched Amazon Redshift, our customers have grown with us.

https://aws.amazon.com/blogs/big-data/scale-your-amazon-redshift-clusters-up-and-down-in-minutes-to-get-the-performance-you-need-when-you-need-it/

http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377

Stitch & Mobile Webinar Questions & Replay

How do you test MongoDB Stitch functions, how do you store Stitch triggers, and what services can you integrate Stitch with? These were some of the great questions that were asked and answered in my recent webinar. You can watch the replay of "MongoDB Mobile and MongoDB Stitch – I.For those new to MongoDB Stitch, it's the serverless platform from MongoDB that isolates complexity and ‘plumbing’ so you can build applications faster.

https://www.mongodb.com/blog/post/stitch--mobile-webinar-questions--replay

http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377

Apache Avro as a Built-in Data Source in Apache Spark 2.4

Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ open source project Avro Data Source for Apache Spark (referred to as spark-avro from now on).

https://databricks.com/blog/2018/11/30/apache-avro-as-a-built-in-data-source-in-apache-spark-2-4.html

http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377

-- --

Regards

DATADOTZ Team
https://www.linkedin.com/in/datadotz
https://www.linkedin.com/in/senthilka
https://www.facebook.com/datadotz

http://datadotz.com/blog/

DataDotz DataDotz

unread,

Dec 14, 2018, 5:37:09 AM12/14/18

to data...@googlegroups.com, datadotz...@googlegroups.com, chenn...@googlegroups.com

Kafka Distributed Message System

Kafka is a message system. Let us understand more about the message system and the problems it solves. Take the currently popular micro-service as an example. Let's assume that there are three terminal-oriented (WeChat official account, mobile app, and browser) web services (HTTP protocols) at the web end, namely Web1, Web2, and Web3, and three internal application services App1, App2, and App3

https://www.alibabacloud.com/blog/an-overview-of-kafka-distributed-message-system_594218

http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387

Cache warming: Agility for a stateful service

EVCache has been a fundamental part of the Netflix platform (we call it Tier-1), holding Petabytes of data. Our caching layer serves multiple use cases from signup, personalization, searching, playback, and more. It is comprised of thousands of nodes in production and hundreds of clusters all of which must routinely scale up due to the increasing growth of our members. To address the high demand of our caching we have recently discussed the Evolution of Application Data Caching: From RAM to SSD.

https://medium.com/netflix-techblog/cache-warming-agility-for-a-stateful-service-2d3b1da82642

http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387

Time Series at ShiftLeft

Time series are a major component of the ShiftLeft runtime experience. This is true for many other products and organizations too, but each case involves different characteristics and requirements. This post describes the requirements that we have to work with, how we use TimescaleDB to store and retrieve time series data, and the tooling we’ve developed to manage our infrastructure

https://blog.shiftleft.io/time-series-at-shiftleft-e1f98196909b

http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387

Manage centralized Microsoft Exchange Server logs using Amazon Kinesis

Microsoft Exchange servers store different types of logs. These log types include message tracking, Exchange Web Services (EWS), Internet Information Services (IIS), and application/system event logs. With Exchange servers deployed on a global scale, logs are often scattered in multiple directories that are local to these servers.

https://aws.amazon.com/blogs/big-data/manage-centralized-microsoft-exchange-server-logs-using-amazon-kinesis-agent-for-windows/

http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387

Building Secure and Governed Microservices with Kafka Streams

With Hortonworks DataFlow (HDF) 3.3 now supporting Kafka Streams, we are truly excited about the possibilities of the applications that you can benefit from when combined with the rest of our platform.In this post, we will demonstrate how Kafka Streams can be integrated with Schema Registry, Atlas and Ranger to build set of microservices apps for a fictitious use case.

https://hortonworks.com/blog/building-secure-and-governed-microservices-with-kafka-streams/

http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387

Spark from the trenches

Spark includes a configurable metrics system based on the dropwizard.metrics library. It is set up via the Spark configuration. As we already are heavy users of Graphite and Grafana, we use the provided Graphite sink.

https://medium.com/teads-engineering/spark-from-the-trenches-part-2-f2ff9ab67ea1

http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387

DataDotz DataDotz

unread,

Dec 19, 2018, 4:34:39 AM12/19/18

to data...@googlegroups.com, datadotz...@googlegroups.com, chenn...@googlegroups.com

Best Practices for Securing Amazon EMR

Amazon EMR is a managed Hadoop framework that you use to process vast amounts of data. One of the reasons that customers choose Amazon EMR is its security features. For example, customers like FINRA in regulated industries such as financial services, and in healthcare, choose Amazon EMR as part of their data strategy. They do so to adhere to strict regulatory requirements from entities such as the Payment Card Industry Data Security Standard (PCI) and the Health Insurance Portability and Accountability Act (HIPAA).

https://aws.amazon.com/blogs/big-data/best-practices-for-securing-amazon-emr/

http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389

ActiveMQ architecture and key metrics

Apache ActiveMQ is message-oriented middleware (MOM), a category of software that sends messages between applications. Using standards-based, asynchronous communication, ActiveMQ allows loose coupling of the elements in an IT environment, which is often foundational to enterprise messaging and distributed applications.

https://www.datadoghq.com/blog/activemq-architecture-and-metrics/

http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389

Built-in Image Data Source in Apache Spark 2.4

Apache Spark 2.3 provided the ImageSchema.readImages API (see Microsoft’s post Image Data Support in Apache Spark), which was originally developed in the MMLSpark library. In Apache Spark 2.4, it’s much easier to use because it is now a built-in data source. Using the image data source, you can load images from directories and get a DataFrame with a single image column.

https://databricks.com/blog/2018/12/10/introducing-built-in-image-data-source-in-apache-spark-2-4.html

http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389

Teardown, Rebuild: Migrating from Hive to PySpark

Machine Learning (ML) engineering and software development are both fundamentally about writing correct and robust algorithms. In ML engineering we have the extra difficulty of ensuring mathematical correctness and avoiding propagation of round-off errors in the calculations when working with floating-point representations of a number.

https://medium.com/@trivagotech/teardown-rebuild-migrating-from-hive-to-pyspark-324176a7ce5

http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389

Getting Started with Apache Pulsar and Data Collector

Apache Pulsar is an open-source distributed pub-sub messaging system originally created at Yahoo, and a top-level Apache project since September 2018. StreamSets Data Collector 3.5.0, released soon after, introduced the Pulsar Consumer and Producer pipeline stages. In this blog entry I'll explain how to get started creating dataflow pipelines for Pulsar.

https://streamsets.com/blog/getting-started-apache-pulsar-streamsets-data-collector/

http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389

Apache Kafka Security | Need and Components of Kafka

There are a number of features added in the Kafka community, in release 0.9.0.0. There is a flexibility for their usage also, like either separately or together, that also enhances security in a Kafka cluster.

https://medium.com/@rinu.gour123/apache-kafka-security-need-and-components-of-kafka-52b417d3ca77

http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389

DataDotz DataDotz

unread,

Dec 27, 2018, 6:00:14 AM12/27/18

to data...@googlegroups.com, datadotz...@googlegroups.com, chenn...@googlegroups.com

HyperLogLog in Presto: A significantly faster way to handle cardinality estimation

Computing the count of distinct elements in massive data sets is often necessary but computationally intensive. Say you need to determine the number of distinct people visiting Facebook in the past week using a single machine. Doing this with a traditional SQL query on a data set as massive as the ones we use at Facebook would take days and terabytes of memory. To speed up these queries, we implemented an algorithm called HyperLogLog (HLL) in Presto, a distributed SQL query engine.We have seen great improvements, with some queries being run within minutes, including those used to analyze thousands of A/B tests.

https://code.fb.com/data-infrastructure/hyperloglog/

http://datadotz.com/datadotz-bigdata-weekly-70/#more-1399

Kafka Performance Tuning — Ways for Kafka Optimization

There are few configuration parameters to be considered while we talk about Kafka Performance tuning. Hence, to improve performance, the most important configurations are the one, which controls the disk flush rate.Also, we can divide these configurations on the component basis. So, let’s talk about Producer first.

https://medium.com/@rinu.gour123/kafka-performance-tuning-ways-for-kafka-optimization-fdee5b19505b

http://datadotz.com/datadotz-bigdata-weekly-70/#more-1399

New Features of Kafka 2.1

Kafka 2.1 is now available with Java 11! Java 11 was created in September 2018 and we get all the benefits from it, such as the Improved SSL and TLS performance (the improvements come from Java 9). According to one of the main Kafka committer, it is 2.5 times faster than Java 8.

https://medium.com/@stephane.maarek/new-features-of-kafka-2-1-33fb5396b546

http://datadotz.com/datadotz-bigdata-weekly-70/#more-1399

Introducing Hive-Kafka integration for real-time Kafka SQL queries

Stream processing engines/libraries like Kafka Streams provide a programmatic stream processing access pattern to Kafka. Application developers love this access pattern but when you talk to BI developers, their analytics requirements are quite different which are focused on use cases around ad hoc analytics, data exploration and trend discovery. BI persona requirements for Kafka access include:

https://hortonworks.com/blog/introducing-hive-kafka-sql/

http://datadotz.com/datadotz-bigdata-weekly-70/#more-1399

Accelerating Hive Queries with Parquet Vectorization

Apache Hive is a widely adopted data warehouse engine that runs on Apache Hadoop. Features that improve Hive performance can significantly improve the overall utilization of resources on the cluster. Hive processes data using a chain of operators within the Hive execution engine. These operators are scheduled in the various tasks (for example, MapTask, ReduceTask, or SparkTask) of the query execution plan. Traditionally, these operators are designed to process one row at a time.

https://blog.cloudera.com/blog/2018/12/faster-swarms-of-data-accelerating-hive-queries-with-parquet-vectorization/

http://datadotz.com/datadotz-bigdata-weekly-70/#more-1399

Apache Spark — Tips and Tricks for better performance

Apache Spark is quickly gaining steam both in the headlines and real-world adoption. Top use cases are Streaming Data, Machine Learning, Interactive Analysis and more. Many known companies uses it like Uber, Pinterest and more. So after working with Spark for more then 3 years in production, I’m happy to share my tips and tricks for better performance.

https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11