Data Management Strategies for Computer Vision
Computer vision (CV) developers often find the biggest barrier to success relates to data management, and yet so much of what you'll find about CV is about the algorithms, not the data. In this blog, I'll describe three separate data management strategies I've used with applications that process images. Through the anecdotes of my experiences, you'll learn about several functions that data platforms provide for CV.The main event here is a discussion about how video can be transported through MapR-ES (which is MapR's reimplementation of Apache Kafka) and how Docker can be used to elastically scale video processors for face detection.
https://mapr.com/blog/data-management-strategies-for-computer-vision/
http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377
Big Data Analytics and Machine Learning with PTC and Hortonworks
Today, PTC and Hortonworks announce a strategic partnership to “fast-forward” the realization of Industry 4.0 benefits including improved manufacturing quality and yield, enhanced asset and plant uptime, and optimized production flexibility and throughput. This collaboration is directed at a state-of-the art solution comprised of complementary offerings from Hortonworks.
http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377
Embed interactive dashboards in your application with Amazon QuickSight
Embedded Amazon QuickSight dashboards allow you to utilize Amazon QuickSight’s serverless architecture and easily scale your insights with your growing user base, while ensuring you only pay for usage with Amazon QuickSight’s unique pay-per-session pricing model.
http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377
Scale your Amazon Redshift clusters
Amazon Redshift is the cloud data warehouse of choice for organizations of all sizes—from fast-growing technology companies such as Turo and Yelp to Fortune 500 companies such as 21st Century Fox and Johnson & Johnson. With quickly expanding use cases, data sizes, and analyst populations, these customers have a critical need for scalable data warehouses.Since we launched Amazon Redshift, our customers have grown with us.
http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377
Stitch & Mobile Webinar Questions & Replay
How do you test MongoDB Stitch functions, how do you store Stitch triggers, and what services can you integrate Stitch with? These were some of the great questions that were asked and answered in my recent webinar. You can watch the replay of "MongoDB Mobile and MongoDB Stitch – I.For those new to MongoDB Stitch, it's the serverless platform from MongoDB that isolates complexity and ‘plumbing’ so you can build applications faster.
https://www.mongodb.com/blog/post/stitch--mobile-webinar-questions--replay
http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377
Apache Avro as a Built-in Data Source in Apache Spark 2.4
Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ open source project Avro Data Source for Apache Spark (referred to as spark-avro from now on).
http://datadotz.com/datadotz-bigdata-weekly-67/#more-1377
--
Kafka Distributed Message System
Kafka is a message system. Let us understand more about the message system and the problems it solves. Take the currently popular micro-service as an example. Let's assume that there are three terminal-oriented (WeChat official account, mobile app, and browser) web services (HTTP protocols) at the web end, namely Web1, Web2, and Web3, and three internal application services App1, App2, and App3
https://www.alibabacloud.com/blog/an-overview-of-kafka-distributed-message-system_594218
http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387
Cache warming: Agility for a stateful service
EVCache has been a fundamental part of the Netflix platform (we call it Tier-1), holding Petabytes of data. Our caching layer serves multiple use cases from signup, personalization, searching, playback, and more. It is comprised of thousands of nodes in production and hundreds of clusters all of which must routinely scale up due to the increasing growth of our members. To address the high demand of our caching we have recently discussed the Evolution of Application Data Caching: From RAM to SSD.
https://medium.com/netflix-techblog/cache-warming-agility-for-a-stateful-service-2d3b1da82642
http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387
Time Series at ShiftLeft
Time series are a major component of the ShiftLeft runtime experience. This is true for many other products and organizations too, but each case involves different characteristics and requirements. This post describes the requirements that we have to work with, how we use TimescaleDB to store and retrieve time series data, and the tooling we’ve developed to manage our infrastructure
https://blog.shiftleft.io/time-series-at-shiftleft-e1f98196909b
http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387
Manage centralized Microsoft Exchange Server logs using Amazon Kinesis
Microsoft Exchange servers store different types of logs. These log types include message tracking, Exchange Web Services (EWS), Internet Information Services (IIS), and application/system event logs. With Exchange servers deployed on a global scale, logs are often scattered in multiple directories that are local to these servers.
http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387
Building Secure and Governed Microservices with Kafka Streams
With Hortonworks DataFlow (HDF) 3.3 now supporting Kafka Streams, we are truly excited about the possibilities of the applications that you can benefit from when combined with the rest of our platform.In this post, we will demonstrate how Kafka Streams can be integrated with Schema Registry, Atlas and Ranger to build set of microservices apps for a fictitious use case.
https://hortonworks.com/blog/building-secure-and-governed-microservices-with-kafka-streams/
http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387
Spark from the trenches
Spark includes a configurable metrics system based on the dropwizard.metrics library. It is set up via the Spark configuration. As we already are heavy users of Graphite and Grafana, we use the provided Graphite sink.
https://medium.com/teads-engineering/spark-from-the-trenches-part-2-f2ff9ab67ea1
http://datadotz.com/datadotz-bigdata-weekly-68/#more-1387
Best Practices for Securing Amazon EMR
Amazon EMR is a managed Hadoop framework that you use to process vast amounts of data. One of the reasons that customers choose Amazon EMR is its security features. For example, customers like FINRA in regulated industries such as financial services, and in healthcare, choose Amazon EMR as part of their data strategy. They do so to adhere to strict regulatory requirements from entities such as the Payment Card Industry Data Security Standard (PCI) and the Health Insurance Portability and Accountability Act (HIPAA).
https://aws.amazon.com/blogs/big-data/best-practices-for-securing-amazon-emr/
http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389
ActiveMQ architecture and key metrics
Apache ActiveMQ is message-oriented middleware (MOM), a category of software that sends messages between applications. Using standards-based, asynchronous communication, ActiveMQ allows loose coupling of the elements in an IT environment, which is often foundational to enterprise messaging and distributed applications.
https://www.datadoghq.com/blog/activemq-architecture-and-metrics/
http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389
Built-in Image Data Source in Apache Spark 2.4
Apache Spark 2.3 provided the ImageSchema.readImages API (see Microsoft’s post Image Data Support in Apache Spark), which was originally developed in the MMLSpark library. In Apache Spark 2.4, it’s much easier to use because it is now a built-in data source. Using the image data source, you can load images from directories and get a DataFrame with a single image column.
http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389
Teardown, Rebuild: Migrating from Hive to PySpark
Machine Learning (ML) engineering and software development are both fundamentally about writing correct and robust algorithms. In ML engineering we have the extra difficulty of ensuring mathematical correctness and avoiding propagation of round-off errors in the calculations when working with floating-point representations of a number.
https://medium.com/@trivagotech/teardown-rebuild-migrating-from-hive-to-pyspark-324176a7ce5
http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389
Getting Started with Apache Pulsar and Data Collector
Apache Pulsar is an open-source distributed pub-sub messaging system originally created at Yahoo, and a top-level Apache project since September 2018. StreamSets Data Collector 3.5.0, released soon after, introduced the Pulsar Consumer and Producer pipeline stages. In this blog entry I'll explain how to get started creating dataflow pipelines for Pulsar.
https://streamsets.com/blog/getting-started-apache-pulsar-streamsets-data-collector/
http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389
Apache Kafka Security | Need and Components of Kafka
There are a number of features added in the Kafka community, in release 0.9.0.0. There is a flexibility for their usage also, like either separately or together, that also enhances security in a Kafka cluster.
https://medium.com/@rinu.gour123/apache-kafka-security-need-and-components-of-kafka-52b417d3ca77
http://datadotz.com/datadotz-bigdata-weekly-69/#more-1389
HyperLogLog in Presto: A significantly faster way to handle cardinality estimation
Computing the count of distinct elements in massive data sets is often necessary but computationally intensive. Say you need to determine the number of distinct people visiting Facebook in the past week using a single machine. Doing this with a traditional SQL query on a data set as massive as the ones we use at Facebook would take days and terabytes of memory. To speed up these queries, we implemented an algorithm called HyperLogLog (HLL) in Presto, a distributed SQL query engine.We have seen great improvements, with some queries being run within minutes, including those used to analyze thousands of A/B tests.
https://code.fb.com/data-infrastructure/hyperloglog/
http://datadotz.com/datadotz-bigdata-weekly-70/#more-1399
Kafka Performance Tuning — Ways for Kafka Optimization
There are few configuration parameters to be considered while we talk about Kafka Performance tuning. Hence, to improve performance, the most important configurations are the one, which controls the disk flush rate.Also, we can divide these configurations on the component basis. So, let’s talk about Producer first.
https://medium.com/@rinu.gour123/kafka-performance-tuning-ways-for-kafka-optimization-fdee5b19505b
http://datadotz.com/datadotz-bigdata-weekly-70/#more-1399
New Features of Kafka 2.1
Kafka 2.1 is now available with Java 11! Java 11 was created in September 2018 and we get all the benefits from it, such as the Improved SSL and TLS performance (the improvements come from Java 9). According to one of the main Kafka committer, it is 2.5 times faster than Java 8.
https://medium.com/@stephane.maarek/new-features-of-kafka-2-1-33fb5396b546
http://datadotz.com/datadotz-bigdata-weekly-70/#more-1399
Introducing Hive-Kafka integration for real-time Kafka SQL queries
Stream processing engines/libraries like Kafka Streams provide a programmatic stream processing access pattern to Kafka. Application developers love this access pattern but when you talk to BI developers, their analytics requirements are quite different which are focused on use cases around ad hoc analytics, data exploration and trend discovery. BI persona requirements for Kafka access include:
https://hortonworks.com/blog/introducing-hive-kafka-sql/
http://datadotz.com/datadotz-bigdata-weekly-70/#more-1399
Accelerating Hive Queries with Parquet Vectorization
Apache Hive is a widely adopted data warehouse engine that runs on Apache Hadoop. Features that improve Hive performance can significantly improve the overall utilization of resources on the cluster. Hive processes data using a chain of operators within the Hive execution engine. These operators are scheduled in the various tasks (for example, MapTask, ReduceTask, or SparkTask) of the query execution plan. Traditionally, these operators are designed to process one row at a time.
http://datadotz.com/datadotz-bigdata-weekly-70/#more-1399
Apache Spark — Tips and Tricks for better performance
Apache Spark is quickly gaining steam both in the headlines and real-world adoption. Top use cases are Streaming Data, Machine Learning, Interactive Analysis and more. Many known companies uses it like Uber, Pinterest and more. So after working with Spark for more then 3 years in production, I’m happy to share my tips and tricks for better performance.
https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11