Course Description:
Big Data, Data Science, Cloud Computing... Lots of exciting stuff, lots of media buzz, lots of confusing descriptions. For a programmer armed with a laptop and some knowledge of Bash and Python scripting, what’s a good way to begin working with these new tools for handling large-scale unstructured data?
This course provides an introduction to MapReduce, showing how to prototype applications as Linux command line scripting, then deploy them at scale in the cloud. Examples use Cascading to manage MapReduce application workflows, and Amazon Elastic MapReduce for the cloud-based elastic resources.
In addition to examining "How" things work, we will take a detailed look at "Why" did MapReduce emerge this way -- what factors lead to the popular frameworks and what typical issues confront large-scale deployments -- so that each student is prepared to make ongoing assessments and learning as the field continues to grow and evolve.
Course Objectives:
1. Describe how MapReduce works, including its architecture, basic operations, and a timing diagram for a typical job step.
2. Compare the common themes among the major innovators in MapReduce history.
3. Articulate trade-offs which indicate good use cases for MapReduce, along with best practices and common troubleshooting issues.
4. Identify resources for self-guided learning about MapReduce, beyond the scope of this course.
5. Understand an end-to-end MapReduce example, including the function of each line of code, and tracing each tuple of data to completion.
6. Recognize MapReduce workflows as the execution of Directed Acyclic Graphs (DAGs), and how Cascading augments frameworks such as Hadoop.
7. Translate between layers in a MapReduce "tech stack" to show correspondences between piped Linux utilities, the timing of MapReduce phases, and Cascading flows.
8. Understand how to manage scale-out of a MapReduce application from scripting Linux command line utilities to production use of large-scale clusters in the cloud.
Prerequisites:
Materials:
Paco Nathan has lead engineering teams and advise business decision makers to leverage large scale data, machine learning, distributed systems, and cloud computing for advanced data analytics.
Specialties AWS, EC2, S3, EMR, "Big Data", analytics, NoSQL, cloud computing, Hadoop, R, Python, Redis, Gephi, Lucene, text mining, NLP, machine learning, data visualization, statistics, engineering management, remote teams, ROWE, distributed systems, recommender systems, predictive modeling.
Event page provided by ACM