ACM Data Mining Camp - ACM Class on Big Data

26 views

Skip to first unread message

Tricia Hoffman

unread,

Oct 11, 2012, 5:03:55 PM10/11/12

to machine-lea...@googlegroups.com, machine-learning-...@googlegroups.com, machinelearningt...@googlegroups.com

The ACM is hosting 2 great events at eBay on North 1st street this weekend. Don't miss out. Please help spread the word.

The free ACM Data Mining Camp is on Saturday: http://www.sfbayacm.org/DMcamp2012 Andrew Ng is the key note speaker!

The Big Data ACM Professional Development Seminar is on Sunday: http://www.sfbayacm.org/event/big-data-professional-development-seminar Paco Nathan from Cascading is the speaker. Description Below.

I hope to see you all there.
Regards,

Patricia Hoffman, PhD
Chair
SF Bay Area
Data Mining SIG

Course Description:

Big Data, Data Science, Cloud Computing... Lots of exciting stuff, lots of media buzz, lots of confusing descriptions. For a programmer armed with a laptop and some knowledge of Bash and Python scripting, what’s a good way to begin working with these new tools for handling large-scale unstructured data?

This course provides an introduction to MapReduce, showing how to prototype applications as Linux command line scripting, then deploy them at scale in the cloud. Examples use Cascading to manage MapReduce application workflows, and Amazon Elastic MapReduce for the cloud-based elastic resources.

In addition to examining "How" things work, we will take a detailed look at "Why" did MapReduce emerge this way -- what factors lead to the popular frameworks and what typical issues confront large-scale deployments -- so that each student is prepared to make ongoing assessments and learning as the field continues to grow and evolve.

Course Objectives:

1. Describe how MapReduce works, including its architecture, basic operations, and a timing diagram for a typical job step.

2. Compare the common themes among the major innovators in MapReduce history.

3. Articulate trade-offs which indicate good use cases for MapReduce, along with best practices and common troubleshooting issues.

4. Identify resources for self-guided learning about MapReduce, beyond the scope of this course.

5. Understand an end-to-end MapReduce example, including the function of each line of code, and tracing each tuple of data to completion.

6. Recognize MapReduce workflows as the execution of Directed Acyclic Graphs (DAGs), and how Cascading augments frameworks such as Hadoop.

7. Translate between layers in a MapReduce "tech stack" to show correspondences between piped Linux utilities, the timing of MapReduce phases, and Cascading flows.

8. Understand how to manage scale-out of a MapReduce application from scripting Linux command line utilities to production use of large-scale clusters in the cloud.

Prerequisites:

Familiarity with Linux/Unix command line utilities
Familiarity with Bash scripting
Some coding experience in Python

Materials:

Laptop which runs a Unix or Linux command line prompt
Python, Hadoop, Cascading (will be installed as course work)
Web browser and WiFi connection (we have screencasts for the demos, as a fallback if our classroom Internet connections experience any problems)
Personal account on Amazon AWS (free)

Schedule:

8:30 - 9:00 – Registration
9:00 - 10:15 – Session 1
10:15 - 10:45 – Break
10:45 - 12:00 – Session 2
12:00 - 1:00 – Lunch
1:00 - 2:15 – Session 3
2:15 - 2:45 – Break
2:45 - 4:00 -– Session 4
4:00 - 4:30 – Wrap-up

Speaker Bio

Paco Nathan has lead engineering teams and advise business decision makers to leverage large scale data, machine learning, distributed systems, and cloud computing for advanced data analytics.

Data Scientist: math/stats, predictive modeling, recommenders, data visualization
Player-Coach: hands-on engineering manager for teams with “Big Data” emphasis
Expert in open source distributed systems based on Hadoop, Hive, Redis, etc.
Expert in cloud architecture using Amazon AWS, especially for large scale data work · Generalist programmer in Python, PHP, JavaScript, HTML/CSS, SQL, R, etc.

Specialties AWS, EC2, S3, EMR, "Big Data", analytics, NoSQL, cloud computing, Hadoop, R, Python, Redis, Gephi, Lucene, text mining, NLP, machine learning, data visualization, statistics, engineering management, remote teams, ROWE, distributed systems, recommender systems, predictive modeling.

Event page provided by ACM

--
Patricia Hoffman PhD

--
Patricia Hoffman PhD

Reply all

Reply to author

Forward

0 new messages