Untitled document

2 views
Skip to first unread message

tr...@trevreport.org

unread,
Jul 28, 2010, 9:37:39 PM7/28/10
to nosql-sum...@googlegroups.com
I've shared Untitled document
Message from tr...@trevreport.org:
This is also posted at http://trevmex.com/ I hope it is helpful.
Click to open:

Philadelphia NoSQL Summer: MapReduce

MapReduce: Simplified Data Processing on Large Clusters

a NOSQL Summer Philadelphia

Nick is MCing

Dave from CIM is hosting!

MapReduce is like really large-scale data processing.

One of the important things is the existence of the GFS., which is useful for controlling network bandwidth.

There is a Map function which groups together all the values with the same key and groups them together to a temporary file as key/value pairs.

The Reduce function aggregates the results of the mapped data.

Why have two functions?

     The map function groups things together for the reduce function to utilize.

All the functions are deterministic. This is useful in the case of failures.

The Master

The master is in charge of the whole state of the MapReduce universe.

Does Hadoop allow you to have multiple masters? (Trotter)

     Unknown, but the master is unlikely to fail, so the Google paper does not consider it.

     The main concern is that it handles multiple worker failures. Workers are more likely to fail because of how many there are.

The master keeps all the task information in memory.

Does all the map tasks have to be done for the reduce tasks to start? (Dave)

     No. You can start reducing as soon as you have a mapped result.

What happens when a worker fails?

     There is a heartbeat from the master. If the worker does not respond, the work is given to another worker.

The state information is small (M * R), why is that?

     The map task can produce R documents.

There aren’t that many things you can do with only one map/reduce function, you usually have to chain them together.

Reducers are notified whenever a machine fails.

The results of a map function are stored on the map machine, which the reducer is notified of.

The final file names are deterministic, that will allow your data to be safe because rewriting in GFS is atomic.

So long as the file system has an atomic move, you are safe (and most all do).

In a non-deterministic system, (i.e. a system where the data set changes), you have weaker results, but they can be pretty accurate.

By using backup tasks, you can speed up slow machines by sending the operations they are using to other finished workers.

You can customize your partitioning function to make all the reduced keys go to one file.

The reduced workers sort before that do the reduce task.

In MongoDB and CouchDB the ONLY way to get aggregate data is through MapReduce. The only other way to get data is to return the full documents and parse them. (bleh!)

Skipping bad records: If there is bad input, all the workers will be aware of it, so the mappers will be aware of it and skip it.

You can add a counter to the mapper which can give you added information (like a little map/reduce running on the Master)

The master pings the workers so that the workers do not have to worry about when to talk back to the master.

In the Google example, they are chaining map/reduce functions together to make it useful.

Erasure coding: divide the object into M parts, and recode to N parts where N is greater than M. Then you can throw around the N parts (this is like the temp files in MapReduce)

In the Erlang book by Joe Armstrong, they implement a MapReduce function, which is easy in Erlang.

MongoDB and CouchDB implement MapReduce in Javascript.

Hadoop is a Java implementation of the MapReduce library that is open source and in wide use.

Next time is Google’s BigTable! I will be MCing. w00t!

 


Google Docs makes it easy to create, store and share online documents, spreadsheets and presentations.
Logo for Google Docs

Trotter Cashion

unread,
Jul 28, 2010, 11:32:10 PM7/28/10
to tr...@trevreport.org, nosql-sum...@googlegroups.com
Nice, thanks!
Reply all
Reply to author
Forward
0 new messages