Question regarding MapReduce library

32 views
Skip to first unread message

Prateek Malhotra

unread,
Jan 29, 2014, 10:57:54 AM1/29/14
to google-a...@googlegroups.com
Hello,

I am trying to understand running a map-reduce job a little better. I know a lot of datastore operations are required to keep the state of the job, but how does the mapreduce library keep track of "yielded" data?

I am running a job to process over 13 million entities and normalize some data to dump into Google Storage (approximately 15GB of data). When I use the FileOutputWriter, where does it keep track of each line I've yielded? How do I end up with only 1 large file written to Google Storage? I looked at the bucket during a map-reduce operation and I don't see anything until the job is done and there's one large file ready for me to use. Does each Shard aggregate the data into a blobstore object before a final step merges all the shards' data and writes it to GCS? How is the library able to do this with F1 instances and memory constraints? I was not able to easily follow the code behind all this so I was hoping someone who is more familiar with the process can shed some light. 

I have other use cases in which I need to process a lot of data and would like to end up with a single large output file, but my method isn't the most stable of processes and does not fit into a map-reduce job. If I knew the general logic behind aggregating the data and placing it into Google Storage this would be of great benefit to me.

Any and all insight would be greatly appreciated!

Thank you,
Prateek
Reply all
Reply to author
Forward
0 new messages