The following documentation page has been updated:
MeetingNotes - Notes from weekly meetings
Project: Minni: Lightweight MapReduce Library
Summary: added "contributions"
Updated by: Hrishikesh Amur
Created by: Hrishikesh Amur
New content:
Jan 25
---
Generalized libminni infrastructure; <br>
* New Mapper class and PAO class need to be defined<br>
** PAO class has add and merge<br>
By default, how many passes do we want on the map-side?<br>
* Google-sponsored Google code projects<br>
* vary the size of 25G dataste and see what happens to the difference
* bin the PAOs and do a pass at the end;<br>
** parametrize the number of buckets (how many partitions on flash vs. disk)
Feb 1
---
Look at:
Piccolo
Architecture: log store per bucket; sort buckets.
* Is there a fundamental relationships between number of output files,
number of buckets, number of reducers that we can come up with?
* Size of the bucket: smaller is better since we have to sort these
* Number of buckets: if it's too large more overhead; overhead from SSD if we
are appending to too many files
* make a list of designs explored (Hrishi)
Erik's note on partition functions:
So suppose that there are R reducers, and we want there to be B bins per mapper.
Then we need a universal partition function that specifies P = lcm(R,B)
different bins if it is to be perfectly useful for both (note that it is not
the gcd like I claimed in the meeting). Then you need to group P / R partitions
together to get a partition function for reducers, and group P/B partitions
together to get a partition function for bins.
The perfect partition function would the P-quantiles for the intermediate
key-space (for some ordering on the keys, e.g. lexicographic). Since we don't
know this, would would either have to determine this by:<br>
1. Domain knowledge<br>
2. Experimentation<br>
Feb 8
---
* for the smallest, is nsort doing it in memory
* nsort parameters
* need to talk about the following cases:
1. dataset fits in memory<br>
2. dataset larger than memory, but the SSD supports writing N buckets with max perf. such that each bucket fits in memory<br>
3. dataset larger than that<br>
* why is hash not doing much better than sort in the first section?<br>
** mod function on Atoms<br>
** replace with Hsieh and check<br>
** other hash functions?<br>
* Look at partition functions such that we can keep all
the buckets at around the same size.
* List of contributions for the paper
--
Documentation page: http://sourcery.cmcl.cs.cmu.edu/indefero/p/minni/page/MeetingNotes/