New Documentation Page MeetingNotes - Notes from weekly meetings (minni)

1 view
Skip to first unread message

sourcer...@gmail.com

unread,
Feb 8, 2011, 5:13:04 PM2/8/11
to libm...@googlegroups.com
Hello,

A new documentation page has been created:

MeetingNotes - Notes from weekly meetings
Project: Minni: Lightweight MapReduce Library
Created by: Hrishikesh Amur

Content:

Jan 25
---

Generalized libminni infrastructure; <br>
* New Mapper class and PAO class need to be defined<br>
** PAO class has add and merge<br>

By default, how many passes do we want on the map-side?<br>
* Google-sponsored Google code projects<br>

* vary the size of 25G dataste and see what happens to the difference
* bin the PAOs and do a pass at the end;<br>
** parametrize the number of buckets (how many partitions on flash vs. disk)

Feb 1
---

Look at:
Piccolo

Architecture: log store per bucket; sort buckets.

* Is there a fundamental relationships between number of output files,
number of buckets, number of reducers that we can come up with?
* Size of the bucket: smaller is better since we have to sort these
* Number of buckets: if it's too large more overhead; overhead from SSD if we
are appending to too many files
* make a list of designs explored (Hrishi)

Erik's note on partition functions:
So suppose that there are R reducers, and we want there to be B bins per mapper.
Then we need a universal partition function that specifies P = lcm(R,B)
different bins if it is to be perfectly useful for both (note that it is not
the gcd like I claimed in the meeting). Then you need to group P / R partitions
together to get a partition function for reducers, and group P/B partitions
together to get a partition function for bins.

The perfect partition function would the P-quantiles for the intermediate
key-space (for some ordering on the keys, e.g. lexicographic). Since we don't
know this, would would either have to determine this by:<br>
1. Domain knowledge<br>
2. Experimentation<br>


Feb 8
---
* for the smallest, is nsort doing it in memory
* nsort parameters

* need to talk about the following cases:
1. dataset fits in memory<br>
2. dataset larger than memory, but the SSD supports writing N buckets with max perf. such that each bucket fits in memory<br>
3. dataset larger than that<br>

* why is hash not doing much better than sort in the first section?<br>
** mod function on Atoms<br>
** replace with Hsieh and check<br>
** other hash functions?<br>

* Look at partition functions such that we can keep all
the buckets at around the same size.

--
Documentation page: http://sourcery.cmcl.cs.cmu.edu/indefero/p/minni/page/MeetingNotes/

Reply all
Reply to author
Forward
0 new messages