New Documentation Page MeetingNotes - Notes from weekly meetings (minni)

1 view

Skip to first unread message

sourcer...@gmail.com

unread,

Feb 8, 2011, 5:13:04 PM2/8/11

to libm...@googlegroups.com

Hello,

A new documentation page has been created:

MeetingNotes - Notes from weekly meetings
Project: Minni: Lightweight MapReduce Library
Created by: Hrishikesh Amur

Content:

Jan 25
---

Generalized libminni infrastructure; 
* New Mapper class and PAO class need to be defined 
** PAO class has add and merge 

By default, how many passes do we want on the map-side? 
* Google-sponsored Google code projects 

* vary the size of 25G dataste and see what happens to the difference
* bin the PAOs and do a pass at the end; 
** parametrize the number of buckets (how many partitions on flash vs. disk)

Feb 1
---

Look at:
Piccolo

Architecture: log store per bucket; sort buckets.

* Is there a fundamental relationships between number of output files,
number of buckets, number of reducers that we can come up with?
* Size of the bucket: smaller is better since we have to sort these
* Number of buckets: if it's too large more overhead; overhead from SSD if we
are appending to too many files
* make a list of designs explored (Hrishi)

Erik's note on partition functions:
So suppose that there are R reducers, and we want there to be B bins per mapper.
Then we need a universal partition function that specifies P = lcm(R,B)
different bins if it is to be perfectly useful for both (note that it is not
the gcd like I claimed in the meeting). Then you need to group P / R partitions
together to get a partition function for reducers, and group P/B partitions
together to get a partition function for bins.

The perfect partition function would the P-quantiles for the intermediate
key-space (for some ordering on the keys, e.g. lexicographic). Since we don't
know this, would would either have to determine this by: 
1. Domain knowledge 
2. Experimentation

Feb 8
---
* for the smallest, is nsort doing it in memory
* nsort parameters

* need to talk about the following cases:
1. dataset fits in memory 
2. dataset larger than memory, but the SSD supports writing N buckets with max perf. such that each bucket fits in memory 
3. dataset larger than that 

* why is hash not doing much better than sort in the first section? 
** mod function on Atoms 
** replace with Hsieh and check 
** other hash functions? 

* Look at partition functions such that we can keep all
the buckets at around the same size.

--
Documentation page: http://sourcery.cmcl.cs.cmu.edu/indefero/p/minni/page/MeetingNotes/

Reply all

Reply to author

Forward

0 new messages