The latest version of Mrs takes advantage of Mrs, so you usually don't
need to serialize manually anymore.
> So, specifically, if my two reduces just look like this, will it work? IA couple of things--I'm not sure what will be most helpful since I'm
> probably need to convert the list of values into a string, with json or
> something?
>
> def reduce(self, key, values):
> yield values
having trouble identifying what the exact questions are.
The latest version of Mrs takes advantage of Mrs, so you usually don't
need to serialize manually anymore.
Depending on how many keys you have, you might be perfectly happy with
the default partition function (hash_partition).
Since you have a couple of map-reduce phases, you'll write a run method
that calls job.local_data (for the input), job.map_data,
job.reduce_data, job.map_data, and job.reduce_data, each with the
options that you need.
If you want your program to process all of the output, then at the end
of your run function, you'll do a job.wait on the last reduce dataset,
and then you can iterate over its `data()` method to get all output
key-value pairs.
I hope this is helpful, at least for getting you started.
With SEPSOwe had multiple items related to the same particle, so things could get
a bit more complicated.
> Right, that's what I have. I didn't know about the data() method, though -It all depends on what you're trying to do, but at least you know what
> I was some arguments to the second reduce_data task. Because I'm expecting
> there to be so many keys, it might be better to stick with outputting them
> in the reduce. Or does it make a difference?
your options are now.
It sounds like you've hit a bug that we need to fix, but I don't have
any idea what "my problem really needs hadoop" means. :)
At one point I thought you mentioned getting a backtrace somewhere? Is
this the same thing, or are you hitting some other symptom?
Did you check for errors both on the master and on the slaves?
Is there an easy way for me to try to reproduce this locally?
Yeah, there's a big difference between a cluster and a workstation. Soyou don't have ssh access to the cluster?
Part of me wonders whether you might have had problems from running too
many slaves on one workstation and running out of RAM. Is this
possibility easy to rule out?
Proving you wrong would be great. :) We did wordcount with some 50 GB
of Gutenberg files, and Hadoop kept crashing. When we shrank the
dataset, Hadoop finished but was an order of magnitude slower. So we're
not entirely happy with the scalability of Hadoop.
I did that because there wasn't enough space in /var, which is where it defaults to. I can see if there's a local disk that has enough space on it, but I think it's unlikely.
Yeah, trying to do that amount of IO to a remote filesystem can'tpossibly work well. How small is the local disk that it's too small?
It seems crazy that there wouldn't be a reasonably sized disk.
I've made a few changes to the program in /home/amcnabb/c/mrs/matt which
you might want to see. Most of the changes are fairly minor, but you
might find them interesting.
I have it running right now.
It looks like the biggest problem, so far
anyway, is the lack of granularity in the input files. A whole bunch of
processors are idle from the beginning, and after a couple of minutes,
almost all of them are idle waiting for the last two tasks or so to
finish. If there's any way to split the input files or make tasks that
can read portions of files, it will be much more efficient to
parallelize.
By the way (for future reference), if you really do need to use an NFS
server, "--mrs-shared" would do what you want. With "--mrs-tmpdir",
slave A writes to NFS, and when slave B needs the data, it contacts
slave A, which reads it from NFS and sends it over the network to slave
B. So, it's really bad. On the other hand, with "--mrs-shared" slave B
can read the data directly from NFS without contacting slave A. That
can work okay for CPU-bound programs like Particle Swarm Optimization,
but it's doomed to fail for programs with hundreds of gigabytes of I/O.
It was fun to go through and look for things that would be particularly
confusing for a user. I'm really fairly pleased with how your code
turned out. There were a few spots that weren't quite ideal, but you
weren't led toward doing things horribly wrong in any major way.
So, at least in the Python code, it looks like a whole bunch of paths
are output in the second map operation. For example, if there were a
path A->B->C->D, it would spit out A, A->B, A->B->C, and A->B->C->D,
making the total amount of data O(n^2). Depending on how long these
paths can get (and it looks like they can get pretty long), this is a
little frightening.
The other possibility is to split the files. Since it's exactly 10
bytes per walk, you can jump to any spot in the middle of the file and
know exactly where a new record starts.
I'm suspecting that it has something to do with the O(n^2) thing, but
I'm going to look into it a bit more and see if there's something else
going on.
The formatting of the key makes sense--the formatting of the value (the
defaultdict) does not seem right. Do you have any insights about how
this should be formatted?
Matt, would you be able to tackle the output formatting problem? Ithink that the final output should be much, much less than 33 GB.
1) Since we have posted this as an example, could we get a description of what it is doing, either in graph-theory terms and (or) application terms?
2) Have you mentioned this to any one else there? Is there any chance of getting others using it? It really seems that mrs is (much) easier to program to and also faster, even on a smaller cluster. Some one ought to care.
With this change, the total time is now 11.6 minutes. Just out of
curiosity, what's your final time in Hadoop?
It's not apples to apples, but it looks like it's about 8 minutes on 100machines for Hadoop (with the granularity fix) vs. about 11 and a half
minutes on 23 machines for Mrs. That's a pretty solid win for Mrs.
It's probably impossible to set all of the factors equal, but if you
send me the Java code (preferably copying the granularity fix from my
version), I can run it on our cluster to compare side by side.
On Wed, Oct 24, 2012 at 02:51:12PM -0400, Matthew Gardner wrote:> I can send you my hadoop code, but the granularity fix isn't the same - youEven in Hadoop, it might be easier to do it the way we did it in Mrs
> have to modify an InputFormat class, which I'm planning on doing at some
> point, but it's just handled a bit differently in hadoop.
(the walks files wouldn't technically be the "input files").
On Wed, Oct 24, 2012 at 2:34 PM, Andrew McNabb <amc...@mcnabbs.org> wrote:It's not apples to apples, but it looks like it's about 8 minutes on 100machines for Hadoop (with the granularity fix) vs. about 11 and a half
minutes on 23 machines for Mrs. That's a pretty solid win for Mrs.But you're using 3 threads per machine? So is it really 69 vs. 100? Either way, though, you're right that mrs is quite impressive.
So, that means I used 100 machines for the each map and reduce after the first map (which only used 28), and I likely got a more even split, so there wasn't a big partition holding the other tasks up.
When we are running your hadoop code, we are seeing 30 tasks, as opposed to your 28. Any reason why we are not seeing the same thing?
And one other followup question: I noticed that the Hadoop program has
three maps and three reduces (three MapReduce jobs), but the Mrs program
only has two maps and two reduces. Are the two programs doing different
work?
Is there anything wrong with the data from the two biggestfiles, or is it just inconvenient to have large files.