A friend of mine wrote on this topic recently:
http://gridgurus.typepad.com/grid_gurus/2008/04/the-mapreduce-p.html
rw2
Hi All,
I am a graduate student of the Indiana University Bloomington.
In my current research work, I have analyzed the MapReduce technique using two scientific applications. I have used Hadoop and also CLG-MapReduce (A streaming based MapReduce implementation that we developed) for the aforementioned analysis.
Since the MapReduce discussion is going on, I thought of sharing my experience with the MapReduce technique with you all.
So far, I have two scientific applications which I have implemented in MapReduce programming model.
1. High Energy Physics(HEP) data analysis involving large volumes of data (I have tested up to 1 terabytes)
2. Kmeans clustering to cluster large number of 2D data points to predefined number of clusters.]
The first application is data/computation intensive application that requires passing the data through a single MapReduce computation cycle. The second application is an iterative application that requires multiple execution of MapReduce cycles before the final result is obtained.
I implemented the HEP data analysis using Hadoop and CGL-MapReduce and compare their performances.
I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and MPI and compare their performances.
I also performed the Kmeans clustering using CGL-MapReduce, MPI, and Java threads on multi-core computers.
I will include the conclusions that we derived so far from our experiments below. ( Following documents contain a detailed description of the work I did)
[Draft Paper] http://www.cs.indiana.edu/~jekanaya/draft.pdf
http://jaliyacgl.blogspot.com/2008/08/high-energy-physics-data-analysis-using.html
1. MapReduce technique scale well for large data/compute intensive applications and easy to program.
2. Google’s approach (also the Hadoop’s approach) of transferring intermediate results via a distributed file system adds large overhead to the computation task. However, the effect of this overhead is minimum for very large data/compute intensive applications. (Please see Figures 6, 7 and 8 of the above paper)
3. The file system based communication overhead is prohibitively large in iterative MapReduce computations. (Please see Figure 9 of the above paper) There is a project under Hadoop named Mahout that intend to support iterative MapReduce computations, but still it uses Hadoop and hence it will also have these overheads.
4. The approach of storing and accessing data in an specialized distributed file system works well when the data is in text formats and the Map/Reduce functions are written in a language that has APIs for accessing the file system. But, for binary data and the Map/Reduce functions written in other languages this is not a feasible solution. A data catalog would be a better solution for these types of generic use cases.
5. Multi-threaded applications run faster in Multicore platforms than the MapReduce or other parallelization techniques targeted for distributed memory. Of course MapReduce is more simple to program J
Any thoughts and comments are welcome!
Thanks,
Jaliya
From:
cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On
Behalf Of Chris K Wensel
Sent: Tuesday, August 26, 2008 7:56 PM
To: cloud-c...@googlegroups.com
Subject: Re: Is Map/Reduce going mainstream?
Hey all
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com
To unsubscribe from this group, send email to
cloud-computi...@googlegroups.com
To post job listing, send email to jo...@cloudjobs.net (position title, employer
and location in subject, description in message body) or visit
http://www.cloudjobs.net
For more options, visit this group at
http://groups.google.ca/group/cloud-computing?hl=en?hl=en
Posting guidelines:
http://groups.google.ca/group/cloud-computing/web/frequently-asked-qu...
-~----------~----~----~----~------~----~------~--~---
It might be interesting to do a quick test and see what the performance
overhead of using virtual servers compared to physical servers.
--
Jim Starkey
President, NimbusDB, Inc.
978 526-1376
Yes, planning to do a thorough test as well. I have some initial results on
the overhead of virtual machines for MPI.
This is what we did.
We used the same MPI program that is used for Kmeans clustering and modified
it so that it will iterate a given number times at each MPI communication
routine. Then we measure the performance on a VM setup (Using the "Nimbus"
cloud) and also on an exactly similar set of machines without VMs.
The result are shown in the following graphs.
http://www.cs.indiana.edu/~jekanaya/cloud/avg_kmeans_time.png
http://www.cs.indiana.edu/~jekanaya/cloud/vm_no_vm_min_max.png
The VMs introduced high overhead to the MPI communication. We did few tests
by changing the configurations such as the number of processor cores used in
a VM and the number of VMs used etc.. but still could not exactly figure out
the reason for this high overhead. It could be the overhead induced by the
network emulator or the network configuration used in Nimbus. We are setting
up a Cloud in IU and once it is done, we will be able to verify these
figures.
Thanks,
Jaliya
> -----Original Message-----
> From: cloud-c...@googlegroups.com [mailto:cloud-
> comp...@googlegroups.com] On Behalf Of Jim Starkey
> Sent: Wednesday, August 27, 2008 9:45 AM
> To: cloud-c...@googlegroups.com
> Subject: Re: Is Map/Reduce going mainstream?
>
>
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google
> Groups "Cloud Computing" group.
> To post to this group, send email to cloud-c...@googlegroups.com
> To unsubscribe from this group, send email to
> cloud-computi...@googlegroups.com
> To post job listing, send email to jo...@cloudjobs.net (position title,
> employer and location in subject, description in message body) or visit
> http://www.cloudjobs.net
> For more options, visit this group at
> http://groups.google.ca/group/cloud-computing?hl=en?hl=en
> Posting guidelines:
> http://groups.google.ca/group/cloud-computing/web/frequently-asked-
> questions
> -~----------~----~----~----~------~----~------~--~---
Greenplum and Aster Data both recently announced MapReduce within SQL
for software-only distributed data warehouses on commodity hardware --
highly parallel, fault tolerate -- great arguments. I've talked with
both vendors. These directions are great to see for several reasons,
and I'll likely become a paying customer. There are cautions as well
for other reasons.
Going beyond the "It's hard to think in MapReduce" point which
GridGurus made ... As an engineering manager at a firm invested in
using lots of SQL, lots of Hadoop, lots of other code written in
neither of the above, and many terabytes of data added each week -- my
primary concern is more about how well a *team* of people can "think"
within a paradigm over a *long* period of time.
Much of what I've seen discussed about Hadoop seems based around an
individual programmer's initial experiences. That's great to see.
However, what really sells a paradigm is when you can evaluate from a
different perspective, namely hindsight.
Consider this as an exercise: imagine that a programmer wrote 1000
lines of code for a project at work based on MapReduce, the company
has run that code in some critical use case for 2-3 years, the
original author moved to another company (or the company was
acquired), and now a team of relatively new eyeballs must reverse
engineer the existing MapReduce code and extend it.
A software lifecycle argument is not heard so much in discussions
about MapReduce, but IMHO it will become more crucial for decision
makers as vendors move to providing SQL, etc.
As a different example, I don't need to argue with exec mgmt or board
members that reworking 10000 lines of PHP with SQL embedded is going
to cost more than reworking the same app if it'd been written in Java
and Hibernate. Or having it written in Python, or whatever other
more-structured language makes sense for the app. The business
decision makers already understand those costs, all too well.
Keep in mind that a previous generation of data warehousing (e.g.,
Netezza) promised parallel processing based on SQL-plus-extensions
atop custom hardware -- highly-parallel, somewhat less fault-tolerant,
but still an advancement. IMHO, what we see now from Greenplum, Aster
Data, Hive, etc., represents a huge leap beyond that previous
generation. Particularly for TCO.
Pros: if you have terabytes of data collected from your different
business units, then those business units will want to run ad-hoc
queries on it, and they won't have time/expertise/patience to write
code for Hadoop. I'd rather give those users means to run some SQL and
be done -- than disrupt my team from core work on algorithms.
Cons: if you have a complex algorithm which is part of the critical
infrastructure and must be maintained over a long-period of time by a
team of analysts and developers, then RUN AWAY, DO NOT WALK, from
anyone trying to sell you a solution based in SQL.
I'd estimate the cost of maintaining substantial apps in SQL to be
10-100x that of more structured languages used in Hadoop: Java, C++,
Python, etc. That's based on encountering the problem over and over
for the past 10 years. Doubtless that there are studies by people with
much broader views and better data, much to the same effect.
The challenge to Aster Data, Greenplum, et al., will be to provide
access to their new systems from within languages like Java, Python,
etc. Otherwise I see them relegated to the land of ad-hoc queries,
being eclipsed by coding paradigms which are less abstracted or more
geared toward software costs long-term.
PIG provides an interesting approach and I'm sure that it will be
useful for many projects. Not clear that it provides the answers to
software lifecycle issues.
From what I've seen through the years in other areas of large scale
data, such as ETL and analytics, first off the data is not
particularly relational.
The concept of workflows seems much more apt -- more likely to become
adopted by the practitioners in quantity once we get past the "slope
of enlightenment" phase of the hype cycle.
Our team has been considering a shift to Cascading. I'd much rather
train a stats PhD on Cascading than have them slog through 50x lines
of SQL which was written by the previous stats PhD.
Having access to DSLs from within application code -- e.g., R for
analytics, Python for system programming, etc. -- that would also be
much more preferable from my perspective than MapReduce within SQL.
Paco
This is very interesting work.
> The [HEP] application is data/computation intensive application that
> requires passing the data through a single MapReduce computation cycle.
What HEP application were you using? Was it doing Monte Carlo generation
(MC), reconstruction (reco), analysis, or a mixture of these?
> I implemented the HEP data analysis using Hadoop and CGL-MapReduce and
> compare their performances [...]
Did you (or are planning to) compare this with PROOF:
http://root.cern.ch/twiki/bin/view/ROOT/PROOF
> [Draft Paper] http://www.cs.indiana.edu/~jekanaya/draft.pdf
Ta for the link.
Cheers,
Paul.
Thanks :)
> > The [HEP] application is data/computation intensive application that
> > requires passing the data through a single MapReduce computation
> cycle.
>
> What HEP application were you using? Was it doing Monte Carlo
> generation
> (MC), reconstruction (reco), analysis, or a mixture of these?
>
We used Monte Carlo data to find "higgs".
The analyses code and the data are both from the High Energy Physics group @
Caltech, and we just used it as a use case for our MapReduce testing.
>
>
> > I implemented the HEP data analysis using Hadoop and CGL-MapReduce
> and
> > compare their performances [...]
>
> Did you (or are planning to) compare this with PROOF:
> http://root.cern.ch/twiki/bin/view/ROOT/PROOF
>
This will be a interesting comparison. PROOF is only for parallelizing ROOT
applications.
Right now our focus is on the generalized framework for supporting these data
analyses.
>
> > [Draft Paper] http://www.cs.indiana.edu/~jekanaya/draft.pdf
>
> Ta for the link.
Thanks,
Jaliya
>
> Cheers,
>
> Paul.
>
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google
> Groups "Cloud Computing" group.
> To post to this group, send email to cloud-c...@googlegroups.com
> To unsubscribe from this group, send email to
> cloud-computi...@googlegroups.com
> To post job listing, send email to jo...@cloudjobs.net (position title,
> employer and location in subject, description in message body) or visit
> http://www.cloudjobs.net
> For more options, visit this group at
> http://groups.google.ca/group/cloud-computing?hl=en?hl=en
> Posting guidelines:
> http://groups.google.ca/group/cloud-computing/web/frequently-asked-
> questions
> -~----------~----~----~----~------~----~------~--~---
Therefore, what I suggest is that once we have a parallel implementation
(MapReduce or any other ) we need to have at least a rough idea of the
efficiency( or speedup) achieved by our solution and see how much overhead
is introduced by the parallelization technique itself.
Thanks
Jaliya
> -----Original Message-----
> From: cloud-c...@googlegroups.com [mailto:cloud-
> comp...@googlegroups.com] On Behalf Of Jeff He
> Sent: Thursday, August 28, 2008 11:23 PM
> To: Cloud Computing
> Subject: Re: Is Map/Reduce going mainstream?
>
>
-- : : Geoffrey Fox g...@indiana.edu FAX 8128567972 http://www.infomall.org : Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977 : SkypeIn 812-669-0772 with voicemail, International cell 8123910207
Yes, MapReduce will work in both situations. However, my point is that the
"exponential performance advantage" that we expect to achieve may not be
possible unless we use a right implementation of the MapReduce that is
suitable for the class of applications.
Thanks
Jaliya
Hence, to say that MapReduce is "going mainstream" is only relative to the fact
that the size of the "market" for such applications is growing.
--Craig
>Attachment converted: Macintosh HD:smime 40.p7s ( / ) (009896FB)
Most certainly so, but it also abused in inappropriate application. It
is occasional used, for example, to parallel a search of a large table.
A smarter way to handle this would be an index. So lets be wary of a
solution looking for a problem....
That said, how do you build a house in the MapReduce model? Filtering, aggregation, and functional conversions are very simple in MapReduce, but typical work loads are more complex, and end up being 2-5-20 individual MapReduce jobs chained together by dependencies (many running in parallel). Thinking that deep in MapReduce is not trivial, and the resulting code base is not simple (see Googles Sawzall papers).If companies like Greenplum are using MapReduce as an underlying compute model, they must offer up a higher level abstraction that users and developers can reason in.
-----Original Message-----
From: cloud-c...@googlegroups.com
[mailto:cloud-c...@googlegroups.com] On Behalf Of Jeff He
Sent: Friday, 29 August 2008 1:23 PM
To: Cloud Computing
Subject: Re: Is Map/Reduce going mainstream?
Sassa
2008/8/29 Jaliya Ekanayake <jeka...@cs.indiana.edu>:
Sassa
2008/8/27 Jaliya Ekanayake <jeka...@cs.indiana.edu>:
Horses for courses. There is discrete data and continuous data. Discrete data can be easily partitioned and I see MapReduce as overkill for this scenario (eg. Student exam results). However, Continuous data (eg. A video stream) is not easily partitioned and does require stuff like this.
Having written a graph/map API, I want to see more developers making use of graph theory, though I think MapReduce might be the wrong product for you J.
From: cloud-c...@googlegroups.com
[mailto:cloud-c...@googlegroups.com] On Behalf Of Roderick Flores
Sent: Sunday, 31 August 2008 4:35 AM
To: cloud-c...@googlegroups.com
Subject: Re: Is Map/Reduce going mainstream?
On Tue, Aug 26, 2008 at 5:56 PM, Chris K Wensel <ch...@wensel.net> wrote:
Thanks,
Jaliya
Jaliya
> -----Original Message-----
> From: cloud-c...@googlegroups.com [mailto:cloud-
> comp...@googlegroups.com] On Behalf Of Sassa NF
I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and MPI and compare their performances.
I also performed the Kmeans clustering using CGL-MapReduce, MPI, and Java threads on multi-core computers.
It seems like you have not understand why we use Kmeans in this context. As you have mentioned there are numerous improvements one can perform to the data and the initial set of clusters in Kmeans and most other clustering algorithms. We selected Kmeans as it is a simple example to show the usage of “iterative computations using MapReduce”
We performed the Kmeans algorithm exactly as you have mentioned. All the map tasks compute the distances between some points and the set of cluster centers and once all of them are finished, a set of reduce tasks start calculating the new cluster centers. So we preserve the “MapReduce” paradigm in our algorithm and the implementation. Also this allows any fault tolerance features of the implementation and the result will not change if few Map tasks are re-executed after failures. The changes to the initial cluster centers and other improvements to the data will not change this process, it will definitely change the number of iterations that we need to perform, but that is a task-specific aspect and not the MapReduce.
Also , we compared the MapReduce results with MPI to show that given larger data sets and higher compute intensive operations, most of these systems converge in performance. So we did not try to mimic the classical parallel computing methods using MapReduce.
One last thought as well, things are very much straight forward when we apply this technique to text data sets, but once we try to apply it to scientific applications, where different data formats and different languages come into play, things get complicated. But overall it is a very easy technique to apply compared to the classic parallel programming techniques such as MPI.
Hope this will explain you our motive.
Thanks,
Jaliya
From:
cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On
Behalf Of Roderick Flores
Sent: Sunday, August 31, 2008 2:57 PM
To: cloud-c...@googlegroups.com
Subject: Re: Is Map/Reduce going mainstream?
On Tue, Aug 26, 2008 at 9:57 PM, Jaliya Ekanayake <jeka...@cs.indiana.edu> wrote:
Chris, excellent post on that issue.
Greenplum and Aster Data both recently announced MapReduce within SQL
for software-only distributed data warehouses on commodity hardware --
highly parallel, fault tolerate -- great arguments. I've talked with
both vendors. These directions are great to see for several reasons,
and I'll likely become a paying customer. There are cautions as well
for other reasons.
Going beyond the "It's hard to think in MapReduce" point which
GridGurus made ... As an engineering manager at a firm invested in
using lots of SQL, lots of Hadoop, lots of other code written in
neither of the above, and many terabytes of data added each week -- my
primary concern is more about how well a *team* of people can "think"
within a paradigm over a *long* period of time.
Much of what I've seen discussed about Hadoop seems based around an
individual programmer's initial experiences. That's great to see.
However, what really sells a paradigm is when you can evaluate from a
different perspective, namely hindsight.
HA for the name node likely wouldn't have helped in the vast majority
of the failure cases I've seen. Typically, our name node "failures"--
really full garbage collections or the name node getting swapped to
disk--are related to bad user code (too many small file creates), bad
tuning (not enough heap/memory), or full file system (not enough free
blocks on the local disk on the data nodes, too many files in the fs,
etc). HAing the name node could have easily turned into a ping pong
situation.
We've had exactly one hardware related failure in recent memory where
HA might have helped, but in the end didn't really matter. The name
node stayed up due to redundant name dirs, including one on NFS. [We
did, however, find a bug with the secondary doing Bad Things(tm) in
this particular failure case though.] But user jobs continued, albeit
a tad slower.
3. An executor interface that enables you to "ship" code to a cluster
of machines and execute that code in parallel.
Thanks for the great links -
An "executor interface" is close to a question I asked in a recent blog post:
http://ceteri.blogspot.com/2008/08/hadoop-in-cloud-patterns-for-automation.html
For example, you may have large Hadoop jobs as important batch
processes which are best to run on elastic resources. What framework
gets used to launch and monitor those resources, then pull back
results after the job has completed? Can I specify to that framework
some tagged build from my SVN repository? How about data which fits
best in a distributed cache, not in HDFS?
If the batch process is critical to my business, how do I automated it
- like in a crontab? As far as I can tell, none of the cloud
providers address this kind of requirement. Hadoop does not include
those kinds of features.
From what I can see, RightScale probably comes closest to providing
services in that area.
Paco
Nitpick: video is only "not easily partitioned" if one excludes the possibility of using high dimensionality spaces to represent it. This is only a problem if one is tacitly assuming a "points on a line" representation (which most systems do) but that assumption is not strictly required, merely common and well-understood.
-Andrew
" For example, you may have large Hadoop jobs as important batch
processes which are best to run on elastic resources. What framework
gets used to launch and monitor those resources, then pull back
results after the job has completed?"
Can I specify to that framework
some tagged build from my SVN repository? How about data which fits
best in a distributed cache, not in HDFS?"
I'm not aware of ways to address all those requirements with Hadoop but
there are other Grid platforms alternatives that lets you do that. Some are
provided as OpenSource and some are not. For example: Fura System provides a
batch scheduler in Java so is JPPF which another FOSS grid framework that
provides similar type of grid schedulers. On the commercial side DataSynapse
and Platform computing are probably the leading vendors. Both provides way
to monitor the jobs and get the results at some other stage. With JavaSpace
based solution each job is normally tagged with Job-ID as part of the Entry
attributes which is used to match all the results belong to that JOB at some
other time. With JavaSpaces you can also push the job code to the compute
node dynamically using dynamic code downloading. As for pointing data to a
cache that becomes more an attribute of the task i.e. with any of those
frameworks the task is a code that you write in java and you can point it to
any data source that you like whether its a cache or a file system.
The main difference between those alternatives and Hadoop is that Hadoop is
more geared to parallel aggregation of large distributed file systems rather
then being a generic job execution framework. On the other hand most of the
Grid frameworks that I mentioned are geared toward batch execution rather
then synchronous aggregated execution. GridGain is a relatively new and
interesting player in that space that seem to compete more with the Hadoop
model then with the classic batch processing grid providers and since it is
java based it brings some of the simplicity and code distribution benefits
that exist with other java based solution.
Having said all that the requirements that you mentioned indicate some of
the limitations of having a specialized framework such as Hadoop narrowed to
specific scenario. If you want to do something slightly different such as
parallel batch job as appose to parallel aggregated job you probably need to
use a totally different solution.
One of the nice things with Space model is that you can easily use it to
address both scenarios. We extended the JavaSpace model and beyond the API
abstraction that I mentioned earlier we added support for monitoring and the
ability to support both asynchronous batch processing or synchronous
aggregated operations where you write set of tasks in parallel and block for
the immediate results in similar to the way you would do it with Hadoop(See
a reference here:
http://wiki.gigaspaces.com/display/GS66/OpenSpaces+Core+Component+-+Executor
s). Since we use the space as the data layer and execution layer you get
data affinity pretty much built into the model i.e. you can decide to route
the task to where the data is and save the overhead associated with moving
of that data or accessing it over the network. We also provide built in
support for executing Groovy, JRuby and other dynamic languages tasks which
gives another dynamic capabilities. One of the common use case is using
dynamic languages as a new form of stored procedure (in away you can think
of SQL as a specialized case of dynamic language as well). And the nice
things that it all runs in EC2 today.
Nati S.
GigaSpaces.
-----Original Message-----
From: cloud-c...@googlegroups.com
[mailto:cloud-c...@googlegroups.com] On Behalf Of Paco NATHAN
Sent: Tuesday, September 02, 2008 5:48 AM
To: cloud-c...@googlegroups.com
Subject: Re: Is Map/Reduce going mainstream?