Is Map/Reduce going mainstream?

Reuven Cohen

unread,

Aug 26, 2008, 6:58:30 PM8/26/08

to cloud-computing

It's been an interesting summer for Google's MapReduce software paradigm. I'm not going to get into the finer details of MapReduce, the general idea is its Google's magic sauce, it's what lets them run their massively distributed data sets. So any company that wants to be like Google or needs to compete with Google should pay attention to MapReduce.

Last month Intel, HP and yahoo announced a joint research program to examine it's usage and now today Greenplum, a provider of database software for the what they describe as "next generation of data warehousing and analytics", announced support for MapReduce within its massively parallel database engine.

Greenplums announcement to integrate MapReduce functionality into its enterprise focused database is an important step toward taking MapReduce out of academic research labs and moving it to lucrative corporate users.

To give you some background, currently the two most popular implementations of MapReduce are the open source Apache Hadoop project and unfortunately named Pig project. For those of you who don't know about about Hadoop, it is an open source platform for distributing, managing and then collecting computing work throughout a large computing cloud using MapReduce. Pig, a Yahoo Research project currently being incubated at Apache, is a language designed to make using the Hadoop infrastructure effectively. It has been described as SQL for MapReduce, allowing queries to be written and then parallelised and run on the Hadoop platform.

I found this quote interesting, it was mentioned in Greemplums press release.

"Greenplum has seamlessly integrated MapReduce into its database, making it possible for us to access our massive dataset with standard SQL queries in combination with MapReduce programs," said Roger Magoulas, Research Director, O'Reilly Media. "We are finding this to be incredibly efficient because complex SQL queries can be expressed in a few lines of Perl or Python code.

Also interesting to note that earlier this year IBM released an Eclipse plug-in that simplifies the creation and deployment of MapReduce programs. This plug-in was developed by the team at IBM Software Group's High Performance On Demand Solutions Unit at the IBM Silicon Valley Laboratory. So it may be a matter of time before we see MapReduce commercially offered by IBM.

So what's next? Will we see a Microsoft implementation or an Oracle MapReduce? For now, MapReduce appears to be the new "coolness" and with all the industry attention it seems to be getting I think we may be on the verge of finally seeing MapReduce enter the mainstream consciousness.

As a side note, my favorite MapReduce implementation is called Skynet. The name says it all.

--
--

Reuven Cohen
Founder & Chief Technologist, Enomaly Inc.
blog > www.elasticvapor.com
-
Get Linked in> http://linkedin.com/pub/0/b72/7b4

Chris K Wensel

unread,

Aug 26, 2008, 7:56:27 PM8/26/08

to cloud-c...@googlegroups.com

Hey all

From a practitioners perspective, this is both great news, and a bit troubling.

Good news since MapReduce is a good divide and conquer strategy when dealing with large (continuous) datasets.

When coupled to a 'global' filesystem with a single namespace like HDFS (the Hadoop file system) or GFS (googles global filesystem), you get a new powerful entity that effectively virtualizes many hardware nodes into a single operating system instance (a wide virtualization), the converse of Xen and VMWare where you get may instances on one hardware node (a narrow virtualization).

Being inherently fault tolerant (due to the MapReduce model and nature of the filesystem) data centers could safely migrate batch like workloads into this system. Data-warehousing is a perfect fit when you realize RDBMS data warehouses are simply caches that have leaked up into the architecture due to their being resource constrained, forcing ETL to be yet another complex part of the datacenter and data-warehouse to load this cache.

Hadoop like systems allow for a 'lazy evaluation' model of the data, where caching is just an attribute, not an architectural component.

Troubling because 'thinking' in MapReduce sucks. If you've ever read "How to write parallel programs" by Nicholas Carriero and David Gelernter (http://www.lindaspaces.com/book/), many of their thought experiments and examples are based on a house building analogy. That is, how would you build a house in X model or Y model. These examples work because the models they present are straightforward.

That said, how do you build a house in the MapReduce model? Filtering, aggregation, and functional conversions are very simple in MapReduce, but typical work loads are more complex, and end up being 2-5-20 individual MapReduce jobs chained together by dependencies (many running in parallel). Thinking that deep in MapReduce is not trivial, and the resulting code base is not simple (see Googles Sawzall papers).

If companies like Greenplum are using MapReduce as an underlying compute model, they must offer up a higher level abstraction that users and developers can reason in.

Pig was mentioned. Jaql from IBM is also of worthy note. Facebook is contributing Hive, which adopts many data-warehouse trappings.

But if developers wish to continue working at the API level, and a need to build their own DSL's or reusable libraries for their specific domain for practitioners of that domain, Cascading is likely a better choice. (http://www.cascading.org/)

As the developer of Cascading, I would love to chat with the Greenplum developers to see how Cascading can target their MapReduce implementation to compliment the current Hadoop planner.

cheers,

chris

--

Chris K Wensel

ch...@wensel.net

http://chris.wensel.net/

http://www.cascading.org/

Rich Wellner

unread,

Aug 26, 2008, 10:17:52 PM8/26/08

to cloud-c...@googlegroups.com

A friend of mine wrote on this topic recently:
http://gridgurus.typepad.com/grid_gurus/2008/04/the-mapreduce-p.html

rw2

Reuven Cohen

unread,

Aug 26, 2008, 10:31:06 PM8/26/08

to cloud-c...@googlegroups.com

On Tue, Aug 26, 2008 at 10:17 PM, Rich Wellner <goo...@objenv.com> wrote:

A friend of mine wrote on this topic recently:
http://gridgurus.typepad.com/grid_gurus/2008/04/the-mapreduce-p.html

rw2

I've been getting lots of private messages, seems that there are a lot of shy readers on this list. There are no stupid questions or comments, please feel free to jump into the fray.

A few people have pointed me to a map/reduce like research project microsoft is working on called Dryad. Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

Check it out at
http://research.microsoft.com/research/sv/Dryad/

ruv

Ray Nugent

unread,

Aug 26, 2008, 10:43:19 PM8/26/08

to cloud-c...@googlegroups.com

Interesting if not a little over hyped. I'm not sure three of four implementations on top of it could be termed "a Rich Ecosystem". But certainly a step in the right - concurrent - direction. While it appears to be able to span machine and clusters does anyone know if it can span cores within a machine or do I need a concurrent language to achieve that?

Ray

Ray Nugent

unread,

Aug 26, 2008, 10:45:39 PM8/26/08

to cloud-c...@googlegroups.com

I think this Blog underscores what Chris' work has proved...

Ray

----- Original Message ----
From: Rich Wellner <goo...@objenv.com>
To: cloud-c...@googlegroups.com
Sent: Tuesday, August 26, 2008 7:17:52 PM
Subject: Re: Is Map/Reduce going mainstream?

A friend of mine wrote on this topic recently:
http://gridgurus.typepad.com/grid_gurus/2008/04/the-mapreduce-p.html

rw2

Chris Hotmail

unread,

Aug 26, 2008, 10:50:58 PM8/26/08

to cloud-c...@googlegroups.com

In my experience, processor core utilization is abstracted within the OS, if you use a multi-threaded compiler and tune the application accordingly.

Chris

Jaliya Ekanayake

unread,

Aug 26, 2008, 11:57:05 PM8/26/08

to cloud-c...@googlegroups.com

Hi All,

I am a graduate student of the Indiana University Bloomington.

In my current research work, I have analyzed the MapReduce technique using two scientific applications. I have used Hadoop and also CLG-MapReduce (A streaming based MapReduce implementation that we developed) for the aforementioned analysis.

Since the MapReduce discussion is going on, I thought of sharing my experience with the MapReduce technique with you all.

So far, I have two scientific applications which I have implemented in MapReduce programming model.

1. High Energy Physics(HEP) data analysis involving large volumes of data (I have tested up to 1 terabytes)

2. Kmeans clustering to cluster large number of 2D data points to predefined number of clusters.]

The first application is data/computation intensive application that requires passing the data through a single MapReduce computation cycle. The second application is an iterative application that requires multiple execution of MapReduce cycles before the final result is obtained.

I implemented the HEP data analysis using Hadoop and CGL-MapReduce and compare their performances.

I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and MPI and compare their performances.

I also performed the Kmeans clustering using CGL-MapReduce, MPI, and Java threads on multi-core computers.

I will include the conclusions that we derived so far from our experiments below. ( Following documents contain a detailed description of the work I did)

[Draft Paper] http://www.cs.indiana.edu/~jekanaya/draft.pdf

http://jaliyacgl.blogspot.com/2008/08/high-energy-physics-data-analysis-using.html

1. MapReduce technique scale well for large data/compute intensive applications and easy to program.

2. Google’s approach (also the Hadoop’s approach) of transferring intermediate results via a distributed file system adds large overhead to the computation task. However, the effect of this overhead is minimum for very large data/compute intensive applications. (Please see Figures 6, 7 and 8 of the above paper)

3. The file system based communication overhead is prohibitively large in iterative MapReduce computations. (Please see Figure 9 of the above paper) There is a project under Hadoop named Mahout that intend to support iterative MapReduce computations, but still it uses Hadoop and hence it will also have these overheads.

4. The approach of storing and accessing data in an specialized distributed file system works well when the data is in text formats and the Map/Reduce functions are written in a language that has APIs for accessing the file system. But, for binary data and the Map/Reduce functions written in other languages this is not a feasible solution. A data catalog would be a better solution for these types of generic use cases.

5. Multi-threaded applications run faster in Multicore platforms than the MapReduce or other parallelization techniques targeted for distributed memory. Of course MapReduce is more simple to program J

Any thoughts and comments are welcome!

Thanks,
Jaliya

From: cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On Behalf Of Chris K Wensel
Sent: Tuesday, August 26, 2008 7:56 PM
To: cloud-c...@googlegroups.com
Subject: Re: Is Map/Reduce going mainstream?

Hey all

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Cloud Computing" group.
To post to this group, send email to cloud-c...@googlegroups.com
To unsubscribe from this group, send email to
cloud-computi...@googlegroups.com
To post job listing, send email to jo...@cloudjobs.net (position title, employer and location in subject, description in message body) or visit http://www.cloudjobs.net
For more options, visit this group at
http://groups.google.ca/group/cloud-computing?hl=en?hl=en
Posting guidelines:
http://groups.google.ca/group/cloud-computing/web/frequently-asked-qu...
-~----------~----~----~----~------~----~------~--~---

Jim Starkey

unread,

Aug 27, 2008, 9:45:27 AM8/27/08

to cloud-c...@googlegroups.com

It might be interesting to do a quick test and see what the performance
overhead of using virtual servers compared to physical servers.

--
Jim Starkey
President, NimbusDB, Inc.
978 526-1376

Jaliya Ekanayake

unread,

Aug 27, 2008, 10:31:06 AM8/27/08

to cloud-c...@googlegroups.com

> It might be interesting to do a quick test and see what the performance
> overhead of using virtual servers compared to physical servers.

Yes, planning to do a thorough test as well. I have some initial results on
the overhead of virtual machines for MPI.
This is what we did.

We used the same MPI program that is used for Kmeans clustering and modified
it so that it will iterate a given number times at each MPI communication
routine. Then we measure the performance on a VM setup (Using the "Nimbus"
cloud) and also on an exactly similar set of machines without VMs.

The result are shown in the following graphs.
http://www.cs.indiana.edu/~jekanaya/cloud/avg_kmeans_time.png
http://www.cs.indiana.edu/~jekanaya/cloud/vm_no_vm_min_max.png

The VMs introduced high overhead to the MPI communication. We did few tests
by changing the configurations such as the number of processor cores used in
a VM and the number of VMs used etc.. but still could not exactly figure out
the reason for this high overhead. It could be the overhead induced by the
network emulator or the network configuration used in Nimbus. We are setting
up a Cloud in IU and once it is done, we will be able to verify these
figures.

Thanks,
Jaliya

> -----Original Message-----
> From: cloud-c...@googlegroups.com [mailto:cloud-
> comp...@googlegroups.com] On Behalf Of Jim Starkey
> Sent: Wednesday, August 27, 2008 9:45 AM
> To: cloud-c...@googlegroups.com
> Subject: Re: Is Map/Reduce going mainstream?
>
>

> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google
> Groups "Cloud Computing" group.
> To post to this group, send email to cloud-c...@googlegroups.com
> To unsubscribe from this group, send email to
> cloud-computi...@googlegroups.com
> To post job listing, send email to jo...@cloudjobs.net (position title,
> employer and location in subject, description in message body) or visit
> http://www.cloudjobs.net
> For more options, visit this group at
> http://groups.google.ca/group/cloud-computing?hl=en?hl=en
> Posting guidelines:
> http://groups.google.ca/group/cloud-computing/web/frequently-asked-

> questions
> -~----------~----~----~----~------~----~------~--~---

Paco NATHAN

unread,

Aug 27, 2008, 12:27:40 PM8/27/08

to cloud-c...@googlegroups.com

Chris, excellent post on that issue.

Greenplum and Aster Data both recently announced MapReduce within SQL
for software-only distributed data warehouses on commodity hardware --
highly parallel, fault tolerate -- great arguments. I've talked with
both vendors. These directions are great to see for several reasons,
and I'll likely become a paying customer. There are cautions as well
for other reasons.

Going beyond the "It's hard to think in MapReduce" point which
GridGurus made ... As an engineering manager at a firm invested in
using lots of SQL, lots of Hadoop, lots of other code written in
neither of the above, and many terabytes of data added each week -- my
primary concern is more about how well a *team* of people can "think"
within a paradigm over a *long* period of time.

Much of what I've seen discussed about Hadoop seems based around an
individual programmer's initial experiences. That's great to see.
However, what really sells a paradigm is when you can evaluate from a
different perspective, namely hindsight.

Consider this as an exercise: imagine that a programmer wrote 1000
lines of code for a project at work based on MapReduce, the company
has run that code in some critical use case for 2-3 years, the
original author moved to another company (or the company was
acquired), and now a team of relatively new eyeballs must reverse
engineer the existing MapReduce code and extend it.

A software lifecycle argument is not heard so much in discussions
about MapReduce, but IMHO it will become more crucial for decision
makers as vendors move to providing SQL, etc.

As a different example, I don't need to argue with exec mgmt or board
members that reworking 10000 lines of PHP with SQL embedded is going
to cost more than reworking the same app if it'd been written in Java
and Hibernate. Or having it written in Python, or whatever other
more-structured language makes sense for the app. The business
decision makers already understand those costs, all too well.

Keep in mind that a previous generation of data warehousing (e.g.,
Netezza) promised parallel processing based on SQL-plus-extensions
atop custom hardware -- highly-parallel, somewhat less fault-tolerant,
but still an advancement. IMHO, what we see now from Greenplum, Aster
Data, Hive, etc., represents a huge leap beyond that previous
generation. Particularly for TCO.

Pros: if you have terabytes of data collected from your different
business units, then those business units will want to run ad-hoc
queries on it, and they won't have time/expertise/patience to write
code for Hadoop. I'd rather give those users means to run some SQL and
be done -- than disrupt my team from core work on algorithms.

Cons: if you have a complex algorithm which is part of the critical
infrastructure and must be maintained over a long-period of time by a
team of analysts and developers, then RUN AWAY, DO NOT WALK, from
anyone trying to sell you a solution based in SQL.

I'd estimate the cost of maintaining substantial apps in SQL to be
10-100x that of more structured languages used in Hadoop: Java, C++,
Python, etc. That's based on encountering the problem over and over
for the past 10 years. Doubtless that there are studies by people with
much broader views and better data, much to the same effect.

The challenge to Aster Data, Greenplum, et al., will be to provide
access to their new systems from within languages like Java, Python,
etc. Otherwise I see them relegated to the land of ad-hoc queries,
being eclipsed by coding paradigms which are less abstracted or more
geared toward software costs long-term.

PIG provides an interesting approach and I'm sure that it will be
useful for many projects. Not clear that it provides the answers to
software lifecycle issues.

From what I've seen through the years in other areas of large scale
data, such as ETL and analytics, first off the data is not
particularly relational.
The concept of workflows seems much more apt -- more likely to become
adopted by the practitioners in quantity once we get past the "slope
of enlightenment" phase of the hype cycle.

Our team has been considering a shift to Cascading. I'd much rather
train a stats PhD on Cascading than have them slog through 50x lines
of SQL which was written by the previous stats PhD.

Having access to DSLs from within application code -- e.g., R for
analytics, Python for system programming, etc. -- that would also be
much more preferable from my perspective than MapReduce within SQL.

Paco

Paul Millar

unread,

Aug 28, 2008, 4:08:38 AM8/28/08

to cloud-c...@googlegroups.com, Jaliya Ekanayake

On Wednesday 27 August 2008 05:57:05 Jaliya Ekanayake wrote:
> In my current research work, I have analyzed the MapReduce technique using
> two scientific applications. I have used Hadoop and also CLG-MapReduce (A
> streaming based MapReduce implementation that we developed) for the
> aforementioned analysis.

This is very interesting work.

> The [HEP] application is data/computation intensive application that

> requires passing the data through a single MapReduce computation cycle.

What HEP application were you using? Was it doing Monte Carlo generation
(MC), reconstruction (reco), analysis, or a mixture of these?

> I implemented the HEP data analysis using Hadoop and CGL-MapReduce and

> compare their performances [...]

Did you (or are planning to) compare this with PROOF:
http://root.cern.ch/twiki/bin/view/ROOT/PROOF

> [Draft Paper] http://www.cs.indiana.edu/~jekanaya/draft.pdf

Ta for the link.

Cheers,

Paul.

Jaliya Ekanayake

unread,

Aug 28, 2008, 9:53:20 AM8/28/08

to cloud-c...@googlegroups.com

> using
> > two scientific applications. I have used Hadoop and also CLG-
> MapReduce (A
> > streaming based MapReduce implementation that we developed) for the
> > aforementioned analysis.
>
> This is very interesting work.

Thanks :)

> > The [HEP] application is data/computation intensive application that
> > requires passing the data through a single MapReduce computation
> cycle.
>
> What HEP application were you using? Was it doing Monte Carlo
> generation
> (MC), reconstruction (reco), analysis, or a mixture of these?
>

We used Monte Carlo data to find "higgs".
The analyses code and the data are both from the High Energy Physics group @
Caltech, and we just used it as a use case for our MapReduce testing.

>
>
> > I implemented the HEP data analysis using Hadoop and CGL-MapReduce
> and
> > compare their performances [...]
>
> Did you (or are planning to) compare this with PROOF:
> http://root.cern.ch/twiki/bin/view/ROOT/PROOF
>

This will be a interesting comparison. PROOF is only for parallelizing ROOT
applications.
Right now our focus is on the generalized framework for supporting these data
analyses.

>
> > [Draft Paper] http://www.cs.indiana.edu/~jekanaya/draft.pdf
>
> Ta for the link.

Thanks,
Jaliya

>
> Cheers,
>
> Paul.

>
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google
> Groups "Cloud Computing" group.
> To post to this group, send email to cloud-c...@googlegroups.com
> To unsubscribe from this group, send email to
> cloud-computi...@googlegroups.com
> To post job listing, send email to jo...@cloudjobs.net (position title,
> employer and location in subject, description in message body) or visit
> http://www.cloudjobs.net
> For more options, visit this group at
> http://groups.google.ca/group/cloud-computing?hl=en?hl=en
> Posting guidelines:
> http://groups.google.ca/group/cloud-computing/web/frequently-asked-

> questions
> -~----------~----~----~----~------~----~------~--~---

Jeff He

unread,

Aug 28, 2008, 11:23:10 PM8/28/08

to Cloud Computing

Our team have used the hadoop stuff to implement some basic spatial
location related algorithms, e.g., the 2D-convex hull, and now some
more advanced.

From the view of an ex "traditional" parallel program developer with
message-passing stuffs, I dont' think the map/reduce is a totally
different animal from MPI/PVM/MW these HPC stuffs. Coz the born-
tightly combination with distributed file&data I/O layer, automatic
task-availability assurance, simpler API, and the most important -
Disk IO bottleneck in a single box, make it most popular among
developers, thought I would still say, Condor/PBS/SGE + MPI + MPIO +
some parallel filesystems over a COTS cluster can do almost the same
job because of the divide-and-conquer nature in some of the workload -
It seems somethings are invented again! :)

However, from the pointview of programming language, LINQ, Swazzle,
and PIG are driving more adoption of functional-style languages in
parallel/performance computing. The are quite different from those of
HPCs!

Comments r welcome! :)

On Aug 28, 9:53 pm, "Jaliya Ekanayake" <jekan...@cs.indiana.edu>
wrote:

> smime.p7s
> 5KViewDownload

Jaliya Ekanayake

unread,

Aug 29, 2008, 10:37:38 AM8/29/08

to cloud-c...@googlegroups.com

IMO the MapReduce program paradigm simplifies the parallel programming.
There may be things that are hard to do in MapReduce compared to the
traditional parallel programming, but overall it will be a very good
alternative. However, it is always good to use the right implementation of
MapReduce for the tasks that we need to parallelize. For example, although
the Google's and Hadoop's way of doing MapReduce (I am not talking about the
API rather the underlying architecture) suite well for heavy I/O bound
problems, we cannot expect all the problems to be in this nature. Some
problems may have moderate amount of I/O operations but heavy computational
requirements. Some may require iterative application of MapReduce. The
MapReduce programming model is something we can apply to all these problems
but, if we use a MapReduce implementation that is targeted for heavy I/O
bound applications to solve a problem which is more
computational/communication intensive, we may not get the parallel
performance that we expected.

Therefore, what I suggest is that once we have a parallel implementation
(MapReduce or any other ) we need to have at least a rough idea of the
efficiency( or speedup) achieved by our solution and see how much overhead
is introduced by the parallelization technique itself.
Thanks
Jaliya

> -----Original Message-----
> From: cloud-c...@googlegroups.com [mailto:cloud-

> comp...@googlegroups.com] On Behalf Of Jeff He
> Sent: Thursday, August 28, 2008 11:23 PM
> To: Cloud Computing
> Subject: Re: Is Map/Reduce going mainstream?
>
>

Krishna Sankar (ksankar)

unread,

Aug 29, 2008, 11:52:07 AM8/29/08

to cloud-c...@googlegroups.com

While I agree with your thoughts, I think it is not I/O vs.
computational aspect that drives MapReduce - MapReduce will work well in
both situations. MapReduce can be applied to
little-I/O-large-computational problems - it will resemble the grids.
What drives MapReduce is the scale and nature of the problem. Any
massively parallel, functionally decomposable problem will achieve
exponential performance advantage with MapReduce - to I/O or not to I/O.

Cheers
<k/>

Geoffrey Fox

unread,

Aug 29, 2008, 1:29:19 PM8/29/08

to cloud-c...@googlegroups.com

So lets look at the two features of of MapReduce
1) High level Interface. Some problems -- including obviously very important ones demonstrated by Google -- can be implemented elegantly by such functional languages. However we were told 25 years ago that functional languages would revolutionize all parallel computing computing. That was found to be false as many problems are poorly represented in this fashion. Further the success of MapReduce has been seen before in many of the 100's of workflow approaches with a variety of visual and language interfaces. So this part of MapReduce but not so general and not so novel
2) Runtime. Here I hope and expect Google has a great implementation. On the other hand, Hadoop has many limitations and MPI (possibly augmented by well studied fault tolerance mechanisms) has some advantages. It is much higher performance and as the synchronization mechanism is distributed has natural scaling. MPI of course does scale to many tens of thousands of cores; it will continue to scale on much large systems -- the scaling is intrinsic to its architecture as I assume it is Google's implementation

-- 
:
: Geoffrey Fox  g...@indiana.edu FAX 8128567972 http://www.infomall.org
: Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977
: SkypeIn 812-669-0772 with voicemail, International cell 8123910207

Jaliya Ekanayake

unread,

Aug 29, 2008, 2:14:41 PM8/29/08

to cloud-c...@googlegroups.com

> While I agree with your thoughts, I think it is not I/O vs.
> computational aspect that drives MapReduce - MapReduce will work well
> in both situations.

Yes, MapReduce will work in both situations. However, my point is that the
"exponential performance advantage" that we expect to achieve may not be
possible unless we use a right implementation of the MapReduce that is
suitable for the class of applications.

Thanks
Jaliya

l...@aero.org

unread,

Aug 29, 2008, 7:50:17 PM8/29/08

to cloud-c...@googlegroups.com, cr...@mhbackup.aero.org

Another important factor for MapReduce's popularity (and effectiveness) is that
it matches the target application environment -- that is to say, massive data
that may be distributed and can be accordingly partitioned and processed
independently. Hence, coupling this well-known, functional
programming paradigm
with a distributed file system for accessing all that data is highly effective
for this class of applications that is now being enabled by the increasingly
connected infomasses being collected by various organizations.

Hence, to say that MapReduce is "going mainstream" is only relative to the fact
that the size of the "market" for such applications is growing.

--Craig

>Attachment converted: Macintosh HD:smime 40.p7s ( / ) (009896FB)

Jim Starkey

unread,

Aug 29, 2008, 11:02:16 PM8/29/08

to cloud-c...@googlegroups.com, cr...@mhbackup.aero.org

l...@aero.org wrote:
> Another important factor for MapReduce's popularity (and effectiveness) is that
> it matches the target application environment -- that is to say, massive data
> that may be distributed and can be accordingly partitioned and processed
> independently. Hence, coupling this well-known, functional
> programming paradigm
> with a distributed file system for accessing all that data is highly effective
> for this class of applications that is now being enabled by the increasingly
> connected infomasses being collected by various organizations.
>
> Hence, to say that MapReduce is "going mainstream" is only relative to the fact
> that the size of the "market" for such applications is growing.
>

Most certainly so, but it also abused in inappropriate application. It
is occasional used, for example, to parallel a search of a large table.
A smarter way to handle this would be an index. So lets be wary of a
solution looking for a problem....

Roderick Flores

unread,

Aug 30, 2008, 2:34:46 PM8/30/08

to cloud-c...@googlegroups.com

On Tue, Aug 26, 2008 at 5:56 PM, Chris K Wensel <ch...@wensel.net> wrote:

That said, how do you build a house in the MapReduce model? Filtering, aggregation, and functional conversions are very simple in MapReduce, but typical work loads are more complex, and end up being 2-5-20 individual MapReduce jobs chained together by dependencies (many running in parallel). Thinking that deep in MapReduce is not trivial, and the resulting code base is not simple (see Googles Sawzall papers).

If companies like Greenplum are using MapReduce as an underlying compute model, they must offer up a higher level abstraction that users and developers can reason in.

I agree that typical MapReduce workloads are far more complicated than the word-count class of examples we read about make it seem. The implementations I have built in the past involve workflows that branch depending on the results of several parameterizations of the data and the whims of the users. What if the user has a choice of reduction algorithms? You don't want to rerun the mapped calculation repeatedly. Similarly the user could take the calculations from two different mapping algorithms and run the output through the same reduction algorithm. If you then allow the user to chain any permutation of different map and reduction algorithms into a final result, you have quite the problem.

As you can imagine, this complicated workflow strews enormous amounts of data throughout the storage fabric, all of which needs to be tracked for the users -- how can a user be expected to distinguish the hundred-thousand files associated with one set of input parameters for a particular algorithm from the millions of others generated by not only themselves but others? As you can imagine, this snowballs into quite the management challenge. What if a one thread amongst the thousands invoked across all of the distributed nodes fails due to something like a write timeout while the others quietly move forward? It would be nice to only need to calculate that one bit rather than rerunning the entire job. Using a SQL database is a good approach to tracking all of the metadata associated with these sorts of applications -- filesystems simply aren't made for this type of work.

Further, I believe that there are other operational constraints beyond these more advanced treatments that MapReduce framewroks should be expected to implement. First and foremost is the expectation that the number of nodes will be limited. Not everyone is willing to, let alone can, run their operations on a utility computing infrastructure. Secondly, jobs may have different priorities depending on any number of factors including but not limited to customer SLA, user group priority, and production versus development runs. Therefore a particular job might be put on the back-burner for days on end, even if it has started executing. We can also conclude that jobs can and should be broken up into more pieces than there are available nodes. In particular, users shouldn't be forced to carve off chunks of data larger than they would like simply because they are node limited. Ultimately we know that each calculation will be done as resources become available.

Adding a simple framework to a SQL database is a big step forward for these frameworks. But that is just one step amongst many that is necessary to make the MapReduce frameworks truly usable by the community at large. The complicated operational needs that users have begs for the removal of the simplistic schedulers from inside MapReduce implementations in preference for a global and more feature-rich resource manager.

Cheers!

--
Roderick Flores

Matt Lynch

unread,

Aug 30, 2008, 9:56:50 PM8/30/08

to cloud-c...@googlegroups.com

WTF? I really don't see how LINQ encourages functional programming - it's
merely a way to query data.

-----Original Message-----
From: cloud-c...@googlegroups.com
[mailto:cloud-c...@googlegroups.com] On Behalf Of Jeff He
Sent: Friday, 29 August 2008 1:23 PM
To: Cloud Computing

Subject: Re: Is Map/Reduce going mainstream?

Sassa NF

unread,

Aug 31, 2008, 3:22:06 AM8/31/08

to cloud-c...@googlegroups.com

What is meant by "exponential performance advantage"?

Sassa

2008/8/29 Jaliya Ekanayake <jeka...@cs.indiana.edu>:

Sassa NF

unread,

Aug 31, 2008, 3:27:23 AM8/31/08

to cloud-c...@googlegroups.com

I presume, the VM setup was running on the same number of physical
nodes, as the no-VM setup, right?

Sassa

2008/8/27 Jaliya Ekanayake <jeka...@cs.indiana.edu>:

Matt Lynch

unread,

Aug 31, 2008, 1:18:58 AM8/31/08

to cloud-c...@googlegroups.com

Horses for courses. There is discrete data and continuous data. Discrete data can be easily partitioned and I see MapReduce as overkill for this scenario (eg. Student exam results). However, Continuous data (eg. A video stream) is not easily partitioned and does require stuff like this.

Having written a graph/map API, I want to see more developers making use of graph theory, though I think MapReduce might be the wrong product for you J.

From: cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On Behalf Of Roderick Flores
Sent: Sunday, 31 August 2008 4:35 AM
To: cloud-c...@googlegroups.com
Subject: Re: Is Map/Reduce going mainstream?

On Tue, Aug 26, 2008 at 5:56 PM, Chris K Wensel <ch...@wensel.net> wrote:

Jaliya Ekanayake

unread,

Aug 31, 2008, 10:15:18 AM8/31/08

to cloud-c...@googlegroups.com

Yes, we used the same number and exactly the same hardware configuration for
the VMs and non VM tests.

Thanks,
Jaliya

Jaliya Ekanayake

unread,

Aug 31, 2008, 10:24:42 AM8/31/08

to cloud-c...@googlegroups.com

May be Krishna Sankar use it to mean "the parallel performance". I used the
same term in my reply to him :)

Jaliya

> -----Original Message-----
> From: cloud-c...@googlegroups.com [mailto:cloud-

> comp...@googlegroups.com] On Behalf Of Sassa NF

Roderick Flores

unread,

Aug 31, 2008, 2:56:50 PM8/31/08

to cloud-c...@googlegroups.com

On Tue, Aug 26, 2008 at 9:57 PM, Jaliya Ekanayake <jeka...@cs.indiana.edu> wrote:

I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and MPI and compare their performances.

I also performed the Kmeans clustering using CGL-MapReduce, MPI, and Java threads on multi-core computers.

Some thoughts… Your work seems more like a standard HPC distribution using a MapReduce framework to me -- which is not to say that it isn't an excellent concept. However, my take on the MapReduce algorithm is that it is largely asynchronous by which I mean that once the problem is "mapped" the calculations that are preformed on each node are independent of any of the others until the "reduce" begins.

Specifically, one of the core assumptions of the approach is that compute nodes are unreliable. Any set of nodes might fail to return a timely result to the reduction bits of the program. Subsequently, the calculations expected from those nodes will be rescheduled on different hardware. Once you introduce this expectation, using any MPI implementation difficult could easily result in the loss of the work done on all of your nodes because of one fairly likely failure. I am not suggesting that this assumption is always present; rather I believe that it is a key differentiator between distributed computing on a dedicated cluster and the more loose utility computing models that seem to characterize the cloud.

As I mentioned in my preceding post, there are a number of complexities and operational constraints that we should consider when build these sorts of applications. Of course, once we take on these restrictions they seriously complicate matters for the developer. First of all, we have to figure out how to break a problem down into distinct pieces. Next we then need to decide how the results can be reduced into something digestible by the user. The main problem here is that -- as discussed in the Grid Gurus post -- you have to know a lot about the data with which you are working.

In particular, there is a solid chance that any map operations you come up might well yield nondeterministic results sets. Sadly, this violates one of the original definitions of the MapReduce problem space. For example, take the K-means algorithm that you are developing. If I take a data set and toss it out to a set of 'n' compute nodes, I will be creating 'n' times 'k' clusters that, depending upon the sampling bias, will probably not reduce into the 'k' clusters we expected to produce on a single node. Moreover, if you randomly resample the data, you will most likely get significantly different results.

Should we be forced to only run these algorithms on very stable clusters rather than on groups of loosely coupled machines on the cloud to avoid nondeterministic results? I have put a bit of thought into this over the last year or so and I don't think this is a significant problem. Rather, I see it as an opportunity, not only as an exercise in advancing distributed algorithms, but in producing a better K-means result set (or agglomerative or any other clustering for that matter).

We already know that many clustering results are heavily dependent on data: introducing random errors or dropping data often produces a very different result. In K-means, altering 'k' can produce dramatically different results. In agglomerative approaches, the order that data is presented changes the results. If your distance measure is fairly complex, the clusters may respond in unpredictable ways. Moreover, I am sure there are other subtleties that I am ignoring. Basically, these algorithms are pretty nondeterministic to begin with so distributing them asynchronously should not be our primary concern.

Which leads me to ask, what do we do to take these complicating factors into account? I don't believe that many people bootstrap their cluster error distributions by data replacement, reordering, or data removal. I am equally unsure whether people put slight changes in their distance measures to test the cluster sensitivity to either their algorithm or numerical calculations. Rather, they do some statistics on the distribution of the cluster members from its center. Switching to a MapReduce implementation of these algorithms not only offers us the chance to reduce the response time, it provides us with the opportunity to actually understand the data -- which is really the point.

Cheers!

--
Roderick Flores

Jaliya Ekanayake

unread,

Aug 31, 2008, 10:27:57 PM8/31/08

to cloud-c...@googlegroups.com

It seems like you have not understand why we use Kmeans in this context. As you have mentioned there are numerous improvements one can perform to the data and the initial set of clusters in Kmeans and most other clustering algorithms. We selected Kmeans as it is a simple example to show the usage of “iterative computations using MapReduce”

We performed the Kmeans algorithm exactly as you have mentioned. All the map tasks compute the distances between some points and the set of cluster centers and once all of them are finished, a set of reduce tasks start calculating the new cluster centers. So we preserve the “MapReduce” paradigm in our algorithm and the implementation. Also this allows any fault tolerance features of the implementation and the result will not change if few Map tasks are re-executed after failures. The changes to the initial cluster centers and other improvements to the data will not change this process, it will definitely change the number of iterations that we need to perform, but that is a task-specific aspect and not the MapReduce.

Also , we compared the MapReduce results with MPI to show that given larger data sets and higher compute intensive operations, most of these systems converge in performance. So we did not try to mimic the classical parallel computing methods using MapReduce.

One last thought as well, things are very much straight forward when we apply this technique to text data sets, but once we try to apply it to scientific applications, where different data formats and different languages come into play, things get complicated. But overall it is a very easy technique to apply compared to the classic parallel programming techniques such as MPI.

Hope this will explain you our motive.

Thanks,
Jaliya

From: cloud-c...@googlegroups.com [mailto:cloud-c...@googlegroups.com] On Behalf Of Roderick Flores
Sent: Sunday, August 31, 2008 2:57 PM
To: cloud-c...@googlegroups.com
Subject: Re: Is Map/Reduce going mainstream?

On Tue, Aug 26, 2008 at 9:57 PM, Jaliya Ekanayake <jeka...@cs.indiana.edu> wrote:

Khazret Sapenov

unread,

Aug 31, 2008, 11:42:11 PM8/31/08

to cloud-c...@googlegroups.com

On Wed, Aug 27, 2008 at 12:27 PM, Paco NATHAN <cet...@gmail.com> wrote:

Chris, excellent post on that issue.

Greenplum and Aster Data both recently announced MapReduce within SQL
for software-only distributed data warehouses on commodity hardware --
highly parallel, fault tolerate -- great arguments. I've talked with
both vendors. These directions are great to see for several reasons,
and I'll likely become a paying customer. There are cautions as well
for other reasons.

Going beyond the "It's hard to think in MapReduce" point which
GridGurus made ... As an engineering manager at a firm invested in
using lots of SQL, lots of Hadoop, lots of other code written in
neither of the above, and many terabytes of data added each week -- my
primary concern is more about how well a *team* of people can "think"
within a paradigm over a *long* period of time.

Much of what I've seen discussed about Hadoop seems based around an
individual programmer's initial experiences. That's great to see.
However, what really sells a paradigm is when you can evaluate from a
different perspective, namely hindsight.

I wonder was there a situation with Hadoop, when you needed to have HA mode for Namenode machine/instance, since according to documentation it is single point of failure.

"... The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported."

Is there other implementation, that has this issue solved (I'm reading Dryade docs right now and didn't find the answer yet, other than hw fault tolerance for data).

thanks,

KS

Allen Wittenauer

unread,

Sep 1, 2008, 2:17:31 AM9/1/08

to cloud-c...@googlegroups.com

On Aug 31, 2008, at 8:42 PM, Khazret Sapenov wrote:
> I wonder was there a situation with Hadoop, when you needed to have
> HA mode for Namenode machine/instance, since according to
> documentation it is single point of failure.
> "... The NameNode machine is a single point of failure for an HDFS
> cluster. If the NameNode machine fails, manual intervention is
> necessary. Currently, automatic restart and failover of the NameNode
> software to another machine is not supported."

HA for the name node likely wouldn't have helped in the vast majority
of the failure cases I've seen. Typically, our name node "failures"--
really full garbage collections or the name node getting swapped to
disk--are related to bad user code (too many small file creates), bad
tuning (not enough heap/memory), or full file system (not enough free
blocks on the local disk on the data nodes, too many files in the fs,
etc). HAing the name node could have easily turned into a ping pong
situation.

We've had exactly one hardware related failure in recent memory where
HA might have helped, but in the end didn't really matter. The name
node stayed up due to redundant name dirs, including one on NFS. [We
did, however, find a bug with the secondary doing Bad Things(tm) in
this particular failure case though.] But user jobs continued, albeit
a tad slower.

natis...@gmail.com

unread,

Sep 1, 2008, 8:35:29 PM9/1/08

to Cloud Computing

With tuplespace model such as the one defined by JavaSpaces (See:
http://en.wikipedia.org/wiki/JavaSpaces#JavaSpaces ) the way HA is
maintained is through transactions:

A job is written as an entry to the space.
A worker node takes that entry under transaction.
if successful it commit the transaction and moves to the next job.
If it fails the job is rolledback and another worker takes ownership
of that job and continues the execution as if nothing happend.

Nati S
GigaSaces

> KS- Hide quoted text -
>
> - Show quoted text -

natis...@gmail.com

unread,

Sep 1, 2008, 9:30:52 PM9/1/08

to Cloud Computing

"From what I've seen through the years in other areas of large scale
data, such as ETL and analytics, first off the data is not
particularly relational.
The concept of workflows seems much more apt -- more likely to become
adopted by the practitioners in quantity once we get past the "slope
of enlightenment" phase of the hype cycle."

Nathan i think that your observation is pretty accurate with what i've
seen in the market.
The world of real time analystics is moving to this direction
allready:

See an interesting article on that regard that was published on
infoworld - entitled: Real time drives database virtualization

"Real time drives database virtualization - Database virtualization
will enable real-time business intelligence through a memory grid that
permeates an infrastructure at all levels""

http://www.infoworld.com/article/08/08/20/Real_time_drives_database_virtualization_1.html

An another reports by forester: Report Warns Of Data Warehouse
'Bottleneck' In Real-Time Analytics
http://www.intelligententerprise.com/showArticle.jhtml?articleID=210101150

Interestingly enough both where published recently so it seems like
the need is growing and with it the demand of "new" model for
processing large amound of data faster.

In the financial services world many of the real-time reports such as
Reconsiliation, Value at Risk and Profit and Loss are allready using
this model as they face ever increasing volume of data while at the
same time a demand to process that data faster. Clearly the
realisation that forced those financial organisatio to explore new
models is that those two supposibly contradicting requirements can't
be addressed just by twikking our existing database or OLAP server.

I do agree with you and Chris Wensel that the current barrier to entry
that exist with solutions like Haddoop is too steep to many
organisations and therfore making parallel processing model closer to
way people build analytical applicaitons today is critical for higher
adoption.

As you mentioned we allready see different solutions that expose
higher level of API such as SQL that abstract some fo those detailes
from the programmer.

IMO there are other level of absctations that can be used to map
existing programming model and leverage parallel processing as part of
the implementation detailes of those programming model rather then
exposing them explicitly:

1. An SQL intereface that provides a method for exectuing SQL commands
in parallel over a partitioned clusster of data nodes.

2. Using RPC abstration that enables invocation on multiple services
that implements the same interface and use a map/reduce style of
invocation to perform parallel invocation and agregation of the
results. In fact you could use that same model to perform both
synchronious agregation (Map/Reduce) and batch processing. An example
RPC based abstraction is decribed in one of the our recent white
papers Service Virtualization Frameowrk: http://www.gigaspaces.com/viewpdf/975))

3. An executor interface that enables you to "ship" code to a cluster
of machines and execute that code in parallel.

In all three cases the programming model can be kept identical or
fairly close to the existing programming model. It doesn't prevent us
from adding additional enhancemtns to support more sepecilized
scenarios where a user would want more fine grained control over the
underlying parallel processing exectuion (control ponits could be
routing interceptors, reducers etc..). The idea is that adding those
control points wouldn't force you to learn new programming model. It
will be more evolution extention to your existing way of thinking.

Nati S
GigaSpaces.

> >http://www.cascading.org/- Hide quoted text -

Khazret Sapenov

unread,

Sep 1, 2008, 10:02:02 PM9/1/08

to cloud-c...@googlegroups.com

On Mon, Sep 1, 2008 at 9:30 PM, <natis...@gmail.com> wrote:

...

3. An executor interface that enables you to "ship" code to a cluster
of machines and execute that code in parallel.

Is it something like good old gexec?

"GEXEC is a scalable cluster remote execution system which provides fast, RSA authenticated remote execution of parallel and distributed jobs. It provides transparent forwarding of stdin, stdout, stderr, and signals to and from remote processes, provides local environment propagation, and is designed to be robust and to scale to systems over 1000 nodes. Internally, GEXEC operates by building an n-ary tree of TCP sockets and threads between gexec daemons and propagating control information up and down the tree. By using hierarchical control, GEXEC distributes both the work and resource usage associated with massive amounts of parallelism across multiple nodes, thereby eliminating problems associated with single node resource limits (e.g., limits on the number of file descriptors on front-end nodes). An initial release of the software (below) consists of a daemon, a client program, and a library which provides programmatic interface to the GEXEC system. "

KS

Paco NATHAN

unread,

Sep 1, 2008, 10:47:53 PM9/1/08

to cloud-c...@googlegroups.com

Nati,

Thanks for the great links -

An "executor interface" is close to a question I asked in a recent blog post:
http://ceteri.blogspot.com/2008/08/hadoop-in-cloud-patterns-for-automation.html

For example, you may have large Hadoop jobs as important batch
processes which are best to run on elastic resources. What framework
gets used to launch and monitor those resources, then pull back
results after the job has completed? Can I specify to that framework
some tagged build from my SVN repository? How about data which fits
best in a distributed cache, not in HDFS?

If the batch process is critical to my business, how do I automated it
- like in a crontab? As far as I can tell, none of the cloud
providers address this kind of requirement. Hadoop does not include
those kinds of features.

From what I can see, RightScale probably comes closest to providing
services in that area.

Paco

Andrew Rogers

unread,

Sep 2, 2008, 2:17:45 AM9/2/08

to cloud-c...@googlegroups.com

--- On Sat, 8/30/08, Matt Lynch <M...@ttLynch.net> wrote:
> Horses for courses. There is discrete data and continuous data.
> Discrete data can be easily partitioned and I see MapReduce as
> overkill for this scenario (eg. Student exam results). However,
> Continuous data (eg. A video stream) is not easily partitioned
> and does require stuff like this.

Nitpick: video is only "not easily partitioned" if one excludes the possibility of using high dimensionality spaces to represent it. This is only a problem if one is tacitly assuming a "points on a line" representation (which most systems do) but that assumption is not strictly required, merely common and well-understood.

-Andrew

Nati Shalom

unread,

Sep 1, 2008, 2:21:38 PM9/1/08

to cloud-c...@googlegroups.com

Hi Nathan

" For example, you may have large Hadoop jobs as important batch
processes which are best to run on elastic resources. What framework
gets used to launch and monitor those resources, then pull back
results after the job has completed?"
Can I specify to that framework
some tagged build from my SVN repository? How about data which fits
best in a distributed cache, not in HDFS?"

I'm not aware of ways to address all those requirements with Hadoop but
there are other Grid platforms alternatives that lets you do that. Some are
provided as OpenSource and some are not. For example: Fura System provides a
batch scheduler in Java so is JPPF which another FOSS grid framework that
provides similar type of grid schedulers. On the commercial side DataSynapse
and Platform computing are probably the leading vendors. Both provides way
to monitor the jobs and get the results at some other stage. With JavaSpace
based solution each job is normally tagged with Job-ID as part of the Entry
attributes which is used to match all the results belong to that JOB at some
other time. With JavaSpaces you can also push the job code to the compute
node dynamically using dynamic code downloading. As for pointing data to a
cache that becomes more an attribute of the task i.e. with any of those
frameworks the task is a code that you write in java and you can point it to
any data source that you like whether its a cache or a file system.

The main difference between those alternatives and Hadoop is that Hadoop is
more geared to parallel aggregation of large distributed file systems rather
then being a generic job execution framework. On the other hand most of the
Grid frameworks that I mentioned are geared toward batch execution rather
then synchronous aggregated execution. GridGain is a relatively new and
interesting player in that space that seem to compete more with the Hadoop
model then with the classic batch processing grid providers and since it is
java based it brings some of the simplicity and code distribution benefits
that exist with other java based solution.

Having said all that the requirements that you mentioned indicate some of
the limitations of having a specialized framework such as Hadoop narrowed to
specific scenario. If you want to do something slightly different such as
parallel batch job as appose to parallel aggregated job you probably need to
use a totally different solution.

One of the nice things with Space model is that you can easily use it to
address both scenarios. We extended the JavaSpace model and beyond the API
abstraction that I mentioned earlier we added support for monitoring and the
ability to support both asynchronous batch processing or synchronous
aggregated operations where you write set of tasks in parallel and block for
the immediate results in similar to the way you would do it with Hadoop(See
a reference here:
http://wiki.gigaspaces.com/display/GS66/OpenSpaces+Core+Component+-+Executor
s). Since we use the space as the data layer and execution layer you get
data affinity pretty much built into the model i.e. you can decide to route
the task to where the data is and save the overhead associated with moving
of that data or accessing it over the network. We also provide built in
support for executing Groovy, JRuby and other dynamic languages tasks which
gives another dynamic capabilities. One of the common use case is using
dynamic languages as a new form of stored procedure (in away you can think
of SQL as a specialized case of dynamic language as well). And the nice
things that it all runs in EC2 today.

Nati S.
GigaSpaces.

-----Original Message-----
From: cloud-c...@googlegroups.com
[mailto:cloud-c...@googlegroups.com] On Behalf Of Paco NATHAN
Sent: Tuesday, September 02, 2008 5:48 AM
To: cloud-c...@googlegroups.com
Subject: Re: Is Map/Reduce going mainstream?

Reply all

Reply to author

Forward