> -----Original Message----- > From: cloud-computing@googlegroups.com [mailto:cloud- > computing@googlegroups.com] On Behalf Of Sassa NF > Sent: Sunday, August 31, 2008 3:27 AM > To: cloud-computing@googlegroups.com > Subject: Re: Is Map/Reduce going mainstream?
> I presume, the VM setup was running on the same number of physical > nodes, as the no-VM setup, right?
> Sassa
> 2008/8/27 Jaliya Ekanayake <jekan...@cs.indiana.edu>: > >> It might be interesting to do a quick test and see what the > performance > >> overhead of using virtual servers compared to physical servers.
> > Yes, planning to do a thorough test as well. I have some initial > results on > > the overhead of virtual machines for MPI. > > This is what we did.
> > We used the same MPI program that is used for Kmeans clustering and > modified > > it so that it will iterate a given number times at each MPI > communication > > routine. Then we measure the performance on a VM setup (Using the > "Nimbus" > > cloud) and also on an exactly similar set of machines without VMs.
> > The VMs introduced high overhead to the MPI communication. We did few > tests > > by changing the configurations such as the number of processor cores > used in > > a VM and the number of VMs used etc.. but still could not exactly > figure out > > the reason for this high overhead. It could be the overhead induced > by the > > network emulator or the network configuration used in Nimbus. We are > setting > > up a Cloud in IU and once it is done, we will be able to verify these > > figures.
> > Thanks, > > Jaliya
> >> -----Original Message----- > >> From: cloud-computing@googlegroups.com [mailto:cloud- > >> computing@googlegroups.com] On Behalf Of Jim Starkey > >> Sent: Wednesday, August 27, 2008 9:45 AM > >> To: cloud-computing@googlegroups.com > >> Subject: Re: Is Map/Reduce going mainstream?
> >> Jaliya Ekanayake wrote:
> >> > Hi All,
> >> > I am a graduate student of the Indiana University Bloomington.
> >> > In my current research work, I have analyzed the MapReduce > technique > >> > using two scientific applications. I have used Hadoop and also > >> > CLG-MapReduce (A streaming based MapReduce implementation that we > >> > developed) for the aforementioned analysis.
> >> > Since the MapReduce discussion is going on, I thought of sharing > my > >> > experience with the MapReduce technique with you all.
> >> > So far, I have two scientific applications which I have > implemented > >> in > >> > MapReduce programming model.
> >> > 1. High Energy Physics(HEP) data analysis involving large > >> > volumes of data (I have tested up to 1 terabytes)
> >> > 2. Kmeans clustering to cluster large number of 2D data > points > >> > to predefined number of clusters.]
> >> > The first application is data/computation intensive application > that > >> > requires passing the data through a single MapReduce computation > >> > cycle. The second application is an iterative application that > >> > requires multiple execution of MapReduce cycles before the final > >> > result is obtained.
> >> > I implemented the HEP data analysis using Hadoop and CGL- > MapReduce > >> > and compare their performances.
> >> > I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, > and > >> > MPI and compare their performances.
> >> > I also performed the Kmeans clustering using CGL-MapReduce, MPI, > and > >> > Java threads on multi-core computers.
> >> It might be interesting to do a quick test and see what the > performance > >> overhead of using virtual servers compared to physical servers.
> >> -- > >> Jim Starkey > >> President, NimbusDB, Inc. > >> 978 526-1376
> --~--~---------~--~----~------------~-------~--~----~ > You received this message because you are subscribed to the Google > Groups "Cloud Computing" group. > To post to this group, send email to cloud-computing@googlegroups.com > To unsubscribe from this group, send email to > cloud-computing-unsubscribe@googlegroups.com > To post job listing, send email to j...@cloudjobs.net (position title, > employer and location in subject, description in message body) or visit > http://www.cloudjobs.net > For more options, visit this group at > http://groups.google.ca/group/cloud-computing?hl=en?hl=en > Posting guidelines: > http://groups.google.ca/group/cloud-computing/web/frequently-asked- > questions > -~----------~----~----~----~------~----~------~--~---
> -----Original Message----- > From: cloud-computing@googlegroups.com [mailto:cloud- > computing@googlegroups.com] On Behalf Of Sassa NF > Sent: Sunday, August 31, 2008 3:22 AM > To: cloud-computing@googlegroups.com > Subject: Re: Is Map/Reduce going mainstream?
> What is meant by "exponential performance advantage"?
> Sassa
> 2008/8/29 Jaliya Ekanayake <jekan...@cs.indiana.edu>: > >> While I agree with your thoughts, I think it is not I/O vs. > >> computational aspect that drives MapReduce - MapReduce will work > well > >> in both situations.
> > Yes, MapReduce will work in both situations. However, my point is > that the > > "exponential performance advantage" that we expect to achieve may not > be > > possible unless we use a right implementation of the MapReduce that > is > > suitable for the class of applications.
> > Thanks > > Jaliya
> >> Cheers > >> <k/>
> >> |-----Original Message----- > >> |From: cloud-computing@googlegroups.com [mailto:cloud- > >> |computing@googlegroups.com] On Behalf Of Jaliya Ekanayake > >> |Sent: Friday, August 29, 2008 7:38 AM > >> |To: cloud-computing@googlegroups.com > >> |Subject: RE: Is Map/Reduce going mainstream? > >> | > >> |IMO the MapReduce program paradigm simplifies the parallel > >> programming. > >> |There may be things that are hard to do in MapReduce compared to > the > >> |traditional parallel programming, but overall it will be a very > good > >> |alternative. However, it is always good to use the right > >> implementation > >> |of MapReduce for the tasks that we need to parallelize. For > example, > >> |although the Google's and Hadoop's way of doing MapReduce (I am not > >> |talking about the API rather the underlying architecture) suite > well > >> for > >> |heavy I/O bound problems, we cannot expect all the problems to be > in > >> |this nature. Some problems may have moderate amount of I/O > operations > >> |but heavy computational requirements. Some may require iterative > >> |application of MapReduce. The MapReduce programming model is > something > >> |we can apply to all these problems but, if we use a MapReduce > >> |implementation that is targeted for heavy I/O bound applications to > >> |solve a problem which is more computational/communication > intensive, > >> we > >> |may not get the parallel performance that we expected. > >> | > >> |Therefore, what I suggest is that once we have a parallel > >> implementation > >> |(MapReduce or any other ) we need to have at least a rough idea of > the > >> |efficiency( or speedup) achieved by our solution and see how much > >> |overhead is introduced by the parallelization technique itself. > >> |Thanks > >> |Jaliya > >> | > >> | > >> |> -----Original Message----- > >> |> From: cloud-computing@googlegroups.com [mailto:cloud- > >> |> computing@googlegroups.com] On Behalf Of Jeff He > >> |> Sent: Thursday, August 28, 2008 11:23 PM > >> |> To: Cloud Computing > >> |> Subject: Re: Is Map/Reduce going mainstream? > >> |> > >> |> > >> |> Our team have used the hadoop stuff to implement some basic > spatial > >> |> location related algorithms, e.g., the 2D-convex hull, and now > some > >> |> more advanced. > >> |> > >> |> >From the view of an ex "traditional" parallel program developer > >> with > >> |> message-passing stuffs, I dont' think the map/reduce is a totally > >> |> different animal from MPI/PVM/MW these HPC stuffs. Coz the born- > >> |> tightly combination with distributed file&data I/O layer, > automatic > >> |> task-availability assurance, simpler API, and the most important > - > >> |> Disk IO bottleneck in a single box, make it most popular among > >> |> developers, thought I would still say, Condor/PBS/SGE + MPI + > MPIO + > >> |> some parallel filesystems over a COTS cluster can do almost the > same > >> |> job because of the divide-and-conquer nature in some of the > workload > >> - > >> |> It seems somethings are invented again! :) > >> |> > >> |> However, from the pointview of programming language, LINQ, > Swazzle, > >> |> and PIG are driving more adoption of functional-style languages > in > >> |> parallel/performance computing. The are quite different from > those > >> of > >> |> HPCs! > >> |> > >> |> Comments r welcome! :) > >> |> > >> |> > >> |> On Aug 28, 9:53 pm, "Jaliya Ekanayake" <jekan...@cs.indiana.edu> > >> |> wrote: > >> |> > > using > >> |> > > > two scientific applications. I have used Hadoop and also > CLG- > >> |> > > MapReduce (A > >> |> > > > streaming based MapReduce implementation that we developed) > >> for > >> |> the > >> |> > > > aforementioned analysis. > >> |> > > >> |> > > This is very interesting work. > >> |> > > >> |> > Thanks :) > >> |> > > >> |> > > > The [HEP] application is data/computation intensive > >> application > >> |> that > >> |> > > > requires passing the data through a single MapReduce > >> computation > >> |> > > cycle. > >> |> > > >> |> > > What HEP application were you using? Was it doing Monte > Carlo > >> |> > > generation (MC), reconstruction (reco), analysis, or a > mixture > >> of > >> |> > > these? > >> |> > > >> |> > We used Monte Carlo data to find "higgs". > >> |> > The analyses code and the data are both from the High Energy > >> Physics > >> |> group @ > >> |> > Caltech, and we just used it as a use case for our MapReduce > >> |testing. > >> |> > > >> |> > > > I implemented the HEP data analysis using Hadoop and CGL- > >> |> MapReduce > >> |> > > and > >> |> > > > compare their performances [...] > >> |> > > >> |> > > Did you (or are planning to) compare this with PROOF: > >> |> > > http://root.cern.ch/twiki/bin/view/ROOT/PROOF > >> |> > > >> |> > This will be a interesting comparison. PROOF is only for > >> |> parallelizing ROOT > >> |> > applications. > >> |> > Right now our focus is on the generalized framework for > supporting > >> |> these data > >> |> > analyses. > >> |> > > >> |> > > >> |> > > >> |> > > > [Draft Paper]http://www.cs.indiana.edu/~jekanaya/draft.pdf > >> |> > > >> |> > > Ta for the link. > >> |> > > >> |> > Thanks, > >> |> > Jaliya > >> |> > > >> |> > > >> |> > > >> |> > > Cheers, > >> |> > > >> |> > > Paul. > >> |> > > >> |> > > > > >> |> > > >> |> > > >> |> > smime.p7s > >> |> > 5KViewDownload > >> |> > >> |>
> --~--~---------~--~----~------------~-------~--~----~ > You received this message because you are subscribed to the Google > Groups "Cloud Computing" group. > To post to this group, send email to cloud-computing@googlegroups.com > To unsubscribe from this group, send email to > cloud-computing-unsubscribe@googlegroups.com > To post job listing, send email to j...@cloudjobs.net (position title, > employer and location in subject, description in message body) or visit > http://www.cloudjobs.net > For more options, visit this group at > http://groups.google.ca/group/cloud-computing?hl=en?hl=en > Posting guidelines: > http://groups.google.ca/group/cloud-computing/web/frequently-asked- > questions > -~----------~----~----~----~------~----~------~--~---
On Tue, Aug 26, 2008 at 9:57 PM, Jaliya Ekanayake <jekan...@cs.indiana.edu>wrote:
> I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and MPI > and compare their performances.
> I also performed the Kmeans clustering using CGL-MapReduce, MPI, and Java > threads on multi-core computers.
Some thoughts… Your work seems more like a standard HPC distribution using a MapReduce framework to me -- which is not to say that it isn't an excellent concept. However, my take on the MapReduce algorithm is that it is largely asynchronous by which I mean that once the problem is "mapped" the calculations that are preformed on each node are independent of any of the others until the "reduce" begins.
Specifically, one of the core assumptions of the approach is that compute nodes are unreliable. Any set of nodes might fail to return a timely result to the reduction bits of the program. Subsequently, the calculations expected from those nodes will be rescheduled on different hardware. Once you introduce this expectation, using any MPI implementation difficult could easily result in the loss of the work done on all of your nodes because of one fairly likely failure. I am not suggesting that this assumption is always present; rather I believe that it is a key differentiator between distributed computing on a dedicated cluster and the more loose utility computing models that seem to characterize the cloud.
As I mentioned in my preceding post, there are a number of complexities and operational constraints that we should consider when build these sorts of applications. Of course, once we take on these restrictions they seriously complicate matters for the developer. First of all, we have to figure out how to break a problem down into distinct pieces. Next we then need to decide how the results can be reduced into something digestible by the user. The main problem here is that -- as discussed in the Grid Gurus post -- you have to know a lot about the data with which you are working.
In particular, there is a solid chance that any map operations you come up might well yield nondeterministic results sets. Sadly, this violates one of the original definitions of the MapReduce problem space. For example, take the K-means algorithm that you are developing. If I take a data set and toss it out to a set of 'n' compute nodes, I will be creating 'n' times 'k' clusters that, depending upon the sampling bias, will probably not reduce into the 'k' clusters we expected to produce on a single node. Moreover, if you randomly resample the data, you will most likely get significantly different results.
Should we be forced to only run these algorithms on very stable clusters rather than on groups of loosely coupled machines on the cloud to avoid nondeterministic results? I have put a bit of thought into this over the last year or so and I don't think this is a significant problem. Rather, I see it as an opportunity, not only as an exercise in advancing distributed algorithms, but in producing a better K-means result set (or agglomerative or any other clustering for that matter).
We already know that many clustering results are heavily dependent on data: introducing random errors or dropping data often produces a very different result. In K-means, altering 'k' can produce dramatically different results. In agglomerative approaches, the order that data is presented changes the results. If your distance measure is fairly complex, the clusters may respond in unpredictable ways. Moreover, I am sure there are other subtleties that I am ignoring. Basically, these algorithms are pretty nondeterministic to begin with so distributing them asynchronously should not be our primary concern.
Which leads me to ask, what do we do to take these complicating factors into account? I don't believe that many people bootstrap their cluster error distributions by data replacement, reordering, or data removal. I am equally unsure whether people put slight changes in their distance measures to test the cluster sensitivity to either their algorithm or numerical calculations. Rather, they do some statistics on the distribution of the cluster members from its center. Switching to a MapReduce implementation of these algorithms not only offers us the chance to reduce the response time, it provides us with the opportunity to actually understand the data -- which is really the point. Cheers!
It seems like you have not understand why we use Kmeans in this context. As you have mentioned there are numerous improvements one can perform to the data and the initial set of clusters in Kmeans and most other clustering algorithms. We selected Kmeans as it is a simple example to show the usage of "iterative computations using MapReduce"
We performed the Kmeans algorithm exactly as you have mentioned. All the map tasks compute the distances between some points and the set of cluster centers and once all of them are finished, a set of reduce tasks start calculating the new cluster centers. So we preserve the "MapReduce" paradigm in our algorithm and the implementation. Also this allows any fault tolerance features of the implementation and the result will not change if few Map tasks are re-executed after failures. The changes to the initial cluster centers and other improvements to the data will not change this process, it will definitely change the number of iterations that we need to perform, but that is a task-specific aspect and not the MapReduce.
Also , we compared the MapReduce results with MPI to show that given larger data sets and higher compute intensive operations, most of these systems converge in performance. So we did not try to mimic the classical parallel computing methods using MapReduce.
One last thought as well, things are very much straight forward when we apply this technique to text data sets, but once we try to apply it to scientific applications, where different data formats and different languages come into play, things get complicated. But overall it is a very easy technique to apply compared to the classic parallel programming techniques such as MPI.
Hope this will explain you our motive.
Thanks, Jaliya
From: cloud-computing@googlegroups.com [mailto:cloud-computing@googlegroups.com] On Behalf Of Roderick Flores Sent: Sunday, August 31, 2008 2:57 PM To: cloud-computing@googlegroups.com Subject: Re: Is Map/Reduce going mainstream?
On Tue, Aug 26, 2008 at 9:57 PM, Jaliya Ekanayake <jekan...@cs.indiana.edu> wrote:
I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and MPI and compare their performances.
I also performed the Kmeans clustering using CGL-MapReduce, MPI, and Java threads on multi-core computers.
Some thoughts. Your work seems more like a standard HPC distribution using a MapReduce framework to me -- which is not to say that it isn't an excellent concept. However, my take on the MapReduce algorithm is that it is largely asynchronous by which I mean that once the problem is "mapped" the calculations that are preformed on each node are independent of any of the others until the "reduce" begins.
Specifically, one of the core assumptions of the approach is that compute nodes are unreliable. Any set of nodes might fail to return a timely result to the reduction bits of the program. Subsequently, the calculations expected from those nodes will be rescheduled on different hardware. Once you introduce this expectation, using any MPI implementation difficult could easily result in the loss of the work done on all of your nodes because of one fairly likely failure. I am not suggesting that this assumption is always present; rather I believe that it is a key differentiator between distributed computing on a dedicated cluster and the more loose utility computing models that seem to characterize the cloud.
As I mentioned in my preceding post, there are a number of complexities and operational constraints that we should consider when build these sorts of applications. Of course, once we take on these restrictions they seriously complicate matters for the developer. First of all, we have to figure out how to break a problem down into distinct pieces. Next we then need to decide how the results can be reduced into something digestible by the user. The main problem here is that -- as discussed in the Grid Gurus post -- you have to know a lot about the data with which you are working.
In particular, there is a solid chance that any map operations you come up might well yield nondeterministic results sets. Sadly, this violates one of the original definitions of the MapReduce problem space. For example, take the K-means algorithm that you are developing. If I take a data set and toss it out to a set of 'n' compute nodes, I will be creating 'n' times 'k' clusters that, depending upon the sampling bias, will probably not reduce into the 'k' clusters we expected to produce on a single node. Moreover, if you randomly resample the data, you will most likely get significantly different results.
Should we be forced to only run these algorithms on very stable clusters rather than on groups of loosely coupled machines on the cloud to avoid nondeterministic results? I have put a bit of thought into this over the last year or so and I don't think this is a significant problem. Rather, I see it as an opportunity, not only as an exercise in advancing distributed algorithms, but in producing a better K-means result set (or agglomerative or any other clustering for that matter).
We already know that many clustering results are heavily dependent on data: introducing random errors or dropping data often produces a very different result. In K-means, altering 'k' can produce dramatically different results. In agglomerative approaches, the order that data is presented changes the results. If your distance measure is fairly complex, the clusters may respond in unpredictable ways. Moreover, I am sure there are other subtleties that I am ignoring. Basically, these algorithms are pretty nondeterministic to begin with so distributing them asynchronously should not be our primary concern.
Which leads me to ask, what do we do to take these complicating factors into account? I don't believe that many people bootstrap their cluster error distributions by data replacement, reordering, or data removal. I am equally unsure whether people put slight changes in their distance measures to test the cluster sensitivity to either their algorithm or numerical calculations. Rather, they do some statistics on the distribution of the cluster members from its center. Switching to a MapReduce implementation of these algorithms not only offers us the chance to reduce the response time, it provides us with the opportunity to actually understand the data -- which is really the point.
Cheers!
-- Roderick Flores
--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Cloud Computing" group. To post to this group, send email to cloud-computing@googlegroups.com To unsubscribe from this group, send email to cloud-computing-unsubscribe@googlegroups.com To post job listing, send email to j...@cloudjobs.net (position title, employer and location in subject, description in message body) or visit http://www.cloudjobs.net For more options, visit this group at http://groups.google.ca/group/cloud-computing?hl=en?hl=en Posting guidelines: http://groups.google.ca/group/cloud-computing/web/frequently-asked-qu... -~----------~----~----~----~------~----~------~--~---
On Wed, Aug 27, 2008 at 12:27 PM, Paco NATHAN <cet...@gmail.com> wrote:
> Chris, excellent post on that issue.
> Greenplum and Aster Data both recently announced MapReduce within SQL > for software-only distributed data warehouses on commodity hardware -- > highly parallel, fault tolerate -- great arguments. I've talked with > both vendors. These directions are great to see for several reasons, > and I'll likely become a paying customer. There are cautions as well > for other reasons.
> Going beyond the "It's hard to think in MapReduce" point which > GridGurus made ... As an engineering manager at a firm invested in > using lots of SQL, lots of Hadoop, lots of other code written in > neither of the above, and many terabytes of data added each week -- my > primary concern is more about how well a *team* of people can "think" > within a paradigm over a *long* period of time.
> Much of what I've seen discussed about Hadoop seems based around an > individual programmer's initial experiences. That's great to see. > However, what really sells a paradigm is when you can evaluate from a > different perspective, namely hindsight.
> I wonder was there a situation with Hadoop, when you needed to have HA
mode for Namenode machine/instance, since according to documentation it is single point of failure. "... The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported."
Is there other implementation, that has this issue solved (I'm reading Dryade docs right now and didn't find the answer yet, other than hw fault tolerance for data).
On Aug 31, 2008, at 8:42 PM, Khazret Sapenov wrote:
> I wonder was there a situation with Hadoop, when you needed to have > HA mode for Namenode machine/instance, since according to > documentation it is single point of failure. > "... The NameNode machine is a single point of failure for an HDFS > cluster. If the NameNode machine fails, manual intervention is > necessary. Currently, automatic restart and failover of the NameNode > software to another machine is not supported."
HA for the name node likely wouldn't have helped in the vast majority of the failure cases I've seen. Typically, our name node "failures"-- really full garbage collections or the name node getting swapped to disk--are related to bad user code (too many small file creates), bad tuning (not enough heap/memory), or full file system (not enough free blocks on the local disk on the data nodes, too many files in the fs, etc). HAing the name node could have easily turned into a ping pong situation.
We've had exactly one hardware related failure in recent memory where HA might have helped, but in the end didn't really matter. The name node stayed up due to redundant name dirs, including one on NFS. [We did, however, find a bug with the secondary doing Bad Things(tm) in this particular failure case though.] But user jobs continued, albeit a tad slower.
A job is written as an entry to the space.
A worker node takes that entry under transaction.
if successful it commit the transaction and moves to the next job.
If it fails the job is rolledback and another worker takes ownership
of that job and continues the execution as if nothing happend.
Nati S
GigaSaces
On Sep 1, 6:42 am, "Khazret Sapenov" <sape...@gmail.com> wrote:
> On Wed, Aug 27, 2008 at 12:27 PM, Paco NATHAN <cet...@gmail.com> wrote:
> > Chris, excellent post on that issue.
> > Greenplum and Aster Data both recently announced MapReduce within SQL
> > for software-only distributed data warehouses on commodity hardware --
> > highly parallel, fault tolerate -- great arguments. I've talked with
> > both vendors. These directions are great to see for several reasons,
> > and I'll likely become a paying customer. There are cautions as well
> > for other reasons.
> > Going beyond the "It's hard to think in MapReduce" point which
> > GridGurus made ... As an engineering manager at a firm invested in
> > using lots of SQL, lots of Hadoop, lots of other code written in
> > neither of the above, and many terabytes of data added each week -- my
> > primary concern is more about how well a *team* of people can "think"
> > within a paradigm over a *long* period of time.
> > Much of what I've seen discussed about Hadoop seems based around an
> > individual programmer's initial experiences. That's great to see.
> > However, what really sells a paradigm is when you can evaluate from a
> > different perspective, namely hindsight.
> > I wonder was there a situation with Hadoop, when you needed to have HA
> mode for Namenode machine/instance, since according to documentation it is
> single point of failure.
> "... The NameNode machine is a single point of failure for an HDFS cluster.
> If the NameNode machine fails, manual intervention is necessary. Currently,
> automatic restart and failover of the NameNode software to another machine
> is not supported."
> Is there other implementation, that has this issue solved (I'm reading
> Dryade docs right now and didn't find the answer yet, other than hw fault
> tolerance for data).
"From what I've seen through the years in other areas of large scale
data, such as ETL and analytics, first off the data is not
particularly relational.
The concept of workflows seems much more apt -- more likely to become
adopted by the practitioners in quantity once we get past the "slope
of enlightenment" phase of the hype cycle."
Nathan i think that your observation is pretty accurate with what i've
seen in the market.
The world of real time analystics is moving to this direction
allready:
See an interesting article on that regard that was published on
infoworld - entitled: Real time drives database virtualization
"Real time drives database virtualization - Database virtualization
will enable real-time business intelligence through a memory grid that
permeates an infrastructure at all levels""
Interestingly enough both where published recently so it seems like
the need is growing and with it the demand of "new" model for
processing large amound of data faster.
In the financial services world many of the real-time reports such as
Reconsiliation, Value at Risk and Profit and Loss are allready using
this model as they face ever increasing volume of data while at the
same time a demand to process that data faster. Clearly the
realisation that forced those financial organisatio to explore new
models is that those two supposibly contradicting requirements can't
be addressed just by twikking our existing database or OLAP server.
I do agree with you and Chris Wensel that the current barrier to entry
that exist with solutions like Haddoop is too steep to many
organisations and therfore making parallel processing model closer to
way people build analytical applicaitons today is critical for higher
adoption.
As you mentioned we allready see different solutions that expose
higher level of API such as SQL that abstract some fo those detailes
from the programmer.
IMO there are other level of absctations that can be used to map
existing programming model and leverage parallel processing as part of
the implementation detailes of those programming model rather then
exposing them explicitly:
1. An SQL intereface that provides a method for exectuing SQL commands
in parallel over a partitioned clusster of data nodes.
2. Using RPC abstration that enables invocation on multiple services
that implements the same interface and use a map/reduce style of
invocation to perform parallel invocation and agregation of the
results. In fact you could use that same model to perform both
synchronious agregation (Map/Reduce) and batch processing. An example
RPC based abstraction is decribed in one of the our recent white
papers Service Virtualization Frameowrk: http://www.gigaspaces.com/viewpdf/975))
3. An executor interface that enables you to "ship" code to a cluster
of machines and execute that code in parallel.
In all three cases the programming model can be kept identical or
fairly close to the existing programming model. It doesn't prevent us
from adding additional enhancemtns to support more sepecilized
scenarios where a user would want more fine grained control over the
underlying parallel processing exectuion (control ponits could be
routing interceptors, reducers etc..). The idea is that adding those
control points wouldn't force you to learn new programming model. It
will be more evolution extention to your existing way of thinking.
Nati S
GigaSpaces.
On Aug 27, 7:27 pm, "Paco NATHAN" <cet...@gmail.com> wrote:
> Greenplum and Aster Data both recently announced MapReduce within SQL
> for software-only distributed data warehouses on commodity hardware --
> highly parallel, fault tolerate -- great arguments. I've talked with
> both vendors. These directions are great to see for several reasons,
> and I'll likely become a paying customer. There are cautions as well
> for other reasons.
> Going beyond the "It's hard to think in MapReduce" point which
> GridGurus made ... As an engineering manager at a firm invested in
> using lots of SQL, lots of Hadoop, lots of other code written in
> neither of the above, and many terabytes of data added each week -- my
> primary concern is more about how well a *team* of people can "think"
> within a paradigm over a *long* period of time.
> Much of what I've seen discussed about Hadoop seems based around an
> individual programmer's initial experiences. That's great to see.
> However, what really sells a paradigm is when you can evaluate from a
> different perspective, namely hindsight.
> Consider this as an exercise: imagine that a programmer wrote 1000
> lines of code for a project at work based on MapReduce, the company
> has run that code in some critical use case for 2-3 years, the
> original author moved to another company (or the company was
> acquired), and now a team of relatively new eyeballs must reverse
> engineer the existing MapReduce code and extend it.
> A software lifecycle argument is not heard so much in discussions
> about MapReduce, but IMHO it will become more crucial for decision
> makers as vendors move to providing SQL, etc.
> As a different example, I don't need to argue with exec mgmt or board
> members that reworking 10000 lines of PHP with SQL embedded is going > to cost more than reworking the same app if it'd been written in Java
> and Hibernate. Or having it written in Python, or whatever other
> more-structured language makes sense for the app. The business
> decision makers already understand those costs, all too well.
> Keep in mind that a previous generation of data warehousing (e.g.,
> Netezza) promised parallel processing based on SQL-plus-extensions
> atop custom hardware -- highly-parallel, somewhat less fault-tolerant,
> but still an advancement. IMHO, what we see now from Greenplum, Aster
> Data, Hive, etc., represents a huge leap beyond that previous
> generation. Particularly for TCO.
> Pros: if you have terabytes of data collected from your different
> business units, then those business units will want to run ad-hoc
> queries on it, and they won't have time/expertise/patience to write
> code for Hadoop. I'd rather give those users means to run some SQL and
> be done -- than disrupt my team from core work on algorithms.
> Cons: if you have a complex algorithm which is part of the critical
> infrastructure and must be maintained over a long-period of time by a
> team of analysts and developers, then RUN AWAY, DO NOT WALK, from
> anyone trying to sell you a solution based in SQL.
> I'd estimate the cost of maintaining substantial apps in SQL to be
> 10-100x that of more structured languages used in Hadoop: Java, C++,
> Python, etc. That's based on encountering the problem over and over
> for the past 10 years. Doubtless that there are studies by people with
> much broader views and better data, much to the same effect.
> The challenge to Aster Data, Greenplum, et al., will be to provide
> access to their new systems from within languages like Java, Python,
> etc. Otherwise I see them relegated to the land of ad-hoc queries,
> being eclipsed by coding paradigms which are less abstracted or more
> geared toward software costs long-term.
> PIG provides an interesting approach and I'm sure that it will be
> useful for many projects. Not clear that it provides the answers to
> software lifecycle issues.
> From what I've seen through the years in other areas of large scale
> data, such as ETL and analytics, first off the data is not
> particularly relational.
> The concept of workflows seems much more apt -- more likely to become
> adopted by the practitioners in quantity once we get past the "slope
> of enlightenment" phase of the hype cycle.
> Our team has been considering a shift to Cascading. I'd much rather
> train a stats PhD on Cascading than have them slog through 50x lines
> of SQL which was written by the previous stats PhD.
> Having access to DSLs from within application code -- e.g., R for
> analytics, Python for system programming, etc. -- that would also be
> much more preferable from my perspective than MapReduce within SQL.
> Paco
> On Tue, Aug 26, 2008 at 6:56 PM, Chris K Wensel <ch...@wensel.net> wrote:
> > Hey all
> > From a practitioners perspective, this is both great news, and a bit
> > troubling.
> > Good news since MapReduce is a good divide and conquer strategy when dealing
> > with large (continuous) datasets.
> > When coupled to a 'global' filesystem with a single namespace like HDFS (the
> > Hadoop file system) or GFS (googles global filesystem), you get a new
> > powerful entity that effectively virtualizes many hardware nodes into a
> > single operating system instance (a wide virtualization), the converse of
> > Xen and VMWare where you get may instances on one hardware node (a narrow
> > virtualization).
> > Being inherently fault tolerant (due to the MapReduce model and nature of
> > the filesystem) data centers could safely migrate batch like workloads into
> > this system. Data-warehousing is a perfect fit when you realize RDBMS data
> > warehouses are simply caches that have leaked up into the architecture due
> > to their being resource constrained, forcing ETL to be yet another complex
> > part of the datacenter and data-warehouse to load this cache.
> > Hadoop like systems allow for a 'lazy evaluation' model of the data, where
> > caching is just an attribute, not an architectural component.
> > Troubling because 'thinking' in MapReduce sucks. If you've ever read "How to
On Mon, Sep 1, 2008 at 9:30 PM, <natisha...@gmail.com> wrote: > ...
> 3. An executor interface that enables you to "ship" code to a cluster > of machines and execute that code in parallel.
Is it something like good old gexec? "GEXEC is a scalable cluster remote execution system which provides fast, RSA authenticated remote execution of parallel and distributed jobs. It provides transparent forwarding of stdin, stdout, stderr, and signals to and from remote processes, provides local environment propagation, and is designed to be robust and to scale to systems over 1000 nodes. Internally, GEXEC operates by building an n-ary tree of TCP sockets and threads between gexec daemons and propagating control information up and down the tree. By using hierarchical control, GEXEC distributes both the work and resource usage associated with massive amounts of parallelism across multiple nodes, thereby eliminating problems associated with single node resource limits (e.g., limits on the number of file descriptors on front-end nodes). An initial release of the software (below) consists of a daemon, a client program, and a library which provides programmatic interface to the GEXEC system. "
For example, you may have large Hadoop jobs as important batch processes which are best to run on elastic resources. What framework gets used to launch and monitor those resources, then pull back results after the job has completed? Can I specify to that framework some tagged build from my SVN repository? How about data which fits best in a distributed cache, not in HDFS?
If the batch process is critical to my business, how do I automated it - like in a crontab? As far as I can tell, none of the cloud providers address this kind of requirement. Hadoop does not include those kinds of features.
From what I can see, RightScale probably comes closest to providing services in that area.
--- On Sat, 8/30/08, Matt Lynch <M...@ttLynch.net> wrote:
> Horses for courses. There is discrete data and continuous data. > Discrete data can be easily partitioned and I see MapReduce as > overkill for this scenario (eg. Student exam results). However, > Continuous data (eg. A video stream) is not easily partitioned > and does require stuff like this.
Nitpick: video is only "not easily partitioned" if one excludes the possibility of using high dimensionality spaces to represent it. This is only a problem if one is tacitly assuming a "points on a line" representation (which most systems do) but that assumption is not strictly required, merely common and well-understood.
" For example, you may have large Hadoop jobs as important batch processes which are best to run on elastic resources. What framework gets used to launch and monitor those resources, then pull back results after the job has completed?" Can I specify to that framework some tagged build from my SVN repository? How about data which fits best in a distributed cache, not in HDFS?"
I'm not aware of ways to address all those requirements with Hadoop but there are other Grid platforms alternatives that lets you do that. Some are provided as OpenSource and some are not. For example: Fura System provides a batch scheduler in Java so is JPPF which another FOSS grid framework that provides similar type of grid schedulers. On the commercial side DataSynapse and Platform computing are probably the leading vendors. Both provides way to monitor the jobs and get the results at some other stage. With JavaSpace based solution each job is normally tagged with Job-ID as part of the Entry attributes which is used to match all the results belong to that JOB at some other time. With JavaSpaces you can also push the job code to the compute node dynamically using dynamic code downloading. As for pointing data to a cache that becomes more an attribute of the task i.e. with any of those frameworks the task is a code that you write in java and you can point it to any data source that you like whether its a cache or a file system.
The main difference between those alternatives and Hadoop is that Hadoop is more geared to parallel aggregation of large distributed file systems rather then being a generic job execution framework. On the other hand most of the Grid frameworks that I mentioned are geared toward batch execution rather then synchronous aggregated execution. GridGain is a relatively new and interesting player in that space that seem to compete more with the Hadoop model then with the classic batch processing grid providers and since it is java based it brings some of the simplicity and code distribution benefits that exist with other java based solution.
Having said all that the requirements that you mentioned indicate some of the limitations of having a specialized framework such as Hadoop narrowed to specific scenario. If you want to do something slightly different such as parallel batch job as appose to parallel aggregated job you probably need to use a totally different solution.
One of the nice things with Space model is that you can easily use it to address both scenarios. We extended the JavaSpace model and beyond the API abstraction that I mentioned earlier we added support for monitoring and the ability to support both asynchronous batch processing or synchronous aggregated operations where you write set of tasks in parallel and block for the immediate results in similar to the way you would do it with Hadoop(See a reference here: http://wiki.gigaspaces.com/display/GS66/OpenSpaces+Core+Component+-+E... s). Since we use the space as the data layer and execution layer you get data affinity pretty much built into the model i.e. you can decide to route the task to where the data is and save the overhead associated with moving of that data or accessing it over the network. We also provide built in support for executing Groovy, JRuby and other dynamic languages tasks which gives another dynamic capabilities. One of the common use case is using dynamic languages as a new form of stored procedure (in away you can think of SQL as a specialized case of dynamic language as well). And the nice things that it all runs in EC2 today.
[mailto:cloud-computing@googlegroups.com] On Behalf Of Paco NATHAN Sent: Tuesday, September 02, 2008 5:48 AM To: cloud-computing@googlegroups.com Subject: Re: Is Map/Reduce going mainstream?
For example, you may have large Hadoop jobs as important batch processes which are best to run on elastic resources. What framework gets used to launch and monitor those resources, then pull back results after the job has completed? Can I specify to that framework some tagged build from my SVN repository? How about data which fits best in a distributed cache, not in HDFS?
If the batch process is critical to my business, how do I automated it - like in a crontab? As far as I can tell, none of the cloud providers address this kind of requirement. Hadoop does not include those kinds of features.
From what I can see, RightScale probably comes closest to providing services in that area.
Paco
On Mon, Sep 1, 2008 at 8:30 PM, <natisha...@gmail.com> wrote:
> 3. An executor interface that enables you to "ship" code to a cluster > of machines and execute that code in parallel.