Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Is Map/Reduce going mainstream?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 26 - 37 of 37 - Expand all  -  Translate all to Translated (View all originals) < Older 
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post will appear after it is approved by moderators
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Jaliya Ekanayake  
View profile  
 More options Aug 31 2008, 10:15 am
From: "Jaliya Ekanayake" <jekan...@cs.indiana.edu>
Date: Sun, 31 Aug 2008 10:15:18 -0400
Local: Sun, Aug 31 2008 10:15 am
Subject: RE: Is Map/Reduce going mainstream?

Yes, we used the same number and exactly the same hardware configuration for
the VMs and non VM tests.

Thanks,
Jaliya

  smime.p7s
5K Download

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jaliya Ekanayake  
View profile  
 More options Aug 31 2008, 10:24 am
From: "Jaliya Ekanayake" <jekan...@cs.indiana.edu>
Date: Sun, 31 Aug 2008 10:24:42 -0400
Local: Sun, Aug 31 2008 10:24 am
Subject: RE: Is Map/Reduce going mainstream?

May be Krishna Sankar use it to mean "the parallel performance". I used the
same term in my reply to him :)

Jaliya

  smime.p7s
5K Download

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Roderick Flores  
View profile  
 More options Aug 31 2008, 2:56 pm
From: "Roderick Flores" <roderick.flo...@gmail.com>
Date: Sun, 31 Aug 2008 12:56:50 -0600
Local: Sun, Aug 31 2008 2:56 pm
Subject: Re: Is Map/Reduce going mainstream?

On Tue, Aug 26, 2008 at 9:57 PM, Jaliya Ekanayake
<jekan...@cs.indiana.edu>wrote:

>  I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and MPI
> and compare their performances.

> I also performed the Kmeans clustering using CGL-MapReduce, MPI, and Java
> threads on multi-core computers.

Some thoughts… Your work seems more like a standard HPC distribution using a
MapReduce framework to me -- which is not to say that it isn't an excellent
concept.  However, my take on the MapReduce algorithm is that it is largely
asynchronous by which I mean that once the problem is "mapped" the
calculations that are preformed on each node are independent of any of the
others until the "reduce" begins.

Specifically, one of the core assumptions of the approach is that compute
nodes are unreliable. Any set of nodes might fail to return a timely result
to the reduction bits of the program.  Subsequently, the calculations
expected from those nodes will be rescheduled on different hardware.  Once
you introduce this expectation, using any MPI implementation difficult could
easily result in the loss of the work done on all of your nodes because of
one fairly likely failure.   I am not suggesting that this assumption is
always present; rather I believe that it is a key differentiator between
distributed computing on a dedicated cluster and the more loose utility
computing models that seem to characterize the cloud.

As I mentioned in my preceding post, there are a number of complexities and
operational constraints that we should consider when build these sorts of
applications. Of course, once we take on these restrictions they seriously
complicate matters for the developer.  First of all, we have to figure out
how to break a problem down into distinct pieces.  Next we then need to
decide how the results can be reduced into something digestible by the
user.  The main problem here is that -- as discussed in the Grid Gurus post
-- you have to know a lot about the data with which you are working.

In particular, there is a solid chance that any map operations you come up
might well yield nondeterministic results sets.  Sadly, this violates one of
the original definitions of the MapReduce problem space. For example, take
the K-means algorithm that you are developing.  If I take a data set and
toss it out to a set of 'n' compute nodes, I will be creating 'n' times 'k'
clusters that, depending upon the sampling bias, will probably not reduce
into the 'k' clusters we expected to produce on a single node.  Moreover, if
you randomly resample the data, you will most likely get significantly
different results.

Should we be forced to only run these algorithms on very stable clusters
rather than on groups of loosely coupled machines on the cloud to avoid
nondeterministic results?  I have put a bit of thought into this over the
last year or so and I don't think this is a significant problem.  Rather, I
see it as an opportunity, not only as an exercise in advancing distributed
algorithms, but in producing a better K-means result set (or agglomerative
or any other clustering for that matter).

We already know that many clustering results are heavily dependent on data:
introducing random errors or dropping data often produces a very different
result.  In K-means, altering 'k' can produce dramatically different
results.  In agglomerative approaches, the order that data is presented
changes the results.  If your distance measure is fairly complex, the
clusters may respond in unpredictable ways.  Moreover, I am sure there are
other subtleties that I am ignoring.  Basically, these algorithms are pretty
nondeterministic to begin with so distributing them asynchronously should
not be our primary concern.

Which leads me to ask, what do we do to take these complicating factors into
account?  I don't believe that many people bootstrap their cluster error
distributions by data replacement, reordering, or data removal.  I am
equally unsure whether people put slight changes in their distance measures
to test the cluster sensitivity to either their algorithm or numerical
calculations. Rather, they do some statistics on the distribution of the
cluster members from its center.  Switching to a MapReduce implementation of
these algorithms not only offers us the chance to reduce the response time,
it provides us with the opportunity to actually understand the data -- which
is really the point.
Cheers!

--
Roderick Flores


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jaliya Ekanayake  
View profile  
 More options Aug 31 2008, 10:27 pm
From: "Jaliya Ekanayake" <jekan...@cs.indiana.edu>
Date: Sun, 31 Aug 2008 22:27:57 -0400
Local: Sun, Aug 31 2008 10:27 pm
Subject: RE: Is Map/Reduce going mainstream?

It seems like you have not understand why we use Kmeans in this context. As
you have mentioned there are numerous improvements one can perform to the
data and the initial set of clusters in Kmeans and most other clustering
algorithms. We selected Kmeans as it is a simple example to show the usage
of "iterative computations using MapReduce"

We performed the Kmeans algorithm exactly as you have mentioned. All the map
tasks compute the distances between some points and the set of cluster
centers and once all of them are finished, a set of reduce tasks start
calculating the new cluster centers. So we preserve the "MapReduce" paradigm
in our algorithm and the implementation. Also this allows any fault
tolerance features of the implementation and the result will not change if
few Map tasks are re-executed after failures.  The changes to the initial
cluster centers and other improvements to the data will not change this
process, it will definitely change the number of iterations that we need to
perform, but that is a task-specific aspect and not the MapReduce.

Also , we compared the MapReduce results with MPI to show that given larger
data sets and higher compute intensive operations, most of these systems
converge in performance. So we did not try to mimic the classical parallel
computing methods using MapReduce.

One last thought as well, things are very much straight forward when we
apply this technique to text data sets, but once we try to apply it to
scientific applications, where different data formats and different
languages come into play, things get complicated. But overall it is a very
easy technique to apply compared to the classic parallel programming
techniques such as MPI.

Hope this will explain you our motive.

Thanks,
Jaliya

From: cloud-computing@googlegroups.com
[mailto:cloud-computing@googlegroups.com] On Behalf Of Roderick Flores
Sent: Sunday, August 31, 2008 2:57 PM
To: cloud-computing@googlegroups.com
Subject: Re: Is Map/Reduce going mainstream?

On Tue, Aug 26, 2008 at 9:57 PM, Jaliya Ekanayake <jekan...@cs.indiana.edu>
wrote:

I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and MPI and
compare their performances.

I also performed the Kmeans clustering using CGL-MapReduce, MPI, and Java
threads on multi-core computers.

Some thoughts. Your work seems more like a standard HPC distribution using a
MapReduce framework to me -- which is not to say that it isn't an excellent
concept.  However, my take on the MapReduce algorithm is that it is largely
asynchronous by which I mean that once the problem is "mapped" the
calculations that are preformed on each node are independent of any of the
others until the "reduce" begins.

Specifically, one of the core assumptions of the approach is that compute
nodes are unreliable. Any set of nodes might fail to return a timely result
to the reduction bits of the program.  Subsequently, the calculations
expected from those nodes will be rescheduled on different hardware.  Once
you introduce this expectation, using any MPI implementation difficult could
easily result in the loss of the work done on all of your nodes because of
one fairly likely failure.   I am not suggesting that this assumption is
always present; rather I believe that it is a key differentiator between
distributed computing on a dedicated cluster and the more loose utility
computing models that seem to characterize the cloud.

As I mentioned in my preceding post, there are a number of complexities and
operational constraints that we should consider when build these sorts of
applications. Of course, once we take on these restrictions they seriously
complicate matters for the developer.  First of all, we have to figure out
how to break a problem down into distinct pieces.  Next we then need to
decide how the results can be reduced into something digestible by the user.
The main problem here is that -- as discussed in the Grid Gurus post -- you
have to know a lot about the data with which you are working.  

In particular, there is a solid chance that any map operations you come up
might well yield nondeterministic results sets.  Sadly, this violates one of
the original definitions of the MapReduce problem space. For example, take
the K-means algorithm that you are developing.  If I take a data set and
toss it out to a set of 'n' compute nodes, I will be creating 'n' times 'k'
clusters that, depending upon the sampling bias, will probably not reduce
into the 'k' clusters we expected to produce on a single node.  Moreover, if
you randomly resample the data, you will most likely get significantly
different results.

Should we be forced to only run these algorithms on very stable clusters
rather than on groups of loosely coupled machines on the cloud to avoid
nondeterministic results?  I have put a bit of thought into this over the
last year or so and I don't think this is a significant problem.  Rather, I
see it as an opportunity, not only as an exercise in advancing distributed
algorithms, but in producing a better K-means result set (or agglomerative
or any other clustering for that matter).  

We already know that many clustering results are heavily dependent on data:
introducing random errors or dropping data often produces a very different
result.  In K-means, altering 'k' can produce dramatically different
results.  In agglomerative approaches, the order that data is presented
changes the results.  If your distance measure is fairly complex, the
clusters may respond in unpredictable ways.  Moreover, I am sure there are
other subtleties that I am ignoring.  Basically, these algorithms are pretty
nondeterministic to begin with so distributing them asynchronously should
not be our primary concern.

Which leads me to ask, what do we do to take these complicating factors into
account?  I don't believe that many people bootstrap their cluster error
distributions by data replacement, reordering, or data removal.  I am
equally unsure whether people put slight changes in their distance measures
to test the cluster sensitivity to either their algorithm or numerical
calculations. Rather, they do some statistics on the distribution of the
cluster members from its center.  Switching to a MapReduce implementation of
these algorithms not only offers us the chance to reduce the response time,
it provides us with the opportunity to actually understand the data -- which
is really the point.

Cheers!

--
Roderick Flores

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Cloud Computing" group.
To post to this group, send email to cloud-computing@googlegroups.com
To unsubscribe from this group, send email to
cloud-computing-unsubscribe@googlegroups.com
To post job listing, send email to j...@cloudjobs.net (position title,
employer and location in subject, description in message body) or visit
http://www.cloudjobs.net
For more options, visit this group at
http://groups.google.ca/group/cloud-computing?hl=en?hl=en
Posting guidelines:
http://groups.google.ca/group/cloud-computing/web/frequently-asked-qu...
-~----------~----~----~----~------~----~------~--~---

  smime.p7s
5K Download

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Khazret Sapenov  
View profile  
 More options Aug 31 2008, 11:42 pm
From: "Khazret Sapenov" <sape...@gmail.com>
Date: Sun, 31 Aug 2008 23:42:11 -0400
Local: Sun, Aug 31 2008 11:42 pm
Subject: Re: Is Map/Reduce going mainstream?

mode for Namenode machine/instance, since according to documentation it is
single point of failure.
"... The NameNode machine is a single point of failure for an HDFS cluster.
If the NameNode machine fails, manual intervention is necessary. Currently,
automatic restart and failover of the NameNode software to another machine
is not supported."

Is there other implementation, that has this issue solved (I'm reading
Dryade docs right now and didn't find the answer yet, other than hw fault
tolerance for data).

thanks,
KS


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Allen Wittenauer  
View profile  
 More options Sep 1 2008, 2:17 am
From: Allen Wittenauer <allenw+li...@iobm.com>
Date: Sun, 31 Aug 2008 23:17:31 -0700
Local: Mon, Sep 1 2008 2:17 am
Subject: Re: Is Map/Reduce going mainstream?
On Aug 31, 2008, at 8:42 PM, Khazret Sapenov wrote:

> I wonder was there a situation with Hadoop, when you needed to have  
> HA mode for Namenode machine/instance, since according to  
> documentation it is single point of failure.
> "... The NameNode machine is a single point of failure for an HDFS  
> cluster. If the NameNode machine fails, manual intervention is  
> necessary. Currently, automatic restart and failover of the NameNode  
> software to another machine is not supported."

        HA for the name node likely wouldn't have helped in the vast majority  
of the failure cases I've seen.  Typically, our name node "failures"--
really full garbage collections or the name node getting swapped to  
disk--are related to bad user code (too many small file creates),  bad  
tuning (not enough heap/memory), or full file system (not enough free  
blocks on the local disk on the data nodes, too many files in the fs,  
etc).  HAing the name node could have easily turned into a ping pong  
situation.

        We've had exactly one hardware related failure in recent memory where  
HA might have helped, but in the end didn't really matter.  The name  
node stayed up due to redundant name dirs, including one on NFS.  [We  
did, however, find a bug with the secondary doing Bad Things(tm) in  
this particular failure case though.]  But user jobs continued, albeit  
a tad slower.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
natisha...@gmail.com  
View profile  
 More options Sep 1 2008, 8:35 pm
From: natisha...@gmail.com
Date: Mon, 1 Sep 2008 17:35:29 -0700 (PDT)
Local: Mon, Sep 1 2008 8:35 pm
Subject: Re: Is Map/Reduce going mainstream?
With tuplespace model such as the one defined by JavaSpaces (See:
http://en.wikipedia.org/wiki/JavaSpaces#JavaSpaces ) the way HA is
maintained is through transactions:

A job is written as an entry to the space.
A worker node takes that entry under transaction.
if successful it commit the transaction and moves to the next job.
If it fails the job is rolledback and another worker takes ownership
of that job and continues the execution as if nothing happend.

Nati S
GigaSaces

On Sep 1, 6:42 am, "Khazret Sapenov" <sape...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
natisha...@gmail.com  
View profile  
 More options Sep 1 2008, 9:30 pm
From: natisha...@gmail.com
Date: Mon, 1 Sep 2008 18:30:52 -0700 (PDT)
Local: Mon, Sep 1 2008 9:30 pm
Subject: Re: Is Map/Reduce going mainstream?
"From what I've seen through the years in other areas of large scale
data, such as ETL and analytics, first off the data is not
particularly relational.
The concept of workflows seems much more apt -- more likely to become
adopted by the practitioners in quantity once we get past the "slope
of enlightenment" phase of the hype cycle."

Nathan i think that your observation is pretty accurate with what i've
seen in the market.
The world of real time analystics is moving to this direction
allready:

See an interesting article on that regard that was published on
infoworld - entitled:  Real time drives database virtualization

"Real time drives database virtualization - Database virtualization
will enable real-time business intelligence through a memory grid that
permeates an infrastructure at all levels""

http://www.infoworld.com/article/08/08/20/Real_time_drives_database_v...

An another reports by forester: Report Warns Of Data Warehouse
'Bottleneck' In Real-Time Analytics
http://www.intelligententerprise.com/showArticle.jhtml?articleID=2101...

Interestingly enough both where published recently so it seems like
the need is growing and with it the demand of "new" model for
processing large amound of data faster.

In the financial services world many of the real-time reports such as
Reconsiliation, Value at Risk and Profit and Loss are allready using
this model as they face ever increasing volume of data while at the
same time a demand to process that data faster. Clearly the
realisation that forced those financial organisatio to explore new
models is that those two supposibly contradicting requirements can't
be addressed just by twikking our existing database or OLAP server.

I do agree with you and Chris Wensel that the current barrier to entry
that exist with solutions like Haddoop is too steep to many
organisations and therfore making parallel processing model closer to
way people build analytical applicaitons today is critical for higher
adoption.

As you mentioned we allready see different solutions that expose
higher level of API such as SQL that abstract some fo those detailes
from the programmer.

IMO there are other level of absctations that can be used to map
existing programming model and leverage parallel processing as part of
the implementation detailes of those programming model rather then
exposing them explicitly:

1. An SQL intereface that provides a method for exectuing SQL commands
in parallel over a partitioned clusster of data nodes.

2. Using RPC abstration that enables invocation on multiple services
that implements the same interface and use a map/reduce style of
invocation to perform parallel invocation and agregation of the
results. In fact you could use that same model to perform both
synchronious agregation (Map/Reduce) and batch processing. An example
RPC based abstraction is decribed in one of the our recent white
papers Service Virtualization Frameowrk: http://www.gigaspaces.com/viewpdf/975))

3. An executor interface that enables you to "ship" code to a cluster
of machines and execute that code in parallel.

In all three cases the programming model can be kept identical or
fairly close to the existing programming model. It doesn't prevent us
from adding additional enhancemtns to support more sepecilized
scenarios where a user would want more fine grained control over the
underlying parallel processing exectuion (control ponits could be
routing interceptors, reducers etc..). The idea is that adding those
control points wouldn't force you to learn new programming model. It
will be more evolution extention to your existing way of thinking.

Nati S
GigaSpaces.

On Aug 27, 7:27 pm, "Paco NATHAN" <cet...@gmail.com> wrote:

...

read more »


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Khazret Sapenov  
View profile  
 More options Sep 1 2008, 10:02 pm
From: "Khazret Sapenov" <sape...@gmail.com>
Date: Mon, 1 Sep 2008 22:02:02 -0400
Local: Mon, Sep 1 2008 10:02 pm
Subject: Re: Is Map/Reduce going mainstream?

On Mon, Sep 1, 2008 at 9:30 PM, <natisha...@gmail.com> wrote:
> ...

> 3. An executor interface that enables you to "ship" code to a cluster
> of machines and execute that code in parallel.

Is it something like good old gexec?
"GEXEC is a scalable cluster remote execution system which provides fast,
RSA authenticated remote execution of parallel and distributed jobs. It
provides transparent forwarding of stdin, stdout, stderr, and signals to and
from remote processes, provides local environment propagation, and is
designed to be robust and to scale to systems over 1000 nodes. Internally,
GEXEC operates by building an n-ary tree of TCP sockets and threads between
gexec daemons and propagating control information up and down the tree. By
using hierarchical control, GEXEC distributes both the work and resource
usage associated with massive amounts of parallelism across multiple nodes,
thereby eliminating problems associated with single node resource limits
(e.g., limits on the number of file descriptors on front-end nodes). An
initial release of the software (below) consists of a daemon, a client
program, and a library which provides programmatic interface to the GEXEC
system. "

KS


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paco NATHAN  
View profile  
 More options Sep 1 2008, 10:47 pm
From: "Paco NATHAN" <cet...@gmail.com>
Date: Mon, 1 Sep 2008 21:47:53 -0500
Local: Mon, Sep 1 2008 10:47 pm
Subject: Re: Is Map/Reduce going mainstream?
Nati,

Thanks for the great links -

An "executor interface" is close to a question I asked in a recent blog post:
http://ceteri.blogspot.com/2008/08/hadoop-in-cloud-patterns-for-autom...

For example, you may have large Hadoop jobs as important batch
processes which are best to run on elastic resources. What framework
gets used to launch and monitor those resources, then pull back
results after the job has completed? Can I specify to that framework
some tagged build from my SVN repository?  How about data which fits
best in a distributed cache, not in HDFS?

If the batch process is critical to my business, how do I automated it
- like in a crontab?  As far as I can tell, none of the cloud
providers address this kind of requirement.  Hadoop does not include
those kinds of features.

From what I can see, RightScale probably comes closest to providing
services in that area.

Paco


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andrew Rogers  
View profile  
 More options Sep 2 2008, 2:17 am
From: Andrew Rogers <jr89...@yahoo.com>
Date: Mon, 1 Sep 2008 23:17:45 -0700 (PDT)
Local: Tues, Sep 2 2008 2:17 am
Subject: RE: Is Map/Reduce going mainstream?
--- On Sat, 8/30/08, Matt Lynch <M...@ttLynch.net> wrote:

> Horses for courses.  There is discrete data and continuous data.
> Discrete data can be easily partitioned and I see MapReduce as
> overkill for this scenario (eg. Student exam results).  However,
> Continuous data (eg. A video stream) is not easily partitioned
> and does require stuff like this.

Nitpick: video is only "not easily partitioned" if one excludes the possibility of using high dimensionality spaces to represent it.  This is only a problem if one is tacitly assuming a "points on a line" representation (which most systems do) but that assumption is not strictly required, merely common and well-understood.

-Andrew


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nati Shalom  
View profile  
 More options Sep 1 2008, 2:21 pm
From: "Nati Shalom" <natisha...@gmail.com>
Date: Mon, 1 Sep 2008 21:21:38 +0300
Local: Mon, Sep 1 2008 2:21 pm
Subject: RE: Is Map/Reduce going mainstream?
Hi Nathan

" For example, you may have large Hadoop jobs as important batch
processes which are best to run on elastic resources. What framework
gets used to launch and monitor those resources, then pull back
results after the job has completed?"
Can I specify to that framework
some tagged build from my SVN repository?  How about data which fits
best in a distributed cache, not in HDFS?"

I'm not aware of ways to address all those requirements with Hadoop but
there are other Grid platforms alternatives that lets you do that. Some are
provided as OpenSource and some are not. For example: Fura System provides a
batch scheduler in Java so is JPPF which another FOSS grid framework that
provides similar type of grid schedulers. On the commercial side DataSynapse
and Platform computing are probably the leading vendors. Both provides way
to monitor the jobs and get the results at some other stage. With JavaSpace
based solution each job is normally tagged with Job-ID as part of the Entry
attributes which is used to match all the results belong to that JOB at some
other time. With JavaSpaces you can also push the job code to the compute
node dynamically using dynamic code downloading. As for pointing data to a
cache that becomes more an attribute of the task i.e. with any of those
frameworks the task is a code that you write in java and you can point it to
any data source that you like whether its a cache or a file system.

The main difference between those alternatives and Hadoop is that Hadoop is
more geared to parallel aggregation of large distributed file systems rather
then being a generic job execution framework. On the other hand most of the
Grid frameworks that I mentioned are geared toward batch execution rather
then synchronous aggregated execution. GridGain is a relatively new and
interesting player in that space that seem to compete more with the Hadoop
model then with the classic batch processing grid providers and since it is
java based it brings some of the simplicity and code distribution benefits
that exist with other java based solution.  

Having said all that the requirements that you mentioned indicate some of
the limitations of having a specialized framework such as Hadoop narrowed to
specific scenario. If you want to do something slightly different such as
parallel batch job as appose to parallel aggregated job you probably need to
use a totally different solution.

One of the nice things with Space model is that you can easily use it to
address both scenarios. We extended the JavaSpace model and beyond the API
abstraction that I mentioned earlier we added support for monitoring and the
ability to support both asynchronous batch processing or synchronous
aggregated operations where you write set of tasks in parallel and block for
the immediate results in similar to the way you would do it with Hadoop(See
a reference here:
http://wiki.gigaspaces.com/display/GS66/OpenSpaces+Core+Component+-+E...
s). Since we use the space as the data layer and execution layer you get
data affinity pretty much built into the model i.e. you can decide to route
the task to where the data is and save the overhead associated with moving
of that data or accessing it over the network. We also provide built in
support for executing Groovy, JRuby and other dynamic languages tasks which
gives another dynamic capabilities. One of the common use case is using
dynamic languages as a new form of stored procedure (in away you can think
of SQL as a specialized case of dynamic language as well). And the nice
things that it all runs in EC2 today.

Nati S.
GigaSpaces.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages < Older 
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google