It's been an interesting summer for Google's MapReduce software paradigm. I'm not going to get into the finer details of MapReduce, the general idea is its Google's magic sauce, it's what lets them run their massively distributed data sets. So any company that wants to be like Google or needs to compete with Google should pay attention to MapReduce.
Last month Intel, HP and yahoo announced a joint research program to examine it's usage and now today Greenplum, a provider of database software for the what they describe as "next generation of data warehousing and analytics", announced support for MapReduce within its massively parallel database engine.
Greenplums announcement to integrate MapReduce functionality into its enterprise focused database is an important step toward taking MapReduce out of academic research labs and moving it to lucrative corporate users.
To give you some background, currently the two most popular implementations of MapReduce are the open source Apache Hadoop project and unfortunately named Pig project. For those of you who don't know about about Hadoop<http://hadoop.apache.org/core/>, it is an open source platform for distributing, managing and then collecting computing work throughout a large computing cloud using MapReduce<http://labs.google.com/papers/mapreduce.html>. Pig <http://incubator.apache.org/projects/pig.html>, a Yahoo Research project currently being incubated at Apache, is a language designed to make using the Hadoop infrastructure effectively. It has been described as SQL for MapReduce, allowing queries to be written and then parallelised and run on the Hadoop platform.
I found this quote interesting, it was mentioned in Greemplums press release.
"Greenplum has seamlessly integrated MapReduce into its database, making it possible for us to access our massive dataset with standard SQL queries in combination with MapReduce programs," said Roger Magoulas, Research Director, O'Reilly Media. "We are finding this to be incredibly efficient because complex SQL queries can be expressed in a few lines of Perl or Python code.
Also interesting to note that earlier this year IBM released an Eclipse plug-in that simplifies the creation and deployment of MapReduce programs. This plug-in was developed by the team at IBM Software Group's High Performance On Demand Solutions Unit at the IBM Silicon Valley Laboratory. So it may be a matter of time before we see MapReduce commercially offered by IBM.
So what's next? Will we see a Microsoft implementation or an Oracle MapReduce? For now, MapReduce appears to be the new "coolness" and with all the industry attention it seems to be getting I think we may be on the verge of finally seeing MapReduce enter the mainstream consciousness.
As a side note, my favorite MapReduce implementation is called Skynet. The name says it all.
From a practitioners perspective, this is both great news, and a bit troubling.
Good news since MapReduce is a good divide and conquer strategy when dealing with large (continuous) datasets.
When coupled to a 'global' filesystem with a single namespace like HDFS (the Hadoop file system) or GFS (googles global filesystem), you get a new powerful entity that effectively virtualizes many hardware nodes into a single operating system instance (a wide virtualization), the converse of Xen and VMWare where you get may instances on one hardware node (a narrow virtualization).
Being inherently fault tolerant (due to the MapReduce model and nature of the filesystem) data centers could safely migrate batch like workloads into this system. Data-warehousing is a perfect fit when you realize RDBMS data warehouses are simply caches that have leaked up into the architecture due to their being resource constrained, forcing ETL to be yet another complex part of the datacenter and data- warehouse to load this cache.
Hadoop like systems allow for a 'lazy evaluation' model of the data, where caching is just an attribute, not an architectural component.
Troubling because 'thinking' in MapReduce sucks. If you've ever read "How to write parallel programs" by Nicholas Carriero and David Gelernter (http://www.lindaspaces.com/book/), many of their thought experiments and examples are based on a house building analogy. That is, how would you build a house in X model or Y model. These examples work because the models they present are straightforward.
That said, how do you build a house in the MapReduce model? Filtering, aggregation, and functional conversions are very simple in MapReduce, but typical work loads are more complex, and end up being 2-5-20 individual MapReduce jobs chained together by dependencies (many running in parallel). Thinking that deep in MapReduce is not trivial, and the resulting code base is not simple (see Googles Sawzall papers).
If companies like Greenplum are using MapReduce as an underlying compute model, they must offer up a higher level abstraction that users and developers can reason in.
Pig was mentioned. Jaql from IBM is also of worthy note. Facebook is contributing Hive, which adopts many data-warehouse trappings.
But if developers wish to continue working at the API level, and a need to build their own DSL's or reusable libraries for their specific domain for practitioners of that domain, Cascading is likely a better choice. (http://www.cascading.org/)
As the developer of Cascading, I would love to chat with the Greenplum developers to see how Cascading can target their MapReduce implementation to compliment the current Hadoop planner.
> It's been an interesting summer for Google's MapReduce software > paradigm. I'm not going to get into the finer details of MapReduce, > the general idea is its Google's magic sauce, it's what lets them > run their massively distributed data sets. So any company that wants > to be like Google or needs to compete with Google should pay > attention to MapReduce.
> Last month Intel, HP and yahoo announced a joint research program to > examine it's usage and now today Greenplum, a provider of database > software for the what they describe as "next generation of data > warehousing and analytics", announced support for MapReduce within > its massively parallel database engine.
> Greenplums announcement to integrate MapReduce functionality into > its enterprise focused database is an important step toward taking > MapReduce out of academic research labs and moving it to lucrative > corporate users.
> To give you some background, currently the two most popular > implementations of MapReduce are the open source Apache Hadoop > project and unfortunately named Pig project. For those of you who > don't know about about Hadoop, it is an open source platform for > distributing, managing and then collecting computing work throughout > a large computing cloud using MapReduce. Pig, a Yahoo Research > project currently being incubated at Apache, is a language designed > to make using the Hadoop infrastructure effectively. It has been > described as SQL for MapReduce, allowing queries to be written and > then parallelised and run on the Hadoop platform.
> I found this quote interesting, it was mentioned in Greemplums press > release.
> "Greenplum has seamlessly integrated MapReduce into its database, > making it possible for us to access our massive dataset with > standard SQL queries in combination with MapReduce programs," said > Roger Magoulas, Research Director, O'Reilly Media. "We are finding > this to be incredibly efficient because complex SQL queries can be > expressed in a few lines of Perl or Python code.
> Also interesting to note that earlier this year IBM released an > Eclipse plug-in that simplifies the creation and deployment of > MapReduce programs. This plug-in was developed by the team at IBM > Software Group's High Performance On Demand Solutions Unit at the > IBM Silicon Valley Laboratory. So it may be a matter of time before > we see MapReduce commercially offered by IBM.
> So what's next? Will we see a Microsoft implementation or an Oracle > MapReduce? For now, MapReduce appears to be the new "coolness" and > with all the industry attention it seems to be getting I think we > may be on the verge of finally seeing MapReduce enter the mainstream > consciousness.
> As a side note, my favorite MapReduce implementation is called > Skynet. The name says it all.
> From a practitioners perspective, this is both great news, and a bit > troubling.
> Good news since MapReduce is a good divide and conquer strategy when > dealing with large (continuous) datasets.
> When coupled to a 'global' filesystem with a single namespace like > HDFS (the Hadoop file system) or GFS (googles global filesystem), you > get a new powerful entity that effectively virtualizes many hardware > nodes into a single operating system instance (a wide virtualization), > the converse of Xen and VMWare where you get may instances on one > hardware node (a narrow virtualization).
> Being inherently fault tolerant (due to the MapReduce model and nature > of the filesystem) data centers could safely migrate batch like > workloads into this system. Data-warehousing is a perfect fit when you > realize RDBMS data warehouses are simply caches that have leaked up > into the architecture due to their being resource constrained, forcing > ETL to be yet another complex part of the datacenter and > data-warehouse to load this cache.
> Hadoop like systems allow for a 'lazy evaluation' model of the data, > where caching is just an attribute, not an architectural component.
> Troubling because 'thinking' in MapReduce sucks. If you've ever read > "How to write parallel programs" by Nicholas Carriero and David > Gelernter (http://www.lindaspaces.com/book/), many of their thought > experiments and examples are based on a house building analogy. That > is, how would you build a house in X model or Y model. These examples > work because the models they present are straightforward.
> That said, how do you build a house in the MapReduce model? Filtering, > aggregation, and functional conversions are very simple in MapReduce, > but typical work loads are more complex, and end up being 2-5-20 > individual MapReduce jobs chained together by dependencies (many > running in parallel). Thinking that deep in MapReduce is not trivial, > and the resulting code base is not simple (see Googles Sawzall papers).
> If companies like Greenplum are using MapReduce as an underlying > compute model, they must offer up a higher level abstraction that > users and developers can reason in.
> Pig was mentioned. Jaql from IBM is also of worthy note. Facebook is > contributing Hive, which adopts many data-warehouse trappings.
> But if developers wish to continue working at the API level, and a > need to build their own DSL's or reusable libraries for their specific > domain for practitioners of that domain, Cascading is likely a better > choice. (http://www.cascading.org/) <http://www.cascading.org/%29>
> As the developer of Cascading, I would love to chat with the Greenplum > developers to see how Cascading can target their MapReduce > implementation to compliment the current Hadoop planner.
> On Aug 26, 2008, at 3:58 PM, Reuven Cohen wrote:
>> It's been an interesting summer for Google's MapReduce software >> paradigm. I'm not going to get into the finer details of MapReduce, >> the general idea is its Google's magic sauce, it's what lets them run >> their massively distributed data sets. So any company that wants to >> be like Google or needs to compete with Google should pay attention >> to MapReduce.
>> Last month Intel, HP and yahoo announced a joint research program to >> examine it's usage and now today Greenplum, a provider of database >> software for the what they describe as "next generation of data >> warehousing and analytics", announced support for MapReduce within >> its massively parallel database engine.
>> Greenplums announcement to integrate MapReduce functionality into its >> enterprise focused database is an important step toward taking >> MapReduce out of academic research labs and moving it to lucrative >> corporate users.
>> To give you some background, currently the two most popular >> implementations of MapReduce are the open source Apache Hadoop >> project and unfortunately named Pig project. For those of you who >> don't know about about Hadoop <http://hadoop.apache.org/core/>, it is >> an open source platform for distributing, managing and then >> collecting computing work throughout a large computing cloud using >> MapReduce <http://labs.google.com/papers/mapreduce.html>. Pig >> <http://incubator.apache.org/projects/pig.html>, a Yahoo Research >> project currently being incubated at Apache, is a language designed >> to make using the Hadoop infrastructure effectively. It has been >> described as SQL for MapReduce, allowing queries to be written and >> then parallelised and run on the Hadoop platform.
>> I found this quote interesting, it was mentioned in Greemplums press >> release.
>> "Greenplum has seamlessly integrated MapReduce into its database, >> making it possible for us to access our massive dataset with standard >> SQL queries in combination with MapReduce programs," said Roger >> Magoulas, Research Director, O'Reilly Media. "We are finding this to >> be incredibly efficient because complex SQL queries can be expressed >> in a few lines of Perl or Python code.
>> Also interesting to note that earlier this year IBM released an >> Eclipse plug-in that simplifies the creation and deployment of >> MapReduce programs. This plug-in was developed by the team at IBM >> Software Group's High Performance On Demand Solutions Unit at the IBM >> Silicon Valley Laboratory. So it may be a matter of time before we >> see MapReduce commercially offered by IBM.
>> So what's next? Will we see a Microsoft implementation or an Oracle >> MapReduce? For now, MapReduce appears to be the new "coolness" and >> with all the industry attention it seems to be getting I think we may >> be on the verge of finally seeing MapReduce enter the mainstream >> consciousness.
>> As a side note, my favorite MapReduce implementation is called >> Skynet. The name says it all.
I've been getting lots of private messages, seems that there are a lot of shy readers on this list. There are no stupid questions or comments, please feel free to jump into the fray.
A few people have pointed me to a map/reduce like research project microsoft is working on called Dryad. Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.
Interesting if not a little over hyped. I'm not sure three of four implementations on top of it could be termed "a Rich Ecosystem". But certainly a step in the right - concurrent - direction. While it appears to be able to span machine and clusters does anyone know if it can span cores within a machine or do I need a concurrent language to achieve that?
I've been getting lots of private messages, seems that there are a lot of shy readers on this list. There are no stupid questions or comments, please feel free to jump into the fray.
A few people have pointed me to a map/reduce like research project microsoft is working on called Dryad. Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.
From a practitioners perspective, this is both great news, and a bit troubling.
Good news since MapReduce is a good divide and conquer strategy when dealing with large (continuous) datasets.
When coupled to a 'global' filesystem with a single namespace like HDFS (the Hadoop file system) or GFS (googles global filesystem), you get a new powerful entity that effectively virtualizes many hardware nodes into a single operating system instance (a wide virtualization), the converse of Xen and VMWare where you get may instances on one hardware node (a narrow virtualization).
Being inherently fault tolerant (due to the MapReduce model and nature of the filesystem) data centers could safely migrate batch like workloads into this system. Data-warehousing is a perfect fit when you realize RDBMS data warehouses are simply caches that have leaked up into the architecture due to their being resource constrained, forcing ETL to be yet another complex part of the datacenter and data-warehouse to load this cache.
Hadoop like systems allow for a 'lazy evaluation' model of the data, where caching is just an attribute, not an architectural component.
Troubling because 'thinking' in MapReduce sucks. If you've ever read "How to write parallel programs" by Nicholas Carriero and David Gelernter (http://www.lindaspaces.com/book/), many of their thought experiments and examples are based on a house building analogy. That is, how would you build a house in X model or Y model. These examples work because the models they present are straightforward.
That said, how do you build a house in the MapReduce model? Filtering, aggregation, and functional conversions are very simple in MapReduce, but typical work loads are more complex, and end up being 2-5-20 individual MapReduce jobs chained together by dependencies (many running in parallel). Thinking that deep in MapReduce is not trivial, and the resulting code base is not simple (see Googles Sawzall papers).
If companies like Greenplum are using MapReduce as an underlying compute model, they must offer up a higher level abstraction that users and developers can reason in.
Pig was mentioned. Jaql from IBM is also of worthy note. Facebook is contributing Hive, which adopts many data-warehouse trappings.
But if developers wish to continue working at the API level, and a need to build their own DSL's or reusable libraries for their specific domain for practitioners of that domain, Cascading is likely a better choice. (http://www.cascading.org/)
As the developer of Cascading, I would love to chat with the Greenplum developers to see how Cascading can target their MapReduce implementation to compliment the current Hadoop planner.
It's been an interesting summer for Google's MapReduce software paradigm. I'm not going to get into the finer details of MapReduce, the general idea is its Google's magic sauce, it's what lets them run their massively distributed data sets. So any company that wants to be like Google or needs to compete with Google should pay attention to MapReduce.
Last month Intel, HP and yahoo announced a joint research program to examine it's usage and now today Greenplum, a provider of database software for the what they describe as "next generation of data warehousing and analytics", announced support for MapReduce within its massively parallel database engine.
Greenplums announcement to integrate MapReduce functionality into its enterprise focused database is an important step toward taking MapReduce out of academic research labs and moving it to lucrative corporate users.
To give you some background, currently the two most popular implementations of MapReduce are the open source Apache Hadoop project and unfortunately named Pig project. For those of you who don't know about about Hadoop, it is an open source platform for distributing, managing and then collecting computing work throughout a large computing cloud using MapReduce. Pig, a Yahoo Research project currently being incubated at Apache, is a language designed to make using the Hadoop infrastructure effectively. It has been described as SQL for MapReduce, allowing queries to be written and then parallelised and run on the Hadoop platform.
I found this quote interesting, it was mentioned in Greemplums press release.
"Greenplum has seamlessly integrated MapReduce into its database, making it possible for us to access our massive dataset with standard SQL queries in combination with MapReduce programs," said Roger Magoulas, Research Director, O'Reilly Media. "We are finding this to be incredibly efficient because complex SQL queries can be expressed in a few lines of Perl or Python code.
Also interesting to note that earlier this year IBM released an Eclipse plug-in that simplifies the creation and deployment of MapReduce programs. This plug-in was developed by the team at IBM Software Group's High Performance On Demand Solutions Unit at the IBM Silicon Valley Laboratory. So it may be a matter of time before we see MapReduce commercially offered by IBM.
So what's next? Will we see a Microsoft implementation or an Oracle MapReduce? For now, MapReduce appears to be the new "coolness" and with all the industry attention it seems to be getting I think we may be on the verge of finally seeing MapReduce enter the mainstream consciousness.
As a side note, my favorite MapReduce implementation is called Skynet. The name says it all.
In my experience, processor core utilization is abstracted within the OS, if you use a multi-threaded compiler and tune the application accordingly.
Chris
From: Ray Nugent Sent: Tuesday, August 26, 2008 7:43 PM
To: cloud-computing@googlegroups.com Subject: Re: Is Map/Reduce going mainstream?
Interesting if not a little over hyped. I'm not sure three of four implementations on top of it could be termed "a Rich Ecosystem". But certainly a step in the right - concurrent - direction. While it appears to be able to span machine and clusters does anyone know if it can span cores within a machine or do I need a concurrent language to achieve that?
I've been getting lots of private messages, seems that there are a lot of shy readers on this list. There are no stupid questions or comments, please feel free to jump into the fray.
A few people have pointed me to a map/reduce like research project microsoft is working on called Dryad. Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.
I am a graduate student of the Indiana University Bloomington.
In my current research work, I have analyzed the MapReduce technique using two scientific applications. I have used Hadoop and also CLG-MapReduce (A streaming based MapReduce implementation that we developed) for the aforementioned analysis.
Since the MapReduce discussion is going on, I thought of sharing my experience with the MapReduce technique with you all.
So far, I have two scientific applications which I have implemented in MapReduce programming model.
1. High Energy Physics(HEP) data analysis involving large volumes of data (I have tested up to 1 terabytes)
2. Kmeans clustering to cluster large number of 2D data points to predefined number of clusters.]
The first application is data/computation intensive application that requires passing the data through a single MapReduce computation cycle. The second application is an iterative application that requires multiple execution of MapReduce cycles before the final result is obtained.
I implemented the HEP data analysis using Hadoop and CGL-MapReduce and compare their performances.
I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and MPI and compare their performances.
I also performed the Kmeans clustering using CGL-MapReduce, MPI, and Java threads on multi-core computers.
I will include the conclusions that we derived so far from our experiments below. ( Following documents contain a detailed description of the work I did)
1. MapReduce technique scale well for large data/compute intensive applications and easy to program.
2. Google's approach (also the Hadoop's approach) of transferring intermediate results via a distributed file system adds large overhead to the computation task. However, the effect of this overhead is minimum for very large data/compute intensive applications. (Please see Figures 6, 7 and 8 of the above paper)
3. The file system based communication overhead is prohibitively large in iterative MapReduce computations. (Please see Figure 9 of the above paper) There is a project under Hadoop named Mahout that intend to support iterative MapReduce computations, but still it uses Hadoop and hence it will also have these overheads.
4. The approach of storing and accessing data in an specialized distributed file system works well when the data is in text formats and the Map/Reduce functions are written in a language that has APIs for accessing the file system. But, for binary data and the Map/Reduce functions written in other languages this is not a feasible solution. A data catalog would be a better solution for these types of generic use cases.
5. Multi-threaded applications run faster in Multicore platforms than the MapReduce or other parallelization techniques targeted for distributed memory. Of course MapReduce is more simple to program J
Any thoughts and comments are welcome!
Thanks, Jaliya
From: cloud-computing@googlegroups.com [mailto:cloud-computing@googlegroups.com] On Behalf Of Chris K Wensel Sent: Tuesday, August 26, 2008 7:56 PM To: cloud-computing@googlegroups.com Subject: Re: Is Map/Reduce going mainstream?
Hey all
From a practitioners perspective, this is both great news, and a bit troubling.
Good news since MapReduce is a good divide and conquer strategy when dealing with large (continuous) datasets.
When coupled to a 'global' filesystem with a single namespace like HDFS (the Hadoop file system) or GFS (googles global filesystem), you get a new powerful entity that effectively virtualizes many hardware nodes into a single operating system instance (a wide virtualization), the converse of Xen and VMWare where you get may instances on one hardware node (a narrow virtualization).
Being inherently fault tolerant (due to the MapReduce model and nature of the filesystem) data centers could safely migrate batch like workloads into this system. Data-warehousing is a perfect fit when you realize RDBMS data warehouses are simply caches that have leaked up into the architecture due to their being resource constrained, forcing ETL to be yet another complex part of the datacenter and data-warehouse to load this cache.
Hadoop like systems allow for a 'lazy evaluation' model of the data, where caching is just an attribute, not an architectural component.
Troubling because 'thinking' in MapReduce sucks. If you've ever read "How to write parallel programs" by Nicholas Carriero and David Gelernter (http://www.lindaspaces.com/book/), many of their thought experiments and examples are based on a house building analogy. That is, how would you build a house in X model or Y model. These examples work because the models they present are straightforward.
That said, how do you build a house in the MapReduce model? Filtering, aggregation, and functional conversions are very simple in MapReduce, but typical work loads are more complex, and end up being 2-5-20 individual MapReduce jobs chained together by dependencies (many running in parallel). Thinking that deep in MapReduce is not trivial, and the resulting code base is not simple (see Googles Sawzall papers).
If companies like Greenplum are using MapReduce as an underlying compute model, they must offer up a higher level abstraction that users and developers can reason in.
Pig was mentioned. Jaql from IBM is also of worthy note. Facebook is contributing Hive, which adopts many data-warehouse trappings.
But if developers wish to continue working at the API level, and a need to build their own DSL's or reusable libraries for their specific domain for practitioners of that domain, Cascading is likely a better choice. (http://www.cascading.org/)
As the developer of Cascading, I would love to chat with the Greenplum developers to see how Cascading can target their MapReduce implementation to compliment the current Hadoop planner.
It's been an interesting summer for Google's MapReduce software paradigm. I'm not going to get into the finer details of MapReduce, the general idea is its Google's magic sauce, it's what lets them run their massively distributed data sets. So any company that wants to be like Google or needs to compete with Google should pay attention to MapReduce.
Last month Intel, HP and yahoo announced a joint research program to examine it's usage and now today Greenplum, a provider of database software for the what they describe as "next generation of data warehousing and analytics", announced support for MapReduce within its massively parallel database engine.
Greenplums announcement to integrate MapReduce functionality into its enterprise focused database is an important step toward taking MapReduce out of academic research labs and moving it to lucrative corporate users.
To give you some background, currently the two most popular implementations of MapReduce are the open source Apache Hadoop project and unfortunately named Pig project. For those of you who don't know about about Hadoop <http://hadoop.apache.org/core/> , it is an open source platform for distributing, managing and then collecting computing work throughout a large computing cloud using MapReduce <http://labs.google.com/papers/mapreduce.html> . Pig <http://incubator.apache.org/projects/pig.html> , a Yahoo Research project currently being incubated at Apache, is a language designed to make using the Hadoop infrastructure effectively. It has been described as SQL for MapReduce, allowing queries to be written and then parallelised and run on the Hadoop platform.
I found this quote interesting, it was mentioned in Greemplums press release.
"Greenplum has seamlessly integrated MapReduce into its database, making it possible for us to access our massive dataset with standard SQL queries in combination with MapReduce programs," said Roger Magoulas, Research Director, O'Reilly Media. "We are finding this to be incredibly efficient because complex SQL queries can be expressed in a few lines of Perl or Python code.
Also interesting to note that earlier this year IBM released an Eclipse plug-in that simplifies the creation and deployment of MapReduce programs. This plug-in was developed by the team at IBM Software Group's High Performance On Demand Solutions Unit at the IBM Silicon Valley Laboratory. So it may be a matter of time before we see MapReduce commercially offered by IBM.
So what's next? Will we see a Microsoft implementation or an Oracle MapReduce? For now, MapReduce appears to be the new "coolness" and with all the industry attention it seems to be getting I think we may be on the verge of finally seeing MapReduce enter the mainstream consciousness.
As a side note, my favorite MapReduce implementation is called Skynet. The name says it all.
--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Cloud Computing" group. To post to this group, send email to cloud-computing@googlegroups.com To unsubscribe from this group, send email to cloud-computing-unsubscribe@googlegroups.com To post job listing, send email to j...@cloudjobs.net (position title, employer and location in subject, description in message body) or visit http://www.cloudjobs.net For more options, visit this group at http://groups.google.ca/group/cloud-computing?hl=en?hl=en Posting guidelines: http://groups.google.ca/group/cloud-computing/web/frequently-asked-qu...
> I am a graduate student of the Indiana University Bloomington.
> In my current research work, I have analyzed the MapReduce technique > using two scientific applications. I have used Hadoop and also > CLG-MapReduce (A streaming based MapReduce implementation that we > developed) for the aforementioned analysis.
> Since the MapReduce discussion is going on, I thought of sharing my > experience with the MapReduce technique with you all.
> So far, I have two scientific applications which I have implemented in > MapReduce programming model.
> 1. High Energy Physics(HEP) data analysis involving large > volumes of data (I have tested up to 1 terabytes)
> 2. Kmeans clustering to cluster large number of 2D data points > to predefined number of clusters.]
> The first application is data/computation intensive application that > requires passing the data through a single MapReduce computation > cycle. The second application is an iterative application that > requires multiple execution of MapReduce cycles before the final > result is obtained.
> I implemented the HEP data analysis using Hadoop and CGL-MapReduce > and compare their performances.
> I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and > MPI and compare their performances.
> I also performed the Kmeans clustering using CGL-MapReduce, MPI, and > Java threads on multi-core computers.
It might be interesting to do a quick test and see what the performance overhead of using virtual servers compared to physical servers.
-- Jim Starkey President, NimbusDB, Inc. 978 526-1376
> It might be interesting to do a quick test and see what the performance > overhead of using virtual servers compared to physical servers.
Yes, planning to do a thorough test as well. I have some initial results on the overhead of virtual machines for MPI. This is what we did.
We used the same MPI program that is used for Kmeans clustering and modified it so that it will iterate a given number times at each MPI communication routine. Then we measure the performance on a VM setup (Using the "Nimbus" cloud) and also on an exactly similar set of machines without VMs.
The VMs introduced high overhead to the MPI communication. We did few tests by changing the configurations such as the number of processor cores used in a VM and the number of VMs used etc.. but still could not exactly figure out the reason for this high overhead. It could be the overhead induced by the network emulator or the network configuration used in Nimbus. We are setting up a Cloud in IU and once it is done, we will be able to verify these figures.
> -----Original Message----- > From: cloud-computing@googlegroups.com [mailto:cloud- > computing@googlegroups.com] On Behalf Of Jim Starkey > Sent: Wednesday, August 27, 2008 9:45 AM > To: cloud-computing@googlegroups.com > Subject: Re: Is Map/Reduce going mainstream?
> Jaliya Ekanayake wrote:
> > Hi All,
> > I am a graduate student of the Indiana University Bloomington.
> > In my current research work, I have analyzed the MapReduce technique > > using two scientific applications. I have used Hadoop and also > > CLG-MapReduce (A streaming based MapReduce implementation that we > > developed) for the aforementioned analysis.
> > Since the MapReduce discussion is going on, I thought of sharing my > > experience with the MapReduce technique with you all.
> > So far, I have two scientific applications which I have implemented > in > > MapReduce programming model.
> > 1. High Energy Physics(HEP) data analysis involving large > > volumes of data (I have tested up to 1 terabytes)
> > 2. Kmeans clustering to cluster large number of 2D data points > > to predefined number of clusters.]
> > The first application is data/computation intensive application that > > requires passing the data through a single MapReduce computation > > cycle. The second application is an iterative application that > > requires multiple execution of MapReduce cycles before the final > > result is obtained.
> > I implemented the HEP data analysis using Hadoop and CGL-MapReduce > > and compare their performances.
> > I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and > > MPI and compare their performances.
> > I also performed the Kmeans clustering using CGL-MapReduce, MPI, and > > Java threads on multi-core computers.
> It might be interesting to do a quick test and see what the performance > overhead of using virtual servers compared to physical servers.
> -- > Jim Starkey > President, NimbusDB, Inc. > 978 526-1376
> --~--~---------~--~----~------------~-------~--~----~ > You received this message because you are subscribed to the Google > Groups "Cloud Computing" group. > To post to this group, send email to cloud-computing@googlegroups.com > To unsubscribe from this group, send email to > cloud-computing-unsubscribe@googlegroups.com > To post job listing, send email to j...@cloudjobs.net (position title, > employer and location in subject, description in message body) or visit > http://www.cloudjobs.net > For more options, visit this group at > http://groups.google.ca/group/cloud-computing?hl=en?hl=en > Posting guidelines: > http://groups.google.ca/group/cloud-computing/web/frequently-asked- > questions > -~----------~----~----~----~------~----~------~--~---
Greenplum and Aster Data both recently announced MapReduce within SQL for software-only distributed data warehouses on commodity hardware -- highly parallel, fault tolerate -- great arguments. I've talked with both vendors. These directions are great to see for several reasons, and I'll likely become a paying customer. There are cautions as well for other reasons.
Going beyond the "It's hard to think in MapReduce" point which GridGurus made ... As an engineering manager at a firm invested in using lots of SQL, lots of Hadoop, lots of other code written in neither of the above, and many terabytes of data added each week -- my primary concern is more about how well a *team* of people can "think" within a paradigm over a *long* period of time.
Much of what I've seen discussed about Hadoop seems based around an individual programmer's initial experiences. That's great to see. However, what really sells a paradigm is when you can evaluate from a different perspective, namely hindsight.
Consider this as an exercise: imagine that a programmer wrote 1000 lines of code for a project at work based on MapReduce, the company has run that code in some critical use case for 2-3 years, the original author moved to another company (or the company was acquired), and now a team of relatively new eyeballs must reverse engineer the existing MapReduce code and extend it.
A software lifecycle argument is not heard so much in discussions about MapReduce, but IMHO it will become more crucial for decision makers as vendors move to providing SQL, etc.
As a different example, I don't need to argue with exec mgmt or board members that reworking 10000 lines of PHP with SQL embedded is going to cost more than reworking the same app if it'd been written in Java and Hibernate. Or having it written in Python, or whatever other more-structured language makes sense for the app. The business decision makers already understand those costs, all too well.
Keep in mind that a previous generation of data warehousing (e.g., Netezza) promised parallel processing based on SQL-plus-extensions atop custom hardware -- highly-parallel, somewhat less fault-tolerant, but still an advancement. IMHO, what we see now from Greenplum, Aster Data, Hive, etc., represents a huge leap beyond that previous generation. Particularly for TCO.
Pros: if you have terabytes of data collected from your different business units, then those business units will want to run ad-hoc queries on it, and they won't have time/expertise/patience to write code for Hadoop. I'd rather give those users means to run some SQL and be done -- than disrupt my team from core work on algorithms.
Cons: if you have a complex algorithm which is part of the critical infrastructure and must be maintained over a long-period of time by a team of analysts and developers, then RUN AWAY, DO NOT WALK, from anyone trying to sell you a solution based in SQL.
I'd estimate the cost of maintaining substantial apps in SQL to be 10-100x that of more structured languages used in Hadoop: Java, C++, Python, etc. That's based on encountering the problem over and over for the past 10 years. Doubtless that there are studies by people with much broader views and better data, much to the same effect.
The challenge to Aster Data, Greenplum, et al., will be to provide access to their new systems from within languages like Java, Python, etc. Otherwise I see them relegated to the land of ad-hoc queries, being eclipsed by coding paradigms which are less abstracted or more geared toward software costs long-term.
PIG provides an interesting approach and I'm sure that it will be useful for many projects. Not clear that it provides the answers to software lifecycle issues.
From what I've seen through the years in other areas of large scale data, such as ETL and analytics, first off the data is not particularly relational. The concept of workflows seems much more apt -- more likely to become adopted by the practitioners in quantity once we get past the "slope of enlightenment" phase of the hype cycle.
Our team has been considering a shift to Cascading. I'd much rather train a stats PhD on Cascading than have them slog through 50x lines of SQL which was written by the previous stats PhD.
Having access to DSLs from within application code -- e.g., R for analytics, Python for system programming, etc. -- that would also be much more preferable from my perspective than MapReduce within SQL.
Paco
On Tue, Aug 26, 2008 at 6:56 PM, Chris K Wensel <ch...@wensel.net> wrote:
> Hey all > From a practitioners perspective, this is both great news, and a bit > troubling. > Good news since MapReduce is a good divide and conquer strategy when dealing > with large (continuous) datasets. > When coupled to a 'global' filesystem with a single namespace like HDFS (the > Hadoop file system) or GFS (googles global filesystem), you get a new > powerful entity that effectively virtualizes many hardware nodes into a > single operating system instance (a wide virtualization), the converse of > Xen and VMWare where you get may instances on one hardware node (a narrow > virtualization). > Being inherently fault tolerant (due to the MapReduce model and nature of > the filesystem) data centers could safely migrate batch like workloads into > this system. Data-warehousing is a perfect fit when you realize RDBMS data > warehouses are simply caches that have leaked up into the architecture due > to their being resource constrained, forcing ETL to be yet another complex > part of the datacenter and data-warehouse to load this cache. > Hadoop like systems allow for a 'lazy evaluation' model of the data, where > caching is just an attribute, not an architectural component. > Troubling because 'thinking' in MapReduce sucks. If you've ever read "How to > write parallel programs" by Nicholas Carriero and David > Gelernter (http://www.lindaspaces.com/book/), many of their thought > experiments and examples are based on a house building analogy. That is, how > would you build a house in X model or Y model. These examples work because > the models they present are straightforward. > That said, how do you build a house in the MapReduce model? Filtering, > aggregation, and functional conversions are very simple in MapReduce, but > typical work loads are more complex, and end up being 2-5-20 individual > MapReduce jobs chained together by dependencies (many running in parallel). > Thinking that deep in MapReduce is not trivial, and the resulting code base > is not simple (see Googles Sawzall papers). > If companies like Greenplum are using MapReduce as an underlying compute > model, they must offer up a higher level abstraction that users and > developers can reason in. > Pig was mentioned. Jaql from IBM is also of worthy note. Facebook is > contributing Hive, which adopts many data-warehouse trappings. > But if developers wish to continue working at the API level, and a need to > build their own DSL's or reusable libraries for their specific domain for > practitioners of that domain, Cascading is likely a better choice. > (http://www.cascading.org/) > As the developer of Cascading, I would love to chat with the Greenplum > developers to see how Cascading can target their MapReduce implementation to > compliment the current Hadoop planner. > cheers, > chris > -- > Chris K Wensel > ch...@wensel.net > http://chris.wensel.net/ > http://www.cascading.org/
On Wednesday 27 August 2008 05:57:05 Jaliya Ekanayake wrote:
> In my current research work, I have analyzed the MapReduce technique using > two scientific applications. I have used Hadoop and also CLG-MapReduce (A > streaming based MapReduce implementation that we developed) for the > aforementioned analysis.
This is very interesting work.
> The [HEP] application is data/computation intensive application that > requires passing the data through a single MapReduce computation cycle.
What HEP application were you using? Was it doing Monte Carlo generation (MC), reconstruction (reco), analysis, or a mixture of these?
> I implemented the HEP data analysis using Hadoop and CGL-MapReduce and > compare their performances [...]
> using > > two scientific applications. I have used Hadoop and also CLG- > MapReduce (A > > streaming based MapReduce implementation that we developed) for the > > aforementioned analysis.
> This is very interesting work.
Thanks :)
> > The [HEP] application is data/computation intensive application that > > requires passing the data through a single MapReduce computation > cycle.
> What HEP application were you using? Was it doing Monte Carlo > generation > (MC), reconstruction (reco), analysis, or a mixture of these?
We used Monte Carlo data to find "higgs". The analyses code and the data are both from the High Energy Physics group @ Caltech, and we just used it as a use case for our MapReduce testing.
> > I implemented the HEP data analysis using Hadoop and CGL-MapReduce > and > > compare their performances [...]
This will be a interesting comparison. PROOF is only for parallelizing ROOT applications. Right now our focus is on the generalized framework for supporting these data analyses.
> --~--~---------~--~----~------------~-------~--~----~ > You received this message because you are subscribed to the Google > Groups "Cloud Computing" group. > To post to this group, send email to cloud-computing@googlegroups.com > To unsubscribe from this group, send email to > cloud-computing-unsubscribe@googlegroups.com > To post job listing, send email to j...@cloudjobs.net (position title, > employer and location in subject, description in message body) or visit > http://www.cloudjobs.net > For more options, visit this group at > http://groups.google.ca/group/cloud-computing?hl=en?hl=en > Posting guidelines: > http://groups.google.ca/group/cloud-computing/web/frequently-asked- > questions > -~----------~----~----~----~------~----~------~--~---
Our team have used the hadoop stuff to implement some basic spatial
location related algorithms, e.g., the 2D-convex hull, and now some
more advanced.
From the view of an ex "traditional" parallel program developer with
message-passing stuffs, I dont' think the map/reduce is a totally
different animal from MPI/PVM/MW these HPC stuffs. Coz the born-
tightly combination with distributed file&data I/O layer, automatic
task-availability assurance, simpler API, and the most important -
Disk IO bottleneck in a single box, make it most popular among
developers, thought I would still say, Condor/PBS/SGE + MPI + MPIO +
some parallel filesystems over a COTS cluster can do almost the same
job because of the divide-and-conquer nature in some of the workload -
It seems somethings are invented again! :)
However, from the pointview of programming language, LINQ, Swazzle,
and PIG are driving more adoption of functional-style languages in
parallel/performance computing. The are quite different from those of
HPCs!
Comments r welcome! :)
On Aug 28, 9:53 pm, "Jaliya Ekanayake" <jekan...@cs.indiana.edu>
wrote:
> > using
> > > two scientific applications. I have used Hadoop and also CLG-
> > MapReduce (A
> > > streaming based MapReduce implementation that we developed) for the
> > > aforementioned analysis.
> > This is very interesting work.
> Thanks :)
> > > The [HEP] application is data/computation intensive application that
> > > requires passing the data through a single MapReduce computation
> > cycle.
> > What HEP application were you using? Was it doing Monte Carlo
> > generation
> > (MC), reconstruction (reco), analysis, or a mixture of these?
> We used Monte Carlo data to find "higgs".
> The analyses code and the data are both from the High Energy Physics group @
> Caltech, and we just used it as a use case for our MapReduce testing.
> > > I implemented the HEP data analysis using Hadoop and CGL-MapReduce
> > and
> > > compare their performances [...]
> This will be a interesting comparison. PROOF is only for parallelizing ROOT
> applications.
> Right now our focus is on the generalized framework for supporting these data
> analyses.
IMO the MapReduce program paradigm simplifies the parallel programming. There may be things that are hard to do in MapReduce compared to the traditional parallel programming, but overall it will be a very good alternative. However, it is always good to use the right implementation of MapReduce for the tasks that we need to parallelize. For example, although the Google's and Hadoop's way of doing MapReduce (I am not talking about the API rather the underlying architecture) suite well for heavy I/O bound problems, we cannot expect all the problems to be in this nature. Some problems may have moderate amount of I/O operations but heavy computational requirements. Some may require iterative application of MapReduce. The MapReduce programming model is something we can apply to all these problems but, if we use a MapReduce implementation that is targeted for heavy I/O bound applications to solve a problem which is more computational/communication intensive, we may not get the parallel performance that we expected.
Therefore, what I suggest is that once we have a parallel implementation (MapReduce or any other ) we need to have at least a rough idea of the efficiency( or speedup) achieved by our solution and see how much overhead is introduced by the parallelization technique itself. Thanks Jaliya
> -----Original Message----- > From: cloud-computing@googlegroups.com [mailto:cloud- > computing@googlegroups.com] On Behalf Of Jeff He > Sent: Thursday, August 28, 2008 11:23 PM > To: Cloud Computing > Subject: Re: Is Map/Reduce going mainstream?
> Our team have used the hadoop stuff to implement some basic spatial > location related algorithms, e.g., the 2D-convex hull, and now some > more advanced.
> >From the view of an ex "traditional" parallel program developer with > message-passing stuffs, I dont' think the map/reduce is a totally > different animal from MPI/PVM/MW these HPC stuffs. Coz the born- > tightly combination with distributed file&data I/O layer, automatic > task-availability assurance, simpler API, and the most important - > Disk IO bottleneck in a single box, make it most popular among > developers, thought I would still say, Condor/PBS/SGE + MPI + MPIO + > some parallel filesystems over a COTS cluster can do almost the same > job because of the divide-and-conquer nature in some of the workload - > It seems somethings are invented again! :)
> However, from the pointview of programming language, LINQ, Swazzle, > and PIG are driving more adoption of functional-style languages in > parallel/performance computing. The are quite different from those of > HPCs!
> Comments r welcome! :)
> On Aug 28, 9:53 pm, "Jaliya Ekanayake" <jekan...@cs.indiana.edu> > wrote: > > > using > > > > two scientific applications. I have used Hadoop and also CLG- > > > MapReduce (A > > > > streaming based MapReduce implementation that we developed) for > the > > > > aforementioned analysis.
> > > This is very interesting work.
> > Thanks :)
> > > > The [HEP] application is data/computation intensive application > that > > > > requires passing the data through a single MapReduce computation > > > cycle.
> > > What HEP application were you using? Was it doing Monte Carlo > > > generation > > > (MC), reconstruction (reco), analysis, or a mixture of these?
> > We used Monte Carlo data to find "higgs". > > The analyses code and the data are both from the High Energy Physics > group @ > > Caltech, and we just used it as a use case for our MapReduce testing.
> > > > I implemented the HEP data analysis using Hadoop and CGL- > MapReduce > > > and > > > > compare their performances [...]
> > This will be a interesting comparison. PROOF is only for > parallelizing ROOT > > applications. > > Right now our focus is on the generalized framework for supporting > these data > > analyses.
> --~--~---------~--~----~------------~-------~--~----~ > You received this message because you are subscribed to the Google > Groups "Cloud Computing" group. > To post to this group, send email to cloud-computing@googlegroups.com > To unsubscribe from this group, send email to > cloud-computing-unsubscribe@googlegroups.com > To post job listing, send email to j...@cloudjobs.net (position title, > employer and location in subject, description in message body) or visit > http://www.cloudjobs.net > For more options, visit this group at > http://groups.google.ca/group/cloud-computing?hl=en?hl=en > Posting guidelines: > http://groups.google.ca/group/cloud-computing/web/frequently-asked- > questions > -~----------~----~----~----~------~----~------~--~---
While I agree with your thoughts, I think it is not I/O vs.
computational aspect that drives MapReduce - MapReduce will work well in
both situations. MapReduce can be applied to
little-I/O-large-computational problems - it will resemble the grids.
What drives MapReduce is the scale and nature of the problem. Any
massively parallel, functionally decomposable problem will achieve
exponential performance advantage with MapReduce - to I/O or not to I/O.
|From: cloud-computing@googlegroups.com [mailto:cloud-
|computing@googlegroups.com] On Behalf Of Jaliya Ekanayake
|Sent: Friday, August 29, 2008 7:38 AM
|To: cloud-computing@googlegroups.com
|Subject: RE: Is Map/Reduce going mainstream?
|
|IMO the MapReduce program paradigm simplifies the parallel programming.
|There may be things that are hard to do in MapReduce compared to the
|traditional parallel programming, but overall it will be a very good
|alternative. However, it is always good to use the right implementation
|of MapReduce for the tasks that we need to parallelize. For example,
|although the Google's and Hadoop's way of doing MapReduce (I am not
|talking about the API rather the underlying architecture) suite well
for
|heavy I/O bound problems, we cannot expect all the problems to be in
|this nature. Some problems may have moderate amount of I/O operations
|but heavy computational requirements. Some may require iterative
|application of MapReduce. The MapReduce programming model is something
|we can apply to all these problems but, if we use a MapReduce
|implementation that is targeted for heavy I/O bound applications to
|solve a problem which is more computational/communication intensive, we
|may not get the parallel performance that we expected.
|
|Therefore, what I suggest is that once we have a parallel
implementation
|(MapReduce or any other ) we need to have at least a rough idea of the
|efficiency( or speedup) achieved by our solution and see how much
|overhead is introduced by the parallelization technique itself.
|Thanks
|Jaliya
|
|
|> -----Original Message-----
|> From: cloud-computing@googlegroups.com [mailto:cloud-
|> computing@googlegroups.com] On Behalf Of Jeff He
|> Sent: Thursday, August 28, 2008 11:23 PM
|> To: Cloud Computing
|> Subject: Re: Is Map/Reduce going mainstream?
|>
|>
|> Our team have used the hadoop stuff to implement some basic spatial
|> location related algorithms, e.g., the 2D-convex hull, and now some
|> more advanced.
|>
|> >From the view of an ex "traditional" parallel program developer with
|> message-passing stuffs, I dont' think the map/reduce is a totally
|> different animal from MPI/PVM/MW these HPC stuffs. Coz the born-
|> tightly combination with distributed file&data I/O layer, automatic
|> task-availability assurance, simpler API, and the most important -
|> Disk IO bottleneck in a single box, make it most popular among
|> developers, thought I would still say, Condor/PBS/SGE + MPI + MPIO +
|> some parallel filesystems over a COTS cluster can do almost the same
|> job because of the divide-and-conquer nature in some of the workload
-
|> It seems somethings are invented again! :)
|>
|> However, from the pointview of programming language, LINQ, Swazzle,
|> and PIG are driving more adoption of functional-style languages in
|> parallel/performance computing. The are quite different from those of
|> HPCs!
|>
|> Comments r welcome! :)
|>
|>
|> On Aug 28, 9:53 pm, "Jaliya Ekanayake" <jekan...@cs.indiana.edu>
|> wrote:
|> > > using
|> > > > two scientific applications. I have used Hadoop and also CLG-
|> > > MapReduce (A
|> > > > streaming based MapReduce implementation that we developed) for
|> the
|> > > > aforementioned analysis.
|> >
|> > > This is very interesting work.
|> >
|> > Thanks :)
|> >
|> > > > The [HEP] application is data/computation intensive
application
|> that
|> > > > requires passing the data through a single MapReduce
computation
|> > > cycle.
|> >
|> > > What HEP application were you using? Was it doing Monte Carlo
|> > > generation (MC), reconstruction (reco), analysis, or a mixture of
|> > > these?
|> >
|> > We used Monte Carlo data to find "higgs".
|> > The analyses code and the data are both from the High Energy
Physics
|> group @
|> > Caltech, and we just used it as a use case for our MapReduce
|testing.
|> >
|> > > > I implemented the HEP data analysis using Hadoop and CGL-
|> MapReduce
|> > > and
|> > > > compare their performances [...]
|> >
|> > > Did you (or are planning to) compare this with PROOF:
|> > > http://root.cern.ch/twiki/bin/view/ROOT/PROOF |> >
|> > This will be a interesting comparison. PROOF is only for
|> parallelizing ROOT
|> > applications.
|> > Right now our focus is on the generalized framework for supporting
|> these data
|> > analyses.
|> >
|> >
|> >
|> > > > [Draft Paper]http://www.cs.indiana.edu/~jekanaya/draft.pdf |> >
|> > > Ta for the link.
|> >
|> > Thanks,
|> > Jaliya
|> >
|> >
|> >
|> > > Cheers,
|> >
|> > > Paul.
|> >
|> > > >
|> >
|> >
|> > smime.p7s
|> > 5KViewDownload
|>
|>
So lets look at the two features of of MapReduce 1) *High level Interface*. Some problems -- including obviously very important ones demonstrated by Google -- can be implemented elegantly by such functional languages. However we were told 25 years ago that functional languages would revolutionize all parallel computing computing. That was found to be false as many problems are poorly represented in this fashion. Further the success of MapReduce has been seen before in many of the 100's of workflow approaches with a variety of visual and language interfaces. So this part of MapReduce but not so general and not so novel 2) *Runtime*. Here I hope and expect Google has a great implementation. On the other hand, Hadoop has many limitations and MPI (possibly augmented by well studied fault tolerance mechanisms) has some advantages. It is much higher performance and as the synchronization mechanism is distributed has natural scaling. MPI of course does scale to many tens of thousands of cores; it will continue to scale on much large systems -- the scaling is intrinsic to its architecture as I assume it is Google's implementation
Krishna Sankar (ksankar) wrote: > While I agree with your thoughts, I think it is not I/O vs. > computational aspect that drives MapReduce - MapReduce will work well in > both situations. MapReduce can be applied to > little-I/O-large-computational problems - it will resemble the grids. > What drives MapReduce is the scale and nature of the problem. Any > massively parallel, functionally decomposable problem will achieve > exponential performance advantage with MapReduce - to I/O or not to I/O.
> Cheers > <k/>
> |-----Original Message----- > |From: cloud-computing@googlegroups.com [mailto:cloud- > |computing@googlegroups.com] On Behalf Of Jaliya Ekanayake > |Sent: Friday, August 29, 2008 7:38 AM > |To: cloud-computing@googlegroups.com > |Subject: RE: Is Map/Reduce going mainstream? > | > |IMO the MapReduce program paradigm simplifies the parallel programming. > |There may be things that are hard to do in MapReduce compared to the > |traditional parallel programming, but overall it will be a very good > |alternative. However, it is always good to use the right implementation > |of MapReduce for the tasks that we need to parallelize. For example, > |although the Google's and Hadoop's way of doing MapReduce (I am not > |talking about the API rather the underlying architecture) suite well > for > |heavy I/O bound problems, we cannot expect all the problems to be in > |this nature. Some problems may have moderate amount of I/O operations > |but heavy computational requirements. Some may require iterative > |application of MapReduce. The MapReduce programming model is something > |we can apply to all these problems but, if we use a MapReduce > |implementation that is targeted for heavy I/O bound applications to > |solve a problem which is more computational/communication intensive, we > |may not get the parallel performance that we expected. > | > |Therefore, what I suggest is that once we have a parallel > implementation > |(MapReduce or any other ) we need to have at least a rough idea of the > |efficiency( or speedup) achieved by our solution and see how much > |overhead is introduced by the parallelization technique itself. > |Thanks > |Jaliya > | > | > |> -----Original Message----- > |> From: cloud-computing@googlegroups.com [mailto:cloud- > |> computing@googlegroups.com] On Behalf Of Jeff He > |> Sent: Thursday, August 28, 2008 11:23 PM > |> To: Cloud Computing > |> Subject: Re: Is Map/Reduce going mainstream? > |> > |> > |> Our team have used the hadoop stuff to implement some basic spatial > |> location related algorithms, e.g., the 2D-convex hull, and now some > |> more advanced. > |> > |> >From the view of an ex "traditional" parallel program developer with > |> message-passing stuffs, I dont' think the map/reduce is a totally > |> different animal from MPI/PVM/MW these HPC stuffs. Coz the born- > |> tightly combination with distributed file&data I/O layer, automatic > |> task-availability assurance, simpler API, and the most important - > |> Disk IO bottleneck in a single box, make it most popular among > |> developers, thought I would still say, Condor/PBS/SGE + MPI + MPIO + > |> some parallel filesystems over a COTS cluster can do almost the same > |> job because of the divide-and-conquer nature in some of the workload > - > |> It seems somethings are invented again! :) > |> > |> However, from the pointview of programming language, LINQ, Swazzle, > |> and PIG are driving more adoption of functional-style languages in > |> parallel/performance computing. The are quite different from those of > |> HPCs! > |> > |> Comments r welcome! :) > |> > |> > |> On Aug 28, 9:53 pm, "Jaliya Ekanayake" <jekan...@cs.indiana.edu> > |> wrote: > |> > > using > |> > > > two scientific applications. I have used Hadoop and also CLG- > |> > > MapReduce (A > |> > > > streaming based MapReduce implementation that we developed) for > |> the > |> > > > aforementioned analysis. > |> > > |> > > This is very interesting work. > |> > > |> > Thanks :) > |> > > |> > > > The [HEP] application is data/computation intensive > application > |> that > |> > > > requires passing the data through a single MapReduce > computation > |> > > cycle. > |> > > |> > > What HEP application were you using? Was it doing Monte Carlo > |> > > generation (MC), reconstruction (reco), analysis, or a mixture of > |> > > these? > |> > > |> > We used Monte Carlo data to find "higgs". > |> > The analyses code and the data are both from the High Energy > Physics > |> group @ > |> > Caltech, and we just used it as a use case for our MapReduce > |testing. > |> > > |> > > > I implemented the HEP data analysis using Hadoop and CGL- > |> MapReduce > |> > > and > |> > > > compare their performances [...] > |> > > |> > > Did you (or are planning to) compare this with PROOF: > |> > > http://root.cern.ch/twiki/bin/view/ROOT/PROOF > |> > > |> > This will be a interesting comparison. PROOF is only for > |> parallelizing ROOT > |> > applications. > |> > Right now our focus is on the generalized framework for supporting > |> these data > |> > analyses. > |> > > |> > > |> > > |> > > > [Draft Paper]http://www.cs.indiana.edu/~jekanaya/draft.pdf > |> > > |> > > Ta for the link. > |> > > |> > Thanks, > |> > Jaliya > |> > > |> > > |> > > |> > > Cheers, > |> > > |> > > Paul. > |> > > |> > > > > |> > > |> > > |> > smime.p7s > |> > 5KViewDownload > |> > |>
-- : : Geoffrey Fox g...@indiana.edu FAX 8128567972 http://www.infomall.org : Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977 : SkypeIn 812-669-0772 with voicemail, International cell 8123910207
> While I agree with your thoughts, I think it is not I/O vs. > computational aspect that drives MapReduce - MapReduce will work well > in both situations.
Yes, MapReduce will work in both situations. However, my point is that the "exponential performance advantage" that we expect to achieve may not be possible unless we use a right implementation of the MapReduce that is suitable for the class of applications.
> |-----Original Message----- > |From: cloud-computing@googlegroups.com [mailto:cloud- > |computing@googlegroups.com] On Behalf Of Jaliya Ekanayake > |Sent: Friday, August 29, 2008 7:38 AM > |To: cloud-computing@googlegroups.com > |Subject: RE: Is Map/Reduce going mainstream? > | > |IMO the MapReduce program paradigm simplifies the parallel > programming. > |There may be things that are hard to do in MapReduce compared to the > |traditional parallel programming, but overall it will be a very good > |alternative. However, it is always good to use the right > implementation > |of MapReduce for the tasks that we need to parallelize. For example, > |although the Google's and Hadoop's way of doing MapReduce (I am not > |talking about the API rather the underlying architecture) suite well > for > |heavy I/O bound problems, we cannot expect all the problems to be in > |this nature. Some problems may have moderate amount of I/O operations > |but heavy computational requirements. Some may require iterative > |application of MapReduce. The MapReduce programming model is something > |we can apply to all these problems but, if we use a MapReduce > |implementation that is targeted for heavy I/O bound applications to > |solve a problem which is more computational/communication intensive, > we > |may not get the parallel performance that we expected. > | > |Therefore, what I suggest is that once we have a parallel > implementation > |(MapReduce or any other ) we need to have at least a rough idea of the > |efficiency( or speedup) achieved by our solution and see how much > |overhead is introduced by the parallelization technique itself. > |Thanks > |Jaliya > | > | > |> -----Original Message----- > |> From: cloud-computing@googlegroups.com [mailto:cloud- > |> computing@googlegroups.com] On Behalf Of Jeff He > |> Sent: Thursday, August 28, 2008 11:23 PM > |> To: Cloud Computing > |> Subject: Re: Is Map/Reduce going mainstream? > |> > |> > |> Our team have used the hadoop stuff to implement some basic spatial > |> location related algorithms, e.g., the 2D-convex hull, and now some > |> more advanced. > |> > |> >From the view of an ex "traditional" parallel program developer > with > |> message-passing stuffs, I dont' think the map/reduce is a totally > |> different animal from MPI/PVM/MW these HPC stuffs. Coz the born- > |> tightly combination with distributed file&data I/O layer, automatic > |> task-availability assurance, simpler API, and the most important - > |> Disk IO bottleneck in a single box, make it most popular among > |> developers, thought I would still say, Condor/PBS/SGE + MPI + MPIO + > |> some parallel filesystems over a COTS cluster can do almost the same > |> job because of the divide-and-conquer nature in some of the workload > - > |> It seems somethings are invented again! :) > |> > |> However, from the pointview of programming language, LINQ, Swazzle, > |> and PIG are driving more adoption of functional-style languages in > |> parallel/performance computing. The are quite different from those > of > |> HPCs! > |> > |> Comments r welcome! :) > |> > |> > |> On Aug 28, 9:53 pm, "Jaliya Ekanayake" <jekan...@cs.indiana.edu> > |> wrote: > |> > > using > |> > > > two scientific applications. I have used Hadoop and also CLG- > |> > > MapReduce (A > |> > > > streaming based MapReduce implementation that we developed) > for > |> the > |> > > > aforementioned analysis. > |> > > |> > > This is very interesting work. > |> > > |> > Thanks :) > |> > > |> > > > The [HEP] application is data/computation intensive > application > |> that > |> > > > requires passing the data through a single MapReduce > computation > |> > > cycle. > |> > > |> > > What HEP application were you using? Was it doing Monte Carlo > |> > > generation (MC), reconstruction (reco), analysis, or a mixture > of > |> > > these? > |> > > |> > We used Monte Carlo data to find "higgs". > |> > The analyses code and the data are both from the High Energy > Physics > |> group @ > |> > Caltech, and we just used it as a use case for our MapReduce > |testing. > |> > > |> > > > I implemented the HEP data analysis using Hadoop and CGL- > |> MapReduce > |> > > and > |> > > > compare their performances [...] > |> > > |> > > Did you (or are planning to) compare this with PROOF: > |> > > http://root.cern.ch/twiki/bin/view/ROOT/PROOF > |> > > |> > This will be a interesting comparison. PROOF is only for > |> parallelizing ROOT > |> > applications. > |> > Right now our focus is on the generalized framework for supporting > |> these data > |> > analyses. > |> > > |> > > |> > > |> > > > [Draft Paper]http://www.cs.indiana.edu/~jekanaya/draft.pdf > |> > > |> > > Ta for the link. > |> > > |> > Thanks, > |> > Jaliya > |> > > |> > > |> > > |> > > Cheers, > |> > > |> > > Paul. > |> > > |> > > > > |> > > |> > > |> > smime.p7s > |> > 5KViewDownload > |> > |>
> --~--~---------~--~----~------------~-------~--~----~ > You received this message because you are subscribed to the Google > Groups "Cloud Computing" group. > To post to this group, send email to cloud-computing@googlegroups.com > To unsubscribe from this group, send email to > cloud-computing-unsubscribe@googlegroups.com > To post job listing, send email to j...@cloudjobs.net (position title, > employer and location in subject, description in message body) or visit > http://www.cloudjobs.net > For more options, visit this group at > http://groups.google.ca/group/cloud-computing?hl=en?hl=en > Posting guidelines: > http://groups.google.ca/group/cloud-computing/web/frequently-asked- > questions > -~----------~----~----~----~------~----~------~--~---
Another important factor for MapReduce's popularity (and effectiveness) is that it matches the target application environment -- that is to say, massive data that may be distributed and can be accordingly partitioned and processed independently. Hence, coupling this well-known, functional programming paradigm with a distributed file system for accessing all that data is highly effective for this class of applications that is now being enabled by the increasingly connected infomasses being collected by various organizations.
Hence, to say that MapReduce is "going mainstream" is only relative to the fact that the size of the "market" for such applications is growing.
> > While I agree with your thoughts, I think it is not I/O vs. >> computational aspect that drives MapReduce - MapReduce will work well >> in both situations.
>Yes, MapReduce will work in both situations. However, my point is that the >"exponential performance advantage" that we expect to achieve may not be >possible unless we use a right implementation of the MapReduce that is >suitable for the class of applications.
>Thanks >Jaliya
>> Cheers >> <k/>
>> |-----Original Message----- >> |From: cloud-computing@googlegroups.com [mailto:cloud- >> |computing@googlegroups.com] On Behalf Of Jaliya Ekanayake >> |Sent: Friday, August 29, 2008 7:38 AM >> |To: cloud-computing@googlegroups.com >> |Subject: RE: Is Map/Reduce going mainstream? >> | >> |IMO the MapReduce program paradigm simplifies the parallel >> programming. >> |There may be things that are hard to do in MapReduce compared to the >> |traditional parallel programming, but overall it will be a very good >> |alternative. However, it is always good to use the right >> implementation >> |of MapReduce for the tasks that we need to parallelize. For example, >> |although the Google's and Hadoop's way of doing MapReduce (I am not >> |talking about the API rather the underlying architecture) suite well >> for >> |heavy I/O bound problems, we cannot expect all the problems to be in >> |this nature. Some problems may have moderate amount of I/O operations >> |but heavy computational requirements. Some may require iterative >> |application of MapReduce. The MapReduce programming model is something >> |we can apply to all these problems but, if we use a MapReduce >> |implementation that is targeted for heavy I/O bound applications to >> |solve a problem which is more computational/communication intensive, >> we >> |may not get the parallel performance that we expected. >> | >> |Therefore, what I suggest is that once we have a parallel >> implementation >> |(MapReduce or any other ) we need to have at least a rough idea of the >> |efficiency( or speedup) achieved by our solution and see how much > > |overhead is introduced by the parallelization technique itself. >> |Thanks >> |Jaliya >> | >> | >> |> -----Original Message----- >> |> From: cloud-computing@googlegroups.com [mailto:cloud- >> |> computing@googlegroups.com] On Behalf Of Jeff He >> |> Sent: Thursday, August 28, 2008 11:23 PM >> |> To: Cloud Computing >> |> Subject: Re: Is Map/Reduce going mainstream? >> |> >> |> >> |> Our team have used the hadoop stuff to implement some basic spatial >> |> location related algorithms, e.g., the 2D-convex hull, and now some >> |> more advanced. >> |> >> |> >From the view of an ex "traditional" parallel program developer >> with >> |> message-passing stuffs, I dont' think the map/reduce is a totally >> |> different animal from MPI/PVM/MW these HPC stuffs. Coz the born- >> |> tightly combination with distributed file&data I/O layer, automatic >> |> task-availability assurance, simpler API, and the most important - >> |> Disk IO bottleneck in a single box, make it most popular among >> |> developers, thought I would still say, Condor/PBS/SGE + MPI + MPIO + >> |> some parallel filesystems over a COTS cluster can do almost the same >> |> job because of the divide-and-conquer nature in some of the workload > > - >> |> It seems somethings are invented again! :) >> |> >> |> However, from the pointview of programming language, LINQ, Swazzle, >> |> and PIG are driving more adoption of functional-style languages in >> |> parallel/performance computing. The are quite different from those >> of >> |> HPCs! >> |> >> |> Comments r welcome! :) >> |> >> |> >> |> On Aug 28, 9:53 pm, "Jaliya Ekanayake" <jekan...@cs.indiana.edu> >> |> wrote: >> |> > > using >> |> > > > two scientific applications. I have used Hadoop and also CLG- >> |> > > MapReduce (A >> |> > > > streaming based MapReduce implementation that we developed) >> for >> |> the >> |> > > > aforementioned analysis. >> |> > >> |> > > This is very interesting work. >> |> > >> |> > Thanks :) >> |> > >> |> > > > The [HEP] application is data/computation intensive >> application >> |> that >> |> > > > requires passing the data through a single MapReduce >> computation >> |> > > cycle. >> |> > >> |> > > What HEP application were you using? Was it doing Monte Carlo >> |> > > generation (MC), reconstruction (reco), analysis, or a mixture >> of >> |> > > these? >> |> > >> |> > We used Monte Carlo data to find "higgs". >> |> > The analyses code and the data are both from the High Energy >> Physics >> |> group @ >> |> > Caltech, and we just used it as a use case for our MapReduce >> |testing. >> |> > >> |> > > > I implemented the HEP data analysis using Hadoop and CGL- >> |> MapReduce >> |> > > and >> |> > > > compare their performances [...] >> |> > >> |> > > Did you (or are planning to) compare this with PROOF: >> |> > > http://root.cern.ch/twiki/bin/view/ROOT/PROOF >> |> > >> |> > This will be a interesting comparison. PROOF is only for >> |> parallelizing ROOT >> |> > applications. >> |> > Right now our focus is on the generalized framework for supporting >> |> these data >> |> > analyses. >> |> > >> |> > >> |> > >> |> > > > [Draft Paper]http://www.cs.indiana.edu/~jekanaya/draft.pdf >> |> > >> |> > > Ta for the link. >> |> > >> |> > Thanks, >> |> > Jaliya >> |> > >> |> > >> |> > >> |> > > Cheers, >> |> > >> |> > > Paul. >> |> > >> |> > > > >> |> > >> |> > >> |> > smime.p7s >> |> > 5KViewDownload >> |> >> |>
>Attachment converted: Macintosh HD:smime 40.p7s ( / ) (009896FB)
l...@aero.org wrote: > Another important factor for MapReduce's popularity (and effectiveness) is that > it matches the target application environment -- that is to say, massive data > that may be distributed and can be accordingly partitioned and processed > independently. Hence, coupling this well-known, functional > programming paradigm > with a distributed file system for accessing all that data is highly effective > for this class of applications that is now being enabled by the increasingly > connected infomasses being collected by various organizations.
> Hence, to say that MapReduce is "going mainstream" is only relative to the fact > that the size of the "market" for such applications is growing.
Most certainly so, but it also abused in inappropriate application. It is occasional used, for example, to parallel a search of a large table. A smarter way to handle this would be an index. So lets be wary of a solution looking for a problem....
On Tue, Aug 26, 2008 at 5:56 PM, Chris K Wensel <ch...@wensel.net> wrote:
> That said, how do you build a house in the MapReduce model? Filtering, > aggregation, and functional conversions are very simple in MapReduce, but > typical work loads are more complex, and end up being 2-5-20 individual > MapReduce jobs chained together by dependencies (many running in parallel). > Thinking that deep in MapReduce is not trivial, and the resulting code base > is not simple (see Googles Sawzall papers).
> If companies like Greenplum are using MapReduce as an underlying compute > model, they must offer up a higher level abstraction that users and > developers can reason in.
I agree that typical MapReduce workloads are far more complicated than the word-count class of examples we read about make it seem. The implementations I have built in the past involve workflows that branch depending on the results of several parameterizations of the data and the whims of the users. What if the user has a choice of reduction algorithms? You don't want to rerun the mapped calculation repeatedly. Similarly the user could take the calculations from two different mapping algorithms and run the output through the same reduction algorithm. If you then allow the user to chain any permutation of different map and reduction algorithms into a final result, you have quite the problem.
As you can imagine, this complicated workflow strews enormous amounts of data throughout the storage fabric, all of which needs to be tracked for the users -- how can a user be expected to distinguish the hundred-thousand files associated with one set of input parameters for a particular algorithm from the millions of others generated by not only themselves but others? As you can imagine, this snowballs into quite the management challenge. What if a one thread amongst the thousands invoked across all of the distributed nodes fails due to something like a write timeout while the others quietly move forward? It would be nice to only need to calculate that one bit rather than rerunning the entire job. Using a SQL database is a good approach to tracking all of the metadata associated with these sorts of applications -- filesystems simply aren't made for this type of work.
Further, I believe that there are other operational constraints beyond these more advanced treatments that MapReduce framewroks should be expected to implement. First and foremost is the expectation that the number of nodes will be limited. Not everyone is willing to, let alone can, run their operations on a utility computing infrastructure. Secondly, jobs may have different priorities depending on any number of factors including but not limited to customer SLA, user group priority, and production versus development runs. Therefore a particular job might be put on the back-burner for days on end, even if it has started executing. We can also conclude that jobs can and should be broken up into more pieces than there are available nodes. In particular, users shouldn't be forced to carve off chunks of data larger than they would like simply because they are node limited. Ultimately we know that each calculation will be done as resources become available.
Adding a simple framework to a SQL database is a big step forward for these frameworks. But that is just one step amongst many that is necessary to make the MapReduce frameworks truly usable by the community at large. The complicated operational needs that users have begs for the removal of the simplistic schedulers from inside MapReduce implementations in preference for a global and more feature-rich resource manager. Cheers!
[mailto:cloud-computing@googlegroups.com] On Behalf Of Jeff He Sent: Friday, 29 August 2008 1:23 PM To: Cloud Computing Subject: Re: Is Map/Reduce going mainstream?
Our team have used the hadoop stuff to implement some basic spatial location related algorithms, e.g., the 2D-convex hull, and now some more advanced.
From the view of an ex "traditional" parallel program developer with message-passing stuffs, I dont' think the map/reduce is a totally different animal from MPI/PVM/MW these HPC stuffs. Coz the born- tightly combination with distributed file&data I/O layer, automatic task-availability assurance, simpler API, and the most important - Disk IO bottleneck in a single box, make it most popular among developers, thought I would still say, Condor/PBS/SGE + MPI + MPIO + some parallel filesystems over a COTS cluster can do almost the same job because of the divide-and-conquer nature in some of the workload - It seems somethings are invented again! :)
However, from the pointview of programming language, LINQ, Swazzle, and PIG are driving more adoption of functional-style languages in parallel/performance computing. The are quite different from those of HPCs!
Comments r welcome! :)
On Aug 28, 9:53 pm, "Jaliya Ekanayake" <jekan...@cs.indiana.edu> wrote: > > using > > > two scientific applications. I have used Hadoop and also CLG- > > MapReduce (A > > > streaming based MapReduce implementation that we developed) for the > > > aforementioned analysis.
> > This is very interesting work.
> Thanks :)
> > > The [HEP] application is data/computation intensive application that > > > requires passing the data through a single MapReduce computation > > cycle.
> > What HEP application were you using? Was it doing Monte Carlo > > generation > > (MC), reconstruction (reco), analysis, or a mixture of these?
> We used Monte Carlo data to find "higgs". > The analyses code and the data are both from the High Energy Physics group @ > Caltech, and we just used it as a use case for our MapReduce testing.
> > > I implemented the HEP data analysis using Hadoop and CGL-MapReduce > > and > > > compare their performances [...]
> This will be a interesting comparison. PROOF is only for parallelizing ROOT > applications. > Right now our focus is on the generalized framework for supporting these data > analyses.
>> While I agree with your thoughts, I think it is not I/O vs. >> computational aspect that drives MapReduce - MapReduce will work well >> in both situations.
> Yes, MapReduce will work in both situations. However, my point is that the > "exponential performance advantage" that we expect to achieve may not be > possible unless we use a right implementation of the MapReduce that is > suitable for the class of applications.
> Thanks > Jaliya
>> Cheers >> <k/>
>> |-----Original Message----- >> |From: cloud-computing@googlegroups.com [mailto:cloud- >> |computing@googlegroups.com] On Behalf Of Jaliya Ekanayake >> |Sent: Friday, August 29, 2008 7:38 AM >> |To: cloud-computing@googlegroups.com >> |Subject: RE: Is Map/Reduce going mainstream? >> | >> |IMO the MapReduce program paradigm simplifies the parallel >> programming. >> |There may be things that are hard to do in MapReduce compared to the >> |traditional parallel programming, but overall it will be a very good >> |alternative. However, it is always good to use the right >> implementation >> |of MapReduce for the tasks that we need to parallelize. For example, >> |although the Google's and Hadoop's way of doing MapReduce (I am not >> |talking about the API rather the underlying architecture) suite well >> for >> |heavy I/O bound problems, we cannot expect all the problems to be in >> |this nature. Some problems may have moderate amount of I/O operations >> |but heavy computational requirements. Some may require iterative >> |application of MapReduce. The MapReduce programming model is something >> |we can apply to all these problems but, if we use a MapReduce >> |implementation that is targeted for heavy I/O bound applications to >> |solve a problem which is more computational/communication intensive, >> we >> |may not get the parallel performance that we expected. >> | >> |Therefore, what I suggest is that once we have a parallel >> implementation >> |(MapReduce or any other ) we need to have at least a rough idea of the >> |efficiency( or speedup) achieved by our solution and see how much >> |overhead is introduced by the parallelization technique itself. >> |Thanks >> |Jaliya >> | >> | >> |> -----Original Message----- >> |> From: cloud-computing@googlegroups.com [mailto:cloud- >> |> computing@googlegroups.com] On Behalf Of Jeff He >> |> Sent: Thursday, August 28, 2008 11:23 PM >> |> To: Cloud Computing >> |> Subject: Re: Is Map/Reduce going mainstream? >> |> >> |> >> |> Our team have used the hadoop stuff to implement some basic spatial >> |> location related algorithms, e.g., the 2D-convex hull, and now some >> |> more advanced. >> |> >> |> >From the view of an ex "traditional" parallel program developer >> with >> |> message-passing stuffs, I dont' think the map/reduce is a totally >> |> different animal from MPI/PVM/MW these HPC stuffs. Coz the born- >> |> tightly combination with distributed file&data I/O layer, automatic >> |> task-availability assurance, simpler API, and the most important - >> |> Disk IO bottleneck in a single box, make it most popular among >> |> developers, thought I would still say, Condor/PBS/SGE + MPI + MPIO + >> |> some parallel filesystems over a COTS cluster can do almost the same >> |> job because of the divide-and-conquer nature in some of the workload >> - >> |> It seems somethings are invented again! :) >> |> >> |> However, from the pointview of programming language, LINQ, Swazzle, >> |> and PIG are driving more adoption of functional-style languages in >> |> parallel/performance computing. The are quite different from those >> of >> |> HPCs! >> |> >> |> Comments r welcome! :) >> |> >> |> >> |> On Aug 28, 9:53 pm, "Jaliya Ekanayake" <jekan...@cs.indiana.edu> >> |> wrote: >> |> > > using >> |> > > > two scientific applications. I have used Hadoop and also CLG- >> |> > > MapReduce (A >> |> > > > streaming based MapReduce implementation that we developed) >> for >> |> the >> |> > > > aforementioned analysis. >> |> > >> |> > > This is very interesting work. >> |> > >> |> > Thanks :) >> |> > >> |> > > > The [HEP] application is data/computation intensive >> application >> |> that >> |> > > > requires passing the data through a single MapReduce >> computation >> |> > > cycle. >> |> > >> |> > > What HEP application were you using? Was it doing Monte Carlo >> |> > > generation (MC), reconstruction (reco), analysis, or a mixture >> of >> |> > > these? >> |> > >> |> > We used Monte Carlo data to find "higgs". >> |> > The analyses code and the data are both from the High Energy >> Physics >> |> group @ >> |> > Caltech, and we just used it as a use case for our MapReduce >> |testing. >> |> > >> |> > > > I implemented the HEP data analysis using Hadoop and CGL- >> |> MapReduce >> |> > > and >> |> > > > compare their performances [...] >> |> > >> |> > > Did you (or are planning to) compare this with PROOF: >> |> > > http://root.cern.ch/twiki/bin/view/ROOT/PROOF >> |> > >> |> > This will be a interesting comparison. PROOF is only for >> |> parallelizing ROOT >> |> > applications. >> |> > Right now our focus is on the generalized framework for supporting >> |> these data >> |> > analyses. >> |> > >> |> > >> |> > >> |> > > > [Draft Paper]http://www.cs.indiana.edu/~jekanaya/draft.pdf >> |> > >> |> > > Ta for the link. >> |> > >> |> > Thanks, >> |> > Jaliya >> |> > >> |> > >> |> > >> |> > > Cheers, >> |> > >> |> > > Paul. >> |> > >> |> > > > >> |> > >> |> > >> |> > smime.p7s >> |> > 5KViewDownload >> |> >> |>
>> It might be interesting to do a quick test and see what the performance >> overhead of using virtual servers compared to physical servers.
> Yes, planning to do a thorough test as well. I have some initial results on > the overhead of virtual machines for MPI. > This is what we did.
> We used the same MPI program that is used for Kmeans clustering and modified > it so that it will iterate a given number times at each MPI communication > routine. Then we measure the performance on a VM setup (Using the "Nimbus" > cloud) and also on an exactly similar set of machines without VMs.
> The VMs introduced high overhead to the MPI communication. We did few tests > by changing the configurations such as the number of processor cores used in > a VM and the number of VMs used etc.. but still could not exactly figure out > the reason for this high overhead. It could be the overhead induced by the > network emulator or the network configuration used in Nimbus. We are setting > up a Cloud in IU and once it is done, we will be able to verify these > figures.
> Thanks, > Jaliya
>> -----Original Message----- >> From: cloud-computing@googlegroups.com [mailto:cloud- >> computing@googlegroups.com] On Behalf Of Jim Starkey >> Sent: Wednesday, August 27, 2008 9:45 AM >> To: cloud-computing@googlegroups.com >> Subject: Re: Is Map/Reduce going mainstream?
>> Jaliya Ekanayake wrote:
>> > Hi All,
>> > I am a graduate student of the Indiana University Bloomington.
>> > In my current research work, I have analyzed the MapReduce technique >> > using two scientific applications. I have used Hadoop and also >> > CLG-MapReduce (A streaming based MapReduce implementation that we >> > developed) for the aforementioned analysis.
>> > Since the MapReduce discussion is going on, I thought of sharing my >> > experience with the MapReduce technique with you all.
>> > So far, I have two scientific applications which I have implemented >> in >> > MapReduce programming model.
>> > 1. High Energy Physics(HEP) data analysis involving large >> > volumes of data (I have tested up to 1 terabytes)
>> > 2. Kmeans clustering to cluster large number of 2D data points >> > to predefined number of clusters.]
>> > The first application is data/computation intensive application that >> > requires passing the data through a single MapReduce computation >> > cycle. The second application is an iterative application that >> > requires multiple execution of MapReduce cycles before the final >> > result is obtained.
>> > I implemented the HEP data analysis using Hadoop and CGL-MapReduce >> > and compare their performances.
>> > I implemented the Kmeans clustering using Hadoop, CGL-MapReduce, and >> > MPI and compare their performances.
>> > I also performed the Kmeans clustering using CGL-MapReduce, MPI, and >> > Java threads on multi-core computers.
>> It might be interesting to do a quick test and see what the performance >> overhead of using virtual servers compared to physical servers.
>> -- >> Jim Starkey >> President, NimbusDB, Inc. >> 978 526-1376
Horses for courses. There is discrete data and continuous data. Discrete data can be easily partitioned and I see MapReduce as overkill for this scenario (eg. Student exam results). However, Continuous data (eg. A video stream) is not easily partitioned and does require stuff like this.
Having written a graph/map API, I want to see more developers making use of graph theory, though I think MapReduce might be the wrong product for you J.
From: cloud-computing@googlegroups.com [mailto:cloud-computing@googlegroups.com] On Behalf Of Roderick Flores Sent: Sunday, 31 August 2008 4:35 AM To: cloud-computing@googlegroups.com Subject: Re: Is Map/Reduce going mainstream?
On Tue, Aug 26, 2008 at 5:56 PM, Chris K Wensel <ch...@wensel.net> wrote:
That said, how do you build a house in the MapReduce model? Filtering, aggregation, and functional conversions are very simple in MapReduce, but typical work loads are more complex, and end up being 2-5-20 individual MapReduce jobs chained together by dependencies (many running in parallel). Thinking that deep in MapReduce is not trivial, and the resulting code base is not simple (see Googles Sawzall papers).
If companies like Greenplum are using MapReduce as an underlying compute model, they must offer up a higher level abstraction that users and developers can reason in.
I agree that typical MapReduce workloads are far more complicated than the word-count class of examples we read about make it seem. The implementations I have built in the past involve workflows that branch depending on the results of several parameterizations of the data and the whims of the users. What if the user has a choice of reduction algorithms? You don't want to rerun the mapped calculation repeatedly. Similarly the user could take the calculations from two different mapping algorithms and run the output through the same reduction algorithm. If you then allow the user to chain any permutation of different map and reduction algorithms into a final result, you have quite the problem.
As you can imagine, this complicated workflow strews enormous amounts of data throughout the storage fabric, all of which needs to be tracked for the users -- how can a user be expected to distinguish the hundred-thousand files associated with one set of input parameters for a particular algorithm from the millions of others generated by not only themselves but others? As you can imagine, this snowballs into quite the management challenge. What if a one thread amongst the thousands invoked across all of the distributed nodes fails due to something like a write timeout while the others quietly move forward? It would be nice to only need to calculate that one bit rather than rerunning the entire job. Using a SQL database is a good approach to tracking all of the metadata associated with these sorts of applications -- filesystems simply aren't made for this type of work.
Further, I believe that there are other operational constraints beyond these more advanced treatments that MapReduce framewroks should be expected to implement. First and foremost is the expectation that the number of nodes will be limited. Not everyone is willing to, let alone can, run their operations on a utility computing infrastructure. Secondly, jobs may have different priorities depending on any number of factors including but not limited to customer SLA, user group priority, and production versus development runs. Therefore a particular job might be put on the back-burner for days on end, even if it has started executing. We can also conclude that jobs can and should be broken up into more pieces than there are available nodes. In particular, users shouldn't be forced to carve off chunks of data larger than they would like simply because they are node limited. Ultimately we know that each calculation will be done as resources become available.
Adding a simple framework to a SQL database is a big step forward for these frameworks. But that is just one step amongst many that is necessary to make the MapReduce frameworks truly usable by the community at large. The complicated operational needs that users have begs for the removal of the simplistic schedulers from inside MapReduce implementations in preference for a global and more feature-rich resource manager.