Project for GSoC

tsek...@gmail.com

unread,

Feb 25, 2014, 3:12:10 AM2/25/14

to stratosp...@googlegroups.com

Dear Stratosphere devs and fellow GSoC potential students,

Hello!

I'm Artem, an undergraduate student from Athens, Greece. You can find me on github (https://github.com/atsikiridis) and occasionally on stackoverflow (http://stackoverflow.com/users/2568511/artem-tsikiridis). Currently, however, I'm in Switzerland where I am doing my internship at CERN as back-end software developer for INSPIRE, a library for High Energy Physics (we're running on http://inspirehep.net/). The service is in python( based on the open-source project http://invenio.net) and my responsibilities are mostly the integration with Redis, database abstractions, testing (unit, regression) and helping
our team to integrate modern technologies and frameworks to the current code base.

Moreover, I am very interested in big data technologies, therefore before coming to CERN I've been trying to make my first steps in research at the Big Data lab of AUEB, my home university. Mostly, the main objective of the project I had been involved with, was the implementation of a dynamic caching mechanism for Hadoop (in a way trying our cache instead of the built-in distributed cache). Other techs involved where Redis, Memcached, Ehacache (Terracotta). With this project we gained some insights about the internals of hadoop (new api. old api, how tasks work, hadoop serialization, the daemons running etc.) and hdfs, deployed clusters on cloud computing platforms (Openstack with Nova, Amazon EC2 with boto). We also used the Java Remote API for some tests.

Unfortunately, I have not used Stratosphere before in a research /prod environment. I have only played with the examples on my local machine. It is very interesting and I would love to learn more.

There will probably be a learning curve for me on the Stratosphere side but implementing a Hadoop Compatibility Layer seems like a very interesting project and I believe I can be of use :)

Finally, I was wondering whether there are some command-line tools for deploying Stratosphere automatically for EC2 or Openstack clouds (for example, Stratosphere specific abstractions on top of python boto api). Do you that would make sense as a project?

Pardon me for the length of this.

Kind regards,
Artem

fhu...@gmail.com

unread,

Feb 25, 2014, 4:20:10 AM2/25/14

to stratosp...@googlegroups.com, tsek...@gmail.com

Hi Artem,

thanks a lot for your interest in Stratosphere and participating in our GSoC projects!

As you know, Hadoop is the big elephant out there in the Big Data jungle and widely adopted. Therefore, a Hadoop compatibility layer is a very! important feature for any large scale data processing system.
Stratosphere builds on foundations of MapReduce but generalizes its concepts and provides a more efficient runtime.
When you have a look at the Stratosphere WordCount example program, you will see, that the programming principles of Stratosphere and Hadoop MapReduce are quite similar, although Stratosphere is not compatible with the Hadoop interfaces.
With the proposed project we want to achieve, that Hadoop MapReduce jobs can be executed on Stratosphere without changing a line of code (if possible).

We have already some pieces for that in place. InputFormats are done (see https://github.com/stratosphere/stratosphere/tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats are work in progress. The biggest missing piece is executing Hadoop Map and Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces (e.g., overwriting partitioning function and sorting comparators, counters, distributed cache, ...). It would of course be desirable to support as many of these interfaces as possible, but they can by added step-by-step once the first Hadoop jobs are running on Stratosphere.

Regarding your question about cloud deployment scripts, one of our team members is currently working on this (see this thread: https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo).
I am not sure, if this is still in the making or already done. If you are interested in this as well, just drop a line to the thread. Although, I am not very familiar with the detail of this, my gut feeling is that this would be a bit too less for an individual project. However, there might be ways to extend this. So if you have any ideas, share them with us and we will be happy to discuss them.

Again, thanks a lot for your interest and please don't hesitate to ask questions. :-)

Best,
Fabian

Stephan Ewen

unread,

Feb 25, 2014, 8:54:04 AM2/25/14

to stratosp...@googlegroups.com, tsek...@gmail.com

Hi Artem!

Nice to see that you are interested in working with us!

Fabian has said most things already. If you want get Stratosphere up and running on the Cloud to play around with it, you can also have a look at this blog post explaining how to use Stratosphere with Amazon's elastic MapReduce: http://stratosphere.eu/blog/tutorial/2014/02/18/amazon-elastic-mapreduce-cloud-yarn.html

Greetings,

Stephan

On Tuesday, February 25, 2014 9:12:10 AM UTC+1, tsek...@gmail.com wrote:

tsek...@gmail.com

unread,

Feb 25, 2014, 4:23:09 PM2/25/14

to stratosp...@googlegroups.com

Hello Fabian,

On Tuesday, February 25, 2014 11:20:10 AM UTC+2, fhu...@gmail.com wrote:
> Hi Artem,
>
> thanks a lot for your interest in Stratosphere and participating in our GSoC projects!
>
> As you know, Hadoop is the big elephant out there in the Big Data jungle and widely adopted. Therefore, a Hadoop compatibility layer is a very! important feature for any large scale data processing system.
> Stratosphere builds on foundations of MapReduce but generalizes its concepts and provides a more efficient runtime.

Great!

> When you have a look at the Stratosphere WordCount example program, you will see, that the programming principles of Stratosphere and Hadoop MapReduce are quite similar, although Stratosphere is not compatible with the Hadoop interfaces.

Yes, I've looked into the example (Wordcount, k-means) I also run the big test job you have locally and it seems to be ok.

> With the proposed project we want to achieve, that Hadoop MapReduce jobs can be executed on Stratosphere without changing a line of code (if possible).
>
> We have already some pieces for that in place. InputFormats are done (see https://github.com/stratosphere/stratosphere/tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats are work in progress. The biggest missing piece is executing Hadoop Map and Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces (e.g., overwriting partitioning function and sorting comparators, counters, distributed cache, ...). It would of course be desirable to support as many of these interfaces as possible, but they can by added step-by-step once the first Hadoop jobs are running on Stratosphere.

So If I understand correctly, the idea is to create logical wrappers for all interfaces used by Hadoop Jobs (the way it has been done with the hadoop datatypes) so it can be run as completely transparently as possible on Stratosphere in an efficient way. I agree, there are many interfaces, but it's very interesting considering the way Stratosphere defines tasks, which is a bit different (though, as you said, the principle is similar).

I assume the focus is on the YARN version of Hadoop (new api)?

And one last question, serialization for Stratosphere is java's default mechanism, right?

>
> Regarding your question about cloud deployment scripts, one of our team members is currently working on this (see this thread: https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo).
> I am not sure, if this is still in the making or already done. If you are interested in this as well, just drop a line to the thread. Although, I am not very familiar with the detail of this, my gut feeling is that this would be a bit too less for an individual project. However, there might be ways to extend this. So if you have any ideas, share them with us and we will be happy to discuss them.

Thank you for pointing up the topic. I will let you know if I come up with anything for this. Probably after I try deploying it on openstack.

>
> Again, thanks a lot for your interest and please don't hesitate to ask questions. :-)

Thank you for the helpful answers.

Kind regards,
Artem

tsek...@gmail.com

unread,

Feb 25, 2014, 4:25:53 PM2/25/14

to stratosp...@googlegroups.com, tsek...@gmail.com

On Tuesday, February 25, 2014 3:54:04 PM UTC+2, Stephan Ewen wrote:
> Hi Artem!

Hello Stephan!

>
>
> Nice to see that you are interested in working with us!

Yeah, it's a very interesting project!

>
>
> Fabian has said most things already. If you want get Stratosphere up and running on the Cloud to play around with it, you can also have a look at this blog post explaining how to use Stratosphere with Amazon's elastic MapReduce: http://stratosphere.eu/blog/tutorial/2014/02/18/amazon-elastic-mapreduce-cloud-yarn.html

Thank you for pointing me to the resource. I'll definitely check it out.

Kind regards,
Artem

fhu...@gmail.com

unread,

Feb 26, 2014, 4:04:52 AM2/26/14

to stratosp...@googlegroups.com

Hi Artem,

yes, implementing wrappers to run Hadoop code in Stratosphere is what is should boil down to.

In some cases it might also be necessary to go into the Stratosphere code and change it to enable a feature, but this can be decided once we have an interface that cannot expressed with the current state of Stratosphere.

You’re also right, that support for the newest Hadoop API is in the focus here.

Internally, Stratosphere uses different serialization techniques for different tasks. In performance uncritical situations we use Java serialization, but during data processing we use our own implementations in most situations. However, the runtime implementation is quite generic such that many things are possible there.

If you still want to apply for a Stratosphere GSoC project, I would suggest to start drafting a proposal in our wiki.

Google’s proposal HowTo: http://en.flossmanuals.net/GSoCStudentGuide/ch008_writing-a-proposal/

Best,

Fabian

--
Fabian Hueske
Phone:      +49 170 5549438
Email:      fhu...@gmail.com
Web:         http://www.user.tu-berlin.de/fabian.hueske

--
You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stratosphere-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/stratosphere-dev.
For more options, visit https://groups.google.com/groups/opt_out.

Robert Metzger

unread,

Feb 26, 2014, 4:36:14 AM2/26/14

to stratosp...@googlegroups.com

Hi,

regarding the Hadoop API versions.

You’re also right, that support for the newest Hadoop API is in the focus here.

Do you mean the "mapred" or the "mapreduce" API?.

"mapred" was deprecated when "mapreduce" was introduced. But they later changed it back and de-deprecated the "mapred" API (read: http://stackoverflow.com/questions/7598422/is-it-better-to-use-the-mapred-or-the-mapreduce-package-to-create-a-hadoop-job)

The Hadoop 2.3.0 documentation states that the "mapred" API has a larger user-base. (read: http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html)

The HadoopDataSource is using the "mapred" API. (I think otherwise we would be unable to switch between Hadoop v1 and Hadoop v2 dependencies).

tl;dr: I suggest to use the "mapred" API.

Regards,

Robert

tsek...@gmail.com

unread,

Feb 26, 2014, 7:01:25 PM2/26/14

to stratosp...@googlegroups.com

Hi,

Thank you for the helpful info. I will start drafting the proposal as soon as possible. I'll let you know when I have a first version.

All the best,
Artem

tsek...@gmail.com

unread,

Feb 26, 2014, 7:06:07 PM2/26/14

to stratosp...@googlegroups.com

Hi Robert,

Thank you for the proposal. Since HadoopDataSource is using mapred, then it's definitely mapred as you suggest. At least that's what I'm going to include to the project proposal which I'll draft very very soon :)

Kind regards,
Artem

tsek...@gmail.com

unread,

Feb 27, 2014, 11:34:40 AM2/27/14

to stratosp...@googlegroups.com

Hello Fabian, I have submitted a first draft of my project proposal at https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis

I would greatly appreciate any comments and observations.

Thank you in advance,
Artem

fhu...@gmail.com

unread,

Feb 27, 2014, 12:43:13 PM2/27/14

to stratosp...@googlegroups.com

Hi Artem,

I like your proposal a lot! Especially the “Who benefits” section is really nice.

One remark here is that I do not expect Hadoop jobs to run significantly faster on Stratosphere compared to Hadoop. The reason is, that the jobs are written in a way that there is not much the system can do to improve the execution. Also, many jobs rely on exactly the static way that Hadoop is processing data.

Hadoop is “only” doing one static pipeline of Map, local sort, Combine, shuffle, sort, Reduce, but is doing this really good, so that it is hard to improve over that. Where Stratosphere shines is when data processing tasks require more than one MR job. In that case, Hadoop reads and writes data multiple times from/to distributed storage (e.g., HDFS) while Stratosphere can stream the data without persisting it. However, if you want to execute Hadoop MR jobs on Stratosphere you cannot easily do the streaming (at least when you consider a single job as a unit of execution).

So, the arguments of the benefits section are still valid and good. However, instead of expecting that existing Hadoop jobs will run faster on Stratosphere, I would highlight, that the new processing platform (Stratosphere) will be beneficial for new jobs and that existing Hadoop jobs could be gradually ported to Stratosphere if desired.

Your suggested schedule looks good to me.

So aside from the minor comment on execution time, I am very happy with your proposal 😊

Best,

Fabian

--
Fabian Hueske
Phone:      +49 170 5549438
Email:      fhu...@gmail.com
Web:         http://www.user.tu-berlin.de/fabian.hueske

Robert Metzger

unread,

Feb 27, 2014, 1:23:03 PM2/27/14

to stratosp...@googlegroups.com

Hi,

I agree with Fabian. The proposal reads very well!

Fabian's regarding the performance is right.

BUT: If you design your Hadoop-compat in a similar way to the HadoopDataSource (so that there is also a HadoopMapOperator and a HadoopReduceOperator), it would be possible that Bob (1.) can use his existing MapReduce classes.

Usually, MapReduce tasks go beyond one single Map-Reduce step (e.g. having multiple chained map-red phases). If users spend a few minutes rewriting their Map-Reduce tasks into one "large" Stratosphere job, they will have a noticeable performance benefit.

I would do it the following way:

- Implement a HadoopMapOperator / HadoopReduceOperator and the necessary tooling around it

- Write a special "entry point" that is using HadoopMapOperator / HadoopReduceOperator behind the scenes and simulating the regular Hadoop mapred behavior.

If this all works out, it will be an amazing contribution to our code!

Regards,

Robert

fhu...@gmail.com

unread,

Feb 27, 2014, 1:27:01 PM2/27/14

to stratosp...@googlegroups.com

Good point, Robert!

If you resemble your multiple Hadoop MR jobs into a single Stratosphere job you will benefit from the pipelined execution. Although, this is no longer “not modifying a single line of code” it is a very reasonable approach 😉

Fabian

--
Fabian Hueske
Phone:      +49 170 5549438
Email:      fhu...@gmail.com
Web:         http://www.user.tu-berlin.de/fabian.hueske

tsek...@gmail.com

unread,

Feb 27, 2014, 2:43:31 PM2/27/14

to stratosp...@googlegroups.com

Hello Fabian,

> I like your proposal a lot! Especially the “Who benefits” section is really nice.

Thank you!

> One remark here is that I do not expect Hadoop jobs to run significantly faster on Stratosphere compared to Hadoop. The reason is, that the jobs are written in a way that there is not much the system can do to improve the execution. Also, many jobs rely on exactly the static way that Hadoop is processing data.
>
>
> Hadoop is “only” doing one static pipeline of Map, local sort, Combine, shuffle, sort, Reduce, but is doing this really good, so that it is hard to improve over that. Where Stratosphere shines is when data processing tasks require more than one MR job. In that case, Hadoop reads and writes data multiple times from/to distributed storage (e.g., HDFS) while Stratosphere can stream the data without persisting it. However, if you want to execute Hadoop MR jobs on Stratosphere you cannot easily do the streaming (at least when you consider a single job as a unit of execution).

Actually, I didn't really mean "Stratosphere + Hadoop faster than Stratosphere" I just though that even with the overhead of the extra abstraction layer some jobs would be faster. But after your explanation i understand it a bit better.

> So, the arguments of the benefits section are still valid and good. However, instead of expecting that existing Hadoop jobs will run faster on Stratosphere, I would highlight, that the new processing platform (Stratosphere) will be beneficial for new jobs and that existing Hadoop jobs could be gradually ported to Stratosphere if desired.

So, Alice has changed her thoughts a bit now :) I modified the wiki.

>
>
> Your suggested schedule looks good to me.

Cool! I was wondering where can I find the ongoing work on Outputformats, just to check. Is it on a public branch? If not it's ok for now.

> So aside from the minor comment on execution time, I am very happy with your proposal 😊

I'm happy you're happy!

Best regards,
Artem

tsek...@gmail.com

unread,

Feb 27, 2014, 2:52:03 PM2/27/14

to stratosp...@googlegroups.com

Hello Robert,

> I agree with Fabian. The proposal reads very well!
>

Ah, great :)

> Fabian's regarding the performance is right.
> BUT: If you design your Hadoop-compat in a similar way to the HadoopDataSource (so that there is also a HadoopMapOperator and a HadoopReduceOperator), it would be possible that Bob (1.) can use his existing MapReduce classes.

>
>
> Usually, MapReduce tasks go beyond one single Map-Reduce step (e.g. having multiple chained map-red phases). If users spend a few minutes rewriting their Map-Reduce tasks into one "large" Stratosphere job, they will have a noticeable performance benefit.
>
>
>
>
> I would do it the following way:
> - Implement a HadoopMapOperator / HadoopReduceOperator and the necessary tooling around it
> - Write a special "entry point" that is using HadoopMapOperator / HadoopReduceOperator behind the scenes and simulating the regular Hadoop mapred behavior.

I see. I still need to do some reading on current HadoopDataSource. But what you suggest seems very reasonable.

>
>
>
>
> If this all works out, it will be an amazing contribution to our code!

Well, I hope it will. So, now I guess I will pick an issue and submit a pull-request. Robert, if it is ok if I will pick one from your suggestions here: https://groups.google.com/forum/#!topic/stratosphere-dev/BcXBo3uqaOw

Thanks,
Artem

Robert Metzger

unread,

Feb 28, 2014, 4:41:48 AM2/28/14

to stratosp...@googlegroups.com

Hi Artem,

Well, I hope it will. So, now I guess I will pick an issue and submit a pull-request. Robert, if it is ok if I will pick one from your suggestions here: https://groups.google.com/forum/#!topic/stratosphere-dev/BcXBo3uqaOw

Sure. What you could also do is writing some tests for the HadoopDataSource. There is currently no test coverage for it, and since you want to have a deeper look into the HadoopDataSource anyways, you could combine this with the pull request. (Feel free to change the code of the HadoopDataSource if you see something).
Have a look at our *ITCase-tests. I would like to see a test for SequenceFile and one for Parquet or so.

I see. I still need to do some reading on current HadoopDataSource. But what you suggest seems very reasonable.

I'm happy to answer open questions.

Cool! I was wondering where can I find the ongoing work on Outputformats, just to check. Is it on a public branch? If not it's ok for now.

I have no idea ;) I asked on the respective issue in GitHub: https://github.com/stratosphere/stratosphere/issues/455. You can also ask there if you have follow-up questions on the code.

Regards,

Robert

Artem

tsek...@gmail.com

unread,

Feb 28, 2014, 7:29:01 AM2/28/14

to stratosp...@googlegroups.com

Hi Robert,

On Friday, February 28, 2014 11:41:48 AM UTC+2, Robert Metzger wrote:
> Hi Artem,
>
>
>
>
>
> Well, I hope it will. So, now I guess I will pick an issue and submit a pull-request. Robert, if it is ok if I will pick one from your suggestions here: https://groups.google.com/forum/#!topic/stratosphere-dev/BcXBo3uqaOw
>
>
>
>
> Sure. What you could also do is writing some tests for the HadoopDataSource. There is currently no test coverage for it, and since you want to have a deeper look into the HadoopDataSource anyways, you could combine this with the pull request. (Feel free to change the code of the HadoopDataSource if you see something).
>
>
> Have a look at our *ITCase-tests. I would like to see a test for SequenceFile and one for Parquet or so.

Unit tests for HadoopDataSource is cool! I'll deal with it this weekend.

> I'm happy to answer open questions.

Thanks!

>
>
>
>
>
> Cool! I was wondering where can I find the ongoing work on Outputformats, just to check. Is it on a public branch? If not it's ok for now.
>
>
>
>
> I have no idea ;) I asked on the respective issue in GitHub: https://github.com/stratosphere/stratosphere/issues/455. You can also ask there if you have follow-up questions on the code.

Ok, for anything else related to OutputFormats I'll ask there.

Thanks,
Artem

mingliang qi

unread,

Feb 28, 2014, 8:09:05 AM2/28/14

to stratosp...@googlegroups.com, tsek...@gmail.com

Hi Artem,

I'm working on the Hadoop Output Format for stratosphere now. HadoopDataSink works in local mode for hadoop 1.2.1 now. And still need to delete temporary directory generated by hadoop and test cluster mode. The code is on my GitHub branch: https://github.com/qmlmoon/stratosphere/tree/hadoop-datasink/stratosphere-addons/hadoop-compatibility

Mingliang

Robert Metzger

unread,

Mar 5, 2014, 11:00:48 AM3/5/14

to stratosp...@googlegroups.com

Hey,

just a little pointer to this discussion here: https://github.com/stratosphere/stratosphere/pull/531#issuecomment-36687798

Make sure that you're not doing the same work as Mingliang.

Regards,

Robert

Thanks,
Artem

tsek...@gmail.com

unread,

Mar 5, 2014, 12:16:06 PM3/5/14

to stratosp...@googlegroups.com

Hello Robert,

Thank you for letting me know! I am preparing a case with SequnceFileInputFormat and Parquet with HadoopDataSource so I don't thing we have a "conflict". Hopefully will send the pull request very very soon (definitely this week :) ).

And Mingliang, thanks for pointing me to the branch.

Thanks,
Artem

Robert Metzger

unread,

Mar 14, 2014, 5:34:25 AM3/14/14

to stratosp...@googlegroups.com

Hi Artem,

what's your progress on the contribution? I would really like to see a good contribution from you since you've made a strong proposal and we had a good discussion on your summer project.
I'm happy to help if anything not clear. (That's why Google calls us Mentors ;) )

Regards,

Robert

tsek...@gmail.com

unread,

Mar 14, 2014, 8:25:02 AM3/14/14

to stratosp...@googlegroups.com

Hello Robert,

well I'm done with the test for SequenceFileInputFormat (example wordcount). The I wrote a parquet test, which has driven me doing some small additions to HadoopDataSource. Mostly a ParquetConverter for some types and some other minor stuff like NullValue support, and a use of reflection for InputSplits that are nested in their respected Inputformat class (which is the case for parquet). The main reason it took me a while to figure it out is that parquet considers the mapred API deprecated, only offers mapreduce. Luckily, there is a wrapper but it took me a while to figure out its config and how it plays with stratosphere :) Moreover, parquet has counters "hard-coded" on its latest branch and this doesn't play with our HadoopDummyReporter. I tried to fix it, but then I moved to an older parquet branch and it works fine, so I'll push it like that :)

So, the pull request comes to staging branch tonight. Talk to you then.

Thank you for the interest,
Artem

Robert Metzger

unread,

Mar 14, 2014, 11:11:39 AM3/14/14

to stratosp...@googlegroups.com

Hi,

Cool! Sounds great.

I'm looking forward for your pull request. I'm internested in particular in the ParquetConverter since I don't exactly understand why it is required.

But we'll see.

Mh. We already had the discussion regarding mapred and mapreduce here. In the worst-case, we have to look into a solution to support both APIs.

Can you estimate the required work for the counters? Stratosphere also has a counter-like feature and I would prefer to use the latest paquet version. (You don't have to do that now.) The topic is also very relevant for my Stratosphere SQL interface as I'm planning to support Paquet from early on.

Regards,

Robert

For more options, visit https://groups.google.com/d/optout.

tsek...@gmail.com

unread,

Mar 16, 2014, 1:10:31 AM3/16/14

to stratosp...@googlegroups.com

Hi Robert,

The pull request is (finally) here (https://github.com/stratosphere/stratosphere/pull/590) rebased on top of staging.

> Cool! Sounds great.
> I'm looking forward for your pull request. I'm internested in particular in the ParquetConverter since I don't exactly understand why it is required.

So parquet primitives (Binary, int64, int 96) are not really hadoop writables and there should be a mapping present. Moreover, since you absolutely must give a schema there are some types that should be mapped too ( Group for the example-schema, Tuple for Hive-Schema etc.) Currently, the way Group is handled is not generic. I only support the case where a Group has a binary type, which is just the thing I needed for my test :)

I would justify myself by saying that the test's point was to check whether HadoopDataSource works with parquet and not support parquet. And technically, a programmer can define any schema-support he likes but some are de-facto standards (GroupSupport, TupleSupport for Hive etc.). Should they be supported? What do you think?

>
>
> But we'll see.
>
>
> Mh. We already had the discussion regarding mapred and mapreduce here. In the worst-case, we have to look into a solution to support both APIs.

The wrapper provided by parquet seems to work fine. But it would be great. Actually, do you think I should add it to my proposal or stick to mapred? I believe it should be added, because when you have the logic for a wrapping of a mapred interface it is not very different to repeat for a mapreduce. Do you think such a thing would be realistic or of minor importance?

> Can you estimate the required work for the counters? Stratosphere also has a counter-like feature and I would prefer to use the latest paquet version. (You don't have to do that now.) The topic is also very relevant for my Stratosphere SQL interface as I'm planning to support Paquet from early on.

Parquet expects to get counters by the reporter for the inputformat. In our case, for HadoopInputFormatWrapper we have a DummyReporter which gives it a null. So the idea is to get rid of the dummy and actually return something meaningful.
After, reading about the accumulators, If I understand correctly what we need here is a wrapping of a simple Stratosphere accumulator to a benchmarking counter (which is the one hardcoded in parquet and blocking us).

I believe I can have it in a week or so for this particular case, now that I understand a better how stratosphere meets hadoop :)

Kind Regards,
Artem

Robert Metzger

unread,

Mar 16, 2014, 6:16:27 AM3/16/14

to stratosp...@googlegroups.com

Hi,

thank you very much for the great work!

I think it was a good idea to give you this task as a preparation for your GSoC because you've probably a much better understanding of the problems you'll encounter on the way to implement a full compatibility layer.

I would justify myself by saying that the test's point was to check whether HadoopDataSource works with parquet and not support parquet.

Yes, that's totally fine. Nobody requested actual Parquet support. Its good to know that our HadoopDataSource is flexible enough to support it as well. If we have an actual user request, we can implement it properly.

The wrapper provided by parquet seems to work fine. But it would be great. Actually, do you think I should add it to my proposal or stick to mapred? I believe it should be added, because when you have the logic for a wrapping of a mapred interface it is not very different to repeat for a mapreduce. Do you think such a thing would be realistic or of minor importance?

It would be great to have support for it. But add it as a "optional" / "nice to have" feature. Once we got the Hadoop MapReduce examples running, we can evaluate how much work it would be to add "mapreduce" support.

After, reading about the accumulators, If I understand correctly what we need here is a wrapping of a simple Stratosphere accumulator to a benchmarking counter (which is the one hardcoded in parquet and blocking us).

Lets evaluate as an optional goal for your GSoC participation if it is possible to somehow create a generic wrapper for Hadoop counters with Stratosphere's accumulators. Otherwise, we should make a list of the most common Hadoop counters and see if it is worth implementing them.

IMPORTANT: Remember to submit your proposal at Google's GSoC tool (Google Melange!) The deadline is over in 5 days or so.

Regards,

Robert

Artem

--
You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.

tsek...@gmail.com

unread,

Mar 16, 2014, 1:48:41 PM3/16/14

to stratosp...@googlegroups.com

Hi!

On Sunday, March 16, 2014 11:16:27 AM UTC+1, Robert Metzger wrote:
> Hi,
>
>
> thank you very much for the great work!
> I think it was a good idea to give you this task as a preparation for your GSoC because you've probably a much better understanding of the problems you'll encounter on the way to implement a full compatibility layer.

Yes, it was a great chance, indeed. I have a better undestanding now.

>
>
> Yes, that's totally fine. Nobody requested actual Parquet support. Its good to know that our HadoopDataSource is flexible enough to support it as well. If we have an actual user request, we can implement it properly.
>

Cool!

> It would be great to have support for it. But add it as a "optional" / "nice to have" feature. Once we got the Hadoop MapReduce examples running, we can evaluate how much work it would be to add "mapreduce" support.
>
>
> After, reading about the accumulators, If I understand correctly what we need here is a wrapping of a simple Stratosphere accumulator to a benchmarking counter (which is the one hardcoded in parquet and blocking us).
>
>
>
> Lets evaluate as an optional goal for your GSoC participation if it is possible to somehow create a generic wrapper for Hadoop counters with Stratosphere's accumulators. Otherwise, we should make a list of the most common Hadoop counters and see if it is worth implementing them.
>

Ok, then. I will add the "mapreduce" support as an optional goal for my milestones. Initially, the Counters would be in my list of mapred interfaces as it is important to run a Hadoop job transparently on Stratosphere (one may define them "on the fly" in Hadoop). Actually, I will put this list of hadoop interfaces so that we can evaluate realistically.

>
>
>
> IMPORTANT: Remember to submit your proposal at Google's GSoC tool (Google Melange!) The deadline is over in 5 days or so.

Yes yes. Will do that :)

All the best,
Artem

tsek...@gmail.com

unread,

Apr 22, 2014, 3:39:37 AM4/22/14

to stratosp...@googlegroups.com, tsek...@gmail.com

Hello guys,

I would also like to say that I am thrilled for being accepted in this year's GSoC. Thanks a lot! This was my only proposal, you see :) I hope the project benefits from my planned contribution.

Cheers,
Artem

Fabian Hueske

unread,

Apr 22, 2014, 4:20:55 AM4/22/14

to stratosp...@googlegroups.com

Hi Artem,

We are really looking forward to work with you.

Let's try to get together (Skype, Hangout) some time in the next week to discuss how we approach the project and how we want to organize stuff.

Any suggestions for a time?

Best,

Fabian

Artem

--
You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.

Artem Tsikiridis

unread,

Apr 22, 2014, 7:19:14 AM4/22/14

to stratosp...@googlegroups.com

Hi Fabian,

A skype call would be great! You have an email.

Thanks,

Artem

--
You received this message because you are subscribed to a topic in the Google Groups "stratosphere-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stratosphere-dev/qYvJRSoMYWQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stratosphere-d...@googlegroups.com.

tsek...@gmail.com

unread,

May 9, 2014, 3:43:31 PM5/9/14

to stratosp...@googlegroups.com, tsek...@gmail.com

Hello everybody,

I have posted a wiki document with some points of my approach for the project. It is here: https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29

We have clarified many points with Fabian (who is officially my mentor) and there's also some preliminary stuff I tried that can be useful for hadoop-compatibility (I'll submit a pull request)

And of course the coding phase starts very soon, in 10 days :)

Regards,
Artem

Nirvanesque

unread,

Aug 26, 2014, 5:12:04 AM8/26/14

to stratosp...@googlegroups.com, tsek...@gmail.com

Hello Artem and mentors,

First of all nice greetings from INRIA, France.
Hope you had an enjoyable experience in GSOC!
Thanks to Robert (rmetzger) for forwarding me here ...

At INRIA, we are starting to adopt Stratosphere / Flink.
The top-level goal is to enhance performance in User Defined Functions (UDFs) with long workflows using multiple M-R, by using the larger set of Second Order Functions (SOFs) in Stratosphere / Flink.
We will demonstrate this improvement by implementing some Use Cases for business purposes.
For this purpose, we have chosen some customer analysis Use Cases using weblogs and related data, for 2 companies (who appeared interested to try using Stratosphere / Flink )
- a mobile phone app developer: http://www.tribeflame.com
- an anti-virus & Internet security software company: www.f-secure.com
I will be happy to share with you these Use Cases, if you are interested. Just ask me here.

At present, we are typically in the profiles of Alice-Bob-Sam, as described in your GSoC proposal. :-)
Hadoop seems to be the starting square for the Stratosphere / Flink journey.
Same is the situation with developers in the above 2 companies :-)

Briefly,
We have installed and run some example programmes from Flink / Stratosphere (versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our Hadoop & Stratosphere installations)
We have some good understanding of Hadoop and its use in Streaming and Pipes in conjunction with scripting languages (Python & R specifically)
In the first phase, we would like to run some "Hadoop-like" jobs (mainly multiple M-R workflows) on Stratosphere, preferably with extensive Java or Scala programming.
I refer to your GSoC project map which seems very interesting.
If we could have a Hadoop abstraction as you have mentioned, that would be ideal for our first phase.
In later phases, when we implement complex join and group operations, we would dive deeper into Stratosphere / Flink Java or Scala APIs

Hence, I would like to know, what is the current status in this direction?
What has been implemented already? In which version onwards? How to try them?
What is yet to be implemented? When - which versions?

You may also like to see my discussion with Robert on this page.
I am still mining in different discussions - here as well as on JIRA.
Please do refer me to the relevant links, JIRA tickets, etc if that saves your time in re-typing large replies.
It will also help us to understand the train of collective thinking in the Stratosphere / Flink roadmap.

Thanks in advance,
Anirvan
PS : Apologies for using names / rechristened names (e.g. Flink / Stratosphere) as I am not sure, which name exactly to use currently.

Fabian Hueske

unread,

Aug 26, 2014, 5:42:36 AM8/26/14

to stratosp...@googlegroups.com

Hi Anirvan,

thanks for getting in touch!
We made very good progress over GSoC but are not fully done yet.

Can you drop a mail on the dev-flink mailing list once you signed up?

Before I respond there and continue the discussion, I want to make sure that you're subscribed :-)

Cheers, Fabian

--

You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stratosphere-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/stratosphere-dev.

Reply all

Reply to author

Forward