Dear Stratosphere devs and fellow GSoC potential students,
Hello!
I'm Artem, an undergraduate student from Athens, Greece. You can find me on github (https://github.com/atsikiridis) and occasionally on stackoverflow (http://stackoverflow.com/users/2568511/artem-tsikiridis). Currently, however, I'm in Switzerland where I am doing my internship at CERN as back-end software developer for INSPIRE, a library for High Energy Physics (we're running on http://inspirehep.net/). The service is in python( based on the open-source project http://invenio.net) and my responsibilities are mostly the integration with Redis, database abstractions, testing (unit, regression) and helping
our team to integrate modern technologies and frameworks to the current code base.
Moreover, I am very interested in big data technologies, therefore before coming to CERN I've been trying to make my first steps in research at the Big Data lab of AUEB, my home university. Mostly, the main objective of the project I had been involved with, was the implementation of a dynamic caching mechanism for Hadoop (in a way trying our cache instead of the built-in distributed cache). Other techs involved where Redis, Memcached, Ehacache (Terracotta). With this project we gained some insights about the internals of hadoop (new api. old api, how tasks work, hadoop serialization, the daemons running etc.) and hdfs, deployed clusters on cloud computing platforms (Openstack with Nova, Amazon EC2 with boto). We also used the Java Remote API for some tests.
Unfortunately, I have not used Stratosphere before in a research /prod environment. I have only played with the examples on my local machine. It is very interesting and I would love to learn more.
There will probably be a learning curve for me on the Stratosphere side but implementing a Hadoop Compatibility Layer seems like a very interesting project and I believe I can be of use :)
Finally, I was wondering whether there are some command-line tools for deploying Stratosphere automatically for EC2 or Openstack clouds (for example, Stratosphere specific abstractions on top of python boto api). Do you that would make sense as a project?
Pardon me for the length of this.
Kind regards,
Artem
You’re also right, that support for the newest Hadoop API is in the focus here.
I would greatly appreciate any comments and observations.
Thank you in advance,
Artem
Well, I hope it will. So, now I guess I will pick an issue and submit a pull-request. Robert, if it is ok if I will pick one from your suggestions here: https://groups.google.com/forum/#!topic/stratosphere-dev/BcXBo3uqaOw
I see. I still need to do some reading on current HadoopDataSource. But what you suggest seems very reasonable.
Cool! I was wondering where can I find the ongoing work on Outputformats, just to check. Is it on a public branch? If not it's ok for now.
Artem
Thanks,
Artem
For more options, visit https://groups.google.com/d/optout.
The pull request is (finally) here (https://github.com/stratosphere/stratosphere/pull/590) rebased on top of staging.
> Cool! Sounds great.
> I'm looking forward for your pull request. I'm internested in particular in the ParquetConverter since I don't exactly understand why it is required.
So parquet primitives (Binary, int64, int 96) are not really hadoop writables and there should be a mapping present. Moreover, since you absolutely must give a schema there are some types that should be mapped too ( Group for the example-schema, Tuple for Hive-Schema etc.) Currently, the way Group is handled is not generic. I only support the case where a Group has a binary type, which is just the thing I needed for my test :)
I would justify myself by saying that the test's point was to check whether HadoopDataSource works with parquet and not support parquet. And technically, a programmer can define any schema-support he likes but some are de-facto standards (GroupSupport, TupleSupport for Hive etc.). Should they be supported? What do you think?
>
>
> But we'll see.
>
>
> Mh. We already had the discussion regarding mapred and mapreduce here. In the worst-case, we have to look into a solution to support both APIs.
The wrapper provided by parquet seems to work fine. But it would be great. Actually, do you think I should add it to my proposal or stick to mapred? I believe it should be added, because when you have the logic for a wrapping of a mapred interface it is not very different to repeat for a mapreduce. Do you think such a thing would be realistic or of minor importance?
> Can you estimate the required work for the counters? Stratosphere also has a counter-like feature and I would prefer to use the latest paquet version. (You don't have to do that now.) The topic is also very relevant for my Stratosphere SQL interface as I'm planning to support Paquet from early on.
Parquet expects to get counters by the reporter for the inputformat. In our case, for HadoopInputFormatWrapper we have a DummyReporter which gives it a null. So the idea is to get rid of the dummy and actually return something meaningful.
After, reading about the accumulators, If I understand correctly what we need here is a wrapping of a simple Stratosphere accumulator to a benchmarking counter (which is the one hardcoded in parquet and blocking us).
I believe I can have it in a week or so for this particular case, now that I understand a better how stratosphere meets hadoop :)
Kind Regards,
Artem
I would justify myself by saying that the test's point was to check whether HadoopDataSource works with parquet and not support parquet.
The wrapper provided by parquet seems to work fine. But it would be great. Actually, do you think I should add it to my proposal or stick to mapred? I believe it should be added, because when you have the logic for a wrapping of a mapred interface it is not very different to repeat for a mapreduce. Do you think such a thing would be realistic or of minor importance?
After, reading about the accumulators, If I understand correctly what we need here is a wrapping of a simple Stratosphere accumulator to a benchmarking counter (which is the one hardcoded in parquet and blocking us).
Artem
--
You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.
I would also like to say that I am thrilled for being accepted in this year's GSoC. Thanks a lot! This was my only proposal, you see :) I hope the project benefits from my planned contribution.
Cheers,
Artem
Artem
--
You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.
--
You received this message because you are subscribed to a topic in the Google Groups "stratosphere-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stratosphere-dev/qYvJRSoMYWQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stratosphere-d...@googlegroups.com.
I have posted a wiki document with some points of my approach for the project. It is here: https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29
We have clarified many points with Fabian (who is officially my mentor) and there's also some preliminary stuff I tried that can be useful for hadoop-compatibility (I'll submit a pull request)
And of course the coding phase starts very soon, in 10 days :)
Regards,
Artem
--
You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stratosphere-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/stratosphere-dev.