GSOC 2014 in Stratosphere

325 views
Skip to first unread message

Shashika Ranga Muramudalige

unread,
Feb 25, 2014, 3:13:11 PM2/25/14
to stratosp...@googlegroups.com
Hi,

I'm Shashika Ranga and I'm 3rd year undergraduate in Department of Computer Science & Engineering, University of Moratuwa, Sri Lanka. I've gone through with your GSOC-14 idea list and found really interesting projects. Those days I'm working as a trainee in hSenidMobile Solutions(pvt) Ltd one of leading software company in Sri Lanka. I'm involving in their internal research project which is implemented by Scala and Java. I have more than 3 years experience with Java and now I'm confident with Scala which has both OOP and functional features.
 
I already clone your source and built with maven. (It takes more than 5 minutes as mentioned :D). I'm interested with Spatial Data Processing Library project. Now I need to familiar with the code. Let me help how to proceed further.

Best Regards,
Shashika Ranga

fhu...@gmail.com

unread,
Feb 25, 2014, 3:52:29 PM2/25/14
to stratosp...@googlegroups.com
Hi Shashika,

thanks for your interest in our project!

The Spatial Data Processing Library is a nice project to get familiar with Big Data processing and to learn Stratosphere.
The basic idea of the project is to provide a collection Stratosphere programs for standard spatial data processing tasks such as spatial joins or spatial partitioning.

I would suggest to have a look at our examples, play around with the system by writing some small jobs and learning how Stratosphere jobs are implemented and executed.
That way, you should get an impression how it feels to work with the system.
Our documentation shows how to do a local set up, write a Stratosphere program, and also discusses some of our example programs.
--> http://stratosphere.eu/quickstart/

Let me know if you have any questions.

Best,
Fabian

Shashika Ranga Muramudalige

unread,
Feb 25, 2014, 6:02:43 PM2/25/14
to stratosp...@googlegroups.com
Hi Fabian,

I'm trying for some examples. There is a link to download some test data.
http://www.gutenberg.org/cache/epub/1787/pg1787.txt
But it gives an error.
Can you check it when you are free.


Thank You !


--
You received this message because you are subscribed to a topic in the Google Groups "stratosphere-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stratosphere-dev/Inf9LuDcZjk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stratosphere-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/stratosphere-dev.
For more options, visit https://groups.google.com/groups/opt_out.



--
Shashika Ranga Muramudalige
Trainee
hSenid Mobile Solutions(Pvt) Ltd
Undergraduate,
Department of Computer Science & Engineering
University of Moratuwa

Fabian Hueske

unread,
Feb 25, 2014, 6:27:14 PM2/25/14
to Shashika Ranga Muramudalige, stratosp...@googlegroups.com

Hi,

It seems like the file was removed from Gutenberg's cache. I downloaded it and it should be available under the given URL again.

The file is just Shakespeare's Hamlet in a plain text file. You can also use any other text file for the example.

Best, Fabian

You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stratosphere-d...@googlegroups.com.

Stephan Ewen

unread,
Feb 25, 2014, 11:18:02 PM2/25/14
to stratosp...@googlegroups.com
Hey!

Can you replace the cache link by a permanent link? Or is that not possible with gutenberg?

Stephan

Robert Metzger

unread,
Feb 26, 2014, 8:49:02 AM2/26/14
to stratosp...@googlegroups.com
Hi,

I used this text 

as a word-count input for my EC2 blog post. Its probably a bit more reliable than Gutenberg.

Shashika Ranga Muramudalige

unread,
Feb 27, 2014, 4:05:05 PM2/27/14
to stratosp...@googlegroups.com
Hi everyone,

I've tried with some examples in the /examples directory. WordCount example gave me a big impression about the platform. It simply awesome how to visualize the calculation part with Job time in http://localhost:8081/.

I tried with all examples with the text file in  http://www.gutenberg.org/cache/epub/1787/pg1787.txt

Some examples indicated Exception in thread "main" eu.stratosphere.api.common.InvalidJobException: File path of FileDataSink is empty string.

I think I need to know about those examples well. Are there any documentation about those examples ?
Now I'm curious with this magic.

Now I'm going through http://bigdataclass.org. I hope it will give me strong idea about the stratosphere and big data analysis. I also need to learn about processing algorithms, spatial partitioning and indexing techniques. I'm glad if you can say me, what should my next step.


Thank You !
Stratosphere.png

Stephan Ewen

unread,
Feb 27, 2014, 4:50:13 PM2/27/14
to stratosp...@googlegroups.com
Hi!

The examples all need sample data, and you pass the path where the sample data is via a parameter (String). If you do not pass a parameter, you get the exception shown there.

For some examples, the code has data generators (KMeans, WebLog), for some the comments point you to places where you can get sample data or generators (e.g. WordCount, TPCHQuery3).

For most examples, we do not yet have website documentation describing where to get sample data, but if you have a look at the source code of the examples, or at the packages where they are included, you will find pointers and some test data generators.

I hope that helps.

Stephan



Robert Metzger

unread,
Feb 27, 2014, 5:03:32 PM2/27/14
to stratosp...@googlegroups.com
Hi,

If you want, you can contribute documentation for the examples by opening a pull request for website changes. The website's contents are hosted in the "gh-pages" of the repository. Use the "preview.sh" script inside this branch to locally review your changes, then, you can open a pull request to update the website.

To pass the arguments Stephan mentioned, you need to use the "-a" argument for the "bin/stratosphere" script.

@Fabian/@Stephan: Can you point Shashika to some material (papers, websites etc.) on spatial data processing?

Regards,
Robert

fhu...@gmail.com

unread,
Feb 27, 2014, 6:18:12 PM2/27/14
to stratosp...@googlegroups.com
Hi Shashika,

our original plans were to use the JTS library (http://tsusiatsoftware.net/jts/main.html) for local processing (geo distance / intersection computations, data structures, etc.) but I just found out, that JTS is released under LGPL license which is incompatible to the Apache license which Stratosphere is released under.

So, this is a problem. We can have a look, if we find an Apache (or other compatible) licensed Java library which can do this job. If we do not find one, the tasks becomes a lot more challenging and I have to admit that I am not so much familiar with spatial algorithms. As I said, I thought, it would boil down to partitioning and shuffling the data in the right way and using the library on the local partitions.

I will have a look if I find something but you should also check for alternatives.

Best,
Fabian

--
Fabian Hueske
Phone:      +49 170 5549438
Email:      fhu...@gmail.com
Web:         http://www.user.tu-berlin.de/fabian.hueske

Robert Metzger

unread,
Feb 27, 2014, 6:28:53 PM2/27/14
to stratosp...@googlegroups.com
Hey,

what about this one? https://github.com/Esri/geometry-api-java Is it suitable? 

Robert


On Thu, Feb 27, 2014 at 7:18 PM, <fhu...@gmail.com> wrote:
Hi Shashika,

fhu...@gmail.com

unread,
Feb 27, 2014, 6:47:38 PM2/27/14
to stratosp...@googlegroups.com
Looks very promising on the first look, esp. as it is also used for Hive.

Good catch 😊

Stephan Ewen

unread,
Feb 28, 2014, 10:40:28 AM2/28/14
to stratosp...@googlegroups.com

Hey!

I think we could go with JTS anyways. The license is only a problem when we make it part of the core (Apache Licensed) module. We could have a separate module for geo spatial processing, with different licensing conditions. LGPL is not toxic to use, but it is not possible to redistribute LGPL code under Apache License.

As far as I have heard, JTS is somewhat the industry standard for geospatial processing in Java, and it might be a good idea to use it anyways.

Greetings,
Stephan



On Tuesday, February 25, 2014 4:13:11 PM UTC+1, Shashika Ranga Muramudalige wrote:

Shashika Ranga Muramudalige

unread,
Mar 1, 2014, 4:38:54 PM3/1/14
to stratosp...@googlegroups.com
Hi,

I tried to do a documentation about the word count example. I've attached that. There might be various mistakes and faults. You can modify and correct as you want.

Up to now, I've implemented task 1 and task 2 in exercises in http://bigdataclass.org  You can see them on mygithub. Now I'm trying in further tasks.

My idea about geospatial processing is, first we check about both https://github.com/Esri/geometry-api-java and JTS library and identify challenges which may have to face in future. I'll work on that too. Please can anyone give me a brief,exact idea about spatial data processing library what will going to do in stratosphere.  So I can study about those things well.


--
You received this message because you are subscribed to a topic in the Google Groups "stratosphere-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stratosphere-dev/Inf9LuDcZjk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stratosphere-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/stratosphere-dev.
For more options, visit https://groups.google.com/groups/opt_out.



--
Shashika Ranga Muramudalige
Trainee
hSenid Mobile Solutions(Pvt) Ltd
Undergraduate,
Department of Computer Science & Engineering
University of Moratuwa
stratosphere-wordcount-documentation.odt

fhu...@gmail.com

unread,
Mar 2, 2014, 6:57:28 PM3/2/14
to stratosp...@googlegroups.com
Hi,

the spatial data processing library should consist of a few components:
  1. Data types that represent spatial objects, e.g., Points, Paths, Polygons, …
    Data types might be based on the data types of a spatial data processing framework (JTS or ESRI). efficient serialization and deserialization are important for data types.
  2. Import methods to read spatial data and convert it to spatial data types.
    Import methods for example to reading the open street map data set or GPS log traces would be good. There might be other input formats that are worth supporting. Depending on the progress this component could be extended.
  3. Operators and algorithms to process spatial data.
    The most important operators IMHO are spatial joins. These joins can be of different types, e.g., intersection or containment of spatial objects, for example points (positions of tweets) in areas (ZIP city boundaries) or intersection of polygons (GPS traces) with areas. Later these methods can be extended to support fuzzy joins, i.e., to only join on exact matches but also within a specified range (points that are within a range of 100 meters of streets).
    These kinds of operations should be supported by the libraries such as JTS, but not for parallel and distributed systems. To support these operations, the data needs to be appropriately partitioned such that the joins can be done on the local partitions. Often, the partitioning cannot be done without objects that span multiple partitions. Depending on the use case, this might require duplicate elimination.

Hope this helps as a first starting point.
Let me know, if you have further questions.

Best,
Fabian

--
Fabian Hueske
Phone:      +49 170 5549438
Email:      fhu...@gmail.com
Web:         http://www.user.tu-berlin.de/fabian.hueske

You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stratosphere-d...@googlegroups.com.

Shashika Ranga Muramudalige

unread,
Mar 5, 2014, 10:28:27 AM3/5/14
to stratosp...@googlegroups.com
Hi Fabian and all,

Thank you for the above explanation. I've done some research stuffs about the task. I've tried some simple examples in both JTS and ESRI framework. Though JTS in under LGPL licence, it is a standard spatial data processing framework which has more features. I've found some functionalities are still not implemented in ESRI. (eg:- calculateLength2D,calculateArea2D). I think JTS is better for our work and it also has a complete documentation.

When read spatial data, I need to know how the input data is going to be and what the data contains ?(eg : id, latitude, longitude). I've checked
  • open street map data has OSM format
  • GPS trace data has GPX format
Should I have methods to read OSM,GPX files or txt files? 

As you said the crucial part is how to partition spatial data for parallel processing with load balancing. For that, I need a some mathematical concepts too. Now I'm working on those algorithms and I will show them in my proposal. First I focus on intersection of paths and polygons. Please let me know if I'm in wrong.

Thank You !

fhu...@gmail.com

unread,
Mar 5, 2014, 10:43:54 AM3/5/14
to stratosp...@googlegroups.com
Hi Shashika,

the internal representation of spatial objects should be independent of the input format (OSM, GPX, …).
Supporting many input formats is always a plus for a library. So I go with the more the better, however, we should focus on the most relevant once. Since OSM is the largest(?) available map data set, I think supporting this would be very good. Also plain text would be good. It is easy to do and a good point to start from.

I think joining/intersecting points with polygons (point is contained in area) is probably the easiest thing to start with. Since points have no length / area, they are only contained in a single partition (assuming non-overlapping partitioning) which makes joining much easier (no postpass to clean duplicates).
From there going to intersections of path and polygons would be the next step.
Windowed joins (e.g., everything which is in 100m of a point, path, etc.) would be the last thing to cover.

Just as a side node, often spatial indexes are created on local partitions (R-Trees, Quad-Trees, …) for efficient look-ups of intersecting objects.

Let me know if you have further questions.

Shashika Ranga Muramudalige

unread,
Mar 8, 2014, 4:57:08 AM3/8/14
to stratosp...@googlegroups.com
Hi Fabian,

What is the best way to parallelize this process ? I think it's better to split file into chunks and process each chunks separately. What kind of problems that can occur in this process.


Is it ok to use JTS library for the spatial data processing ?


fhu...@gmail.com

unread,
Mar 8, 2014, 8:28:22 AM3/8/14
to stratosp...@googlegroups.com
Hi Shashika,

the reading data / files in parallel is a common problem and the solution depends on the actual file format. For a CSV file, it is quite simple. Records are individual text lines which are separated by line brakes. For binary representations, it is usually more difficult if there is not special record delimiter.
So, this really depends on the file format that you want to read. In some cases, files need to be prepared in a sequential preprocessing step. This step could split a file into chunks as you said. Problems can occur do to the size of the file and / or the computational overhead which cannot be parallelized.
We should check the file formats that we want to support and figure out if we need a sequential preprocessing or not.

I am fine with the JTS library, but as Stephan pointed out, the spatial library cannot be part of the Stratosphere system / distribution since Apache and LGPL are incompatible licenses. We can offer the spatial library under a different license in a separate repository though.

Best,
Fabian

--
Fabian Hueske
Phone:      +49 170 5549438
Email:      fhu...@gmail.com
Web:         http://www.user.tu-berlin.de/fabian.hueske

--
You received this message because you are subscribed to the Google Groups "stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stratosphere-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/stratosphere-dev.
For more options, visit https://groups.google.com/d/optout.

Shashika Ranga Muramudalige

unread,
Mar 8, 2014, 6:03:34 PM3/8/14
to stratosp...@googlegroups.com
Hi Fabian,

Our library should capable of reading OSM data because it is a standard format that most map applications are using. GIX is also important format. Since both are XML formats we can use tools to filter data. For OSM has a rich Java tool called Osmosis which we can use to filter data as we want. I think it will helpful for us. Though we filter data, it can be huge. So partitioning is still needed.

I've checked an algorithm which called k-nearest neighbours We can find out the neighbor points within an area through this algorithm.

Honestly I need to know more about my scope in our spatial data processing library. Simply, we have a big data analysis platform- Stratosphere and spatial data processing library - JTS.

We have a big data file as input. Read data through our own data types. Then need to partition data, assign sub tasks and give data to JTS for processing.
Am I write ?
What kind of output we expect from our library?
Do I need to worry about how operations happen in JTS like intersecting/creating polygons...etc ?

If you can give me real world example that we will be going to solve through our library in future, it will be great to me for proper clarification about my scope.


--
You received this message because you are subscribed to a topic in the Google Groups "stratosphere-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stratosphere-dev/Inf9LuDcZjk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stratosphere-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

fhu...@gmail.com

unread,
Mar 9, 2014, 5:08:01 PM3/9/14
to stratosp...@googlegroups.com
Hi Shashika,

yes, OSM support is definitely important! The suggested tool looks good at the first look. We would need to check whether it can be nicely integrated with Stratosphere.

Let me walk you through an example to make things (hopefully) a bit more clear.
Assume you have the spatial coordinates of the worlds county/city borders (maybe from OpenStreetMap) and a set of tweets with position information (Long/Lat). The goal is aggregate tweets by county and hour of day.

To do that, you need to do a spatial join that finds for each tweet coordinate (point) the county (area) it is contained in, group by county and hour, and count.
First of all, you need to read the data (OSM and plain coordinate data). This data needs to be converted into an internal representation of spatial objects. Such types are provided by JTS but need to be wrapped in order to be processable by Stratosphere. Once you have these types, you want to partition the data to enable parallel processing. For example, you partition the points in non-overlapping regions and distribute them over the compute nodes. Then, you partition the county areas according to the same  regions. However, it is possible, that a single area overlaps with more than one region. In this case, the county area needs to be shipped to all overlapping region partitions in order to get the correct result.
On each local node, the points check for each region if they are contained (this check can be improved by using a spatial index that is build on the county areas). If a point is contained in a region, the point (and corresponding tweet) is emitted. All emitted tweets are grouped by area and time and counted.

The challenge in this case is to find a good region partitioning (e.g., partitioning should be finer in NYC or SF than in North Dakota) and efficient local processing.

I hope this example made things a bit more clear. Things become more complicated if you want to join path and areas, or arbitrary spatial objects and if you add some distance tolerance…

Shashika Ranga Muramudalige

unread,
Mar 14, 2014, 11:59:35 AM3/14/14
to stratosp...@googlegroups.com
Hi Fabian,

That's above example gave me a clear picture about our requirements and scope.  According to those requirements in the spatial data library, I've gone through some research papers in last days especially in spatial hadoop.

I've created my proposal and submitted to the Google-melange.

You can check it. 

Thank You !


Fabian Hueske

unread,
Mar 14, 2014, 11:07:13 PM3/14/14
to stratosp...@googlegroups.com
Hi Shashika,

thank you for posting your proposal.
Looks good in my opinion!

Best,
Fabian


Shashika Ranga Muramudalige

unread,
Mar 18, 2014, 6:56:35 PM3/18/14
to stratosp...@googlegroups.com
Hi Fabian,

As the first step, I've tried to create some new data types which are compatible to both our stratosphere and JTS library data types. There is also some example which I tried to familiar with JTS functionalities. When you have a time check out them in here. If you can, give me a feedback about any issues or updates about this. 

Thank you !

fhu...@gmail.com

unread,
Mar 19, 2014, 11:43:20 PM3/19/14
to stratosp...@googlegroups.com
Hi Shashika,

Sorry for not getting back to you earlier.
I had once participated in a hackathon about spatial data on Stratosphere (this is also triggered the idea of having a spatial library) and wrote some code to serialize JTS data types for Stratosphere.
The idea is to simply wrap the JTS data types and provide efficient de/serialization and locally work on these JTS objects using JTS code.

You can have a look at the code here:

Best,
Fabian

--
Fabian Hueske
Phone:      +49 170 5549438
Email:      fhu...@gmail.com
Web:         http://www.user.tu-berlin.de/fabian.hueske

Shashika Ranga Muramudalige

unread,
Mar 21, 2014, 1:55:01 PM3/21/14
to stratosp...@googlegroups.com
Hi Fabian,

Thank you for that link. I deeply looked on your code about de/serialization of the data types. I learnt lot from it. I tried to play with those data types and created few of them in here. I'm also tried to build those types before. So now I got a proper knowledge about the de/serialization Stratosphere datatypes. We can use them for both Stratosphere and JTS data processes without issues.
I also made little changes in my proposal that how Stratosphere unique features are going to aid in our spatial data library.

Thank You !


--
You received this message because you are subscribed to a topic in the Google Groups "stratosphere-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stratosphere-dev/Inf9LuDcZjk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stratosphere-d...@googlegroups.com.

Shashika Ranga Muramudalige

unread,
Sep 12, 2014, 12:29:56 PM9/12/14
to stratosp...@googlegroups.com
Hi Fabian,

This is after some long time. I would like to contribute to your project further.
--
Shashika Ranga Muramudalige
Undergraduate,

Robert Metzger

unread,
Sep 12, 2014, 12:41:08 PM9/12/14
to d...@flink.incubator.apache.org, stratosp...@googlegroups.com
Hey Shashika,

Stratosphere has been renamed to Apache Flink while we entered the Apache Incubator. We are now using a new mailing list on d...@flink.incubator.apache.org.
I please subscribe to dev@flink by sending an (empty) email to dev-su...@flink.incubator.apache.org.

I think adding a spatial data processing library on top of Flink is still a very interesting project.


Robert

Reply all
Reply to author
Forward
0 new messages