Regarding the spatialRDD indexing

85 views
Skip to first unread message

Mayur Bhosale

unread,
Jul 17, 2018, 2:35:51 AM7/17/18
to GeoSpark Discussion Board
Hi,

We are working on a Spark extension to process large-scale 3D data sets called Spark3D (https://github.com/astrolabsoftware/spark3D).

Our approach is fairly similar to that of GeoSpark in supporting 2D spatial data-sets. 

We are looking at the indexing strategies that we can deploy similar to RTREE and QUADTREE based indexing in GeoSpark.

I went through the indexing code and have these question around it, would really appreciate if someone can clarify those - 

1. What are the use-cases for indexedRDD or which are the cases where indexedRDD would give a better performance compared to rawRDD/spatialPartitionedRDD and if so, how?

2. How exactly is indexedRDD different from spatialPartitionedRDD?
One difference I could spot looking at the QUADTREE indexing code is, in spatialPartitioninedRDD with `placeObject` we are just getting the right node/partition for the location of the object within the data-structure as opposed to indexedRDD, where with `.insert` we are actually placing the object inside the underneath data-structure.

3. Why doesn't KNN have a support for QUADTREE based indexing (may be the answer of the above two questions will clarify this one also)

Thanks in advance :) 

Jia Yu

unread,
Jul 22, 2018, 6:26:39 PM7/22/18
to mayu...@gmail.com, GeoSpark Discussion Board
Hi Mayur,

1. Regarding the purpose and meaning of each RDD, please refer to this link[1]

Note that, spatialPartitionedRDD/indexedRDD are only used in SpatialJoinQuery/DistanceJoinQuery

rawRDD, indexedRawRDD are for range query and KNN query.

Based on my recent benchmark,  indexedRawRDD and  indexedRDD are better than their no index versions when the geometries are polygons or queries are highly selective (<0.1%) 

QuadTree indexes are better than R-Tree indexes in most cases.

2. R-Tree index is good for KNN by nature because it uses MBR and cluster object by distance. While Quad-Tree is a space-partitioning tree, you can hardly prune data by distance. Regarding the details of how to do KNN in an R-Tree, please read this famous paper [2]. I implemented this idea into both LocationTech JTS and the JTSNesrplus.

3. If you want to implement this into GeoSpark, you probably need to extend Quad-Tree to Octree and R-Tree to R-Tree 3D version.

Please feel free to let me know if you need any help for implementing your idea into GeoSpark. I am available via emails, Skype, slack, Zoom, phone calls, video conference, ...


Thanks,
Jia

[2] Nearest Neighbor Queries https://dl.acm.org/citation.cfm?id=223794 

------------------------------------

Jia Yu,

Ph.D. Student in Computer Science



--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/19ec1926-29d7-4137-ad6e-1f2943f95be9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jia Yu

unread,
Jul 22, 2018, 6:31:38 PM7/22/18
to mayu...@gmail.com, GeoSpark Discussion Board
Hi,

I replied you a couple days ago. But my reply was put into the spam of the Discussion Board. Now it is back. Please read it.

Thanks,
Jia

------------------------------------

Jia Yu,

Ph.D. Student in Computer Science


Reply all
Reply to author
Forward
0 new messages