Partitioning and Indexing questions

31 views
Skip to first unread message

ashish agarwal

unread,
Aug 11, 2020, 1:48:58 PM8/11/20
to GeoSpark Discussion Board

1) If we increase no. of partitions for a  SRDD while doing spatial partitioning with KDBTree our assumption is that it will divide the data into small-small grids. Is there any limit after which the partitioning will stop and further increase in no. of partitions will not help ?

2) What is the memory overhead if we try to increase the no. of partions exponentially for spatial partitioned rdd ?

3) We have business points and parcels for USA can you suggest and we are doing pointinpoly which data we use for geospark.join.indexbuildside and geospark.join.spatitionside ?

4) What is the indexing overhead since we are using the data only once for pointinpoly does it make sense to index the data ?

ashish agarwal

unread,
Aug 12, 2020, 12:34:51 PM8/12/20
to GeoSpark Discussion Board
Please help

ashish agarwal

unread,
Aug 17, 2020, 1:10:36 PM8/17/20
to GeoSpark Discussion Board
A gentle reminder


On Tuesday, August 11, 2020 at 11:18:58 PM UTC+5:30, ashish agarwal wrote:

Jia Yu

unread,
Aug 18, 2020, 7:22:41 PM8/18/20
to ashish agarwal, GeoSpark Discussion Board


---------- Forwarded message ---------
From: Jia Yu <ji...@apache.org>
Date: Tue, Aug 18, 2020 at 4:21 PM
Subject: Re: Partitioning and Indexing questions
To: ashish agarwal <ashishag...@gmail.com>
Cc: GeoSpark Discussion Board <geospark-dis...@googlegroups.com>


Hi Ashish,

1. Building index on the larger dataset will help, even if this index is a one-time on-the-fly index.
2. You cannot keep increasing the num partitions. Because that will eventually decrease the performance. If your data includes lots of overlapping polygons, the join query will be slow in any case because the spatial partitioning technique cannot well handle this.

Thanks,
Jia

--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/a6f8e144-7513-4dba-be0b-1ec44ffbd10do%40googlegroups.com.

------------------------------------

Jia Yu (new email: jia...@wsu.edu)

Assistant Professor

Washington State University School of EECS

Reach me via: Homepage | GitHub

ashish agarwal

unread,
Aug 20, 2020, 1:40:20 PM8/20/20
to GeoSpark Discussion Board



@Jia - Thanks for the response and yes we have overlapping polygons. Can you please suggest some ideas/thougts for  the question listed below as well
3) We have business points and parcels for USA can you suggest and we are doing pointinpoly which data we use for geospark.join.indexbuildside and geospark.join.spatitionside ?


On Tuesday, August 11, 2020 at 11:18:58 PM UTC+5:30, ashish agarwal wrote:

Jia Yu

unread,
Aug 20, 2020, 1:45:27 PM8/20/20
to ashish agarwal, GeoSpark Discussion Board
say, you have a point df (left) and a polygon df (right). Usually, point df is much larger than polygondf. So the indexbuildside should be left and spatial partition side should be left as well.

------------------------------------

Jia Yu (new email: jia...@wsu.edu)

Assistant Professor

Washington State University School of EECS

Reach me via: Homepage | GitHub


--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages