Joining spatialrdd with a shapefile

25 views
Skip to first unread message

Kaushik Roy

unread,
Nov 1, 2019, 7:25:11 PM11/1/19
to GeoSpark Discussion Board
Hi everyone,

My use case is as follows:

The client has a lot of spatial data all across ontario. Third party shapefiles are coming in bunches and need to be joined across the resident data for answers. So far they have been using postgis, and geopandas, which is doing ok but is not scalable.
I have been able to read shapefiles and run sql using geospark. its quite slow. I am looking into tuning it, but also trying the rdd solution.

As of version 1.2.0 geospark, my questions are as follows:

1. is there any noticable difference between rdd join and sql joins as of the current version? i have been following this project for quite some time and see a lot of improvements been added over the last 4 years.
2. does the order of the variables in the join matter? which side should i put the shapefile and the data? is the behaviour same for rdd and sql?
3. there are a bunch of joins explained in the doc. how to know where to use which join. FOr example so far we are making an rtree out of the data. and then joining the shapefiles using that. do we do an rtree in this case too?
4. can the solution be saved in a parquet? i read somewhere that the data will be saved as wkt. can we spark jts library functions in the DSL?
5. can i run the joins without partitioning? suppose i partition the data into ontario counties/regions. could that maybe help with the joins with shapefiles?

Any clarity on this would be very much  welcome.

Thanks,
Roy
Reply all
Reply to author
Forward
0 new messages