How do cache a spatial dataframe with spatial partitioning and indexes?

41 views
Skip to first unread message

fajwt...@gmail.com

unread,
Feb 19, 2019, 9:31:51 AM2/19/19
to GeoSpark Discussion Board
In the manual there is the following query:

http://datasystemslab.github.io/GeoSpark/api/sql/GeoSparkSQL-Optimizer/#range-join

SELECT * FROM polygondf, pointdf WHERE ST_Contains(polygondf.polygonshape,pointdf.pointshape)

Or alternatively formulated in scala:

var joindf = polygondf.as("polygons").join(pointdf.as("points"), callUDF("ST_Contains", $"polygons.polygonshape", $"points.pointshape"))

How could I cache the polygondf, such that it is spatially partitioned and indexed?

I have a pipeline where the pointsdf is actually a stream and what seems to be happening is the polygondf is resampled and repartitioned in every batch even though I made an attempt to cache it by calling:

var polygondf = sourcedf.cache()

I'm guessing polygondf at this point only has the raw spatial data before it has been sampled, partitioned and indexed.

I am playing with converting to spatial rdd [1], then forcing the partitioning and index computation, converting back to dataframes and then calling .cache(). However its not totally clear how to do this or whether it will work. There may also be an easier way I have overlooked.

Thanks,


Frank

[1] http://datasystemslab.github.io/GeoSpark/tutorial/sql/#dataframe-to-spatialrdd

Jia Yu

unread,
Feb 21, 2019, 10:40:39 PM2/21/19
to fajwt...@gmail.com, GeoSpark Discussion Board
Hi,

In GeoSpark DataFrameAPI, currently, there is no way to cache spatial index. Every join query will create do indexing and spatial partitioning on the fly.

To achieve the freedom of caching everything, you have to use GeoSpark RDD API.

Thanks,
Jia

------------------------------------

Jia Yu,

Ph.D. Student in Computer Science



--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/c4df7383-43e2-4ca7-88e3-ddaa7c9be262%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages