kNNQuery is extremly slow

39 views
Skip to first unread message

ps-spark

unread,
Feb 14, 2019, 12:52:53 PM2/14/19
to GeoSpark Discussion Board
Hello Everyone,

I have started using GeoSpark a month ago and gained some experience with it. Actually, I like it but I cannot figure out how to use the SpatialKnnQuery in a performant way. I imported a shape file (road network of Germany) as LineStringRDD. On the other hand I have GPS coordinates, I imported them as RDD[(key, Point)].  I can run a single SpatialKnnQuery fine. Now I want to find the roads the cars have driven on.


My first approach: the points to rev geocode are in RDD

val spark = ... // SparkSession with appropriate configs (KyroSerializer, etc)

val roads: LineStringRDD = ShapeFileReader.readToLineStringRDD(x, y, z)
roads.analyse()
roads.spatialPartitionedRDD.persist()

val geometryFactory = new GeometryFactory()

val gpsPoints: RDD[(key, Point)] = ... // using geometryFactory
gpsPoints.partitionBy(roads.getPartitioner)  // I don't know, if it makes any sense

val result: RDD[(key, List[LineString])] = gpsPoints.mapValues(p => KNNQuery.SpatialKnnQuery(roads, p, 1, false))

I got object not serializable exception because of using SpatialKnnQuery in an RDD transformation :(  Why can't I use two RDDs in the KNNQuery to distribute both the shape data and the points?

My second approach: the points to rev geocode are in Array

val spark = ... // SparkSession with appropriate configs (KyroSerializer, etc)

val roads: LineStringRDD = ShapeFileReader.readToLineStringRDD(x, y, z)
roads.analyse()
roads.spatialPartitionedRDD.persist()

val geometryFactory = new GeometryFactory()

val gpsPoints: RDD[(key, Point)] = ... // using geometryFactory
val gpsPointsArray: Array[(key, Point)] = gpsPoints.collect()

val result: Array[(key, List[LineString])] =
  gpsPointsArray.foreach(item => (item._1, KNNQuery.SpatialKnnQuery(roads, item._2, 1, false)))


It works but extremely slow (30 points/min). Nominatim hits 250 points/s !!!

What am I doing wrong? What is the best way to run many KNNQuery performant?

Any help is appreciated!

Best wishes,
ps-spark

Jia Yu

unread,
Feb 15, 2019, 12:32:36 PM2/15/19
to ps-spark, GeoSpark Discussion Board
Hi Ps-Spark,

There are two major improvements you can try:
(1) Use index to speed up the query
(2) Cache the built index to memory for repeated queries

In your case, the "objectRDD" is a LineStringRDD

No need to do spatial partitioning for KNN query.

I believe these solutions can significantly accelerate your queries.

Thanks,
Jia

------------------------------------

Jia Yu,

Ph.D. Student in Computer Science



--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/6d0bea7e-83f8-4ba9-868f-e5bb9dd2ba55%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ps-spark

unread,
Feb 19, 2019, 12:13:52 PM2/19/19
to GeoSpark Discussion Board
Hi Jia,

Thank you for your prompt answer. I have tried you suggestion but the street network SpatialRDD does not fit into memory. 

2019-02-19T16:38:37.212
 WARN cara core  
Not enough space to cache rdd_24_0 in memory! (computed 2.8 GB so far) (org.apache.spark.storage.memory.MemoryStore:66)



There is a parameter of Kryo's serializer (spark.kryoserializer.buffer.max) but it only supports values up to 2048mb. The street network is not especially large, it contains 1.3M LineStrings (only one state of Germany). The shape file takes only 0.9G but the LineStringRDD exceeds obviously the 2G limit :( I had to reduce the street network to 1M objects to be able to load the RDD into memory. Indeed it's fast after that with indexing. It's a pitty that I cannot use more than 2G buffer size, even if I have 32GB memory in my cluster.

Bye,
ps-spark

ps-spark

unread,
Feb 19, 2019, 12:17:23 PM2/19/19
to GeoSpark Discussion Board
Anyway, I have also tried another approach. I have created a PointRDD from the GPS coordinates and used spatial JoinQuery to get the intersections. Unfortunately it did not return a  single element. It is because none of the objects (Point, LineString) are closed?

Jia Yu

unread,
Feb 19, 2019, 12:32:32 PM2/19/19
to ps-spark, GeoSpark Discussion Board
Hi,

1. Cached LineStringRDD is certainly larger than original RDD because now in-memory line strings are in loose java object formats. That's why it is fast.
I recommend cache it in serialized format: MEMORY_ONLY_SER. This will lose a little bit performance but reduce memory size. Please make sure use GeoSparkKryo serializer


2. For JoinQuery, did you use DistanceJoinQuery? SpatialJoinQuery definitely returns nothing because LineString is not a closed ring and covers no points.

Thanks,
Jia


------------------------------------

Jia Yu,

Ph.D. Student in Computer Science


On Tue, Feb 19, 2019 at 10:17 AM ps-spark <p.szab...@gmail.com> wrote:
Anyway, I have also tried another approach. I have created a PointRDD from the GPS coordinates and used spatial JoinQuery to get the intersections. Unfortunately it did not return a  single element. It is because none of the objects (Point, LineString) are closed?

--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
Reply all
Reply to author
Forward
0 new messages