GeoSpark job failing on Save step with OOM issues

27 views
Skip to first unread message

Ryan Bessey

unread,
Jan 30, 2019, 2:34:13 PM1/30/19
to GeoSpark Discussion Board
We have a job that does a spatial join of 27GB of location data with the Texas state boundary.  It works fine on smaller datasets but fails with the 27GB dataset.  Large datasets with smaller state boundaries also work fine.  The errors seem to be all related to Out Of Memory.  We've tried tuning many spark configurations (executor memory, driver memory, executor cores, driver cores, num executors, maxAppAttempts, madResultSize, memoryOverhead, network timeout, etc.) .  We've also tried different GeoSpark SQL settings (different index types and grid types), and different SQL functions (ST_INTERSECTS, ST_CONTAINS, ST_WITHINS).  Nothing seems to work.  It gets really close to finishing the Save job and then fails.  We noticed that the partitions that are getting written to s3 vary greatly in size.  Some are a few KB and others a couple hundred MB.  It seems that the partitions from the join are greatly skewed and is causing the issue.  Do you have any suggestions?

Thank you,
Ryan

Jia Yu

unread,
Feb 1, 2019, 3:09:24 AM2/1/19
to Ryan Bessey, GeoSpark Discussion Board
Hi Ryan,

This is probably caused by the inappropriate partition numbers in SQL based spatial join. To solve this issue, I suggest you use GeoSparkSQL+GeoSparkRDD solution. A full example is here:


This uses GeoSpark Core for joins and then convert the RDD to Dataframe.

Note that:
1. This is the API for 1.2.0-SNAPSHOT. You don’t need to worry about the column names of DataFrame, GeoSpark-core will carry them and return them back to the result DataFrame
2. You can consider increasing the size of PointRDD partition.

Regards,
Jia

------------------------------------

Jia Yu,

Ph.D. Student in Computer Science



--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/77836015-ccd4-415d-895d-2ad894bf9243%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Bessey

unread,
Feb 5, 2019, 4:39:04 PM2/5/19
to GeoSpark Discussion Board
Hi Jia,

We've tried implementing as you suggested, but are still having similar issues.  Here is the code:

var polygonDf = sparkSession.sql("select ST_GeomFromWKT(w.geometry) as polygeom, w.* from wkt_table w")
var polygonRDD = new SpatialRDD[Geometry]
polygonRDD
.rawSpatialRDD = Adapter.toRdd(polygonDf)
polygonRDD
.analyze()

var pointDf = sparkSession.sql("select ST_Point(l.long, l.lat) as pointgeom, l.* from locations l")
var pointRDD = new SpatialRDD[Geometry]
pointRDD
.rawSpatialRDD = Adapter.toRdd(pointDf)
pointRDD
.analyze()

pointRDD.spatialPartitioning(GridType.QUADTREE)
polygonRDD.spatialPartitioning(pointRDD.getPartitioner)

pointRDD.buildIndex(IndexType.QUADTREE, true)

var joinResultPairRDD = JoinQuery.SpatialJoinQueryFlat(pointRDD, polygonRDD, true, true)

val list1 = polygonDf.columns.toList
val list2 = pointDf.columns.toList

var joinResultDf = Adapter.toDf(joinResultPairRDD, list1, list2, sparkSession)

joinResultDf

Do you see any issues or have any suggestions?

Thanks,
Ryan

Jia Yu

unread,
Feb 7, 2019, 4:04:12 AM2/7/19
to Ryan Bessey, GeoSpark Discussion Board
Hi Ryan,

Before you convert Df to PointRDD, please try to double or triple the partition size. In addition, use GridType.KDBTree instead.

image.png

Thanks,
Jia

------------------------------------

Jia Yu,

Ph.D. Student in Computer Science


--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
Reply all
Reply to author
Forward
0 new messages