GeoSparSQL Code failing with memory issues

18 views

Skip to first unread message

Padma Ch

unread,

May 26, 2019, 12:51:11 PM5/26/19

to GeoSpark Discussion Board

I have point DataFrame with around 3 million point and complex polygon DataFrame which has around 700k polygons. Each polygon has geometry ranging from 500 points to nearly 1000 points. I am running ST_WITHIN(pointdf.point, polygondf.polygon).

Failing with spark memory issue -
Lost task 7.0 in stage 10.0 (TID 8219, 10.120.11.156, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 37.6 GB of 37 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

Tried increasing executor memory, memory overhead but still no use .

Below are settings used -

Settings

GeoSpark version = 1.2.0

Apache Spark version = 2.3

JRE version = 1.8

API type = Scala
Added all necessary settings such as -

sparkSession = SparkSession.builder().
config("spark.serializer",classOf[KryoSerializer].getName).
config("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName).
config("geospark.join.gridtype","kdbtree").
config("spark.sql.shuffle.partitions", 2000)
master("yarn").appName("GeoSpatialAnalysis").getOrCreate()

Also repartitioned pointDf and polygonDf to 2000
Still having the same problem.

If Geospark core is more performant compared to geospark sql,

How could I perform WITHIN operation using Spatial JOIN in RDD API ?

Reply all

Reply to author

Forward

0 new messages