I have point DataFrame with around 3 million point and complex polygon DataFrame which has around 700k polygons. Each polygon has geometry ranging from 500 points to nearly 1000 points. I am running ST_WITHIN(pointdf.point, polygondf.polygon).
Failing with spark memory issue -
Lost task 7.0 in stage 10.0 (TID 8219, 10.120.11.156, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 37.6 GB of 37 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Tried increasing executor memory, memory overhead but still no use .
Below are settings used -
Settings
GeoSpark version = 1.2.0
Apache Spark version = 2.3
JRE version = 1.8
API type = Scala
Added all necessary settings such as -
sparkSession = SparkSession.builder().
config("spark.serializer",classOf[KryoSerializer].getName).
config("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName).
config("geospark.join.gridtype","kdbtree").
config("spark.sql.shuffle.partitions", 2000)
master("yarn").appName("GeoSpatialAnalysis").getOrCreate()
Also repartitioned pointDf and polygonDf to 2000
Still having the same problem.
If Geospark core is more performant compared to geospark sql,
How could I perform WITHIN operation using Spatial JOIN in RDD API ?