Slow join with ST_CONTAINS

233 views
Skip to first unread message

Alex To

unread,
Jun 7, 2018, 9:06:04 PM6/7/18
to GeoSpark Discussion Board
Hi

I have a simple join statement as follows:

"SELECT df1.*, df2.* FROM df1 
 INNER JOIN df2 
 ON 1=1 WHERE ST_CONTAINS(df1.WKB_GEOMETRY, df2.WKB_GEOMETRY) = TRUE"

Basically, what I want to do is that I only want the rows from both df1 and df2 where df2.WKB_GEOMETRY is completely contained by df1.WKB_GEOMETRY

I tried to run the above query in PostGIS and GeoSpark (on the same server) and they seem to produce similar results but GeoSpark takes significantly longer (like almost 2 days) 

where PostGIS takes about 1 hour. 

I am using local[*] so all my 48 cores are max at 100%, the server has 512GB Ram, about 256GB is used by the Spark job. I also initialized sparkSession like this

val spark: SparkSession = SparkSession
.builder()
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
.appName("bitre")
.getOrCreate()

Any suggestions about what I can do to speed it up?

Thank you


Jia Yu

unread,
Jun 8, 2018, 12:26:59 AM6/8/18
to Alex To, GeoSpark Discussion Board
Hi Alex,

The "Inner join" keyword cannot be captured by GeoSpark join query optimizer. We use the pattern matching technique to capture certain join behaviors and optimize them. I forgot to capture it...

Since it is not optimized, the Spark spatial join will use cartesian product join which is n^2 complexity and extremely slow. 

Actually, the default GeoSpark joins are all inner joins, you can go ahead and use the following way:

SELECT *
FROM polygondf, pointdf
WHERE ST_Contains(polygondf.polygonshape,pointdf.pointshape)
http://datasystemslab.github.io/GeoSpark/api/sql/GeoSparkSQL-Optimizer/#range-join

Thanks,
Jia

------------------------------------

Jia Yu,

Ph.D. Student in Computer Science



--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/d121246d-e0c5-43af-a55b-00a4663ed518%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex To

unread,
Jun 8, 2018, 12:35:24 AM6/8/18
to Jia Yu, GeoSpark Discussion Board
Cool, thank you very much for your prompt reply.

You mentioned that you forgot to capture it, is there any plan to implement this in the future releases?

Best Regards

To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussion-board+unsub...@googlegroups.com.
To post to this group, send email to geospark-discussion-board@googlegroups.com.



--

Alex To

PhD Candidate

School of Information Technologies

Knowledge Discovery and Management Research Group

Faculty of Engineering & IT

THE UNIVERSITY OF SYDNEY | NSW | 2006

Desk 4e69 | Building J12| 1 Cleveland Street

M. +61423330656

Jia Yu

unread,
Jun 8, 2018, 12:38:04 AM6/8/18
to Alex To, GeoSpark Discussion Board
Yes, I can add it in 1.2.0. I guess this is just a quick fix. Can you create an issue on GeoSpark GitHub so that we can track it.

------------------------------------

Jia Yu,

Ph.D. Student in Computer Science


To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.



--

Alex To

PhD Candidate

School of Information Technologies

Knowledge Discovery and Management Research Group

Faculty of Engineering & IT

THE UNIVERSITY OF SYDNEY | NSW | 2006

Desk 4e69 | Building J12| 1 Cleveland Street

M. +61423330656

--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.

Alex To

unread,
Jun 8, 2018, 12:41:48 AM6/8/18
to Jia Yu, GeoSpark Discussion Board
Sure, just created an issue on GitHub

Best Regards

To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussion-board+unsub...@googlegroups.com.
To post to this group, send email to geospark-discussion-board@googlegroups.com.



--

Alex To

PhD Candidate

School of Information Technologies

Knowledge Discovery and Management Research Group

Faculty of Engineering & IT

THE UNIVERSITY OF SYDNEY | NSW | 2006

Desk 4e69 | Building J12| 1 Cleveland Street

M. +61423330656

--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussion-board+unsub...@googlegroups.com.
To post to this group, send email to geospark-discussion-board@googlegroups.com.

Alex To

unread,
Jul 1, 2018, 11:29:14 PM7/1/18
to GeoSpark Discussion Board
Hi just want to update this issue. I think my measurement was not correct.

PostGIS did not take 1 hour, PostGIS took several hours too. 

I made several changes to my project and now I can't remember whether for this process I used KryoSerializer previously or the default Java serializer.

The last version I have now is configured like this (I didn't use kdbtree and quadtree before, not sure if I used KryoSerializer previously)

.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
.config("geospark.join.gridtype", "kdbtree")
.config("geospark.global.indextype", "quadtree")

And suddenly the process finishes in 1 and half hour. So do you think the INNER JOIN needs further optimisation or not? Sorry for the false alarm.

Best Regards

Jia Yu

unread,
Jul 2, 2018, 2:32:48 AM7/2/18
to Alex To, GeoSpark Discussion Board
Are you saying use the code here:

"SELECT df1.*, df2.* FROM df1 
 INNER JOIN df2 
 ON 1=1 WHERE ST_CONTAINS(df1.WKB_GEOMETRY, df2.WKB_GEOMETRY) = TRUE"
You can get the expected performance? Can you print the query plan of this SQL query in Spark?

1 hour is a long time for GeoSpark join. What's your data type and data size?

Thanks,
Jia

------------------------------------

Jia Yu,

Ph.D. Student in Computer Science


--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.

Alex To

unread,
Jul 2, 2018, 2:57:13 AM7/2/18
to GeoSpark Discussion Board
Hi

My exact code is 

lga.createOrReplaceTempView("df1")
linkRef.createOrReplaceTempView("df2")
val mapping = spark.sql(
      """SELECT df2.LINK_ID, df2.DIR_TRAVEL, df1.LGA_CODE17, df1.LGA_NAME17, df1.STE_NAME17
         FROM df1, df2 
         WHERE ST_CONTAINS(
         ST_Transform(ST_GeomFromWKT(df1.WKT), 'epsg:4283', 'epsg:4326'), 
         ST_GeomFromWKT(df2.WKT))""")

Here is the print out from mapping.explain

== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Project [LGA_CODE17#11, LGA_NAME17#12, STE_NAME17#14]
+- Filter isnotnull(WKT#10)
   +- Relation[WKT#10,LGA_CODE17#11,LGA_NAME17#12,STE_CODE17#13,STE_NAME17#14,AREASQKM17#15] csv
and
Project [LINK_ID#1224, DIR_TRAVEL#1256]
+- Filter  **org.apache.spark.sql.geosparksql.expressions.ST_Contains$**
   +- Relation[WKT#1223,LINK_ID#1224,ST_NAME#1225,FEAT_ID#1226,ST_LANGCD#1227,NUM_STNMES#1228,ST_NM_PREF#1229,ST_TYP_BEF#1230,ST_NM_BASE#1231,ST_NM_SUFF#1232,ST_TYP_AFT#1233,ST_TYP_ATT#1234,ADDR_TYPE#1235,L_REFADDR#1236,L_NREFADDR#1237,L_ADDRSCH#1238,L_ADDRFORM#1239,R_REFADDR#1240,R_NREFADDR#1241,R_ADDRSCH#1242,R_ADDRFORM#1243,REF_IN_ID#1244,NREF_IN_ID#1245,N_SHAPEPNT#1246,... 75 more fields] csv
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;

Here is the data file data.zip

The lga data frame is read from lga_aust_2017.csv and linkRef data frame is from link_ref.csv
lga has around 544 rows and link ref is about 3 million rows.

I do not know what kind of performance should I expect because previously I did something wrong that made me think GeoSpark is several times slower than PostGIS but it is not the case. Now this one finishes in 1 and half hour which is acceptable for me.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussion-board+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages