"SELECT df1.*, df2.* FROM df1
INNER JOIN df2
ON 1=1 WHERE ST_CONTAINS(df1.WKB_GEOMETRY, df2.WKB_GEOMETRY) = TRUE"
Basically, what I want to do is that I only want the rows from both df1 and df2 where df2.WKB_GEOMETRY is completely contained by df1.WKB_GEOMETRY
I tried to run the above query in PostGIS and GeoSpark (on the same server) and they seem to produce similar results but GeoSpark takes significantly longer (like almost 2 days)
where PostGIS takes about 1 hour.
I am using local[*] so all my 48 cores are max at 100%, the server has 512GB Ram, about 256GB is used by the Spark job. I also initialized sparkSession like this
val spark: SparkSession = SparkSession
.builder()
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
.appName("bitre")
.getOrCreate()Any suggestions about what I can do to speed it up?Thank you
SELECT * FROM polygondf, pointdf WHERE ST_Contains(polygondf.polygonshape,pointdf.pointshape)http://datasystemslab.github.io/GeoSpark/api/sql/GeoSparkSQL-Optimizer/#range-join
------------------------------------
Jia Yu,
Ph.D. Student in Computer Science
Reach me via Jia Yu's Homepage | GitHub Repositories
--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/d121246d-e0c5-43af-a55b-00a4663ed518%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussion-board+unsub...@googlegroups.com.
To post to this group, send email to geospark-discussion-board@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/d121246d-e0c5-43af-a55b-00a4663ed518%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Alex To
PhD Candidate
School of Information Technologies
Knowledge Discovery and Management Research Group
Faculty of Engineering & IT
THE UNIVERSITY OF SYDNEY | NSW | 2006
Desk 4e69 | Building J12| 1 Cleveland Street
M. +61423330656
------------------------------------
Jia Yu,
Ph.D. Student in Computer Science
Reach me via Jia Yu's Homepage | GitHub Repositories
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/d121246d-e0c5-43af-a55b-00a4663ed518%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Alex To
PhD Candidate
School of Information Technologies
Knowledge Discovery and Management Research Group
Faculty of Engineering & IT
THE UNIVERSITY OF SYDNEY | NSW | 2006
Desk 4e69 | Building J12| 1 Cleveland Street
M. +61423330656
--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/CAA2QdypUzsLu1hVbds9Mf3nVmMq0UXkZk4R-%2BzqPYHi3y-TaTQ%40mail.gmail.com.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussion-board+unsub...@googlegroups.com.
To post to this group, send email to geospark-discussion-board@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/d121246d-e0c5-43af-a55b-00a4663ed518%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Alex To
PhD Candidate
School of Information Technologies
Knowledge Discovery and Management Research Group
Faculty of Engineering & IT
THE UNIVERSITY OF SYDNEY | NSW | 2006
Desk 4e69 | Building J12| 1 Cleveland Street
M. +61423330656
--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussion-board+unsub...@googlegroups.com.
To post to this group, send email to geospark-discussion-board@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/CAA2QdypUzsLu1hVbds9Mf3nVmMq0UXkZk4R-%2BzqPYHi3y-TaTQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
PostGIS did not take 1 hour, PostGIS took several hours too.
I made several changes to my project and now I can't remember whether for this process I used KryoSerializer previously or the default Java serializer.
The last version I have now is configured like this (I didn't use kdbtree and quadtree before, not sure if I used KryoSerializer previously)
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
.config("geospark.join.gridtype", "kdbtree")
.config("geospark.global.indextype", "quadtree")
"SELECT df1.*, df2.* FROM df1
INNER JOIN df2
ON 1=1 WHERE ST_CONTAINS(df1.WKB_GEOMETRY, df2.WKB_GEOMETRY) = TRUE"
------------------------------------
Jia Yu,
Ph.D. Student in Computer Science
Reach me via Jia Yu's Homepage | GitHub Repositories
--
You received this message because you are subscribed to the Google Groups "GeoSpark Discussion Board" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussio...@googlegroups.com.
To post to this group, send email to geospark-dis...@googlegroups.com.
Visit this group at https://groups.google.com/group/geospark-discussion-board.
To view this discussion on the web visit https://groups.google.com/d/msgid/geospark-discussion-board/8e5993cb-1983-4150-a760-e7f1e2c68ecb%40googlegroups.com.
lga.createOrReplaceTempView("df1")linkRef.createOrReplaceTempView("df2")val mapping = spark.sql( """SELECT df2.LINK_ID, df2.DIR_TRAVEL, df1.LGA_CODE17, df1.LGA_NAME17, df1.STE_NAME17 FROM df1, df2 WHERE ST_CONTAINS( ST_Transform(ST_GeomFromWKT(df1.WKT), 'epsg:4283', 'epsg:4326'), ST_GeomFromWKT(df2.WKT))""")
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Project [LGA_CODE17#11, LGA_NAME17#12, STE_NAME17#14]
+- Filter isnotnull(WKT#10)
+- Relation[WKT#10,LGA_CODE17#11,LGA_NAME17#12,STE_CODE17#13,STE_NAME17#14,AREASQKM17#15] csv
and
Project [LINK_ID#1224, DIR_TRAVEL#1256]
+- Filter **org.apache.spark.sql.geosparksql.expressions.ST_Contains$**
+- Relation[WKT#1223,LINK_ID#1224,ST_NAME#1225,FEAT_ID#1226,ST_LANGCD#1227,NUM_STNMES#1228,ST_NM_PREF#1229,ST_TYP_BEF#1230,ST_NM_BASE#1231,ST_NM_SUFF#1232,ST_TYP_AFT#1233,ST_TYP_ATT#1234,ADDR_TYPE#1235,L_REFADDR#1236,L_NREFADDR#1237,L_ADDRSCH#1238,L_ADDRFORM#1239,R_REFADDR#1240,R_NREFADDR#1241,R_ADDRSCH#1242,R_ADDRFORM#1243,REF_IN_ID#1244,NREF_IN_ID#1245,N_SHAPEPNT#1246,... 75 more fields] csv
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
To unsubscribe from this group and stop receiving emails from it, send an email to geospark-discussion-board+unsub...@googlegroups.com.