I tested stuff in this PR https://github.com/graphframes/graphframes/pull/477 and then I made this PR https://github.com/graphframes/graphframes/pull/478søn. 12. jan. 2025 kl. 23:10 skrev Ángel <angel.alva...@gmail.com>:Hi Russell,I've just got the OOM error during Test 13. I'm running it from IntelliJ on Windows with Java 11.I'll look into it over the course of the next week.Regards,Ángel
--
Are you sure that temporarily disabling a global setting like AQE is the best approach to fix this issue?
I increased the number of shuffle partitions in the Spark session configure in the GraphFrameTestSparkContext.scala from 4 to 10, and the "checkpoint interval" test ran perfectly without throwing an OOM error. Why? No idea, but it worked.
My only change was increasing 4 to 10 in the spark.sql.shuffle.partitions setting (still doesn't work for a value lower than 10).I've performed some tests, changed some code ... and I'm starting to grasp what's behind this OOM, but I'm still working on it .. quite interesting indeed this issue! (thanks)According to MAT ... a humongous execution plan is being created ...
Don't thank me; I'm really learning a lot about Spark and Graphframes internals and having fun digging into this issue. I'm planning to write an article about it next weekend.
I'm getting closer and closer...
Btw., I've also noticed a couple of issues with Spark when working with large plans. I find it weird that nobody has realized this before. I want to double-check them - and also test them in Spark 4 - , but if I'm correct, I'll open a ticket on the Spark project and submit a PR this week. As just mentioned, I could be wrong, but with the improvements, the test went from taking 3–4 minutes to only 1.5 minutes.
Hi Team,
I believe I’ve finally managed to fix the OOM issue, though I’m still analyzing some details and conducting tests to confirm my findings. The issue seems to be related to how Spark updates and propagates the plan across the internal bus (AdaptiveSparkPlanExec.onUpdatePlan) but, above all, to how Spark generates the physical plan when DataFrames are cached iteratively—such as in GraphFrames. I’ve fixed the OOM issue by adding some minor improvements to the Spark code, but I’d also like to explore whether a solution might be achievable by modifying only the GraphFrames code, in case my improvements to Spark are not accepted.
Over the next few weeks, I’m planning to write three articles summarizing everything I’ve discovered. I’ll share the links once they’re published, and I’d greatly appreciate any feedback you can provide.
@Bjørn Jørgensen, I was genuinely surprised by the links you shared last week. Shouldn’t Connected Components be thoroughly tested in the GraphFrames unit tests? Regardless, while this other issue appears to be also fixed when AQE is disabled, I believe there isn't a connection. It seems like the way GraphFrames assigned ids only functioned correctly when data partitioning was static—that is, in older Spark versions where AQE didn’t exist or was disabled by default. Do you happen to have a simple dataset we could use for further testing?
Regards,
Ángel
Russell Jurney | rju...@graphlet.ai | graphlet.ai | Graphlet AI Blog | LinkedIn | BlueSky
--
You received this message because you are subscribed to the Google Groups "GraphFrames" group.
To unsubscribe from this group and stop receiving emails from it, send an email to graphframes...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/graphframes/CAGUyL8ifNPF-0fs6aubJi4REd4JS4-SPEgfHM7E7ugd7a_J5XQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.