GraphFrames' ConnectedComponentSuite test 'two components and two dangling vertices' fails with OutOfMemoryError: Java heap space

59 views
Skip to first unread message

Russell Jurney

unread,
Jan 11, 2025, 3:22:53 AMJan 11
to graph...@googlegroups.com, user
Friends of GraphFrames (github.com/graphframes/graphframes), I have a question for you...

I can't get the unit test 'two components and two dangling vertices' in the org.graphframes.lib.ConnectedComponentsSuite to pass. It fails with an 'OutOfMemoryError: Java heap space' error. I am a little stuck on completing a docs release with a motif finding tutorial due to this issue.


Can someone else please try this and see if it passes on the master branch?

> build/sbt clean compile package test

I've tried giving it lots of RAM just to see if it would help, as much as 32g driver and 16g for executors and... it has no effect. The test graph is 8 nodes and 6 edges, so it shouldn't have a memory problem... yet when it runs, all 24 cores of my CPU get used, it spikes as indicated in the image in the gist.

I am running the following setup:

* Ubuntu 20.04 (22.04 in the Docker image)
* OpenJDK 11 (I also tried 8, same problem)
* Scala 2.12.20 (I also tried 2.13, same problem)
* Python 3.11 (I also tried 3.9, same problem)

Or I am running the Dockerfile in the gist.

Any help much appreciated! Thanks

-----------------------------------------------------------------
Oh, some new community stuff for GraphFrames. Hackathon announced next week :)

Thanks!
Russell Jurney
@rjurney russell...@gmail.com LI FB datasyndrome.com

Russell Jurney

unread,
Jan 13, 2025, 10:45:40 AMJan 13
to Bjørn Jørgensen, Ángel, graph...@googlegroups.com, user
Merged, thanks guys!

Russ

On Sun, Jan 12, 2025 at 2:23 PM Bjørn Jørgensen <bjornjo...@gmail.com> wrote:

søn. 12. jan. 2025 kl. 23:10 skrev Ángel <angel.alva...@gmail.com>:
Hi Russell,

I've just got the OOM error during Test 13. I'm running it from IntelliJ on Windows with Java 11.

image.png
I'll look into it over the course of the next week.
  
Regards,
Ángel


--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Russell Jurney

unread,
Jan 14, 2025, 3:52:53 PMJan 14
to Ángel, Bjørn Jørgensen, graph...@googlegroups.com, user
Can you please share the code? It doesn't seem an ideal solution, but if AQE is confused, disabling it makes sense. I can't figure out why a low partition count for an 8 node, 6 edge network would require a lot of partitions... users may have different numbers... do you suggest we enforce some minimum partition count?

On Tue, Jan 14, 2025 at 7:33 AM Ángel <angel.alva...@gmail.com> wrote:

Are you sure that temporarily disabling a global setting like AQE is the best approach to fix this issue?
I increased the number of shuffle partitions in the Spark session configure in the GraphFrameTestSparkContext.scala from 4 to 10, and the "checkpoint interval" test ran perfectly without throwing an OOM error. Why? No idea, but it worked.



Russell Jurney

unread,
Jan 15, 2025, 7:15:21 PMJan 15
to Ángel, Bjørn Jørgensen, graph...@googlegroups.com, user
Thank you SO much for looking into this! GraphFrames lives!

As an update, here's the most popular logo from a poll I took:

image.png

Although this one in second place fits more with the Apache logo directory:

image.png

Which one makes you want to work on this issue more? :D 

Russ

On Wed, Jan 15, 2025 at 5:04 AM Ángel <angel.alva...@gmail.com> wrote:
My only change was increasing 4 to 10 in the spark.sql.shuffle.partitions setting (still doesn't work for a value lower than 10).

image.png

I've performed some tests, changed some code ...  and I'm starting to grasp what's behind this OOM, but I'm still working on it .. quite interesting indeed this issue! (thanks)

According to MAT ... a humongous execution plan is being created ...
image.png

Bjørn Jørgensen

unread,
Jan 16, 2025, 3:00:53 AMJan 16
to Ángel, Russell Jurney, graph...@googlegroups.com, user

tor. 16. jan. 2025 kl. 08:21 skrev Ángel <angel.alva...@gmail.com>:

Don't thank me; I'm really learning a lot about Spark and Graphframes internals and having fun digging into this issue. I'm planning to write an article about it next weekend.

I'm getting closer and closer...

Btw., I've also noticed a couple of issues with Spark when working with large plans. I find it weird that nobody has realized this before. I want to double-check them - and also test them in Spark 4 - , but if I'm correct, I'll open a ticket on the Spark project and submit a PR this week. As just mentioned, I could be wrong, but with the improvements, the test went from taking 3–4 minutes to only 1.5 minutes.


Ángel

unread,
Jan 21, 2025, 9:15:23 PMJan 21
to Bjørn Jørgensen, Russell Jurney, graph...@googlegroups.com, user

Hi Team,

I believe I’ve finally managed to fix the OOM issue, though I’m still analyzing some details and conducting tests to confirm my findings. The issue seems to be related to how Spark updates and propagates the plan across the internal bus (AdaptiveSparkPlanExec.onUpdatePlan) but, above all, to how Spark generates the physical plan when DataFrames are cached iteratively—such as in GraphFrames. I’ve fixed the OOM issue by adding some minor improvements to the Spark code, but I’d also like to explore whether a solution might be achievable by modifying only the GraphFrames code, in case my improvements to Spark are not accepted. 

Over the next few weeks, I’m planning to write three articles summarizing everything I’ve discovered. I’ll share the links once they’re published, and I’d greatly appreciate any feedback you can provide. 

@Bjørn Jørgensen, I was genuinely surprised by the links you shared last week. Shouldn’t Connected Components be thoroughly tested in the GraphFrames unit tests? Regardless, while this other issue appears to be also fixed when AQE is disabled, I believe there isn't a connection. It seems like the way GraphFrames assigned ids only functioned correctly when data partitioning was static—that is, in older Spark versions where AQE didn’t exist or was disabled by default. Do you happen to have a simple dataset we could use for further testing?

Regards,
Ángel

Russell Jurney

unread,
Jan 23, 2025, 11:31:04 PMJan 23
to Ángel, Bjørn Jørgensen, graph...@googlegroups.com, user
I have the stats.meta.stackexchange.com graph I’m hoping to wrangle into the tests, which is around 100K edges. I should have the tutorial and associated tools read in a week or two. I could send you the node and edge Parquets if it will be helpful?

Ángel

unread,
Jan 24, 2025, 2:34:52 AMJan 24
to Russell Jurney, Bjørn Jørgensen, graph...@googlegroups.com, user
Super helpful. Thanks, Russell.
 
I'm planning to dedicate the entire weekend to this issue, and I hope to create some Spark and/or Graphframes PRs as well.  

Russell Jurney

unread,
Jan 24, 2025, 7:26:31 PMJan 24
to Ángel, Bjørn Jørgensen, graph...@googlegroups.com, user
Okay, I just sent you some instructions privately and will update the list tomorrow.

Ángel

unread,
Feb 1, 2025, 10:54:10 AMFeb 1
to Bjørn Jørgensen, Russell Jurney, GraphFrames, user

Russell Jurney

unread,
Feb 1, 2025, 12:20:13 PMFeb 1
to Ángel, Bjørn Jørgensen, Russell Jurney, GraphFrames, user
Very cool, thanks for sharing!

Thanks,


--
You received this message because you are subscribed to the Google Groups "GraphFrames" group.
To unsubscribe from this group and stop receiving emails from it, send an email to graphframes...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/graphframes/CAGUyL8ifNPF-0fs6aubJi4REd4JS4-SPEgfHM7E7ugd7a_J5XQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ángel

unread,
Feb 19, 2025, 1:17:25 AMFeb 19
to Russell Jurney, Bjørn Jørgensen, Russell Jurney, GraphFrames, user
The second part of the mini-series ... link
Reply all
Reply to author
Forward
0 new messages