Workflow sometimes crashing due to failure to download jars

120 views
Skip to first unread message

David Gallagher

unread,
Jan 26, 2022, 9:27:51 AM1/26/22
to Google Cloud Dataproc Discussions
I have a workflow that is running multiple pyspark jobs in parallel on a managed cluster. The jobs use Delta Lake tables, and so depend on the delta lake jar. I've included the jar in the properties for the cluster. At startup, DataProc starts three jobs in parallel. Two of them complete successfully, but a third crashes; error is :

Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: io.delta#delta-core_2.12;1.0.0: not found]

But, the other two jobs succeed in accessing the dependency:

The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-64f049ab-f494-4927-8219-06cf5214840a;1.0
        confs: [default]
        found io.delta#delta-core_2.12;1.0.0 in central
        found org.antlr#antlr4;4.7 in central
        found org.antlr#antlr4-runtime;4.7 in central
        found org.antlr#antlr-runtime;3.5.2 in central
        found org.antlr#ST4;4.0.8 in central
        found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
        found org.glassfish#javax.json;1.0.4 in central
        found com.ibm.icu#icu4j;58.2 in central
downloading https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.0/delta-core_2.12-1.0.0.jar ...
        [SUCCESSFUL ] io.delta#delta-core_2.12;1.0.0!delta-core_2.12.jar (30ms)

Is this a kind of intermittent failure that I should expect? Is there a way to make sure that the jar is available on the managed cluster? Do I need to include the jars in the job steps?

Thanks,

Dave
Reply all
Reply to author
Forward
0 new messages