I have a workflow that is running multiple pyspark jobs in parallel on a managed cluster. The jobs use Delta Lake tables, and so depend on the delta lake jar. I've included the jar in the properties for the cluster. At startup, DataProc starts three jobs in parallel. Two of them complete successfully, but a third crashes; error is :
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: io.delta#delta-core_2.12;1.0.0: not found]
But, the other two jobs succeed in accessing the dependency:
The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-64f049ab-f494-4927-8219-06cf5214840a;1.0
confs: [default]
found io.delta#delta-core_2.12;1.0.0 in central
found org.antlr#antlr4;4.7 in central
found org.antlr#antlr4-runtime;4.7 in central
found org.antlr#antlr-runtime;3.5.2 in central
found org.antlr#ST4;4.0.8 in central
found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
found org.glassfish#javax.json;1.0.4 in central
found com.ibm.icu#icu4j;58.2 in central
downloading
https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.0/delta-core_2.12-1.0.0.jar ...
[SUCCESSFUL ] io.delta#delta-core_2.12;1.0.0!delta-core_2.12.jar (30ms)
Is this a kind of intermittent failure that I should expect? Is there a way to make sure that the jar is available on the managed cluster? Do I need to include the jars in the job steps?
Thanks,
Dave