First about the spark RDD, you are absolutely right that RDD.forEachPartition() is the right method to use, my bad. Because it returns void there are no later spark steps that could trigger a second execution. But does that mean that your spark job did not finish succesfully, despite the few transaction failures? I would expect that spark would reschedule the corresponding task until it succeeds. The only problem you can have then is that transactions are not properly closed (the reason for the exception you showed?), so that is why I suggested to catch the exception, rollback the transaction and raise your own exception towards spark.
Your other questions.
1) If you use spark, I would expect that you have a singleton object per spark executor that contains the janusgraph connection and that you manage parallelism on the spark executor with the number of cores per executor. If you use more transactions per spark task/core, you loose the option to rollback the transaction if needed and have spark reschedule the task.
2) It is just something that people sometimes complain about. I guess this should be recognizable from the exceptions raised. Of course it will not hurt to monitor CPU and ram usage of your elasticsearch instances. It will only happen if the elastic cluster is the weakest link in the chain, that is if janusgraph and HBase can process more transactions than elastic can handle.
Last remark, it is not unusual that a few spark tasks fail, it is just something that happens for all kinds of reasons in complex distributed setups. Your application must simple be able to handle these failures and reschedule the task.
Best wishes, Marc