I'm wondering if Spark is suitable for distributing work that's primarily meant to be side-effecting. My simplified use-case would be to spin up, say N instances, and have them all load data into a database (in my case, DynamoDB). In pseudo-code,
val input = sc.textfile(...) ++ ...
val acc = sc.accumulator(...)
input foreach { item =>
acc += insertIntoDB(item)
}
(BTW, are there significant differences between using foreach() with accumulators versus map/flatMap() to collect stats and potential errors about side-effecting operations? I'm wondering if there's more to it than coding style... though I guess map/flatMap() are assumed to be safe/non-side-effecting in case the RDD needs to be re-played).
What I mean by "suitable" is whether it's possible to easily control the level of parallelism on each worker nodes, e.g. depending on the instance type (m1.small), I'd like to use, say, 20 threads on each worker node. I've read the documentation about
setting the level of parallelism but it's unclear whether I can control it for `foreach` which I assume runs within the map phase.... does Spark pick the parallelism based on data size for `foreach`? How can I override this?
thanks!
alex