After more experimenting with this, I realized that union was
unnecessary, because all the conflicting entries would get resolved by
the groupBy/mapValues/maxBy operations later, and so a simple concat is
better (and probably more efficient).
So a better question would be: is there any way of improving upon the
default concatenation operation?
I did notice in my own benchmarks that when a.length < b.length (a ++
b) was more efficient than (b ++ a) so that's how I've implemented it
for now.
I also just completed the parallel programming
course on coursera, so I was tinkering with
my own array concatenation, based on the example here --
-- but I did not detect any major improvement in performance.
Finally, I was wondering about the effect of using so many conversions
to/from java, in the form of the .asJava/.asScala methods.
Are these creating unnecessary performance overhead that I can avoid?