An efficient union + merge algorithm

57 views
Skip to first unread message

Denis Papathanasiou

unread,
Oct 8, 2016, 2:00:52 PM10/8/16
to scala-user
I'm using jackson to marshal two sets of strings into complex case
classes, based on an underlying java class definition (this application
combines both java and scala classes in a single jar).

I wish to merge the results, using a simple algorithm: union the
resulting case classes, group by a unique id present in each record,
and use a timestamp to resolve possible conflicts in the unique id.

Since the merged result needs to be passed back as a java object, I also
need to apply some conversions, and create a new instance of the main
case class before the function returns.

The whole thing looks like this:

val mergedResult = a.union(b)
  .groupBy(_.uniqueID)
  .mapValues(_.maxBy(_.timestamp))
  .values
  .toList
  .asJava

val result = new ComplexCaseClass
result.setRecords(mergedResult.asInstanceOf[java.util.List[ComplexSubClass]])
result

While it is functionally correct, I wonder about its efficiency.

Are all these various casts and conversions adding unnecessary overhead?

In general, are there ways I can make it more efficient?

Denis Papathanasiou

unread,
Oct 11, 2016, 7:33:55 PM10/11/16
to scala-user
After more experimenting with this, I realized that union was
unnecessary, because all the conflicting entries would get resolved by
the groupBy/mapValues/maxBy operations later, and so a simple concat is
better (and probably more efficient).

So a better question would be: is there any way of improving upon the
default concatenation operation?

I did notice in my own benchmarks that when a.length < b.length (a ++
b) was more efficient than (b ++ a) so that's how I've implemented it
for now.

I also just completed the parallel programming
course on coursera, so I was tinkering with
my own array concatenation, based on the example here --
-- but I did not detect any major improvement in performance.

Finally, I was wondering about the effect of using so many conversions
to/from java, in the form of the .asJava/.asScala methods.

Are these creating unnecessary performance overhead that I can avoid?
Reply all
Reply to author
Forward
0 new messages