Greetings. I wanted to share with you our open source project we called 'akka-mapreduce' for the lack of a better name. We are developing a small Map/Reduce framework using Akka in Scala.
https://github.com/projetoeureka/akka-mapreduceThe idea is to run map-reduce jobs keeping the aggregated output in memory all the way to the end. It is a more lightweight alternative to Spark, Storm or Hadoop Streaming directed to ad-hoc data processing problems.
When your data is too big you have to read it iteratively, and that makes the problem unsuitable for things like Scala parallel collections. Akka-mapreduce attempts to be a better fit for those situations when, while you can't load the input to the memory all at once, it is small enough to be processed by a monolithic application in a single multi-core machine.
We have developed a reducer component with a "decimator" actor that allows us to know that all the data has been processed. We are also creating a builder class that lets you define a processing pipeline like this
val pipeline = pipe_mapkv {
row: String => row split raw"\s+"
} times 4 map {
word => Some(KeyVal(word.trim.toLowerCase, 1))
} times 4 reduce (_ + _) times 8 output self
We started the project for our own purposes, but we thought it might be of interest to someone else, and decided to publish it. It is still a very young project, so there are many rough edges and things are changing very fast, but we would love to hear some feedback from the community.
There appears to be a lot of Akka map-reduce examples out there on the Internet, but few of them are in Scala, and even fewer offer a really good solution to properly reading the data and to figure out that processing has ended. We are not sure the framework will be adopted by anyone, but it might have a couple of ideas that can be borrowed by your individual projects.
Cheers,
++nic