scala and batch processing

947 views
Skip to first unread message

Kostas kougios

unread,
Apr 19, 2015, 3:54:48 PM4/19/15
to scala...@googlegroups.com
We will be building a lot of batch jobs and we got a requirement to find the best framework for batch processing, things that have been proposed so far are spring-batch & camel.

Our requirements will vary and can be i.e. a batch job that incrementally ftp some files, unzips them, extracts info from xml files within those zips, feed s3 , elastic search, hadoop etc.

So spring batch and camel are well proven products from well known organizations but I got the feeling that if we choose those, we will get stuck with boilerplate code and abstractions that get in our way of implementing requirements.

In scala we can have functions URL => File that download ftp file and then File => File that unzip the file and so on and compose them to do the job. It sounds very natural way of doing things and boilerplate free and the code can also be reused outside the batch processing. The same ofcourse can be done with jdk8 functions but the code I've seen for spring-batch is too jdk6-ish and boilerplate.

In addition I've been examining akka-streams which seems to very nicely compose different functions and create stream pipelines which seems a very good fit for what we need. But it seems to be lacking some nice features of spring-batch, like resuming of failed steps (i.e. if the ftp succeeded but the unzip failed, we should be able to resume just the unzip) and a web interface. Resuming failed steps is very easy to impl, provided you can serialize the input of a function, but still it is not supported.

On top of that there doesn't seem to be a scala batch framework with features similar to spring batch.

Did anyone used scala for batch processing and how did it go?

Vincent Marquez

unread,
Apr 21, 2015, 1:38:47 AM4/21/15
to Kostas kougios, scala-user
Hi Kostas, 

We are using Scala for this type of work and it's been going great.  Previously we had a few C# applications that did this fairly well, and we're finding Scala is doing an even better job.  Because of the functional nature of Scala and the ease of doing concurrency with Futures, we rolled a lot of our own functionality for retries and logging with fairly minimal effort it's worked out fine.   

--Vincent

--
You received this message because you are subscribed to the Google Groups "scala-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Koert Kuipers

unread,
Apr 21, 2015, 1:56:50 AM4/21/15
to Kostas kougios, scala-user
we use
* drake for the workflow management, concurrency and ability to resume where we left off
* pieces of shell script called within drake for the glue and simple commands (like ftp retrieval or a file copy to hdfs)
* scala projects for the heavy lifting (hbase, hadoop, elasticsearch, kafka, etc.), called from within drake
* all of it deployed and scheduled with cron using chef




Konstantinos Kougios

unread,
Apr 21, 2015, 3:38:32 PM4/21/15
to Vincent Marquez, scala-user
Yes it doesn't require much work to replicate what batch frameworks do. Retry is very simple for f(x)=y since we can serialize x and then rerun the function as many times as we want. In addition to multithreading with Futures, distribution of tasks can be easily done with akka or akka-streams (or both) and scheduling with akka-quartz. Composing Jobs from functions is easy and can be done with chained calls as per akka-streams or

job("my job)(function1) // start a job by executing a function
.andThen(function2) // take the output of function1 and feed function2
.distribute(function3,10 minutes) // run function3 with input from function2 distributed in multiple servers and timeout after 10 minutes
.scatter(function4,function5) // take the output of function3 and pass it both to 4 & 5 (like tee)

i.e.
job(...)(new FileDownloader)
.andThen(new FileUnzipper)
.andThen(new CsvParser)
.scatter(new HadoopImport,new DatabaseImport)

In fact I am thinking of making an open-source batch framework like the above or by just using akka-streams. It is something (I think) missing for scala.

Kostas

Konstantinos Kougios

unread,
Apr 21, 2015, 3:41:47 PM4/21/15
to Koert Kuipers, scala-user
interesting approach, seems to take advantage of the OS commands too.

Vincent Marquez

unread,
Apr 21, 2015, 5:42:24 PM4/21/15
to Konstantinos Kougios, scala-user
You may want to look at Task from Scalaz which has similar functionality to future but a bit more flexible in some ways.

http://timperrett.com/2014/07/20/scalaz-task-the-missing-documentation/ is a great writeup on it.  I still use future often because of the simplicity, but Task can be useful for the kind of batch processing you are talking about.  

I think an open source batch library would be neat, but I will say i'm much more inclined to pull in libraries and DSLs than I am frameworks.  I find usually the former is less opinionated and much more composable, while frameworks are more rigid. 

Just my two cents.  

--Vincent

Konstantinos Kougios

unread,
Apr 21, 2015, 5:57:11 PM4/21/15
to Vincent Marquez, scala-user


On 21/04/15 22:42, Vincent Marquez wrote:
You may want to look at Task from Scalaz which has similar functionality to future but a bit more flexible in some ways.

http://timperrett.com/2014/07/20/scalaz-task-the-missing-documentation/ is a great writeup on it.  I still use future often because of the simplicity, but Task can be useful for the kind of batch processing you are talking about. 
I was thinking of using akka because the scheduled tasks can run over a cluster of servers. This way the batch jobs it won't be limited to 1 box.


I think an open source batch library would be neat, but I will say i'm much more inclined to pull in libraries and DSLs than I am frameworks.  I find usually the former is less opinionated and much more composable, while frameworks are more rigid.
Yes makes sense. The idea is to optionally wire a scheduler (akka-quartz?) with a lightweight job declaration DSL  (job().andThen..distribute...scatter...gather..) on top of akka that can stream bits of batch jobs so that all servers are utilized simultaneously. Then ofcourse there are dozens of libs for http/ftp/hadoop/elastic/database/csv/xml manipulation, no need to do anything there. All parts of a job (tasks) will be plain functions so that they can be reused under any other context. Finally maybe add an optional web interface (like spring-batch) to manage jobs and also an sbt-based console to manually run jobs with custom arguments.

All in all a set of libs + UI for scalable batch executions.

Cheers

Oliver Ruebenacker

unread,
Apr 21, 2015, 6:01:20 PM4/21/15
to Konstantinos Kougios, Vincent Marquez, scala-user

     Hello,

  If you want batch jobs in Scala on a cluster sounds like you want Apache Spark.

     Best, Oliver
--
Oliver Ruebenacker
Solutions Architect at Altisource Labs
Be always grateful, but never satisfied.

Konstantinos Kougios

unread,
Apr 21, 2015, 6:07:13 PM4/21/15
to Oliver Ruebenacker, Vincent Marquez, scala-user
Hi Oliver,

But isn't spark based on RDD (redundant distributed dataset is it?) . So the benefit is when we got a lot of data and we want to distribute and process them. How would it work for batch processing? I suppose there are similarities as it gets an input and process it in many ways and then feed i.e. a database.

It might worth having a system where we can trigger spark, akka-stream and other jobs. There is no reason limiting it to just 1 lib.

Cheers

Oliver Ruebenacker

unread,
Apr 22, 2015, 8:10:28 AM4/22/15
to Konstantinos Kougios, Vincent Marquez, scala-user

     Hello,

  For the record, RDD stands for resilient distributed dataset. Resilient means that if one of your processes dies and looses data, that data will be automatically recalculated. From the controller thread, an RDD looks like a simple collection, and Spark takes care of spreading data and calculation out over multiple nodes.

  Perhaps I am misunderstanding your use case. Spark is when you have a large calculation that you want to distribute and you don't care about the details of how it is distributed.

  If, on the other hand, your use case is to perform the same maintenance on every node, that would not fit Spark. 

  I should also mention that Spark is peculiar about its dependencies. It still uses Scala 2.10. It also has it's own edition of Akka included, which may lead to version issues if you use another edition of Akka (say, the standard one) within the same app. For example, I once tried to use Spark and Play (which uses Akka) in the same app, got some errors I could not resolve, and then split the project into two apps (one for Spark, one for Play) that communicate over the network.

     Best, Oliver

Konstantinos Kougios

unread,
Apr 22, 2015, 8:14:20 AM4/22/15
to Oliver Ruebenacker, Vincent Marquez, scala-user
Hi Oliver,

Is spark a fit for batch jobs? I.e. something that downloads and unzips files? I mean ofcourse you could do this with spark but I believe spark is more suited for big data processing. We also need to have a way to stop jobs that are running and resume them if they failed and after we say fix a bug.

Cheers

Oliver Ruebenacker

unread,
Apr 22, 2015, 8:34:02 AM4/22/15
to Konstantinos Kougios, Vincent Marquez, scala-user

     Hello,

  That depends on what kind of batch jobs you have. You are right, it is for big data processing. If your batch jobs are of a very different kind, then maybe Spark is not for you.

     Best, Oliver

Reply all
Reply to author
Forward
0 new messages