Aggregating millions of records using Akka-Streams

Aditya pavan Kumar

unread,

Aug 8, 2019, 5:02:13 AM8/8/19

to Akka User List

I am running a simulation which generates a million records every second.
I’m writing them to Kafka and reading them through Akka Streams. I’m performing a few aggregations on this data and writing the output back to Kafka.

The data contains a timestamp based on which the aggregations are grouped.
Using the timestamps, I’m creating windows of data and performing aggregation on these windows.
Since there are a million records each second, the aggregations are taking about 40 seconds for one million records.
This is really slow because new data is being generated and written to Kafka every second.

I referred this blog post for performing window aggregations: https://dvirgiln.github.io/akka-streams-windowing/

Is there any better way to perform these aggregations in lesser time(preferably less than one second) using Akka Streams?

Johannes Rudolph

unread,

Aug 8, 2019, 5:42:02 AM8/8/19

to Akka User List

Reposted here: https://discuss.lightbend.com/t/aggregating-a-million-records-using-akka-streams/4781

Brian Maso

unread,

Aug 8, 2019, 4:18:33 PM8/8/19

to akka...@googlegroups.com

The word that come to mind is "sharding". Very simplistically ignoring many potential other factors: if one process can only handle about 25K msgs/sec (1Mil msgs/40 secs = 25K msgs/sec), you could use 40 separate processes to handle the load.

First I would figure out where the bottleneck is: how many messages can you Kafka pipeline push through per sec with null processing on each message? Can your network handle that load? etc.

I don't know what your situation is, but I wonder if Kafka is really the right solution? Would lower level UDP/datagram sockets possibly work? That technology can handle way more interactions/sec, though with reliability and message size limit trade-offs -- but lising a small percentage of samples wouldn't significantly effect most aggregation calculations -- that is, error would stay within acceptable limits (if you have reasonable error limits).

Anyway, sounds like some engineering work needs to be done to design a system with sufficient throughput and error tolerances. Akka Streams isn't the issue, there's more global architectural issues at play here.

Brian Maso

--
*****************************************************************************************************
** New discussion forum: https://discuss.akka.io/ replacing akka-user google-group soon.
** This group will soon be put into read-only mode, and replaced by discuss.akka.io
** More details: https://akka.io/blog/news/2018/03/13/discuss.akka.io-announced
*****************************************************************************************************
>>>>>>>>>>
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/akka-user/cc8f1634-76d2-4f31-a226-950c37a239cd%40googlegroups.com.

Reply all

Reply to author

Forward