Akka-stream - aggregate record counts while writing to a sink, and update an object in the middle of the flow process with the aggregated data.

1,415 views
Skip to first unread message

Eugene Dzhurinsky

unread,
Nov 3, 2016, 10:57:59 PM11/3/16
to Akka User List
Hello, I want to implement the following workflow:

- a source has the sequence of IDs to process
- initial flow INIT fetches the document by ID and extracts the profile
- another flow FETCH is spawned, it fetches one or more of the associated documents and store them in some sink DATA (file)
- the flow must also calculate the number of records saved to the sink DATA
- once the flow FETCH is complete - the profile is updated with the count of fetched documents
- then the profile is written into the sink METADATA (file)
- optionally if the number of records in FETCH phase doesn't match the number of the actual number set in metadata - then it should write some key into yet another sink ERROR


So far it's not clear how would I

- aggregate the records produced by certain flow to calculate the number of the records processed for certain input
- keep the intermediate profile object somewhere until the records are not fetched and saved in another flow

Please advice.

Thanks!

Eugene Dzhurinsky

unread,
Nov 4, 2016, 8:26:05 AM11/4/16
to Akka User List
More I think about is - more I am convinced that this is something that must be implemented via nested flows, like - the step to extract the initial user profile and then extract all associated records must be implemented as a separate Graph, that is invoked as part of transformation of the input stream. 

So I have a flow nested in another flow with it's own materializer - on every user ID from the outer flow I spawn another instance of inner flow and wait until it completes, and then send the data down to the appropriate sinks.

Am I missing something?

Viktor Klang

unread,
Nov 4, 2016, 8:43:43 AM11/4/16
to Akka User List
Why would it need its own materializer?

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscribe@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--
Cheers,

Eugene Dzhurinsky

unread,
Nov 4, 2016, 8:19:21 PM11/4/16
to Akka User List
On Friday, November 4, 2016 at 8:43:43 AM UTC-4, √ wrote:
Why would it need its own materializer?

I have a stream of IDs, coming from the database.
There is the flow, that maps the ID into the profile ( fetch the details from some external storage ). So far it is simple enough - 

val profileFetcher = Flow.fromFunction[ID, Profile, NotUsed](id ....)


Now I have to take the profile and run another flow, that will
- query 1 .. N different resources (depending on the content of Profile)
- transform the content of the resources and save them into a separate file (sink)
- aggregate results of that processing (for now - calculate the number of records fetched from external resources) and update the profile object
- stream down the profile object into another file (sink).

So far I can see that the function, that goes downstream

val profileSaver = Flow.fromFunction[Profile, ByteString, NotUsed] = (profile ....)

must take the profile, create it's own flow, wait until that flow completes and then update the profile object before converting it into a ByteString.

And that's why I need to materialize the inner flow.


Makes sense?

matheus...@gmail.com

unread,
Nov 4, 2016, 11:22:40 PM11/4/16
to Akka User List
This can easily modeled as:
val source = Source(ids)
source
    .mapAsync(4)(fetchDocumentById)
    .map(_.profile)
    .flatMapConcat{prof =>
               sourceOfRelatedDocs
                    .mapAsync(4)(persistDoc)
                    .fold(0)(_ ++ _ )
                    .map(count => (count, prof))
    }
    .mapAsync(4){case (count, prof) => updateProfile(count, prof)}
    .to(Sink.ignore)

Eugene Dzhurinsky

unread,
Nov 5, 2016, 8:41:55 AM11/5/16
to Akka User List
So far I understood that persistDoc is another function that should persist something into the appropriate file. However I believe that in my case it is a flow, that has some source and sink attached, and the sink is the FileIO.

Basically, I don't see how that is supposed to work out - FileIO doesn't return the number of the performed operations or something like that.

Thanks!

matheus...@gmail.com

unread,
Nov 5, 2016, 3:55:24 PM11/5/16
to Akka User List
If I understood, it's enough replace map(persistDoc) by via(persistDoc), if persitDoc is a flow. The fold stage is to aggregate the number of persistence operations performed by persistDoc stage. I understood you want to fetch a list of docs related to a profile, persist them and so update the profile if the number of docs persisted. This is the logic executed by the flatMapConcat stage I described previously.
Best regards.

Eugene Dzhurinsky

unread,
Nov 5, 2016, 10:43:36 PM11/5/16
to Akka User List
Okay, perhaps the simple source snippet will make it clear:

    type OptProfile = Option[FullProfile]
    type
LikesAndCount = (Int, Stream[Profile])

    val src
: Source[Int, NotUsed] = Source[Int](Conf.startId() to Conf.endId())
    val fetchProfileFlow
: Flow[Int, OptProfile, NotUsed] = Flow.fromFunction(Profile.extractFullProfile)
    val fetchLikesFlow
: Flow[Int, LikesAndCount, NotUsed] = Flow.fromFunction(Likes.extractUserList)
    val profileDataSink
: Sink[ByteString, Future[IOResult]] = FileIO.toPath(new File(base, "profiles").toPath)

    val fetchLikesAndUpdateProfile
= Flow[(OptProfile, LikesAndCount)].flatMapConcat {
     
case (Some(profile), (likesExpected, stream)) GraphDSL.create() {
       
implicit builder
         
import GraphDSL.Implicits._

          val
in = builder.add(Source(stream))
          val profileLikesSink
= builder.add(FileIO.toPath(new File(base, s"likes_${profile.id}").toPath))
         
in ~> Flow.fromFunction[Profile.Profile, ByteString](p ByteString(s"${p.id}:${p.username}\n")) ~> profileLikesSink
         
// update profile here with the numbers of records and emit it
         
Source.single(profile).shape
     
}
   
}

   
RunnableGraph.fromGraph(
     
GraphDSL.create() {
       
implicit builder
         
import GraphDSL.Implicits._

          val inlet
= builder.add(Broadcast[Int](2))
          val merge
= builder.add(Zip[OptProfile, LikesAndCount])

          src
~> inlet.in
          inlet
.out(0) ~> fetchProfileFlow ~> merge.in0
          inlet
.out(1) ~> fetchLikesFlow ~> merge.in1
          merge
.out ~> fetchLikesAndUpdateProfile ~>
           
Flow.fromFunction[Profile.FullProfile, ByteString](p ByteString(s"$p\n")) ~>
            profileDataSink
         
ClosedShape
     
}
   
).run()


So far it is not clear how would I write fetchLikesAndUpdateProfile in a way that it will
- create a sink for storing list of fetched data (every profile has the associated file named after the profile ID)
- how to retrieve the number of stored records in fetchLikesAndUpdateProfile and update the property in the FullProfile object.

Thanks!

matheus...@gmail.com

unread,
Nov 6, 2016, 12:08:10 PM11/6/16
to Akka User List
In fetchLikesAndUpdateProfile, you can broadcast the input to profileLikesSink and also to a flow the performs a fold to count all elements passed through and update the profile.

Eugene Dzhurinsky

unread,
Nov 6, 2016, 11:07:01 PM11/6/16
to Akka User List
Sorry, I don't really get it. Before I get the number of the records from the likes stream - I have to materialize that stream in order to get the result.

So far I can think of a custom GraphStage that will take 2 inputs - the profile and the stream, and produce one output - profile, that will be emitted only after the last element of the likes stream is consumed, hence the number of the records is known.

Eugene Dzhurinsky

unread,
Nov 8, 2016, 8:49:31 PM11/8/16
to Akka User List
Got some time to experiment: with this definition

    val fetchLikesAndUpdateProfile = Flow[(OptProfile, LikesAndCount)].flatMapConcat {
     
case (Some(profile), (likesExpected, stream)) GraphDSL.create() {
       
implicit builder
         
import GraphDSL.Implicits._

          val in = builder.add(Source(stream))
          val profileLikesSink
= builder.add(FileIO.toPath(new File(base, s"likes_${profile.id}").toPath))
         
in ~> Flow.fromFunction[Profile.Profile, ByteString](p ByteString(s"${p.id}:${p.username}\n")) ~> profileLikesSink
         
// update profile here with the numbers of records and emit i

         
Source.single(profile).shape
     
}
   
}


The graph is teared at runtime:


java
.lang.IllegalArgumentException: requirement failed: The inlets [] and outlets [single.out] must correspond to the inlets [] and outlets []
 at scala
.Predef$.require(Predef.scala:219)
 at akka
.stream.Shape.requireSamePortsAs(Shape.scala:168)
 at akka
.stream.impl.StreamLayout$CompositeModule.replaceShape(StreamLayout.scala:426)
 at akka
.stream.scaladsl.GraphApply$class.create(GraphApply.scala:19)
 at akka
.stream.scaladsl.GraphDSL$.create(Graph.scala:993)
 at sample
.Aggregate$$anonfun$6.apply(Aggregate.scala:112)
 at sample
.Aggregate$$anonfun$6.apply(Aggregate.scala:111)


Looks like it doesn't like Source.single(profile).shape but the type of the result should be Graph[SourceShape[T],M], and I assume that the SourceShape should actually emit the updated profile object. Can somebody please explain how is it supposed to work, cuz I'm lost? 

Thanks!

mathe...@sagaranatech.com

unread,
Nov 9, 2016, 8:33:40 AM11/9/16
to Akka User List
I guess you need to return a source built with the builder. Other problem is you build a ClosedGraph in ~> flow ~> sink
Reply all
Reply to author
Forward
0 new messages