Scenario: Read from one database using an ActorPublisher, write to another database using a subscriber.
I expect the reads to be much faster than the writes, so we need to slow down the reads at some threshold. Growing an unbounded queue of data, will simply OOM. The below works for small datasets. With large datasets, the gap between read-write becomes enormous and so OOM.
My ActorPublisher:
class ScrollPublisher(clientFrom: ElasticClient, config: Config) extends ActorPublisher[SearchHits] {
val logger = Logger(LoggerFactory.getLogger(this.getClass))
var readCount = 0
var processing = false
import akka.stream.actor.ActorPublisherMessage._
@volatile var executeQuery = () => clientFrom.execute {
search in config.indexFrom / config.mapping scroll "30m" limit config.scrollSize
}
def nextHits(): Unit = {
if (!processing) {
processing = true
val future = executeQuery()
future.foreach {
response =>
processing = false
if (response.getHits.hits.nonEmpty) {
logger.info("Fetched: \t" + response.getHits.getHits.length + " documents in\t" + response.getTookInMillis + "ms.")
readCount += response.getHits.getHits.length
logger.info("Total Fetched:\t" + readCount)
if (isActive && totalDemand > 0) {
executeQuery = () => clientFrom.execute {
searchScroll(response.getScrollId).keepAlive("30m")
}
nextHits()
onNext(response.getHits) // sends elements to the stream
}
} else {
onComplete()
}
}
future.onFailure {
case t =>
processing = false
throw t
}
}
}
def receive = {
case Request(cnt) =>
logger.info("ActorPublisher Received: \t" + cnt)
if (isActive && totalDemand > 0) {
nextHits()
}
case Cancel =>
context.stop(self)
case _ =>
}
}
Enter code here...
Source declaration:
// SearchHits Akka Stream Source
val documentSource = Source.actorPublisher[SearchHits](Props(new ScrollPublisher(clientFrom, config))).map {
case searchHits =>
searchHits.getHits
}
My Sink, which performs an asynch write to the new database:
documentSource.buffer(16, OverflowStrategy.backpressure).runWith(Sink.foreach {
searchHits =>
Thread.sleep(1000)
totalRec += searchHits.size
logger.info("\t\t\tRECEIVED: " + searchHits.size + " \t\t\t TOTAL RECEIVED: "+ totalRec)
val bulkIndexes = searchHits.map(hit => (hit.`type`, hit.id, hit.sourceAsString())).collect {
case (typ, _id, source) =>
index into config.indexTo / config.mapping id _id -> typ doc JsonDocumentSource(source)
}
val future = clientTo.execute {
bulk(
bulkIndexes
)
}
The sleep is put in there to simulate lag for local development. I've tried changing values for the buffer, and the max/initial values for the materializer, and still it seems to ignore back pressure.
Is there a logic flaw in this code?