Hi,
When reading from Kafka, I noticed that all reading is done on a single Spark instance.
It seems not to be distributed among multiple spark-workers.
(when testing on a 10 node spark cluster, we saw indeed that one node had a CPU usage of almost 100%, the one reading from Kafka, while the others were much lower)
When taking a look at the KafkaStreaming code :
/**
* Input stream that pulls messages from a Kafka Broker.
*
* @param topics Map of (topic_name -> numPartitions) to consume. Each partition is consumed
* in its own thread.
* @param storageLevel RDD storage level.
*/
private[streaming]
class KafkaInputDStream[T: ClassManifest, D <: Decoder[_]: Manifest]
It seems that indeed all partitions are read in a different thread, but within the same task.
So basically my question: Is it possible to distribute the reading from Kafka among multiple workers?
If not, this might be a huge bottleneck as you're bound to the network bandwidth and processing power of this single instance.
We could implement our own KafkaInputDStream classes so that we end up with e.g. one DStream for each partition.
That would automatically make it distributed, no?
Or am I missing something here?
Greets,
Bart