MultiWorkUnit load?

Brian Orwig

unread,

Jun 26, 2017, 12:36:07 PM6/26/17

to gobblin-users

Can someone explain what load means for a multiworkunit?

2017-06-26T16:30:26.890 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 7: estimated load=300.101737, partitions=[[topicA:1]]
2017-06-26T16:30:26.890 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 8: estimated load=156.292414, partitions=[[topicB:1]]
2017-06-26T16:30:26.890 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 9: estimated load=73.114013, partitions=[[topicC:0]]
2017-06-26T16:30:26.891 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 10: estimated load=25.697858, partitions=[[topicD:0]]
2017-06-26T16:30:26.891 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 11: estimated load=1731.634858, partitions=[[topicE:1]]
2017-06-26T16:30:26.891 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 12: estimated load=891.672269, partitions=[[topicF:1]]
2017-06-26T16:30:26.891 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 13: estimated load=15.171912, partitions=[[topicG:1]]

Is this the amount of data that needs to be processed? The processing load? Something else?

Thanks

Issac Buenrostro

unread,

Jun 26, 2017, 12:45:15 PM6/26/17

to Brian Orwig, gobblin-users

Hi Brian,

The load is (# of events) * (estimated pull time per event). It is used for bin packing the partitions into mappers with about equal load.

Best,

Issac

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/b414ceba-dbd0-4639-a652-c0734ba68c01%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brian Orwig

unread,

Jun 26, 2017, 1:01:05 PM6/26/17

to gobblin-users, brian...@derbysoft.net

Thank you for the quick response.

Follow up question: so for the ones with high load I need to improve the ingestion/processing rate by increasing the # of mappers, # of threads, etc? Is there a performance tuning guide anywhere that has details on what items to tune to increase throughput?

On Monday, June 26, 2017 at 11:45:15 AM UTC-5, Issac Buenrostro wrote:

Hi Brian,

The load is (# of events) * (estimated pull time per event). It is used for bin packing the partitions into mappers with about equal load.

Best,
Issac

On Mon, Jun 26, 2017 at 9:36 AM, Brian Orwig <brian...@derbysoft.net> wrote:

Can someone explain what load means for a multiworkunit?

2017-06-26T16:30:26.890 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 7: estimated load=300.101737, partitions=[[topicA:1]] 2017-06-26T16:30:26.890 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 8: estimated load=156.292414, partitions=[[topicB:1]] 2017-06-26T16:30:26.890 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 9: estimated load=73.114013, partitions=[[topicC:0]] 2017-06-26T16:30:26.891 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 10: estimated load=25.697858, partitions=[[topicD:0]] 2017-06-26T16:30:26.891 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 11: estimated load=1731.634858, partitions=[[topicE:1]] 2017-06-26T16:30:26.891 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 12: estimated load=891.672269, partitions=[[topicF:1]] 2017-06-26T16:30:26.891 [gobblin] [INFO] [gobblin.source.extractor.extract.kafka.workunit.packer.KafkaWorkUnitPacker] - MultiWorkUnit 13: estimated load=15.171912, partitions=[[topicG:1]]

Is this the amount of data that needs to be processed? The processing load? Something else?

Thanks

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.

Issac Buenrostro

unread,

Jun 26, 2017, 1:10:33 PM6/26/17

to Brian Orwig, gobblin-users

Hi Brian,

Sort of. The horizontal scalability of the Kafka source is limited by partitions. When a large partition gets assigned to a single mapper, adding more mappers or increasing threading will not help anymore, as the partition will never be assigned to more than one mapper (this is necessary for correct watermarking). In the logs you pasted above it looks like there is only one partition per mapper.

In general, if the consume speed is ~2x the produce speed, you should be fine as long as you run Gobblin regularly. If you run pulls at a low frequency, or are trying to bootstrap or catchup, you might encounter a problem if your consume / produce is low. In our runs, we see consumption rates anywhere from 1k to 50k events per second per partition, depending on the size and deserialization cost of the events. If your producer rates are close to this, you may need to increase the partitioning of your topics.

If you have any more questions, let me know.

Best,

Issac

To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-users+unsubscribe@googlegroups.com.

To post to this group, send email to gobbli...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/df022f68-2718-4a1a-8410-fb0c1d5a17a9%40googlegroups.com.

Brian Orwig

unread,

Jun 26, 2017, 1:24:37 PM6/26/17

to gobblin-users, brian...@derbysoft.net

Thank you again for the detailed response. We have lots of topics and on several we are definitely trying to play catch up since we just turned on consumption for all of them. All of the topics have more than 1 partition (I just included a snippet above), however I will go through the ones that we are having issues catching up on and see about increasing the partitions for those based on the produced vs consumed rates.

-Brian

Reply all

Reply to author

Forward