Speeding up Hadoop Index Task

aaro...@gmail.com

unread,

May 4, 2017, 11:54:13 AM5/4/17

to Druid User

I am back-filling large periods of data from HDFS but it's proving slow (10+ of hours per month) and am looking to speed up hadoop ingestion.

My data has 35 dimensions and 36 metrics and I need it aggregated at the hourly level. It is being stored as CSVs on HDFS and amounts to ~40 GB per day.

Sample Row:

2017-03-01T07:00:00.000Z,US:OK,157876,0,7,604094,...

Here are the granularity and tuning portions from my ingestion spec.

"granularitySpec":{

"type":"uniform",

"segmentGranularity":"hour",

"queryGranularity": "none",

"rollup":true,

"intervals":[ "2016-01-01/P1M" ]

},

"tuningConfig": {

"type":"hadoop",

"targetPartitionSize":5000000,

"rowFlushBoundary":75000,

"numShards":-1,

"indexSpec":{

"bitmap":{

"type":"concise"

},

"dimensionCompression":"lz4",

"metricCompression":"lz4",

"longEncoding":"longs"

},

"buildV9Directly":false,

"forceExtendableShardSpecs":true

}

I am running each hadoop ingestion task on a month's worth of data at once and also confirmed that the MapReduce task is not running locally.

2017-05-03T17:29:55,357 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2017-05-03T17:29:55,531 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 744
2017-05-03T17:29:56,187 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - number of splits:6913
2017-05-03T17:29:56,256 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1492526761603_0101

Is there anything I can change in the ingestion spec or my process (load a day's worth instead of a month's worth) to speed up ingestion?

Gian Merlino

unread,

May 5, 2017, 2:47:10 PM5/5/17

to druid...@googlegroups.com

On the Druid side one thing you can do is set buildV9Directly to true, or upgrade to Druid 0.10.0 where it's true by default. Raising rowFlushBoundary could help too, although not too high since you don't want to run out of memory. Other than that, the best thing to do is follow 'standard' Hadoop performance tuning steps: turn on combining if you have a lot of small files, potentially adjust mapper/reducer container sizing, that kind of thing.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/8ca1f66e-c3df-4910-ba78-8e327d3d13c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

aaro...@gmail.com

unread,

May 10, 2017, 11:46:29 AM5/10/17

to Druid User

Thanks Gian, your suggestions greatly sped up the ingestion task.

Gian

To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.

Reply all

Reply to author

Forward