I am back-filling large periods of data from HDFS but it's proving slow (10+ of hours per month) and am looking to speed up hadoop ingestion.
My data has 35 dimensions and 36 metrics and I need it aggregated at the hourly level. It is being stored as CSVs on HDFS and amounts to ~40 GB per day.
Sample Row:
2017-03-01T07:00:00.000Z,US:OK,157876,0,7,604094,...
Here are the granularity and tuning portions from my ingestion spec.
"granularitySpec":{
"type":"uniform",
"segmentGranularity":"hour",
"queryGranularity": "none",
"rollup":true,
"intervals":[ "2016-01-01/P1M" ]
},
"tuningConfig": {
"type":"hadoop",
"targetPartitionSize":5000000,
"rowFlushBoundary":75000,
"numShards":-1,
"indexSpec":{
"bitmap":{
"type":"concise"
},
"dimensionCompression":"lz4",
"metricCompression":"lz4",
"longEncoding":"longs"
},
"buildV9Directly":false,
"forceExtendableShardSpecs":true
}
}
I am running each hadoop ingestion task on a month's worth of data at once and also confirmed that the MapReduce task is not running locally.
2017-05-03T17:29:55,357 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2017-05-03T17:29:55,531 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 744
2017-05-03T17:29:56,187 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - number of splits:6913
2017-05-03T17:29:56,256 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1492526761603_0101
Is there anything I can change in the ingestion spec or my process (load a day's worth instead of a month's worth) to speed up ingestion?