The Map input records is not equal to Map output records?

wangm...@gmail.com

unread,

Jul 28, 2015, 1:52:10 AM7/28/15

to Druid User

Hi,

when i use the indexing service to ingest data with index hadoop,and the job is successful,but i find the input records is not equal the out records.some records has been discarded.

see below:

2015-07-28T12:15:20,039 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - map 100% reduce 96%

2015-07-28T12:15:24,059 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job -  map 100% reduce 97%
2015-07-28T12:16:59,417 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job -  map 100% reduce 98%
2015-07-28T12:21:00,535 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job -  map 100% reduce 99%
2015-07-28T12:24:24,513 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job -  map 100% reduce 100%
2015-07-28T13:11:01,180 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Job job_1431571906396_22852 completed successfully
2015-07-28T13:11:01,383 INFO [task-runner-0] org.apache.hadoop.mapreduce.Job - Counters: 53
	File System Counters
		FILE: Number of bytes read=27922640633
		FILE: Number of bytes written=55943546674
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=25971081731
		HDFS: Number of bytes written=2618327862
		HDFS: Number of read operations=1191
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=236
	Job Counters 
		Failed map tasks=10
		Killed reduce tasks=9
		Launched map tasks=298
		Launched reduce tasks=109
		Other local map tasks=10
		Data-local map tasks=278
		Rack-local map tasks=10
		Total time spent by all maps in occupied slots (ms)=64678934
		Total time spent by all reduces in occupied slots (ms)=280668248
		Total time spent by all map tasks (ms)=32339467
		Total time spent by all reduce tasks (ms)=70167062
		Total vcore-seconds taken by all map tasks=32339467
		Total vcore-seconds taken by all reduce tasks=70167062
		Total megabyte-seconds taken by all map tasks=132462456832
		Total megabyte-seconds taken by all reduce tasks=574808571904
	Map-Reduce Framework
		Map input records=36490889
		Map output records=36490840
		Map output bytes=27776676541
		Map output materialized bytes=27922812701
		Input split bytes=38016
		Combine input records=0
		Combine output records=0
		Reduce input groups=9
		Reduce shuffle bytes=27922812701
		Reduce input records=36490840
		Reduce output records=0
		Spilled Records=72981680
		Shuffled Maps =28800
		Failed Shuffles=0
		Merged Map outputs=28800
		GC time elapsed (ms)=6669226
		CPU time spent (ms)=115836130
		Physical memory (bytes) snapshot=492926922752
		Virtual memory (bytes) snapshot=1436702986240
		Total committed heap usage (bytes)=817816928256
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=25971043715
	File Output Format Counters 
		Bytes Written=0
2015-07-28T13:11:01,525 INFO [task-runner-0] io.druid.indexer.IndexGeneratorJob - Adding segment logdata_format_2015-01-11T00:00:00.000+08:00_2015-01-12T00:00:00.000+08:00_2015-07-28T12:02:24.665+08:00 to the list of published segments
2015-07-28T13:11:01,531 INFO [task-runner-0] io.druid.indexer.IndexGeneratorJob - Adding segment logdata_format_2015-01-12T00:00:00.000+08:00_2015-01-13T00:00:00.000+08:00_2015-07-28T12:02:24.665+08:00 to the list of published segments
2015-07-28T13:11:01,535 INFO [task-runner-0] io.druid.indexer.IndexGeneratorJob - Adding segment logdata_format_2015-01-13T00:00:00.000+08:00_2015-01-14T00:00:00.000+08:00_2015-07-28T12:02:24.665+08:00 to the list of published segments
2015-07-28T13:11:01,540 INFO [task-runner-0] io.druid.indexer.IndexGeneratorJob - Adding segment logdata_format_2015-01-14T00:00:00.000+08:00_2015-01-15T00:00:00.000+08:00_2015-07-28T12:02:24.665+08:00 to the list of published segments.

     As the info above,the map input records isn't equal the map output recodes,why?what's you advice here?Thanks.

Himanshu

unread,

Jul 28, 2015, 9:55:32 AM7/28/15

to Druid User

Hi,

Do you have "ignoreInvalidRows" (http://druid.io/docs/0.8.0/ingestion/batch-ingestion.html#tuningconfig) set to true? Can you share full task log?

-- Himanshu

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/82281c91-0ca1-47aa-8955-034ae417cd2b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gian Merlino

unread,

Jul 28, 2015, 11:21:05 AM7/28/15

to druid...@googlegroups.com

There may also be some records that are outside the "intervals" in your job json. Those will be ignored.

--

Reply all

Reply to author

Forward