Struggling with simple example

23 views
Skip to first unread message

Igor Gatis

unread,
Nov 1, 2013, 12:39:29 PM11/1/13
to dumbo...@googlegroups.com
I'm trying to understand inputformat and outputformat in order to be able to cascade mapreductions. In the example below, I was expecting input.txt to be equal to input.seq.txt but that does not happen. Why? Also, notice that input.seq.txt has a invalid leading TAB. This is using single node cluster hadoop 1.0.3.

(side question: my ultimate goal is to create a sequence file using python-hadoop containing binary data, cascade a couple of hadoop mapreductions and read final data from a sequence file also using python-hadoop. How do I do that using Dumbo?)

$ cat input.txt 
k1      v1
k2      v2
k3      v3
k4      v4

$ dumbo start test.py -hadoop hadoop \
  -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
  -input /samples/input.txt \
  -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat \
  -output /samples/input.seq \
  -overwrite yes

$ dumbo get /samples/input.seq/part-00000 input.seq -hadoop hadoop

$ xxd < input.seq
0000000: 5345 5106 2f6f 7267 2e61 7061 6368 652e  SEQ./org.apache.
0000010: 6861 646f 6f70 2e74 7970 6564 6279 7465  hadoop.typedbyte
0000020: 732e 5479 7065 6442 7974 6573 5772 6974  s.TypedBytesWrit
0000030: 6162 6c65 2f6f 7267 2e61 7061 6368 652e  able/org.apache.
0000040: 6861 646f 6f70 2e74 7970 6564 6279 7465  hadoop.typedbyte
0000050: 732e 5479 7065 6442 7974 6573 5772 6974  s.TypedBytesWrit
0000060: 6162 6c65 0000 0000 0000 0d2c 2382 bfbc  able.......,#...
0000070: 5d6a 4e4d 772a 44e6 9962 0000 0012 0000  ]jNMw*D..b......
0000080: 0009 0000 0005 0700 0000 0000 0000 0507  ................
0000090: 0000 0000 0000 0016 0000 000b 0000 0007  ................
00000a0: 0700 0000 0276 3100 0000 0707 0000 0002  .....v1.........
00000b0: 6b31 0000 0016 0000 000b 0000 0007 0700  k1..............
00000c0: 0000 0276 3200 0000 0707 0000 0002 6b32  ...v2.........k2
00000d0: 0000 0016 0000 000b 0000 0007 0700 0000  ................
00000e0: 0276 3300 0000 0707 0000 0002 6b33 0000  .v3.........k3..
00000f0: 0016 0000 000b 0000 0007 0700 0000 0276  ...............v
0000100: 3400 0000 0707 0000 0002 6b34            4.........k4


$ dumbo start test.py -hadoop hadoop \
  -inputformat org.apache.hadoop.mapred.SequenceFileInputFormat \
  -input /samples/input.seq/part-00000 \
  -outputformat org.apache.hadoop.mapred.TextOutputFormat \
  -output /samples/input.seq.txt \
  -overwrite yes

$ dumbo get /samples/input.seq.txt/part-00000 input.seq.txt -hadoop hadoop

$ xxd < input.seq.txt
0000000: 0953 4551 062f 6f72 672e 6170 6163 6865  .SEQ./org.apache
0000010: 2e68 6164 6f6f 702e 7479 7065 6462 7974  .hadoop.typedbyt
0000020: 6573 2e54 7970 6564 4279 7465 7357 7269  es.TypedBytesWri
0000030: 7461 626c 652f 6f72 672e 6170 6163 6865  table/org.apache
0000040: 2e68 6164 6f6f 702e 7479 7065 6462 7974  .hadoop.typedbyt
0000050: 6573 2e54 7970 6564 4279 7465 7357 7269  es.TypedBytesWri
0000060: 7461 626c 6500 0000 0000 000a 0000 0005  table...........
0000070: 0700 0000 0000 0000 0507 0000 0000 0000  ................
0000080: 0016 0000 000b 0000 0007 0700 0000 0276  ...............v
0000090: 3100 0000 0707 0000 0002 6b31 0000 0016  1.........k1....
00000a0: 0000 000b 0000 0007 0700 0000 0276 3200  .............v2.
00000b0: 0000 0707 0000 0002 6b32 0000 0016 0000  ........k2......
00000c0: 000b 0000 0007 0700 0000 0276 3300 0000  ...........v3...
00000d0: 0707 0000 0002 6b33 0000 0016 0000 000b  ......k3........
00000e0: 0000 0007 0700 0000 0276 3400 0000 0707  .........v4.....
00000f0: 0000 0002 6b34 092c 23ef bfbd efbf bdef  ....k4.,#.......
0000100: bfbd 5d6a 4e4d 772a 44ef bfbd 6200 0000  ..]jNMw*D...b...
0000110: 1200 0000 0a                             .....

Reply all
Reply to author
Forward
0 new messages