Struggling with simple example

23 views

Skip to first unread message

Igor Gatis

unread,

Nov 1, 2013, 12:39:29 PM11/1/13

to dumbo...@googlegroups.com

I'm trying to understand inputformat and outputformat in order to be able to cascade mapreductions. In the example below, I was expecting input.txt to be equal to input.seq.txt but that does not happen. Why? Also, notice that input.seq.txt has a invalid leading TAB. This is using single node cluster hadoop 1.0.3.

(side question: my ultimate goal is to create a sequence file using python-hadoop containing binary data, cascade a couple of hadoop mapreductions and read final data from a sequence file also using python-hadoop. How do I do that using Dumbo?)

$ cat input.txt

k1 v1

k2 v2

k3 v3

k4 v4

$ dumbo start test.py -hadoop hadoop \

-inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \

-input /samples/input.txt \

-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat \

-output /samples/input.seq \

-overwrite yes

$ dumbo get /samples/input.seq/part-00000 input.seq -hadoop hadoop

$ xxd < input.seq

0000000: 5345 5106 2f6f 7267 2e61 7061 6368 652e SEQ./org.apache.

0000010: 6861 646f 6f70 2e74 7970 6564 6279 7465 hadoop.typedbyte

0000020: 732e 5479 7065 6442 7974 6573 5772 6974 s.TypedBytesWrit

0000030: 6162 6c65 2f6f 7267 2e61 7061 6368 652e able/org.apache.

0000040: 6861 646f 6f70 2e74 7970 6564 6279 7465 hadoop.typedbyte

0000050: 732e 5479 7065 6442 7974 6573 5772 6974 s.TypedBytesWrit

0000060: 6162 6c65 0000 0000 0000 0d2c 2382 bfbc able.......,#...

0000070: 5d6a 4e4d 772a 44e6 9962 0000 0012 0000 ]jNMw*D..b......

0000080: 0009 0000 0005 0700 0000 0000 0000 0507 ................

0000090: 0000 0000 0000 0016 0000 000b 0000 0007 ................

00000a0: 0700 0000 0276 3100 0000 0707 0000 0002 .....v1.........

00000b0: 6b31 0000 0016 0000 000b 0000 0007 0700 k1..............

00000c0: 0000 0276 3200 0000 0707 0000 0002 6b32 ...v2.........k2

00000d0: 0000 0016 0000 000b 0000 0007 0700 0000 ................

00000e0: 0276 3300 0000 0707 0000 0002 6b33 0000 .v3.........k3..

00000f0: 0016 0000 000b 0000 0007 0700 0000 0276 ...............v

0000100: 3400 0000 0707 0000 0002 6b34 4.........k4

$ dumbo start test.py -hadoop hadoop \

-inputformat org.apache.hadoop.mapred.SequenceFileInputFormat \

-input /samples/input.seq/part-00000 \

-outputformat org.apache.hadoop.mapred.TextOutputFormat \

-output /samples/input.seq.txt \

-overwrite yes

$ dumbo get /samples/input.seq.txt/part-00000 input.seq.txt -hadoop hadoop

$ xxd < input.seq.txt

0000000: 0953 4551 062f 6f72 672e 6170 6163 6865 .SEQ./org.apache

0000010: 2e68 6164 6f6f 702e 7479 7065 6462 7974 .hadoop.typedbyt

0000020: 6573 2e54 7970 6564 4279 7465 7357 7269 es.TypedBytesWri

0000030: 7461 626c 652f 6f72 672e 6170 6163 6865 table/org.apache

0000040: 2e68 6164 6f6f 702e 7479 7065 6462 7974 .hadoop.typedbyt

0000050: 6573 2e54 7970 6564 4279 7465 7357 7269 es.TypedBytesWri

0000060: 7461 626c 6500 0000 0000 000a 0000 0005 table...........

0000070: 0700 0000 0000 0000 0507 0000 0000 0000 ................

0000080: 0016 0000 000b 0000 0007 0700 0000 0276 ...............v

0000090: 3100 0000 0707 0000 0002 6b31 0000 0016 1.........k1....

00000a0: 0000 000b 0000 0007 0700 0000 0276 3200 .............v2.

00000b0: 0000 0707 0000 0002 6b32 0000 0016 0000 ........k2......

00000c0: 000b 0000 0007 0700 0000 0276 3300 0000 ...........v3...

00000d0: 0707 0000 0002 6b33 0000 0016 0000 000b ......k3........

00000e0: 0000 0007 0700 0000 0276 3400 0000 0707 .........v4.....

00000f0: 0000 0002 6b34 092c 23ef bfbd efbf bdef ....k4.,#.......

0000100: bfbd 5d6a 4e4d 772a 44ef bfbd 6200 0000 ..]jNMw*D...b...

0000110: 1200 0000 0a .....

Reply all

Reply to author

Forward

0 new messages