reading dumbo output data

500 views
Skip to first unread message

Seongsu Lee

unread,
Aug 21, 2011, 2:16:40 AM8/21/11
to dumbo-user
Hi,

I am trying to use Dumbo for my python script for Hadoop.

I made my test-purpose wordcount.py program to run on Hadoop with
dumbo framework while enabling typedbytes. What I got in the output of
reducer is like this:

SEQ^F/org.apache.hadoop.typedbytes.TypedBytesWritable/
org.apache.hadoop.typedbytes.TypedBytesWritable^@^@^@^@^@^@S???^?????
p?.^Y??^@^@^@^U^@^@^@^L^@^@^@^H^G^@^ [... goes much more]

I tried to read the above data by following Python code on my Python
prompt:

>>> import typedbytes
>>> for x in typedbytes.PairedInput(open('part-00000', 'rb')).reads(): print x
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.linux-x86_64/egg/typedbytes.py", line 355, in
reads
File "build/bdist.linux-x86_64/egg/typedbytes.py", line 85, in
_reads
File "build/bdist.linux-x86_64/egg/typedbytes.py", line 74, in _read
File "build/bdist.linux-x86_64/egg/typedbytes.py", line 163, in
invalid_typecode
struct.error: Invalid type byte: 83

Do I need to convert the reducer output before putting the data into
typedbytes.PairedInput() function? Or any idea? Thanks in advance.

Seongsu

----
Follow is wordcount.py:

#!/usr/bin/env python

def mapper(key, value):
for word in value.split():
yield word, 1

def reducer(key, values):
yield key, sum(values)

if __name__ == "__main__":
import dumbo
dumbo.run(mapper, reducer)

What I am using are hadoop 0.21.0, dumbo 0.21.30 and python 2.7.2

air

unread,
Aug 21, 2011, 10:48:49 PM8/21/11
to dumbo...@googlegroups.com
use -outputformat text when running your program

2011/8/21 Seongsu Lee <se...@senux.com>

--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To post to this group, send email to dumbo...@googlegroups.com.
To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=en.




--
Knowledge Mangement .

Seongsu Lee

unread,
Aug 22, 2011, 1:01:52 AM8/22/11
to dumbo-user
Hi, thanks for your reply.
Using '-outputformat text' make the reducer output to plain text. But,
what I want to have is to save the reducer output in typedbytes format
and reading it again to save time to parse it and to avoid to use
verbose code to parse it.

On Aug 22, 11:48 am, air <cnwe...@gmail.com> wrote:
> use *-outputformat text* when running your program

air

unread,
Aug 22, 2011, 3:33:00 AM8/22/11
to dumbo...@googlegroups.com
try 

dumbo get 



2011/8/22 Seongsu Lee <se...@senux.com>
--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To post to this group, send email to dumbo...@googlegroups.com.
To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=en.




--
Knowledge Mangement .

Klaas Bosteels

unread,
Aug 22, 2011, 4:52:04 AM8/22/11
to dumbo...@googlegroups.com
(Sorry if I posted this message several times, the mail app in Lion
seems to be a bit buggy.)

Hey Seongsu,

Dumbo outputs typed bytes writables to sequence files by default. You
can convert this output to text by means of the "dumbo cat" command or
you can use hadoop streaming's "dumptb" command to convert the
sequence files to typed bytes first and then read it using the
"typedbytes" module (which is what the "dumbo cat" command does under
the hood).

Hope this helps,
-Klaas

Seongsu Lee

unread,
Aug 22, 2011, 5:53:50 AM8/22/11
to dumbo-user
Hi Klaas,

I found that the data, converted by "dumptb", can be read by
"typedbytes" module in Python. It's working. It is helpful.

Thank you!
Seongsu

Klaas Bosteels

unread,
Oct 24, 2011, 10:19:18 AM10/24/11
to Bharath Krishnan, dumbo...@googlegroups.com
Ah, I see -- that won't work yeah.

The problem is that dumbo cat behaves very differently depending on what backend it uses:

1) When you provide a -hadoop /path/to/hadoop option, it will use the hadoop streaming backend and run the dumptb command under the hood and convert the outputted typed bytes on the fly to text.

2) When you don't provide a -hadoop option, it will use the local unix backend which will expect the output file to be a text file containing python repr strings.

-Klaas


On 24 Oct 2011, at 16:06, Bharath Krishnan wrote:

> Hi Klaas,
>
> I ran the streaming job on Amazon & copied one of the output files to
> my local disk from s3.
>
> Then I just called:
>
> dumbo cat filename
>
> I re-read the thread after I emailed you and figured out a way to make
> it work. (It still is confusing to me why dumbo cat wouldn't work).
>
> If I do the following:
>
> hadoop jar hadoop-streaming.jar dumptb part-00000 > data.tb
>
> and then read the file in python using the typedbytes module, it works.
>
> import typedbytes
> for x in typedbytes.PairedInput(open('dump.tb', 'rb')).reads():
> print x
>
> dumbo cat data.tb does not work though.
>
> Thanks!
>
> -bharath
>
>
> On Mon, Oct 24, 2011 at 10:00 AM, Klaas Bosteels
> <klaas.b...@gmail.com> wrote:
>> Hey bharath,
>>
>> It's normal for your output to be saved as sequence files containing typed bytes writables, but you should be able to print it using the dumbo cat command.
>>
>> The output below seems to suggest that you are somehow running dumbo cat as a map reduce job on hadoop. What command are you executing exactly?
>>
>> -K
>>
>> On 24 Oct 2011, at 15:56, bharath wrote:
>>
>>> Hi Klaas,
>>>
>>> I too have similar problems reading dumbo output. The data file starts
>>> like this:
>>>
>>> SEQ/org.apache.hadoop.typedbytes.TypedBytesWritable/
>>> org.apache.hadoop.typedbytes.TypedBytesWritable?l|?7gB?\[]'?_?
>>> A1920G0009_10_G? ??{0?#c?L2613L0024_LB_??&??u[A1812G0???
>>> eA1812G0019_J6_?͛?㭇12_01_??GBv?,?H7705H0061_QD_??GBv?,?
>>> H7704G0033_82_??
>>>
>>> When I try to convert it to text using dumbo cat, I get a bunch of
>>> warnings followed by an error:
>>>
>>> WARNING: skipping bad input (SEQ/
>>> org.apache.hadoop.typedbytes.TypedBytesWritable/
>>> org.apache.hadoop.typedbytes.TypedBytesWritable?l|?7gB?\[]'?_?
>>> A1920G0009_10_G? ??{0?#c?L???eA1812G0019_J6_?͛?㭇812G0019_01_????
>>> wcmA3726M0012_01_??GBv?,?H7705H0061_QD_??GBv?,?H7704G0033_82_??
>>> ?=?B2324H0034_TB_????]M??L2720L0013_2W_????5A??B8267H0005_01_???P?S?
>>> B2720L0023_01_?ɮ??k?B8263F0004_10_??uB?WA1813G0017_J6_??l(ݣ ?
>>> H3827M0002_9I_??L^?d[M1812G0025_NN??????-M1812G0026_NN_??Wn?
>>> M3612M0073_WI_?ǓU+?pB8267C0003_01_??j????M1810G0056_CW_????
>>> 3S2611L0109_7E_??)
>>> reporter:counter:Dumbo,Bad inputs,1


>>>
>>> Traceback (most recent call last):

>>> File "/usr/local/Cellar/python/2.7.1/bin/dumbo", line 9, in <module>
>>> load_entry_point('dumbo==0.21.30', 'console_scripts', 'dumbo')()
>>> File "/usr/local/Cellar/python/2.7.1/lib/python2.7/site-packages/
>>> dumbo/__init__.py", line 32, in execute_and_exit
>>> sys.exit(dumbo())
>>> File "/usr/local/Cellar/python/2.7.1/lib/python2.7/site-packages/
>>> dumbo/cmd.py", line 41, in dumbo
>>> retval = cat(sys.argv[2], parseargs(sys.argv[2:]))
>>> File "/usr/local/Cellar/python/2.7.1/lib/python2.7/site-packages/
>>> dumbo/cmd.py", line 94, in cat
>>> return create_filesystem(opts).cat(path, opts)
>>> File "/usr/local/Cellar/python/2.7.1/lib/python2.7/site-packages/
>>> dumbo/backends/unix.py", line 116, in cat
>>> return decodepipe(opts + [('file', path)])
>>> File "/usr/local/Cellar/python/2.7.1/lib/python2.7/site-packages/
>>> dumbo/cmd.py", line 155, in decodepipe
>>> for output in dumptext(outputs):
>>> File "/usr/local/Cellar/python/2.7.1/lib/python2.7/site-packages/
>>> dumbo/util.py", line 65, in dumptext
>>> for output in outputs:
>>> File "/usr/local/Cellar/python/2.7.1/lib/python2.7/site-packages/
>>> dumbo/util.py", line 55, in loadcode
>>> yield map(eval, input.split('\t', 1))
>>> File "<string>", line 1
>>> ?
>>> ^
>>> SyntaxError: unexpected EOF while parsing
>>>
>>> Any help is much appreciated!
>>>
>>> Thanks,
>>>
>>> -bharath

Bharath Krishnan

unread,
Oct 24, 2011, 10:21:41 AM10/24/11
to Klaas Bosteels, dumbo...@googlegroups.com
Thanks, that makes sense.

-bharath

On Mon, Oct 24, 2011 at 10:19 AM, Klaas Bosteels

Reply all
Reply to author
Forward
0 new messages