HBase & Dumbo error: Cast string to buffer

52 views
Skip to first unread message

Nathan

unread,
Nov 29, 2010, 2:01:52 PM11/29/10
to dumbo-user
I had a demo of tf-idf.py working and writing to an HBase table called
webstat. I created the table with one column family "w". But I receive
this warning, which is causing a broken pipe error in the python. Here
is the java error I found:

2010-11-29 12:16:09,591 WARN org.apache.hadoop.streaming.PipeMapRed:
java.io.IOException: couldn't get column values, expecting Map<Buffer,
Map<Buffer, Buffer>>
at fm.last.hbase.mapred.TypedBytesTableOutputFormat
$TableRecordWriter.write(TypedBytesTableOutputFormat.java:100)
at fm.last.hbase.mapred.TypedBytesTableOutputFormat
$TableRecordWriter.write(TypedBytesTableOutputFormat.java:51)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
at org.apache.hadoop.streaming.PipeMapRed
$MROutputThread.run(PipeMapRed.java:421)
Caused by: java.lang.ClassCastException: java.lang.String cannot be
cast to org.apache.hadoop.record.Buffer
at fm.last.hbase.mapred.TypedBytesTableOutputFormat
$TableRecordWriter.write(TypedBytesTableOutputFormat.java:83)
... 3 more

This is the final Reduce job in the chain where it writes back to
hbase:

class Reducer2(Reducer):
def __init__(self):
Reducer.__init__(self)
self.doccount = float(self.params["doccount"])
def secondary(self, key, values):
idf = log(self.doccount / self.sum)
for (doc, tf) in values:
yield key, {"w" : {"tfidf" : tf * idf }} #columns

I am using the lasthbase code currently from github. The patched java
files posted on here aren't available anymore, so I don't know if that
will fix my problem. I am using hbase 0.20.6 and Cloudera's CDH2. I
had this working before, but don't know what the difference is.

Rafael Carrascosa

unread,
Nov 29, 2010, 4:44:44 PM11/29/10
to dumbo...@googlegroups.com
I don't know if this has anything to do with it, but in case it
helps..... what about trying "str(tf * idf)" instead of "tf * idf"?

> --
> You received this message because you are subscribed to the Google Groups "dumbo-user" group.
> To post to this group, send email to dumbo...@googlegroups.com.
> To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=en.
>
>

Nathan

unread,
Nov 29, 2010, 5:31:00 PM11/29/10
to dumbo-user
Yeah, that is what I had to do the first time around when it worked.
For some reason, trying both variations is giving me the same exact
error, which made me think the problem lay elsewhere. I will keep
plugging away and post if I have any success.

On Nov 29, 3:44 pm, Rafael Carrascosa <rafacarrasc...@gmail.com>
wrote:

mr. luk

unread,
Dec 3, 2010, 4:36:18 AM12/3/10
to dumbo-user
Hi there,
It was me who broke TypedBytesTableOutputFormat. Sorry for that!
lasthbase originally converted everything comming from hbase to an
utf8 encoded unicode string. As there are some byte sequences, that
are not representable in utf8, binary data stored in hbase and
retrieved through lasthbase gets altered (i.e. non utf8 byte sequences
are replaced by "0xEF 0xBF 0xBD"). That is why I made the changes.
Unfortunately, I had no code in place for testing the
TypedBytesTableOutput (see http://github.com/tims/lasthbase/issues/1
).

When I dug a bit deeper in this issue, I discovered, that in the
python implementation of typedbytes (that's afaik the data de-/
serializing protocol in hadoop streaming), python strings (which are
just byte sequences) are read and written as typed bytes STRINGs.
But in the typedbytes protocol definition (http://hadoop.apache.org/
mapreduce/docs/r0.21.0/api/org/apache/hadoop/typedbytes/package-
summary.html) typedbytes STRINGs are utf-8 encoded strings. So I
think, there's a mapping missmatch (as in python strings all possible
byte sequences can occur).
I would find it more consistent, to map python strings to the
typedbytes BYTES data type and python unicode objects to typedbytes
STRINGS. This would also have the consequence of treating everything
as bytes in lasthbase (as it is now), which would be consistent to the
java api.

What do you think?

Best regards,
Lukas

Klaas Bosteels

unread,
Dec 3, 2010, 6:22:07 AM12/3/10
to dumbo...@googlegroups.com
I think we used to do it like that for a short while but eventually we
decided to use regular python strings for the typed bytes strings
after all, mainly for performance/speed reasons iirc. I'm open to
clever suggestions that would improve the situation without affecting
performance or backwards compatibility too much though...

-K

mr. luk

unread,
Dec 3, 2010, 8:55:24 AM12/3/10
to dumbo-user

I would propose to treat python strings as typedbytes BYTES and python
unicode object as typedbytes STRINGS.

I fixed up TypedBytesTableOutputFormat to work with this mapping (and
tested it), so it should not throw the ClassCastException anymore.
However, for this to work, typedbytes.py (and supposedly ctypedbytes
as well) needs to be patched as follows:

--- typedbytes.py.orig 2010-12-03 14:36:29.921496000 +0100
+++ typedbytes.py 2010-12-03 14:35:12.789370000 +0100
@@ -181,7 +181,7 @@
LONG: read_long,
FLOAT: read_float,
DOUBLE: read_double,
- STRING: read_string,
+ STRING: read_bytes,
VECTOR: read_vector,
LIST: read_list,
MAP: read_map,
@@ -322,7 +322,7 @@
IntType: write_int,
LongType: write_long,
FloatType: write_double,
- StringType: write_string,
+ StringType: write_bytes,
TupleType: write_vector,
ListType: write_list,
DictType: write_map,


Regarding backward compatibility, I don't see a problem for utf-8
encoded unicode strings. However I might have overseen something.

Best regards,
Lukas

On Dec 3, 12:22 pm, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
> I think we used to do it like that for a short while but eventually we
> decided to use regular python strings for the typed bytes strings
> after all, mainly for performance/speed reasons iirc. I'm open to
> clever suggestions that would improve the situation without affecting
> performance or backwards compatibility too much though...
>
> -K
>
> On Fri, Dec 3, 2010 at 9:36 AM, mr. luk <mr.bobu...@gmail.com> wrote:
> > Hi there,
> > It was me who broke TypedBytesTableOutputFormat. Sorry for that!
> > lasthbase originally converted everything comming from hbase to an
> > utf8 encoded unicode string. As there are some byte sequences, that
> > are not representable in utf8, binary data stored in hbase and
> > retrieved through lasthbase gets altered (i.e. non utf8 byte sequences
> > are replaced by "0xEF 0xBF 0xBD"). That is why I made the changes.
> > Unfortunately, I had no code in place for testing the
> > TypedBytesTableOutput (seehttp://github.com/tims/lasthbase/issues/1

Klaas Bosteels

unread,
Dec 3, 2010, 10:09:40 AM12/3/10
to dumbo...@googlegroups.com
There might be a performance penalty for that though. Strings are a
very common datatype on Hadoop and converting between utf8 and unicode
all the time can add up to substantial amounts of wasted time. I think
I'd at least like to see some benchmark numbers before implementing
the switch.

-K

Klaas Bosteels

unread,
Dec 3, 2010, 10:49:13 AM12/3/10
to dumbo...@googlegroups.com
That patch seems wrong btw, I think you meant to do something like this:

STRING: read_unicode
StringType: write_bytes

I guess we could also fix your problem by only changing the latter line, i.e.:

* read typedbytes bytes to regular python strings
* read typedbytes strings to regular python strings
* write regular python strings to typedbytes bytes
* write unicode python strings to typedbytes strings

That should still work but would be a bit weird I guess.

And another way of avoiding the speed issue might be to implement fast
unicode reading and writing C functions in ctypedbytes maybe.

In any case, it would be good to create a ticket for this and gather
the different thoughts there. Could you be persuaded to do that you
think, Lukas? :)

-K

mr. luk

unread,
Dec 3, 2010, 11:15:41 AM12/3/10
to dumbo-user
Hi Klaas,
I agree with you. But with the patch applied, there is imho even less
string conversion going on in the pipeline.
Before, TypedBytesTableInputFormat converted bytes coming from hbase
to a utf-16 encoded unicode string (Bytes.toString()), which is then
decoded to a utf-8 encoded unicode string in hadoop.streaming
(String.getBytes("UTF-8") ).
In the other direction, the bytes coming from python were interpreted
as a utf-8 encoded string which then is converted to a utf-16 encoded
string in hadoop.streaming (String(buffer,"UTF-8") ) and then is again
decoded to a utf-8 encoded string in TypedBytesTableOutputFormat
(Bytes.toBytes()).
With the proposed modifications, both conversions aren't done anymore,
as the data is just treated as bytes (which should be faster).
As soon I get time, I hope to verify the above statements with a
proper benchmark and report back.

Best regards,
Lukas


On Dec 3, 4:09 pm, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
> There might be a performance penalty for that though. Strings are a
> very common datatype on Hadoop and converting between utf8 and unicode
> all the time can add up to substantial amounts of wasted time. I think
> I'd at least like to see some benchmark numbers before implementing
> the switch.
>
> -K
>

Klaas Bosteels

unread,
Dec 3, 2010, 11:22:06 AM12/3/10
to dumbo...@googlegroups.com
I was mostly talking about more tradional hadoop use cases actually.
It's very common to take big text files as input on hadoop which will
get served up to dumbo scripts as typedbytes strings. Right now those
strings will not be converted further, but if we switch to reading
typedbytes strings as python unicode strings then the dumbo script
will have to do lots of utf8 -> python unicode conversions.

-K

mr. luk

unread,
Dec 3, 2010, 11:28:00 AM12/3/10
to dumbo-user
On Dec 3, 4:49 pm, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
> That patch seems wrong btw, I think you meant to do something like this:
>
>     STRING: read_unicode
>     StringType: write_bytes

Yes I mixed things up in the patch there.

> I guess we could also fix your problem by only changing the latter line, i.e.:
>
> * read typedbytes bytes to regular python strings
> * read typedbytes strings to regular python strings
> * write regular python strings to typedbytes bytes
> * write unicode python strings to typedbytes strings
>
> That should still work but would be a bit weird I guess.

Almost, I think
* read typedbytes bytes to regular python strings
* read typedbytes strings to python unicode strings
* write regular python strings to typedbytes bytes
* write unicode python strings to typedbytes strings
would be imho not weird.

> And another way of avoiding the speed issue might be to implement fast
> unicode reading and writing C functions in ctypedbytes maybe.
In my case speed was not the issue, but data corruption (see my first
comment).

> In any case, it would be good to create a ticket for this and gather
> the different thoughts there. Could you be persuaded to do that you
> think, Lukas? :)

Hehe, of course.

> -K
>
> On Fri, Dec 3, 2010 at 3:09 PM, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
> > There might be a performance penalty for that though. Strings are a
> > very common datatype on Hadoop and converting between utf8 and unicode
> > all the time can add up to substantial amounts of wasted time. I think
> > I'd at least like to see some benchmark numbers before implementing
> > the switch.
>
> > -K
>

mr. luk

unread,
Dec 3, 2010, 11:57:05 AM12/3/10
to dumbo-user
Oh, I was not fully aware of that.
On the one side, I feel this (the utf8 to python unicode conversion)
would be the proper and clean way to go, on the other side I see the
big performance issue.
I'll open an issue in lasthbase issue tracker (as you proposed) and we
can discuss this there.

Best,
Lukas

On Dec 3, 5:22 pm, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
> I was mostly talking about more tradional hadoop use cases actually.
> It's very common to take big text files as input on hadoop which will
> get served up to dumbo scripts as typedbytes strings. Right now those
> strings will not be converted further, but if we switch to reading
> typedbytes strings as python unicode strings then the dumbo script
> will have to do lots of utf8 -> python unicode conversions.
>
> -K
>

mr. luk

unread,
Dec 5, 2010, 9:38:57 AM12/5/10
to dumbo-user
Hi there,
The issue has now a new home at
http://github.com/tims/lasthbase/issues/issue/3
I tried to give a short summary over this discussion. Please feel to
add any comments.

Best regards,
Lukas

On 3 Dez., 16:49, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
> That patch seems wrong btw, I think you meant to do something like this:
>
>     STRING: read_unicode
>     StringType: write_bytes
>
> I guess we could also fix your problem by only changing the latter line, i.e.:
>
> * read typedbytes bytes to regular python strings
> * read typedbytes strings to regular python strings
> * write regular python strings to typedbytes bytes
> * write unicode python strings to typedbytes strings
>
> That should still work but would be a bit weird I guess.
>
> And another way of avoiding the speed issue might be to implement fast
> unicode reading and writing C functions in ctypedbytes maybe.
>
> In any case, it would be good to create a ticket for this and gather
> the different thoughts there. Could you be persuaded to do that you
> think, Lukas? :)
>
> -K
>
> On Fri, Dec 3, 2010 at 3:09 PM, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
> > There might be a performance penalty for that though. Strings are a
> > very common datatype on Hadoop and converting between utf8 and unicode
> > all the time can add up to substantial amounts of wasted time. I think
> > I'd at least like to see some benchmark numbers before implementing
> > the switch.
>
> > -K
>
Reply all
Reply to author
Forward
0 new messages