INFO: inputting typed bytes
INFO: buffersize = 168960
INFO: outputting typed bytes
Traceback (most recent call last):
File "/usr/lib/python2.6/runpy.py", line 122, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.6/runpy.py", line 34, in _run_code
exec code in run_globals
File "/home/mat/work/lalisio/data/hadoop-0.20/mapred/local/
taskTracker/jobcache/job_200907271130_0005/
attempt_200907271130_0005_m_000000_0/work/mrtest.py", line 10, in
<module>
dumbo.run(mapper,reducer,combiner=reducer)
File "build/bdist.linux-i686/egg/dumbo/core.py", line 611, in run
File "build/bdist.linux-i686/egg/typedbytes.py", line 371, in writes
File "build/bdist.linux-i686/egg/typedbytes.py", line 237, in
_writes
File "build/bdist.linux-i686/egg/typedbytes.py", line 210, in
flatten
File "build/bdist.linux-i686/egg/dumbo/core.py", line 748, in
redfunc_iter
File "build/bdist.linux-i686/egg/dumbo/core.py", line 755, in
<genexpr>
File "build/bdist.linux-i686/egg/dumbo/util.py", line 32, in sorted
File "build/bdist.linux-i686/egg/dumbo/util.py", line 32, in
<genexpr>
File "build/bdist.linux-i686/egg/dumbo/core.py", line 733, in
mapfunc_iter
File "build/bdist.linux-i686/egg/typedbytes.py", line 355, in reads
File "build/bdist.linux-i686/egg/typedbytes.py", line 85, in _reads
File "build/bdist.linux-i686/egg/typedbytes.py", line 74, in _read
File "build/bdist.linux-i686/egg/typedbytes.py", line 163, in
invalid_typecode
struct.error: Invalid type byte: 50
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 255
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads
(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished
(PipeMapRed.java:564)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:
36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Probably ResultSet from HBase can not be deserialzed by typedbytes.
So do I need to write my own InputType or is there an existing one?
Or is there a completely different way to process HBase Tables with
dumbo?
I've been working on this.
You need to have a custom mapper, which take the key value,
(ImmutableBytesWritable, RowResult) and outputs
(TypedBytesWritable, TypedBytesWritable). Then you can just use dumbos
java integration.
I've made a mapper that should do that, but I've missed something, or
there is a bug.
I keep getting: struct.error: Invalid type byte: 50
here's my mapper class:
/**
* Mapper class for fetching columns from hbase and returning typed
bytes for use with dumbo. Uses the depreciated
* mapred api, because hadoop streaming uses it.
*
* @author tims
*/
public class HBaseDumboMapper implements
Mapper<ImmutableBytesWritable, RowResult, TypedBytesWritable,
TypedBytesWritable> {
private static Logger log = Logger.getLogger(HBaseDumboMapper.class);
> INFO: inputting typed bytes
> INFO: buffersize = 168960
> INFO: outputting typed bytes
> Traceback (most recent call last):
> File "/usr/lib/python2.6/runpy.py", line 122, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
> File "/usr/lib/python2.6/runpy.py", line 34, in _run_code
> exec code in run_globals
> File "/home/mat/work/lalisio/data/hadoop-0.20/mapred/local/
> taskTracker/jobcache/job_200907271130_0005/
> attempt_200907271130_0005_m_000000_0/work/mrtest.py", line 10, in
> <module>
> dumbo.run(mapper,reducer,combiner=reducer)
> File "build/bdist.linux-i686/egg/dumbo/core.py", line 611, in run
> File "build/bdist.linux-i686/egg/typedbytes.py", line 371, in writes
> File "build/bdist.linux-i686/egg/typedbytes.py", line 237, in
> _writes
> File "build/bdist.linux-i686/egg/typedbytes.py", line 210, in
> flatten
> File "build/bdist.linux-i686/egg/dumbo/core.py", line 748, in
> redfunc_iter
> File "build/bdist.linux-i686/egg/dumbo/core.py", line 755, in
> <genexpr>
> File "build/bdist.linux-i686/egg/dumbo/util.py", line 32, in sorted
> File "build/bdist.linux-i686/egg/dumbo/util.py", line 32, in
> <genexpr>
> File "build/bdist.linux-i686/egg/dumbo/core.py", line 733, in
> mapfunc_iter
> File "build/bdist.linux-i686/egg/typedbytes.py", line 355, in reads
> File "build/bdist.linux-i686/egg/typedbytes.py", line 85, in _reads
> File "build/bdist.linux-i686/egg/typedbytes.py", line 74, in _read
> File "build/bdist.linux-i686/egg/typedbytes.py", line 163, in
> invalid_typecode
> struct.error: Invalid type byte: 50
> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
> failed with code 255
> at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads
> (PipeMapRed.java:362)
> at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished
> (PipeMapRed.java:564)
> at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
> at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:
> 36)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Probably ResultSet from HBase can not be deserialzed by typedbytes.
> So do I need to write my own InputType or is there an existing one?
> Or is there a completely different way to process HBase Tables with
> dumbo?
Think you might want to avoid passing byte[]s to
TypedBytesWritable.setValue() (either directly or indirectly as part
of a collection). What gets called under the hood is the following
method:
/**
* Writes a Java object as a typed bytes sequence.
*
* @param obj the object to be written
* @throws IOException
*/
public void write(Object obj) throws IOException {
if (obj instanceof Buffer) {
writeBytes(((Buffer) obj).get());
} else if (obj instanceof Byte) {
writeByte((Byte) obj);
} else if (obj instanceof Boolean) {
writeBool((Boolean) obj);
} else if (obj instanceof Integer) {
writeInt((Integer) obj);
} else if (obj instanceof Long) {
writeLong((Long) obj);
} else if (obj instanceof Float) {
writeFloat((Float) obj);
} else if (obj instanceof Double) {
writeDouble((Double) obj);
} else if (obj instanceof String) {
writeString((String) obj);
} else if (obj instanceof ArrayList) {
writeVector((ArrayList) obj);
} else if (obj instanceof List) {
writeList((List) obj);
} else if (obj instanceof Map) {
writeMap((Map) obj);
} else {
throw new RuntimeException("cannot write objects of this type");
}
}
Not sure why you're not getting the "cannot write objects of this
type" exception, but I'd expect things to work when you use
org.apache.hadoop.record.Buffer objects instead of byte[]s (or convert
the byte[]s to other proper objects first).
On Mon, Jul 27, 2009 at 1:43 PM, Tim Sell<trs...@gmail.com> wrote:
> I've been working on this.
> You need to have a custom mapper, which take the key value,
> (ImmutableBytesWritable, RowResult) and outputs
> (TypedBytesWritable, TypedBytesWritable). Then you can just use dumbos
> java integration.
> I've made a mapper that should do that, but I've missed something, or
> there is a bug.
> I keep getting: struct.error: Invalid type byte: 50
> here's my mapper class:
> /**
> * Mapper class for fetching columns from hbase and returning typed
> bytes for use with dumbo. Uses the depreciated
> * mapred api, because hadoop streaming uses it.
> *
> * @author tims
> */
> public class HBaseDumboMapper implements
> Mapper<ImmutableBytesWritable, RowResult, TypedBytesWritable,
> TypedBytesWritable> {
> private static Logger log = Logger.getLogger(HBaseDumboMapper.class);
>> INFO: inputting typed bytes
>> INFO: buffersize = 168960
>> INFO: outputting typed bytes
>> Traceback (most recent call last):
>> File "/usr/lib/python2.6/runpy.py", line 122, in _run_module_as_main
>> "__main__", fname, loader, pkg_name)
>> File "/usr/lib/python2.6/runpy.py", line 34, in _run_code
>> exec code in run_globals
>> File "/home/mat/work/lalisio/data/hadoop-0.20/mapred/local/
>> taskTracker/jobcache/job_200907271130_0005/
>> attempt_200907271130_0005_m_000000_0/work/mrtest.py", line 10, in
>> <module>
>> dumbo.run(mapper,reducer,combiner=reducer)
>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 611, in run
>> File "build/bdist.linux-i686/egg/typedbytes.py", line 371, in writes
>> File "build/bdist.linux-i686/egg/typedbytes.py", line 237, in
>> _writes
>> File "build/bdist.linux-i686/egg/typedbytes.py", line 210, in
>> flatten
>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 748, in
>> redfunc_iter
>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 755, in
>> <genexpr>
>> File "build/bdist.linux-i686/egg/dumbo/util.py", line 32, in sorted
>> File "build/bdist.linux-i686/egg/dumbo/util.py", line 32, in
>> <genexpr>
>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 733, in
>> mapfunc_iter
>> File "build/bdist.linux-i686/egg/typedbytes.py", line 355, in reads
>> File "build/bdist.linux-i686/egg/typedbytes.py", line 85, in _reads
>> File "build/bdist.linux-i686/egg/typedbytes.py", line 74, in _read
>> File "build/bdist.linux-i686/egg/typedbytes.py", line 163, in
>> invalid_typecode
>> struct.error: Invalid type byte: 50
>> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
>> failed with code 255
>> at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads
>> (PipeMapRed.java:362)
>> at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished
>> (PipeMapRed.java:564)
>> at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
>> at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:
>> 36)
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> Probably ResultSet from HBase can not be deserialzed by typedbytes.
>> So do I need to write my own InputType or is there an existing one?
>> Or is there a completely different way to process HBase Tables with
>> dumbo?
On Mon, Jul 27, 2009 at 2:00 PM, Klaas Bosteels<klaas.boste...@gmail.com> wrote:
> Think you might want to avoid passing byte[]s to
> TypedBytesWritable.setValue() (either directly or indirectly as part
> of a collection). What gets called under the hood is the following
> method:
> /**
> * Writes a Java object as a typed bytes sequence.
> *
> * @param obj the object to be written
> * @throws IOException
> */
> public void write(Object obj) throws IOException {
> if (obj instanceof Buffer) {
> writeBytes(((Buffer) obj).get());
> } else if (obj instanceof Byte) {
> writeByte((Byte) obj);
> } else if (obj instanceof Boolean) {
> writeBool((Boolean) obj);
> } else if (obj instanceof Integer) {
> writeInt((Integer) obj);
> } else if (obj instanceof Long) {
> writeLong((Long) obj);
> } else if (obj instanceof Float) {
> writeFloat((Float) obj);
> } else if (obj instanceof Double) {
> writeDouble((Double) obj);
> } else if (obj instanceof String) {
> writeString((String) obj);
> } else if (obj instanceof ArrayList) {
> writeVector((ArrayList) obj);
> } else if (obj instanceof List) {
> writeList((List) obj);
> } else if (obj instanceof Map) {
> writeMap((Map) obj);
> } else {
> throw new RuntimeException("cannot write objects of this type");
> }
> }
> Not sure why you're not getting the "cannot write objects of this
> type" exception, but I'd expect things to work when you use
> org.apache.hadoop.record.Buffer objects instead of byte[]s (or convert
> the byte[]s to other proper objects first).
> -Klaas
> On Mon, Jul 27, 2009 at 1:43 PM, Tim Sell<trs...@gmail.com> wrote:
>> I've been working on this.
>> You need to have a custom mapper, which take the key value,
>> (ImmutableBytesWritable, RowResult) and outputs
>> (TypedBytesWritable, TypedBytesWritable). Then you can just use dumbos
>> java integration.
>> I've made a mapper that should do that, but I've missed something, or
>> there is a bug.
>> I keep getting: struct.error: Invalid type byte: 50
>> here's my mapper class:
>> /**
>> * Mapper class for fetching columns from hbase and returning typed
>> bytes for use with dumbo. Uses the depreciated
>> * mapred api, because hadoop streaming uses it.
>> *
>> * @author tims
>> */
>> public class HBaseDumboMapper implements
>> Mapper<ImmutableBytesWritable, RowResult, TypedBytesWritable,
>> TypedBytesWritable> {
>> private static Logger log = Logger.getLogger(HBaseDumboMapper.class);
>>> INFO: inputting typed bytes
>>> INFO: buffersize = 168960
>>> INFO: outputting typed bytes
>>> Traceback (most recent call last):
>>> File "/usr/lib/python2.6/runpy.py", line 122, in _run_module_as_main
>>> "__main__", fname, loader, pkg_name)
>>> File "/usr/lib/python2.6/runpy.py", line 34, in _run_code
>>> exec code in run_globals
>>> File "/home/mat/work/lalisio/data/hadoop-0.20/mapred/local/
>>> taskTracker/jobcache/job_200907271130_0005/
>>> attempt_200907271130_0005_m_000000_0/work/mrtest.py", line 10, in
>>> <module>
>>> dumbo.run(mapper,reducer,combiner=reducer)
>>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 611, in run
>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 371, in writes
>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 237, in
>>> _writes
>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 210, in
>>> flatten
>>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 748, in
>>> redfunc_iter
>>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 755, in
>>> <genexpr>
>>> File "build/bdist.linux-i686/egg/dumbo/util.py", line 32, in sorted
>>> File "build/bdist.linux-i686/egg/dumbo/util.py", line 32, in
>>> <genexpr>
>>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 733, in
>>> mapfunc_iter
>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 355, in reads
>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 85, in _reads
>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 74, in _read
>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 163, in
>>> invalid_typecode
>>> struct.error: Invalid type byte: 50
>>> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
>>> failed with code 255
>>> at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads
>>> (PipeMapRed.java:362)
>>> at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished
>>> (PipeMapRed.java:564)
>>> at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
>>> at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:
>>> 36)
>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>> Probably ResultSet from HBase can not be deserialzed by typedbytes.
>>> So do I need to write my own InputType or is there an existing one?
>>> Or is there a completely different way to process HBase Tables with
>>> dumbo?
TypedBytesWritable row = new TypedBytesWritable();
row.setValue(new String("help"));
TypedBytesWritable tbColumns = new TypedBytesWritable();
tbColumns.setValue(new String("I'm stuck in a row factory"));
collector.collect(row, tbColumns);
}
And I still get invalid type byte 50.
I get the same if I don't do any collecting at all.
I've dumbo installed in a virtual environment.
here how I am running it:
$ ../python/env/bin/dumbo test.py -hadoop ~/hadoop-0.20.0.hbase/
-libjar build/dist/bobbie-mapred.jar -output /user/search/test
-inputformat org.apache.hadoop.hbase.mapred.TableInputFormat
-hadoopconf hbase.mapred.tablecolumns="fam1:qualifier1" -input
tablename
> On Mon, Jul 27, 2009 at 2:00 PM, Klaas Bosteels<klaas.boste...@gmail.com> wrote:
>> Think you might want to avoid passing byte[]s to
>> TypedBytesWritable.setValue() (either directly or indirectly as part
>> of a collection). What gets called under the hood is the following
>> method:
>> /**
>> * Writes a Java object as a typed bytes sequence.
>> *
>> * @param obj the object to be written
>> * @throws IOException
>> */
>> public void write(Object obj) throws IOException {
>> if (obj instanceof Buffer) {
>> writeBytes(((Buffer) obj).get());
>> } else if (obj instanceof Byte) {
>> writeByte((Byte) obj);
>> } else if (obj instanceof Boolean) {
>> writeBool((Boolean) obj);
>> } else if (obj instanceof Integer) {
>> writeInt((Integer) obj);
>> } else if (obj instanceof Long) {
>> writeLong((Long) obj);
>> } else if (obj instanceof Float) {
>> writeFloat((Float) obj);
>> } else if (obj instanceof Double) {
>> writeDouble((Double) obj);
>> } else if (obj instanceof String) {
>> writeString((String) obj);
>> } else if (obj instanceof ArrayList) {
>> writeVector((ArrayList) obj);
>> } else if (obj instanceof List) {
>> writeList((List) obj);
>> } else if (obj instanceof Map) {
>> writeMap((Map) obj);
>> } else {
>> throw new RuntimeException("cannot write objects of this type");
>> }
>> }
>> Not sure why you're not getting the "cannot write objects of this
>> type" exception, but I'd expect things to work when you use
>> org.apache.hadoop.record.Buffer objects instead of byte[]s (or convert
>> the byte[]s to other proper objects first).
>> -Klaas
>> On Mon, Jul 27, 2009 at 1:43 PM, Tim Sell<trs...@gmail.com> wrote:
>>> I've been working on this.
>>> You need to have a custom mapper, which take the key value,
>>> (ImmutableBytesWritable, RowResult) and outputs
>>> (TypedBytesWritable, TypedBytesWritable). Then you can just use dumbos
>>> java integration.
>>> I've made a mapper that should do that, but I've missed something, or
>>> there is a bug.
>>> I keep getting: struct.error: Invalid type byte: 50
>>> here's my mapper class:
>>> /**
>>> * Mapper class for fetching columns from hbase and returning typed
>>> bytes for use with dumbo. Uses the depreciated
>>> * mapred api, because hadoop streaming uses it.
>>> *
>>> * @author tims
>>> */
>>> public class HBaseDumboMapper implements
>>> Mapper<ImmutableBytesWritable, RowResult, TypedBytesWritable,
>>> TypedBytesWritable> {
>>> private static Logger log = Logger.getLogger(HBaseDumboMapper.class);
>>>> INFO: inputting typed bytes
>>>> INFO: buffersize = 168960
>>>> INFO: outputting typed bytes
>>>> Traceback (most recent call last):
>>>> File "/usr/lib/python2.6/runpy.py", line 122, in _run_module_as_main
>>>> "__main__", fname, loader, pkg_name)
>>>> File "/usr/lib/python2.6/runpy.py", line 34, in _run_code
>>>> exec code in run_globals
>>>> File "/home/mat/work/lalisio/data/hadoop-0.20/mapred/local/
>>>> taskTracker/jobcache/job_200907271130_0005/
>>>> attempt_200907271130_0005_m_000000_0/work/mrtest.py", line 10, in
>>>> <module>
>>>> dumbo.run(mapper,reducer,combiner=reducer)
>>>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 611, in run
>>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 371, in writes
>>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 237, in
>>>> _writes
>>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 210, in
>>>> flatten
>>>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 748, in
>>>> redfunc_iter
>>>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 755, in
>>>> <genexpr>
>>>> File "build/bdist.linux-i686/egg/dumbo/util.py", line 32, in sorted
>>>> File "build/bdist.linux-i686/egg/dumbo/util.py", line 32, in
>>>> <genexpr>
>>>> File "build/bdist.linux-i686/egg/dumbo/core.py", line 733, in
>>>> mapfunc_iter
>>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 355, in reads
>>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 85, in _reads
>>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 74, in _read
>>>> File "build/bdist.linux-i686/egg/typedbytes.py", line 163, in
>>>> invalid_typecode
>>>> struct.error: Invalid type byte: 50
>>>> java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
>>>> failed with code 255
>>>> at
>> On Mon, Jul 27, 2009 at 2:00 PM, Klaas Bosteels<klaas.boste...@gmail.com> wrote:
>>> Think you might want to avoid passing byte[]s to
>>> TypedBytesWritable.setValue() (either directly or indirectly as part
>>> of a collection). What gets called under the hood is the following
>>> method:
>>> /**
>>> * Writes a Java object as a typed bytes sequence.
>>> *
>>> * @param obj the object to be written
>>> * @throws IOException
>>> */
>>> public void write(Object obj) throws IOException {
>>> if (obj instanceof Buffer) {
>>> writeBytes(((Buffer) obj).get());
>>> } else if (obj instanceof Byte) {
>>> writeByte((Byte) obj);
>>> } else if (obj instanceof Boolean) {
>>> writeBool((Boolean) obj);
>>> } else if (obj instanceof Integer) {
>>> writeInt((Integer) obj);
>>> } else if (obj instanceof Long) {
>>> writeLong((Long) obj);
>>> } else if (obj instanceof Float) {
>>> writeFloat((Float) obj);
>>> } else if (obj instanceof Double) {
>>> writeDouble((Double) obj);
>>> } else if (obj instanceof String) {
>>> writeString((String) obj);
>>> } else if (obj instanceof ArrayList) {
>>> writeVector((ArrayList) obj);
>>> } else if (obj instanceof List) {
>>> writeList((List) obj);
>>> } else if (obj instanceof Map) {
>>> writeMap((Map) obj);
>>> } else {
>>> throw new RuntimeException("cannot write objects of this type");
>>> }
>>> }
>>> Not sure why you're not getting the "cannot write objects of this
>>> type" exception, but I'd expect things to work when you use
>>> org.apache.hadoop.record.Buffer objects instead of byte[]s (or convert
>>> the byte[]s to other proper objects first).
>>> -Klaas
>>> On Mon, Jul 27, 2009 at 1:43 PM, Tim Sell<trs...@gmail.com> wrote:
>>>> I've been working on this.
>>>> You need to have a custom mapper, which take the key value,
>>>> (ImmutableBytesWritable, RowResult) and outputs
>>>> (TypedBytesWritable, TypedBytesWritable). Then you can just use dumbos
>>>> java integration.
>>>> I've made a mapper that should do that, but I've missed something, or
>>>> there is a bug.
>>>> I keep getting: struct.error: Invalid type byte: 50
>>>> here's my mapper class:
>>>> /**
>>>> * Mapper class for fetching columns from hbase and returning typed
>>>> bytes for use with dumbo. Uses the depreciated
>>>> * mapred api, because hadoop streaming uses it.
>>>> *
>>>> * @author tims
>>>> */
>>>> public class HBaseDumboMapper implements
>>>> Mapper<ImmutableBytesWritable, RowResult, TypedBytesWritable,
>>>> TypedBytesWritable> {
>>>> private static Logger log = Logger.getLogger(HBaseDumboMapper.class);
>>> On Mon, Jul 27, 2009 at 2:00 PM, Klaas Bosteels<klaas.boste...@gmail.com> wrote:
>>>> Think you might want to avoid passing byte[]s to
>>>> TypedBytesWritable.setValue() (either directly or indirectly as part
>>>> of a collection). What gets called under the hood is the following
>>>> method:
>>>> /**
>>>> * Writes a Java object as a typed bytes sequence.
>>>> *
>>>> * @param obj the object to be written
>>>> * @throws IOException
>>>> */
>>>> public void write(Object obj) throws IOException {
>>>> if (obj instanceof Buffer) {
>>>> writeBytes(((Buffer) obj).get());
>>>> } else if (obj instanceof Byte) {
>>>> writeByte((Byte) obj);
>>>> } else if (obj instanceof Boolean) {
>>>> writeBool((Boolean) obj);
>>>> } else if (obj instanceof Integer) {
>>>> writeInt((Integer) obj);
>>>> } else if (obj instanceof Long) {
>>>> writeLong((Long) obj);
>>>> } else if (obj instanceof Float) {
>>>> writeFloat((Float) obj);
>>>> } else if (obj instanceof Double) {
>>>> writeDouble((Double) obj);
>>>> } else if (obj instanceof String) {
>>>> writeString((String) obj);
>>>> } else if (obj instanceof ArrayList) {
>>>> writeVector((ArrayList) obj);
>>>> } else if (obj instanceof List) {
>>>> writeList((List) obj);
>>>> } else if (obj instanceof Map) {
>>>> writeMap((Map) obj);
>>>> } else {
>>>> throw new RuntimeException("cannot write objects of this type");
>>>> }
>>>> }
>>>> Not sure why you're not getting the "cannot write objects of this
>>>> type" exception, but I'd expect things to work when you use
>>>> org.apache.hadoop.record.Buffer objects instead of byte[]s (or convert
>>>> the byte[]s to other proper objects first).
>>>> -Klaas
>>>> On Mon, Jul 27, 2009 at 1:43 PM, Tim Sell<trs...@gmail.com> wrote:
>>>>> I've been working on this.
>>>>> You need to have a custom mapper, which take the key value,
>>>>> (ImmutableBytesWritable, RowResult) and outputs
>>>>> (TypedBytesWritable, TypedBytesWritable). Then you can just use dumbos
>>>>> java integration.
>>>>> I've made a mapper that should do that, but I've missed something, or
>>>>> there is a bug.
>>>>> I keep getting: struct.error: Invalid type byte: 50
>>>>> here's my mapper class:
>>>>> /**
>>>>> * Mapper class for fetching columns from hbase and returning typed
>>>>> bytes for use with dumbo. Uses the depreciated
>>>>> * mapred api, because hadoop streaming uses it.
>>>>> *
>>>>> * @author tims
>>>>> */
>>>>> public class HBaseDumboMapper implements
>>>>> Mapper<ImmutableBytesWritable, RowResult, TypedBytesWritable,
>>>>> TypedBytesWritable> {
>>>>> private static Logger log = Logger.getLogger(HBaseDumboMapper.class);
>>>> On Mon, Jul 27, 2009 at 2:00 PM, Klaas Bosteels<klaas.boste...@gmail.com> wrote:
>>>>> Think you might want to avoid passing byte[]s to
>>>>> TypedBytesWritable.setValue() (either directly or indirectly as part
>>>>> of a collection). What gets called under the hood is the following
>>>>> method:
>>>>> /**
>>>>> * Writes a Java object as a typed bytes sequence.
>>>>> *
>>>>> * @param obj the object to be written
>>>>> * @throws IOException
>>>>> */
>>>>> public void write(Object obj) throws IOException {
>>>>> if (obj instanceof Buffer) {
>>>>> writeBytes(((Buffer) obj).get());
>>>>> } else if (obj instanceof Byte) {
>>>>> writeByte((Byte) obj);
>>>>> } else if (obj instanceof Boolean) {
>>>>> writeBool((Boolean) obj);
>>>>> } else if (obj instanceof Integer) {
>>>>> writeInt((Integer) obj);
>>>>> } else if (obj instanceof Long) {
>>>>> writeLong((Long) obj);
>>>>> } else if (obj instanceof Float) {
>>>>> writeFloat((Float) obj);
>>>>> } else if (obj instanceof Double) {
>>>>> writeDouble((Double) obj);
>>>>> } else if (obj instanceof String) {
>>>>> writeString((String) obj);
>>>>> } else if (obj instanceof ArrayList) {
>>>>> writeVector((ArrayList) obj);
>>>>> } else if (obj instanceof List) {
>>>>> writeList((List) obj);
>>>>> } else if (obj instanceof Map) {
>>>>> writeMap((Map) obj);
>>>>> } else {
>>>>> throw new RuntimeException("cannot write objects of this type");
>>>>> }
>>>>> }
>>>>> Not sure why you're not getting the "cannot write objects of this
>>>>> type" exception, but I'd expect things to work when you use
>>>>> org.apache.hadoop.record.Buffer objects instead of byte[]s (or convert
>>>>> the byte[]s to other proper objects first).
>>>>> -Klaas
>>>>> On Mon, Jul 27, 2009 at 1:43 PM, Tim Sell<trs...@gmail.com> wrote:
>>>>>> I've been working on this.
>>>>>> You need to have a custom mapper, which take the key value,
>>>>>> (ImmutableBytesWritable, RowResult) and outputs
>>>>>> (TypedBytesWritable, TypedBytesWritable). Then you can just use dumbos
>>>>>> java integration.
>>>>>> I've made a mapper that should do that, but I've missed something, or
>>>>>> there is a bug.
>>>>>> I keep getting: struct.error: Invalid type byte: 50
>>>>>> here's my mapper class:
>>>>>> /**
>>>>>> * Mapper class for fetching columns from hbase and returning typed
>>>>>> bytes for use with dumbo. Uses the depreciated
>>>>>> * mapred api, because hadoop streaming uses it.
>>>>>> *
>>>>>> * @author tims
>>>>>> */
>>>>>> public class HBaseDumboMapper implements
>>>>>> Mapper<ImmutableBytesWritable, RowResult, TypedBytesWritable,
>>>>>> TypedBytesWritable> {
>>>>>> private static Logger log = Logger.getLogger(HBaseDumboMapper.class);
>>>>> On Mon, Jul 27, 2009 at 2:00 PM, Klaas Bosteels<klaas.boste...@gmail.com> wrote:
>>>>>> Think you might want to avoid passing byte[]s to
>>>>>> TypedBytesWritable.setValue() (either directly or indirectly as part
>>>>>> of a collection). What gets called under the hood is the following
>>>>>> method:
>>>>>> /**
>>>>>> * Writes a Java object as a typed bytes sequence.
>>>>>> *
>>>>>> * @param obj the object to be written
>>>>>> * @throws IOException
>>>>>> */
>>>>>> public void write(Object obj) throws IOException {
>>>>>> if (obj instanceof Buffer) {
>>>>>> writeBytes(((Buffer) obj).get());
>>>>>> } else if (obj instanceof Byte) {
>>>>>> writeByte((Byte) obj);
>>>>>> } else if (obj instanceof Boolean) {
>>>>>> writeBool((Boolean) obj);
>>>>>> } else if (obj instanceof Integer) {
>>>>>> writeInt((Integer) obj);
>>>>>> } else if (obj instanceof Long) {
>>>>>> writeLong((Long) obj);
>>>>>> } else if (obj instanceof Float) {
>>>>>> writeFloat((Float) obj);
>>>>>> } else if (obj instanceof Double) {
>>>>>> writeDouble((Double) obj);
>>>>>> } else if (obj instanceof String) {
>>>>>> writeString((String) obj);
>>>>>> } else if (obj instanceof ArrayList) {
>>>>>> writeVector((ArrayList) obj);
>>>>>> } else if (obj instanceof List) {
>>>>>> writeList((List) obj);
>>>>>> } else if (obj instanceof Map) {
>>>>>> writeMap((Map) obj);
>>>>>> } else {
>>>>>> throw new RuntimeException("cannot write objects of this type");
>>>>>> }
>>>>>> }
>>>>>> Not sure why you're not getting the "cannot write objects of this
>>>>>> type" exception, but I'd expect things to work when you use
>>>>>> org.apache.hadoop.record.Buffer objects instead of byte[]s (or convert
>>>>>> the byte[]s to other proper objects first).
>>>>>> -Klaas
>>>>>> On Mon, Jul 27, 2009 at 1:43 PM, Tim Sell<trs...@gmail.com> wrote:
>>>>>>> I've been working on this.
>>>>>>> You need to have a custom mapper, which take the key value,
>>>>>>> (ImmutableBytesWritable, RowResult) and outputs
>>>>>>> (TypedBytesWritable, TypedBytesWritable). Then you can just use dumbos
>>>>>>> java integration.
>>>>>>> I've made a mapper that should do that, but I've missed something, or
>>>>>>> there is a bug.
>>>>>>> I keep getting: struct.error: Invalid type byte: 50
>>>>>>> here's my mapper class:
>>>>>>> /**
>>>>>>> * Mapper class for fetching columns from hbase and returning typed
>>>>>>> bytes for use with dumbo. Uses the depreciated
>>>>>>> * mapred api, because hadoop streaming uses it.
>>>>>>> *
>>>>>>> * @author tims
>>>>>>> */
>>>>>>> public class HBaseDumboMapper implements
>>>>>>> Mapper<ImmutableBytesWritable, RowResult, TypedBytesWritable,
>>>>>>> TypedBytesWritable> {
>>>>>>> private static Logger log = Logger.getLogger(HBaseDumboMapper.class);
Thanks for your help klaas, turned out it was just because I was using
an old version :P heh. so frustrating..
So below is my final class for outputting hbase columns to typedbytes
(as strings).
You will get typed bytes with
key = row
value = a dict { family : qualifier : value }
To run this do:
dumbo test.py -hadoop <hadoopdir> -libjar <jar with below mapper in
it> -inputformat org.apache.hadoop.hbase.mapred.TableInputFormat
-hadoopconf hbase.mapred.tablecolumns="family1:qualifier1
family2:qualifier2" -input <tablename> -output <outdir>
/**
* Mapper class for fetching columns from hbase and returning typed
bytes for use with dumbo. Uses the depreciated
* mapred api, because hadoop streaming uses it.
*
* @author tims
*/
public class HBaseDumboMapper implements
Mapper<ImmutableBytesWritable, RowResult, TypedBytesWritable,
TypedBytesWritable> {
>>>>>> On Mon, Jul 27, 2009 at 2:00 PM, Klaas Bosteels<klaas.boste...@gmail.com> wrote:
>>>>>>> Think you might want to avoid passing byte[]s to
>>>>>>> TypedBytesWritable.setValue() (either directly or indirectly as part
>>>>>>> of a collection). What gets called under the hood is the following
>>>>>>> method:
>>>>>>> /**
>>>>>>> * Writes a Java object as a typed bytes sequence.
>>>>>>> *
>>>>>>> * @param obj the object to be written
>>>>>>> * @throws IOException
>>>>>>> */
>>>>>>> public void write(Object obj) throws IOException {
>>>>>>> if (obj instanceof Buffer) {
>>>>>>> writeBytes(((Buffer) obj).get());
>>>>>>> } else if (obj instanceof Byte) {
>>>>>>> writeByte((Byte) obj);
>>>>>>> } else if (obj instanceof Boolean) {
>>>>>>> writeBool((Boolean) obj);
>>>>>>> } else if (obj instanceof Integer) {
>>>>>>> writeInt((Integer) obj);
>>>>>>> } else if (obj instanceof Long) {
>>>>>>> writeLong((Long) obj);
>>>>>>> } else if (obj instanceof Float) {
>>>>>>> writeFloat((Float) obj);
>>>>>>> } else if (obj instanceof Double) {
>>>>>>> writeDouble((Double) obj);
>>>>>>> }
Talking with Mat Lehmann on #hbase, it would be nice to be able to
wrap a java mapper in a python mapper, then we could do things like
use a python mapper on the above HBaseDumboMapper seamlessly, imo this
would be much nicer then writing a custom inputformat for hbase that
emits typedbytes.
Can we do this already in dumbo? or shall I make an issue?
> Thanks for your help klaas, turned out it was just because I was using
> an old version :P heh. so frustrating..
> So below is my final class for outputting hbase columns to typedbytes
> (as strings).
> You will get typed bytes with
> key = row
> value = a dict { family : qualifier : value }
> To run this do:
> dumbo test.py -hadoop <hadoopdir> -libjar <jar with below mapper in
> it> -inputformat org.apache.hadoop.hbase.mapred.TableInputFormat
> -hadoopconf hbase.mapred.tablecolumns="family1:qualifier1
> family2:qualifier2" -input <tablename> -output <outdir>
> /**
> * Mapper class for fetching columns from hbase and returning typed
> bytes for use with dumbo. Uses the depreciated
> * mapred api, because hadoop streaming uses it.
> *
> * @author tims
> */
> public class HBaseDumboMapper implements
> Mapper<ImmutableBytesWritable, RowResult, TypedBytesWritable,
> TypedBytesWritable> {
>>>>>>> On Mon, Jul 27, 2009 at 2:00 PM, Klaas Bosteels<klaas.boste...@gmail.com> wrote:
>>>>>>>> Think you might want to avoid passing byte[]s to
>>>>>>>> TypedBytesWritable.setValue() (either directly or indirectly as part
>>>>>>>> of a collection). What gets called under the hood is the following
>>>>>>>> method:
>>>>>>>> /**
>>>>>>>> * Writes a Java object as a typed bytes sequence.
>>>>>>>> *
>>>>>>>> * @param obj the
> Talking with Mat Lehmann on #hbase, it would be nice to be able to
> wrap a java mapper in a python mapper, then we could do things like
> use a python mapper on the above HBaseDumboMapper seamlessly, imo this
> would be much nicer then writing a custom inputformat for hbase that
> emits typedbytes.
> Can we do this already in dumbo? or shall I make an issue?
> ~Tim.
> 2009/7/27 Tim Sell <trs...@gmail.com>:
>> Thanks for your help klaas, turned out it was just because I was using
>> an old version :P heh. so frustrating..
>> So below is my final class for outputting hbase columns to typedbytes
>> (as strings).
>> You will get typed bytes with
>> key = row
>> value = a dict { family : qualifier : value }
>> To run this do:
>> dumbo test.py -hadoop <hadoopdir> -libjar <jar with below mapper in
>> it> -inputformat org.apache.hadoop.hbase.mapred.TableInputFormat
>> -hadoopconf hbase.mapred.tablecolumns="family1:qualifier1
>> family2:qualifier2" -input <tablename> -output <outdir>
>> /**
>> * Mapper class for fetching columns from hbase and returning typed
>> bytes for use with dumbo. Uses the depreciated
>> * mapred api, because hadoop streaming uses it.
>> *
>> * @author tims
>> */
>> public class HBaseDumboMapper implements
>> Mapper<ImmutableBytesWritable, RowResult, TypedBytesWritable,
>> TypedBytesWritable> {
On Mon, Jul 27, 2009 at 5:32 PM, Tim Sell<trs...@gmail.com> wrote:
> Talking with Mat Lehmann on #hbase, it would be nice to be able to > wrap a java mapper in a python mapper, then we could do things like > use a python mapper on the above HBaseDumboMapper seamlessly, imo this > would be much nicer then writing a custom inputformat for hbase that > emits typedbytes.
> Can we do this already in dumbo? or shall I make an issue?
I'm afraid not. Feel free to ticket, but this would probably require changes to Streaming so it's probably not exactly an easy addition.
Why would this be better/nicer than just implementing an extension of the hbase inputformat that outputs typed bytes tho?
> On Mon, Jul 27, 2009 at 5:32 PM, Tim Sell<trs...@gmail.com> wrote:
>> Talking with Mat Lehmann on #hbase, it would be nice to be able to
>> wrap a java mapper in a python mapper, then we could do things like
>> use a python mapper on the above HBaseDumboMapper seamlessly, imo this
>> would be much nicer then writing a custom inputformat for hbase that
>> emits typedbytes.
>> Can we do this already in dumbo? or shall I make an issue?
> I'm afraid not. Feel free to ticket, but this would probably require
> changes to Streaming so it's probably not exactly an easy addition.
> Why would this be better/nicer than just implementing an extension of
> the hbase inputformat that outputs typed bytes tho?
I couldn't be bothered figuring out how to set the max number of
mappers from dumbo, so I just let it to be the number of regions in
the table you are scanning.
It outputs typed bytes as above. Still strings, since there's the
buffer bug and I'm using strings anyway.
It's quite nice to use, way better the using the mapper I posted earlier.
eg:
<pre>
#test2.py
import dumbo
def mapper(key, columns):
for family in columns:
for qualifier, value in columns[family].iteritems():
yield key, (family, qualifier, value)
def runner(job):
job.additer(mapper)
def starter(prog):
pass
if __name__ == "__main__":
dumbo.main(runner,starter)
</pre>
I should probably stick this on github or something, and eventually
add to hbase.
Now we just need a TypedBytesTableOutputFormat :)
>> On Mon, Jul 27, 2009 at 5:32 PM, Tim Sell<trs...@gmail.com> wrote:
>>> Talking with Mat Lehmann on #hbase, it would be nice to be able to
>>> wrap a java mapper in a python mapper, then we could do things like
>>> use a python mapper on the above HBaseDumboMapper seamlessly, imo this
>>> would be much nicer then writing a custom inputformat for hbase that
>>> emits typedbytes.
>>> Can we do this already in dumbo? or shall I make an issue?
>> I'm afraid not. Feel free to ticket, but this would probably require
>> changes to Streaming so it's probably not exactly an easy addition.
>> Why would this be better/nicer than just implementing an extension of
>> the hbase inputformat that outputs typed bytes tho?
> I couldn't be bothered figuring out how to set the max number of
> mappers from dumbo, so I just let it to be the number of regions in
> the table you are scanning.
> It outputs typed bytes as above. Still strings, since there's the
> buffer bug and I'm using strings anyway.
> It's quite nice to use, way better the using the mapper I posted earlier.
> eg:
> <pre>
> #test2.py
> import dumbo
> def mapper(key, columns):
> for family in columns:
> for qualifier, value in columns[family].iteritems():
> yield key, (family, qualifier, value)
> def runner(job):
> job.additer(mapper)
> def starter(prog):
> pass
> if __name__ == "__main__":
> dumbo.main(runner,starter)
> </pre>
> I should probably stick this on github or something, and eventually
> add to hbase.
> Now we just need a TypedBytesTableOutputFormat :)
> ~Tim.
> 2009/7/27 Tim Sell <trs...@gmail.com>:
>> Just because it's easier to write a mapper, but if we have to change
>> streaming, it's probably easier to write a new input format.
>>> On Mon, Jul 27, 2009 at 5:32 PM, Tim Sell<trs...@gmail.com> wrote:
>>>> Talking with Mat Lehmann on #hbase, it would be nice to be able to
>>>> wrap a java mapper in a python mapper, then we could do things like
>>>> use a python mapper on the above HBaseDumboMapper seamlessly, imo this
>>>> would be much nicer then writing a custom inputformat for hbase that
>>>> emits typedbytes.
>>>> Can we do this already in dumbo? or shall I make an issue?
>>> I'm afraid not. Feel free to ticket, but this would probably require
>>> changes to Streaming so it's probably not exactly an easy addition.
>>> Why would this be better/nicer than just implementing an extension of
>>> the hbase inputformat that outputs typed bytes tho?