Equivalent of 'hadoop dfs -getmerge?

200 views
Skip to first unread message

Erik Forsberg

unread,
Nov 30, 2009, 1:50:09 PM11/30/09
to dumbo...@googlegroups.com
Hi!

If I have a bunch of result part files (part-00000, part-00001, ...,
part-00000N), all of them in the typedbytes format, what's the best way
to transfer them to the local filesystem, preferrably in sorted form.

'dumbo cat' only seems to be able to cat single files, and that's kind
of inconvenient if the number of reducers is large.

I guess I'm looking for something like the 'hadoop dfs -getmerge'
command, but for typedbytes.

Thanks,
\EF

Klaas Bosteels

unread,
Nov 30, 2009, 4:26:30 PM11/30/09
to dumbo...@googlegroups.com
dumbo cat /path/on/dfs/to/parts/dir -hadoop /path/to/hadoop should work

-Klaas
> --
>
> You received this message because you are subscribed to the Google
> Groups "dumbo-user" group.
> To post to this group, send email to dumbo...@googlegroups.com.
> To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=en
> .
>
>

Erik Forsberg

unread,
Dec 1, 2009, 1:50:48 AM12/1/09
to http://groups.google.com/group/dumbo-user/post@opera.com, dumbo...@googlegroups.com
On Mon, 30 Nov 2009 22:26:30 +0100
Klaas Bosteels <klaas.b...@gmail.com> wrote:

> dumbo cat /path/on/dfs/to/parts/dir -hadoop /path/to/hadoop should
> work

Doesn't work for me (Dumbo 0.21.21):

bin/dumbo cat /user/forsberg/test0 -hadoop /usr/lib/hadoop
java.io.IOException: Cannot open filename /user/forsberg/test0/_logs
at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1181)
at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1172)
at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:355) at
org.apache.hadoop.dfs.DistributedFileSystem.open(DistributedFileSystem.java:163)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:364) at
org.apache.hadoop.streaming.AutoInputFormat.getRecordReader(AutoInputFormat.java:56)
at
org.apache.hadoop.streaming.DumpTypedBytes.dumpTypedBytes(DumpTypedBytes.java:101)
at
org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:82)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at
org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.hadoop.util.RunJar.main(RunJar.java:155) at
org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at
org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

Adding a / to the path doesn't help:

bin/dumbo cat /user/forsberg/test0/ -hadoop /usr/lib/hadoop
java.io.IOException: Cannot open
filename /user/forsberg/test0/_logs at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1181)
at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1172)
at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:355) at
org.apache.hadoop.dfs.DistributedFileSystem.open(DistributedFileSystem.java:163)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:364) at
org.apache.hadoop.streaming.AutoInputFormat.getRecordReader(AutoInputFormat.java:56)
at
org.apache.hadoop.streaming.DumpTypedBytes.dumpTypedBytes(DumpTypedBytes.java:101)
at
org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:82)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at
org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.hadoop.util.RunJar.main(RunJar.java:155) at
org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at
org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

For reference, the contents of the DFS directory:

hadoop dfs -ls /user/forsberg/test0
Found 3 items
drwxr-xr-x - forsberg supergroup 0 2009-11-30
10:34 /user/forsberg/test0/_logs -rw-r--r-- 1 forsberg supergroup
21164658 2009-11-30 10:35 /user/forsberg/test0/part-00000 -rw-r--r--
1 forsberg supergroup 21243935 2009-11-30
10:35 /user/forsberg/test0/part-00001

If I first do 'dfs -rmr /user/forsberg/test0/_logs, 'dumbo cat' works
as intended. Hmm.. googling on _logs gives me
http://dumbotics.com/2009/05/31/dumbo-on-clouderas-distribution/, so it
seems this is a known problem.

I guess I can disable logs creation, or move the logs somewhere else,
but still, I think this is a bug. Which reminds me that when I try to
join the dumbo space at assembla it just renders me a blank page :-(.

Thanks for your reply, it got me thinking! :-)
\EF
> email to dumbo-user+...@googlegroups.com. For more options,

Zak Stone

unread,
Dec 1, 2009, 2:09:37 AM12/1/09
to dumbo...@googlegroups.com
Can you confirm that dumbo cat actually concatenates all parts of the
output when you remove the _logs directory? I got the impression that
it didn't at one point, but that was a while back, and I didn't have
time to set up a proper test.

To avoid the whole issue, I wrote a little wrapper around "hadoop fs
-ls" that ignores paths that contain "_logs" and explicitly iterates
through all of the rest.

Zak

Erik Forsberg

unread,
Dec 1, 2009, 2:27:27 AM12/1/09
to dumbo...@googlegroups.com
On Tue, 1 Dec 2009 02:09:37 -0500
Zak Stone <zst...@gmail.com> wrote:

> Can you confirm that dumbo cat actually concatenates all parts of the
> output when you remove the _logs directory? I got the impression that
> it didn't at one point, but that was a while back, and I didn't have
> time to set up a proper test.

With my dumbo 0.21.21, I can confirm that it does indeed. I tested as
follows:

$ bin/dumbo cat /user/forsberg/test0/ -hadoop /usr/lib/hadoop > dumbo-merged

$ bin/dumbo cat /user/forsberg/test0/part-00000 -hadoop /usr/lib/hadoop > part-00000

$ bin/dumbo cat /user/forsberg/test0/part-00001 -hadoop /usr/lib/hadoop > part-00001

$ cat part-00000 part-00001 > part-merged

$ sha1sum part-merged dumbo-merged
00478c563f5cf3e59e77a4575352a5d2c3af3d90 part-merged
00478c563f5cf3e59e77a4575352a5d2c3af3d90 dumbo-merged

\EF

Klaas Bosteels

unread,
Dec 1, 2009, 4:43:12 AM12/1/09
to dumbo...@googlegroups.com
Ah yes, the _logs directory can indeed cause trouble. I tend to forget
about it because we disable the creation of that dir on all our
clusters. I'd be happy to accept a patch that provides a (proper)
solution for this problem tho.. :)

As mentioned in the blog post, a simple workaround is to use

dumbo cat /user/forsberg/test0/part*

instead of

dumbo cat /user/forsberg/test0/

-Klaas

Zak Stone

unread,
Dec 1, 2009, 9:13:15 AM12/1/09
to dumbo...@googlegroups.com
Thanks, Klaas. Erik, could you run that test again with Dumbo's part* syntax?

Zak

Erik Forsberg

unread,
Dec 3, 2009, 11:41:47 AM12/3/09
to dumbo...@googlegroups.com
On Tue, 1 Dec 2009 09:13:15 -0500
Zak Stone <zst...@gmail.com> wrote:

> Thanks, Klaas. Erik, could you run that test again with Dumbo's part*
> syntax?

Doesn't work on dumbo 0.21.21. Only gets the first part file
(part-00000). No error message or anything.

But I've disabled user log creation now, so no problem.

\EF

Zak Stone

unread,
Dec 3, 2009, 11:48:16 AM12/3/09
to dumbo...@googlegroups.com
Then that's the problem I encountered before; glad to know I wasn't
hallucinating. Klaas, any idea how easy it would be to fix this?

Thanks,
Zak

Klaas Bosteels

unread,
Dec 3, 2009, 12:19:04 PM12/3/09
to dumbo...@googlegroups.com
Right, the workaround I suggested indeed doesn't work. Not sure how
difficult it would be to fix, and I'm afraid I don't have time to look
into it right now. I'd be happy to investigate it later if someone
tickets it though...

-K

Zak Stone

unread,
Dec 3, 2009, 1:26:22 PM12/3/09
to dumbo...@googlegroups.com
Thanks, Klaas -- I just created a ticket!

Zak

Kavya Smily

unread,
Sep 24, 2013, 8:11:09 AM9/24/13
to dumbo...@googlegroups.com

If you have more doubts regarding Hadoop.. Contact Big Infosys they will provide the best Hadoop Online Training from Hyderabad

G Suresh

unread,
Oct 24, 2013, 6:09:42 AM10/24/13
to dumbo...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages