hdfs partitions and rhadoop

Martin Eggenberger

unread,

Mar 18, 2014, 3:55:29 PM3/18/14

to rha...@googlegroups.com

I am trying to run a rather large rmr job where the file data is partitioned as follows:

/data/folder/partition1
/data/folder/partition2
/data/folder/partitionN

When running the following job I am getting a not a file error.

library(rmr2)

large.job.mapper <- function( key, values )
{
output.key    = key
output.value = data.frame( OrderCount = 1 )

keyval( output.key, output.value )
}

large.job.mr <- function (inputPath, outputPath = NULL )
{
mapreduce( input = inputPath,
             output = outputPath,
             map = large.job.mapper,
             verbose=T
)
}

result = large.job.mr ( '/data/folder/' )

OUTPUT

14/03/18 15:52:46 INFO mapred.JobClient: Cleaning up the staging area hdfs://pxpmhwtmn001.gid.gap.com:8020/user/bdload/.staging/job_201403071300_22423
14/03/18 15:52:46 ERROR security.UserGroupInformation: PriviledgedActionException as:bdload cause:java.io.IOException: Not a file: Not a file: hdfs://pxpmhwtmn001.gid.gap.com:8020/data/folder/partiton1
14/03/18 15:52:46 ERROR streaming.StreamJob: Error Launching job : Not a file: hdfs://pxpmhwtmn001.gid.gap.com:8020/data/folder/partiton1
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 5

Is there any way to set an input filter in rmr2?
Thank you
-me

Antonio Piccolboni

unread,

Mar 18, 2014, 4:06:15 PM3/18/14

to RHadoop Google Group

On Tue, Mar 18, 2014 at 12:55 PM, Martin Eggenberger <meg...@gmail.com> wrote:

I am trying to run a rather large rmr job where the file data is partitioned as follows:

/data/folder/partition1
/data/folder/partition2
/data/folder/partitionN

When running the following job I am getting a not a file error.

And partition1 is a directory containing the actual files, is that correct?

library(rmr2)

large.job.mapper <- function( key, values )
{
output.key    = key
output.value = data.frame( OrderCount = 1 )

keyval( output.key, output.value )
}

large.job.mr <- function (inputPath, outputPath = NULL )
{
mapreduce( input = inputPath,
             output = outputPath,
             map = large.job.mapper,
             verbose=T
)
}

result = large.job.mr ( '/data/folder/' )

OUTPUT

14/03/18 15:52:46 INFO mapred.JobClient: Cleaning up the staging area hdfs://pxpmhwtmn001.gid.gap.com:8020/user/bdload/.staging/job_201403071300_22423
14/03/18 15:52:46 ERROR security.UserGroupInformation: PriviledgedActionException as:bdload cause:java.io.IOException: Not a file: Not a file: hdfs://pxpmhwtmn001.gid.gap.com:8020/data/folder/partiton1
14/03/18 15:52:46 ERROR streaming.StreamJob: Error Launching job : Not a file: hdfs://pxpmhwtmn001.gid.gap.com:8020/data/folder/partiton1
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 5

Is there any way to set an input filter in rmr2?

What is an input filter and is this question related to the error report in any way? Thanks

Antonio

Thank you
-me

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Antonio Piccolboni

unread,

Mar 18, 2014, 4:22:37 PM3/18/14

to rha...@googlegroups.com

The internet seems to support two views:

The standard input format doesn't support nested directories. Write your own or build a list of the actual files with a recursive list operation

Just use globs in the input

The two groups seem busy in a downvoting confrontation on Stackoverflow. It seems to me the second is easy enough to try so that the truth be settled on the globbing issue. Your input could be /data/folder/* or /data/folder/*/* according to different oral traditions. Alternatively, you could use hdfs.ls from the rhdfs package to build a vector of partitions and pass it to mapreduce as the input argument. I think it could be of general interest if you reported back on your solution. Thanks

Antonio

Antonio Piccolboni

unread,

Mar 18, 2014, 4:27:04 PM3/18/14

to rha...@googlegroups.com

I just found that Ted Dunning espouses the single glob theory (/data/folder/*). If Ted says so I don't care what the rest of stackoverflow says, that's what we need to try first, and it will work.

Antonio

On Tuesday, March 18, 2014 12:55:29 PM UTC-7, Martin Eggenberger wrote:

Martin Eggenberger

unread,

Mar 18, 2014, 7:52:54 PM3/18/14

to rha...@googlegroups.com

I am using an input filter as follows. That allows me to create a dyunamic partition in hive and subsequently filter based on that.

input.filter = function ( directory, filter, start, end )
{
    files = hdfs.ls( directory )$file
    filter = paste(directory, filter, sep='/')
    keys   = substr( files, nchar( filter ) + 1, 100 )
    files[ which (keys == start) : which( keys == end) ]
}

Works like a charm. The start and end tag can be passed through the command line.

Reply all

Reply to author

Forward