filesystem issues with hfs textline tap on a zip file

34 views
Skip to first unread message

Andrew Xue

unread,
Oct 15, 2012, 7:25:49 PM10/15/12
to cascadi...@googlegroups.com
Hi -- Trying to just put a tap over some text files that have been zipped up -- the issue I am running into is that ZipInputFormat class is complaining that the file is a directory. The error I get is this:

IOException does not support directories: s3://my_bucket/my_path/my_file.zip  cascading.tap.hadoop.ZipInputFormat.listPathsInternal (ZipInputFormat.java:119)

It seem like the issue is that when the FileSystem object calls isFile on the file path (ie, s3://my_bucket/my_path/my_file.zip), it is returning false ... even though its a file. 

Is there something I am missing? Anyone seen an issue like this? This is an issue I get both on my local computer and also when running from S3 with Amazon EMR. If you do unzip s3://my_bucket/my_path/my_file.zip, it extracts a file (of another name, but into the same dir).

Would really appreciate any guidance! Thanks


P.S. Below are the logging lines (in bold) I put into a custom subclass of Textline and the output I get:

    public void sourceInit( Tap tap, JobConf conf )
    {
        if( hasZippedFiles( FileInputFormat.getInputPaths(conf), conf) )
        {
            LOG.info("USING ZIP FORMAT!!!");
            conf.setInputFormat( ZipInputFormat.class );
        }
        else
            conf.setInputFormat( TextInputFormat.class );
    }

    private boolean hasZippedFiles( Path[] paths, JobConf conf )
    {

        boolean isZipped = paths[ 0 ].getName().endsWith( ".zip" );

        for (Path p : paths)
        {
            LOG.info(p);
        }
        LOG.info("path name: "+paths[ 0 ].getName()+" is zipped? "+isZipped);

        try
        {
            FileSystem fs = paths[0].getFileSystem( conf );
            LOG.info("file system: "+fs.toString());
            LOG.info("is it a file? "+fs.isFile(paths[0]));
        }
        catch (Exception e) {LOG.error(e); }

        for( int i = 1; i < paths.length; i++ )
        {
            if( isZipped != paths[ i ].getName().endsWith( ".zip" ) )
                throw new IllegalStateException( "cannot mix zipped and upzippled files" );
        }

        return isZipped;
    }

output:

12/10/15 23:10:35 INFO s3://my_bucket/my_path/my_file.zip
12/10/15 23:10:35 INFO path name: s3://my_bucket/my_path/my_file.zip is zipped? true
12/10/15 23:10:35 INFO  file system: org.apache.hadoop.fs.s3native.NativeS3FileSystem@2b79ef
12/10/15 23:10:35 INFO is it a file? false
12/10/15 23:10:35 INFO USING ZIP FORMAT!!!

Andrew Xue

unread,
Oct 15, 2012, 8:02:46 PM10/15/12
to cascadi...@googlegroups.com
never mind i just noticed a dumb mistake on my side that caused this error ... the error made me think it was some issue with the fs thinking it was a dir instead of a file -- the real mistake was a typo in the file path, ie, it was file doesn't exist error sort of obfuscated by the directories error
Reply all
Reply to author
Forward
0 new messages