Hi -- Trying to just put a tap over some text files that have been zipped up -- the issue I am running into is that ZipInputFormat class is complaining that the file is a directory. The error I get is this:
IOException does not support directories: s3://my_bucket/my_path/my_file.zip cascading.tap.hadoop.ZipInputFormat.listPathsInternal (ZipInputFormat.java:119)
It seem like the issue is that when the FileSystem object calls isFile on the file path (ie, s3://my_bucket/my_path/my_file.zip), it is returning false ... even though its a file.
Is there something I am missing? Anyone seen an issue like this? This is an issue I get both on my local computer and also when running from S3 with Amazon EMR. If you do unzip s3://my_bucket/my_path/my_file.zip, it extracts a file (of another name, but into the same dir).
Would really appreciate any guidance! Thanks
P.S. Below are the logging lines (in bold) I put into a custom subclass of Textline and the output I get:
public void sourceInit( Tap tap, JobConf conf )
{
if( hasZippedFiles( FileInputFormat.getInputPaths(conf), conf) )
{
LOG.info("USING ZIP FORMAT!!!");
conf.setInputFormat( ZipInputFormat.class );
}
else
conf.setInputFormat( TextInputFormat.class );
}
private boolean hasZippedFiles( Path[] paths, JobConf conf )
{
boolean isZipped = paths[ 0 ].getName().endsWith( ".zip" );
for (Path p : paths)
{
LOG.info(p);
}
LOG.info("path name: "+paths[ 0 ].getName()+" is zipped? "+isZipped);
try
{
FileSystem fs = paths[0].getFileSystem( conf );
LOG.info("file system: "+fs.toString());
LOG.info("is it a file? "+fs.isFile(paths[0]));
}
catch (Exception e) {LOG.error(e); }
for( int i = 1; i < paths.length; i++ )
{
if( isZipped != paths[ i ].getName().endsWith( ".zip" ) )
throw new IllegalStateException( "cannot mix zipped and upzippled files" );
}
return isZipped;
}
output:
12/10/15 23:10:35 INFO s3://my_bucket/my_path/my_file.zip
12/10/15 23:10:35 INFO path name: s3://my_bucket/my_path/my_file.zip is zipped? true
12/10/15 23:10:35 INFO file system: org.apache.hadoop.fs.s3native.NativeS3FileSystem@2b79ef
12/10/15 23:10:35 INFO is it a file? false
12/10/15 23:10:35 INFO USING ZIP FORMAT!!!