Regarding combining lzo files

464 views
Skip to first unread message

rajit...@gmail.com

unread,
Jun 12, 2013, 1:37:03 PM6/12/13
to elephant...@googlegroups.com
Hi ,

My mr job is consuming lzo files which spawns certain no. of mappers based on the no. of input files. I am trying to optimise the mapper usage by using CombinedInputFileFormat so that input files can be combined and increase the data processed by a single mapper. Hence I have my custom inputformat class extending the CombinedInputFileFormat which uses LzoThriftBlockReader. 

Can anyone suggest me if lzo files can be combined ? I have also tried using lzo indexer on my input and get error saying that no codec for index file. 

Following is the error I am getting. Please let me know if my approach is correct or is there anything I am missing.

java.lang.RuntimeException: java.io.IOException: No codec for file /data/globalbasesummary/201305072330-hkg1-part-r-00000.lzo.index found, cannot run
Caused by: java.io.IOException: No codec for file /data/globalbasesummary/201305072330-hkg1-part-r-00000.lzo.index found, cannot run
	at com.twitter.elephantbird.mapreduce.input.LzoRecordReader.initialize(LzoRecordReader.java:80)
	at com.twitter.elephantbird.mapreduce.input.LzoBinaryBlockRecordReader.initialize(LzoBinaryBlockRecordReader.java:83)	
	at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:161)

Raghu Angadi

unread,
Jun 12, 2013, 2:04:12 PM6/12/13
to elephant...@googlegroups.com, rajit...@gmail.com

You are ending up reading '*.lzo.index' files as well, through your input format. These files are used for splitting large lzo files, and do not contain any data. LzoInputFormats avoid this..

btw, if you use pig, it is does the combining multiple files into one mapper for you. 

Raghu.

--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elephantbird-d...@googlegroups.com.
To post to this group, send email to elephant...@googlegroups.com.
Visit this group at http://groups.google.com/group/elephantbird-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

rajitha r

unread,
Jun 13, 2013, 3:29:31 AM6/13/13
to Raghu Angadi, elephant...@googlegroups.com
Raghu,

Thanks for replying. I had used LzoThriftBlockInputFomat and it doesn't complain but it spawned too many mappers. I want to avoid this and hence wanted to combine the lzo files. So I had written a custom input format class extending the Hadoop CombinedFileInputFormat passing LzoThriftBlockRecordReader. Do LzoInputFormats allow combining ? This I am not aware about. If yes , can you please suggest as which classes helpme doing that.

 I am not using pig and launching MR jobs through oozie. 

Thanks,
Rajitha
--
Regards,

Rajitha.R

Raghu Angadi

unread,
Jun 13, 2013, 2:26:46 PM6/13/13
to rajitha r, elephant...@googlegroups.com
Btw, how big are these lzo files? If these are small, most of the time you don't even need .index files, you can just delete them.

In in your custom input format, you should extend listStatus to drop .lzo.index files. something like : 

protected List<FileStatus> listStatus(JobContext job
                                        ) throws IOException {
   List<FileStatus> files = super.listStatus(job);
   List<FileStatus> nonIndexFiles = new ArrayList<FileStatus>();
   for (FileStatus f : files) {
      if (!f.getPath.endsWith(".lzo.index")) {
        nonIndexFiles.add(f);
      }
    }
    return nonIndexFiles;
}

I don't think LzoInputFormat support combining multiple splits into one.

Raghu.

Jeremy Davis

unread,
Dec 4, 2013, 7:17:44 PM12/4/13
to elephant...@googlegroups.com, rajitha r
Just checking in on this topic.
I ran in to some issues that I think might be related to CombineFileInputFormat and LZO.
Has anyone had success spanning indexed lzo files?

-JD

Raghu Angadi

unread,
Dec 4, 2013, 7:26:35 PM12/4/13
to elephant...@googlegroups.com, rajitha r
This happens a lot at Twitter. Pig is often configured to combine multiple splits up to 1GB (in some cases even larger).


Dmitriy Ryaboy

unread,
Dec 4, 2013, 7:30:40 PM12/4/13
to elephant...@googlegroups.com, rajitha r
Jeremy, what exactly did you run into?
Are you using Pig or Cascading (or, more specifically, mapred or mapreduce)?

Pig doesn't use CIF. It has its own thing, and works fine.

CIF can be made to work with the elephant-bird input files, but this requires some changes to EB that we are currently testing.
--
Dmitriy V Ryaboy
Twitter Analytics
http://twitter.com/squarecog

Jeremy Davis

unread,
Dec 4, 2013, 8:28:29 PM12/4/13
to elephant...@googlegroups.com, rajitha r
I'm using Cascading 2.2.1 and I ran in to an issue trying out the new CIF feature.
Input was a bunch of indexed LZO files, and I just assumed that the issue was related to LZO splits. (Could be wrong)
I Was just putting out feelers to see if other people had used the feature/run in to any issues.
-JD

Dmitriy Ryaboy

unread,
Dec 4, 2013, 8:44:20 PM12/4/13
to elephant...@googlegroups.com, rajitha r
As luck would have it, this is exactly why we are working on the patch :-)
This isn't due to the splits, it's due to the fact that CIF expects certain things that aren't true for EB input formats, and EB input formats expect certain things that aren't true for CIF. But we discussed this with Chris Wensel and are pretty sure we have a fix.

Stay tuned..

Jeremy Davis

unread,
Dec 5, 2013, 10:09:10 AM12/5/13
to elephant...@googlegroups.com, rajitha r
Excellent!
I don't use EB per se, but I cribbed our LzoTextDelimited Scheme from your work.
Any chance I can get early access to see your changes in flight?
-JD

Dmitriy Ryaboy

unread,
Dec 5, 2013, 7:44:32 PM12/5/13
to elephant...@googlegroups.com, rajitha r
Current EB master + this pull request: https://github.com/kevinweil/elephant-bird/pull/360 should do the trick

watch out, though -- we think there's a bit of a correctness bug in the pull request that mangles the first record being read in the wrapped IF, Chandler is looking at it.  So, good to verify that the approach generally works, but don't put this into production.
Message has been deleted
Message has been deleted
Message has been deleted

Jeremy Davis

unread,
Dec 6, 2013, 4:21:20 PM12/6/13
to elephant...@googlegroups.com, rajitha r
I've applied the CIF patch, and I notice that:
if I turn on CIF, then I do not enter LzoInputFormat.listStatus()
If I turn off CIF, then I DO enter LzoInputFormat.listStatus().
Consequently with CIF It tries to read the .lzo.index files and I get:
No codec for file file:/tmp/test/in4.old/part-00001.lzo.index found, cannot run

-JD

Jeremy Davis

unread,
Dec 6, 2013, 4:58:59 PM12/6/13
to elephant...@googlegroups.com, rajitha r
I also notice, that if I get rid of the index files (to side step the File Status problem), that when if jumps to the next file the first line is an empty String.
I wonder if it is related to skip header..Still tracing it through…
-JD

Dmitriy Ryaboy

unread,
Dec 6, 2013, 5:30:00 PM12/6/13
to elephant...@googlegroups.com, rajitha r
Yeah that empty is the bug Chandler is chasing. The other one about index files we hadn't noticed yet... Need to think about it. 

Jeremy Davis

unread,
Dec 6, 2013, 8:03:09 PM12/6/13
to elephant...@googlegroups.com, rajitha r
My work around for the later was to set the property:
mapred.input.pathFilter.class

Dmitriy Ryaboy

unread,
Dec 6, 2013, 8:50:28 PM12/6/13
to elephant...@googlegroups.com, rajitha r
Good call! Mind making a pull request with this change?

Dmitriy Ryaboy

unread,
Dec 9, 2013, 5:19:41 PM12/9/13
to elephant...@googlegroups.com, rajitha r
Jeremy, 
We just merged the patch I referenced earlier (with the empty string issue fixed).

Do you have the mapred.input.pathFilter.class change handy?
Message has been deleted

Jeremy Davis

unread,
Dec 9, 2013, 10:07:28 PM12/9/13
to elephant...@googlegroups.com, rajitha r
Dmitriy,
Sorry for the delay..
Sorry no patch here, as I'm using home grown version.

    @Override
    public void sourceConfInit(FlowProcess<JobConf> flowProcess, Tap<JobConf, RecordReader, OutputCollector> tap, JobConf conf ) {
        conf.setInputFormat(DeprecatedLzoTextInputFormat.class);
        conf.set("mapred.input.pathFilter.class","path.to.whatever.LzoFilter");
        conf.set("cascading.hadoop.hfs.combine.files","true");
    }

/**
 *
 */
public class LzoFilter implements PathFilter {
    @Override
    public boolean accept (Path path) {
        String name = path.getName();
        //adjust as needed
        return !name.startsWith(".") &&
               !name.startsWith("_") &&
               !name.endsWith(".lzo.index");
    }
}

Coincidentally, did you run in to this issue:

 Caused by: java.io.IOException: Compressed length 1748762994 exceeds max block size 67108864 (probably corrupt file) at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:286) at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:256) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77) at 

(I promise my file isn't corrupt )  :)
-JD

Jeremy Davis

unread,
Dec 10, 2013, 6:34:21 PM12/10/13
to elephant...@googlegroups.com, rajitha r
I didn't realize until I tried it on the cluster that This method (accept) is also looking at directory names.
Makes it a little more complicated.

Dmitriy Ryaboy

unread,
Dec 10, 2013, 7:41:51 PM12/10/13
to elephant...@googlegroups.com, rajitha r
I am told CIF doesn't call the wrapped input formats to get splits, this avoiding their path filters. No idea why someone thought that's a good idea :-/

Jeremy Davis

unread,
Mar 7, 2014, 7:04:28 PM3/7/14
to elephant...@googlegroups.com, rajitha r
Have some time to start looking back in to this issue.
Has anyone conclusively solved the LZO + CFIF issue?

-JD

Jeremy Davis

unread,
Mar 7, 2014, 7:07:34 PM3/7/14
to elephant...@googlegroups.com, rajitha r
Specifically with Cascading.

Viswanathan J

unread,
Mar 8, 2014, 12:17:13 PM3/8/14
to elephant...@googlegroups.com, rajitha r

Getting the following error running jobs in hadoop/pig,

java.lang.Exception: java.lang.RuntimeException: java.io.IOException: No codec for file


Caused by: java.io.IOException: No codec for file 

2590 at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat(MultiInputFormat.java:176)
2591 at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:88)
2592 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initNextRecordReader(PigRecordReader.java:256)

Pls help.

Viswanathan J

unread,
Mar 8, 2014, 12:18:03 PM3/8/14
to elephant...@googlegroups.com, rajitha r
Getting the above issue in Hadoop2/pig.

Dmitriy Ryaboy

unread,
Mar 10, 2014, 11:45:07 AM3/10/14
to elephant...@googlegroups.com, rajitha r
Getting no codec is just an issue of not configuring the codec, and is not relevant to the file combination issue.

After taking a long look at the implementation of Hadoop's CombinedFileInputFormat, which underlies this feature in cascading, I've come to the conclusion that it's deficient at a fundamental level -- it's very naive about what formats it wraps, and doesn't delegate to the wrapped format where it should. It's also not written in such a way that this behavior is easily modifiable by extending or wrapping the class. My conclusion is that to make this work properly in Cascading or elsewhere, the whole CFIF needs to be replaced with a different implementation, which would do the appropriate delegation.

Shouldn't be a terribly large amount of work, but no one has stepped up so far.

D


For more options, visit https://groups.google.com/d/optout.



--
Dmitriy V Ryaboy
Data Platform @ Twitter
http://twitter.com/squarecog

Jeremy Davis

unread,
Mar 10, 2014, 1:56:55 PM3/10/14
to elephant...@googlegroups.com, rajitha r
That's what I was afraid of. 
-JD
...
Reply all
Reply to author
Forward
0 new messages