Issues with DCFIF and LZO

108 views
Skip to first unread message

jdavis....@gmail.com

unread,
Nov 13, 2014, 8:31:53 PM11/13/14
to elephant...@googlegroups.com

Hello all,
I'm using 4.6rc6, and I'm running in to an issue with LZO w/ CFIF.
If all the files are smaller than my combine size, then everything works as expected.
_I Believe_ the problem occurs if one of the input files is larger than my target Split Size.

In my case I targeted 256MB, and some of the files are 330MB.
I either get the lzo -6 error, or it appears that I lose the newline, and pick up on the next line, which Cascading catches for me. 

I'm working on a test case...


cascading.flow.FlowException: internal error during mapper execution
	at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:148)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:348)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:282)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1117)
	at org.apache.hadoop.mapred.Child.main(Child.java:271)
Caused by: java.lang.InternalError: lzo1x_decompress_safe returned: -6
	at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method)
	at com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:315)
	at com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122)
	at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:247)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
	at java.io.InputStream.read(InputStream.java:101)
	at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205)
	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169)
	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:294)
	at util.lzo.LzoLineRecordReader.nextKeyValue(LzoLineRecordReader.java:79)
	at com.uss.utils.lzo.CompositeRecordReader.nextKeyValue(CompositeRecordReader.java:84)
	at util.lzo.DeprecatedInputFormatWrapper$RecordReaderWrapper.next(DeprecatedInputFormatWrapper.java:330)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:227)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:212)
	at cascading.tap.hadoop.util.MeasuredRecordReader.next(MeasuredRecordReader.java:61)
	at cascading.scheme.hadoop.TextDelimited.source(TextDelimited.java:1005)
	at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:163)
	at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:136)
	at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
	at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
	at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130)
	... 7 more



cascading.tuple.TupleException: unable to read from input identifier: 'unknown'
	at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:149)
	at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
	at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
	at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:348)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:282)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1117)
	at org.apache.hadoop.mapred.Child.main(Child.java:271)
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 8, got: 14:ACCOUNT:123735572,15:42:06,20131211,1,EMAIL_OPEN_SALE,LC,12.11.13 lc  2TWACCOUNT:100883092,12:02:18,20140111,1,EMAIL_SEND_CORP_OTHER,NM,01.11.14 nm johnny was - remainder,
	at cascading.scheme.util.DelimitedParser.onlyParseLine(DelimitedParser.java:404)
	at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:341)
	at cascading.scheme.hadoop.TextDelimited.source(TextDelimited.java:1015)
	at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:163)
	at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:136)
	... 10 more

Dmitriy Ryaboy

unread,
Nov 13, 2014, 8:44:45 PM11/13/14
to elephant...@googlegroups.com
Are you using elephantbird?

The stack trace indicates that you are not -- thought the name DeprecatedInputFormatWrapper suggests whatever you are using might have at some point forked off EB:

at util.lzo.LzoLineRecordReader.nextKeyValue(LzoLineRecordReader.java:79)
	at com.uss.utils.lzo.CompositeRecordReader.nextKeyValue(CompositeRecordReader.java:84)
	at util.lzo.DeprecatedInputFormatWrapper$RecordReaderWrapper.next(DeprecatedInputFormatWrapper.java:330)

--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elephantbird-d...@googlegroups.com.
To post to this group, send email to elephant...@googlegroups.com.
Visit this group at http://groups.google.com/group/elephantbird-dev.
For more options, visit https://groups.google.com/d/optout.



--
Dmitriy V Ryaboy
Data Platform @ Twitter
http://twitter.com/squarecog

dataso...@gmail.com

unread,
Nov 14, 2014, 12:22:20 AM11/14/14
to elephant...@googlegroups.com
Hmm… I will double check my wiring. 

dataso...@gmail.com

unread,
Nov 14, 2014, 2:53:22 AM11/14/14
to elephant...@googlegroups.com
Yep.. Wrong import from our earlier attempts.
Thanks a bunch for this library!
All these little files have been a pain in my butt!

CombinedSequenceFile for intermediate was an unexpected bonus.

I'll let you know if I run in to any problems.

Regards,
-JD

dataso...@gmail.com

unread,
Nov 15, 2014, 11:56:54 AM11/15/14
to elephant...@googlegroups.com
Working on a concise repro…
But this happens with CFIF on.. but not when off. All files pass "lzop -t"

-JD

java.lang.RuntimeException: Could not read first record (and it was not an EOF)
	at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.initKeyValueObjects(DeprecatedInputFormatWrapper.java:280)
	at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.createKey(DeprecatedInputFormatWrapper.java:291)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.createKey(MapTask.java:203)
	at cascading.tap.hadoop.util.MeasuredRecordReader.createKey(MeasuredRecordReader.java:76)
	at cascading.scheme.hadoop.TextLine.sourcePrepare(TextLine.java:410)
	at cascading.scheme.hadoop.TextDelimited.sourcePrepare(TextDelimited.java:995)
	at cascading.tuple.TupleEntrySchemeIterator.<init>(TupleEntrySchemeIterator.java:107)
	at cascading.tap.hadoop.io.HadoopTupleEntrySchemeIterator.<init>(HadoopTupleEntrySchemeIterator.java:49)
	at cascading.tap.hadoop.io.HadoopTupleEntrySchemeIterator.<init>(HadoopTupleEntrySchemeIterator.java:44)
	at cascading.tap.hadoop.Hfs.openForRead(Hfs.java:518)
	at cascading.tap.hadoop.Hfs.openForRead(Hfs.java:109)
	at cascading.flow.stream.SourceStage.map(SourceStage.java:74)
	at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
	at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:348)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:282)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1117)
	at org.apache.hadoop.mapred.Child.main(Child.java:271)
Caused by: java.io.IOException: Invalid LZO header
	at com.hadoop.compression.lzo.LzopInputStream.readHeader(LzopInputStream.java:116)
	at com.hadoop.compression.lzo.LzopInputStream.<init>(LzopInputStream.java:55)
	at com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:105)
	at com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:113)
	at com.twitter.elephantbird.mapreduce.input.LzoRecordReader.initialize(LzoRecordReader.java:93)
	at com.twitter.elephantbird.mapreduce.input.combine.CompositeRecordReader$DelayedRecordReader.createRecordReader(CompositeRecordReader.java:72)
	at com.twitter.elephantbird.mapreduce.input.combine.CompositeRecordReader.nextKeyValue(CompositeRecordReader.java:120)
	at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.initKeyValueObjects(DeprecatedInputFormatWrapper.java:271)
	... 20 more


On Thursday, November 13, 2014 5:44:45 PM UTC-8, Dmitriy Ryaboy wrote:

Dmitriy Ryaboy

unread,
Nov 17, 2014, 3:08:06 PM11/17/14
to elephant...@googlegroups.com
How are you loading? I wonder if you are picking up index files themselves with CFIF, rather than just the lzo files.

dataso...@gmail.com

unread,
Nov 18, 2014, 1:25:10 PM11/18/14
to elephant...@googlegroups.com
Can't repro yet.. I'm hoping it never comes back and it was user error.
(unrelated) I did notice that this counter never gets updated.
Map input bytes000

-JD
Reply all
Reply to author
Forward
0 new messages