More issues after switching to cascading 2.2.1 with cascading.hadoop.hfs.combine.files=true

jd

unread,

Dec 4, 2013, 5:44:46 PM12/4/13

to cascadi...@googlegroups.com

Love this feature, this is going to make a huge difference for us.

I'm running in to an issue though (it goes away if I toggle combine.files on/off)

In our case these files are lzo compressed and are indexed.

My guess is that it is either trying to read an index as a data file, or it is not using the indexes correctly once it spans files.

Perhaps our Lzo Text Delimited Scheme needs some updating?

-JD

2013-12-04 14:15:25,005 WARN cascading.flow.stream.TrapHandler: exception trap on branch: 'mtinput_/attribution/H11001/etl/output/201312021545/mt/coremetric', for [uninitialized]

cascading.tuple.TupleException: unable to read from input identifier: 'unknown'

at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)

at cascading.flow.stream.SourceStage.map(SourceStage.java:76)

at cascading.flow.stream.SourceStage.run(SourceStage.java:58)

at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:127)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)

at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)

at org.apache.hadoop.mapred.Child.main(Child.java:264)

Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 6, got: 4:Q^@^@^@^@^D�^N"^@^@^@^@^D�^\r^@^@^@^@^D�,�^@^@^@^@^D�J�^@^@^@^@^D�v�^@^@^@^@^D��U^@^@^@^@^D��^Q^@^@^@^@^D�^Y�^@^@^@^@^D�Xr^@^@^@^@^D��^W^@^@^@^@^D��K^@^@^@^@^D��^@^@^@^@^D��^@^@^@^@^D�^D,^@^@^@^@^D�!h^@^@^@^@^D�@�^@^@^@^@^D�Y�^@^@^@^@^D�}�^@^@^@^@^D��^@^@^@^@^D��Q^@^@^@^@^D��,^@^@^@^@^D�^Ay^@^@^@^@^D�^_,^@^@^@^@^D�@l^@^@^@^@^D�@]^@^@^@^@^D�D�^@^@^@^@^D�Er^@^@^@^@^D�UB^@^@^@^@^D�uA^@^@^@^@^DŜb^@^@^@^@^D��^@^@^@^@^D�^E\^@^@^@^@^D�Eb^@^@^@^@^Dʅ�^@^@^@^@^D��X^@^@^@^@^D��U^@^@^@^@^D��^@^@^@^@^D��^@^@^@^@^D�^]�^@^@^@^@^D�8^[^@^@^@^@^D�O�^@^@^@^@^D�p�^@^@^@^@^Dԏ}^@^@^@^@^Dթy^@^@^@^@^D��t^@^@^@^@^D��^@^@^@^@^D�^S�^@^@^@^@^D�20^@^@^@^@^D�R3^@^@^@^@^D�J�^@^@^@^@^D�O)^@^@^@^@^D�d1^@^@^@^@^D�{f^@^@^@^@^D�^P^@^@^@^@^D��^@^@^@^@^D��^@^@^@^@^D�-B^@^@^@^@^D�h�^@^@^@^@^D�^^@^@^@^@^D��^@^@^@^@^D�'�^@^@^@^@^D�+^]^@^@^@^@^D�>^S^@^@^@^@^D�c?^@^@^@^@^D��^@^@^@^@^D�m^@^@^@^@^D�<^@^@^@^@^D��^[^@^@^@^@^D�^CK^@^@^@^@^D�.�^@^@^@^@^D�Q^H^@^@^@^@^D�j�^@^@^@^@^D��^@^@^@^@^D��H^@^@^@^@^D�S^@^@^@^@^D�G^@^@^@^@^D�^@^@^@^@^D��y^@^@^@^@^D��h^@^@^@^@^D�-�^@^@^@^@^D�a>^@^@^@^@^E^@�^T^@^@^@^@^E^Aݞ^@^@^@^@^E^C$/^@^@^@^@^E^D\�^@^@^@^@^E^Eg�^@^@^@^@^E^F}�^@^@^@^@^E^G��^@^@^@^@^E^H��^@^@^@^@^E,��^@^@^@^@^E

at cascading.scheme.util.DelimitedParser.onlyParseLine(DelimitedParser.java:404)

at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:341)

at cascading.scheme.hadoop.TextDelimited.source(TextDelimited.java:1008)

at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)

at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)

... 10 more

Message has been deleted

Jeremy Davis

unread,

Dec 6, 2013, 7:56:29 PM12/6/13

to cascadi...@googlegroups.com

The issue is that the CombineFileInputFormat does not exclude .lzo.index files.

To work around this I used the property "mapred.input.pathFilter.class"
Is there another suggested approach?

-JD

Sangjin Lee

unread,

Dec 10, 2013, 1:54:53 PM12/10/13

to cascadi...@googlegroups.com

That sounds like as reasonable a solution as any other. One way to think about this combine.files support in cascading is that it basically enables hadoop's CombineFileInputFormat for your data.

As such, I suspect you may be able to reproduce the same issue if you created a vanilla MR job (without cascading) that uses CombineFileInputFormat in the same manner against your lzo data. CombineFileInputFormat treats all files in the input paths as data. One effective way to filter out the index files is utilizing mapred.input.pathFilter.class.

Another solution would be to create a specific CombineFileInputFormat subclass for the lzo data that does that directly, and use that as your input format class instead of using the combine.files support.

Let me know if it makes sense.

Thanks,
Sangjin

On Fri, Dec 6, 2013 at 4:13 PM, jd <jerd...@speakeasy.net> wrote:

The issue here is that CombineFileInputFormat.listStatus() is not excluding the .lzo.index files.
As a work around I set the property: mapred.input.pathFilter.class

to point to a class that implements a proper PathFilter.
Is a better way?

-JD

On Wednesday, December 4, 2013 2:44:46 PM UTC-8, jd wrote:

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/2dd61e62-e263-405a-a612-3b85efa9df71%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Jeremy Davis

unread,

Dec 10, 2013, 2:37:49 PM12/10/13

to cascadi...@googlegroups.com

Thanks for the reply,

I did consider subclassing. Was just wondering if there was a path already taken here.

Serega Sheypak

unread,

Jul 20, 2015, 5:12:44 AM7/20/15

to cascadi...@googlegroups.com

Hi, what is the right way to pass these properties to scadling job using -Dproperty=value or -D property=value?
I'm using hadoop com.twitter.scalding.Tool to run my scadling job.

вторник, 10 декабря 2013 г., 20:37:49 UTC+1 пользователь Jeremy Davis написал:

Serega Sheypak

unread,

Jul 20, 2015, 7:50:41 AM7/20/15

to cascadi...@googlegroups.com

This is right way


override def config: Map[AnyRef,AnyRef] = {
  if(combineInput.nonEmpty) {
    super.config ++ Map("cascading.hadoop.hfs.combine.files" -> "true",
      "cascading.hadoop.hfs.combine.max.size" -> "268435456")
  }else{
    super.config
  }

}

set these properties before running job

понедельник, 20 июля 2015 г., 11:12:44 UTC+2 пользователь Serega Sheypak написал:

Reply all

Reply to author

Forward