Defect found in TextDelimited?

142 views
Skip to first unread message

Michael Peterson

unread,
Jun 15, 2013, 2:31:39 PM6/15/13
to cascadi...@googlegroups.com
How does one file a defect report for Cascading? I looked at the GitHub repo, but didn't find a way to post an issue, so I'm sending it here.  Let me know what else I should do with it.

First I'm using Cascading (Java API) 2.1.2.

Here is the issue I'm seeing.  I'm using a TextDelimited with the "quote" param so that it will strip off the quotes as it reads the Tuple values in.  I hit a snag where when there is an unclosed quote, Cascading hangs.  It doesn't do this for most input in that scenario, but here is one where it does:

74,"F","5",,"1","28465",,"AK","012",103373031,\N,\N,"2","Non-health care facility point of origin","3","Urgent","02/05/2012",6,\N,23,\N,947,\N,13,"Medicine",4,"A","MA","75757",0,"02/01/2012","51612",0,0,,111,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
58,"M","5",,"1","28461",,"AK","012",103373031,\N,\N,"2","Non-health care facility point of origin","4","Urgent","02/02/2012",2,"Discharged/transferred to a short term general hospital for inpatient care",10,"Endocrine,637,\N,13,"Medicine",1,"B","CI","56592",0,"03/11/2011","43434",0,0,,112,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N

There are two lines here - both have 48 entries.  I have deidentified the data.  The first line parses fine.  The second line does not because the entry "Endocrine does not have a closing double quote.  If I close the double quote, it parses fine.  I would expect Cascading to throw an error, but it simply runs continuously without stopping consuming 100% CPU.

This defect occurs both in local mode and in Hadoop mode (using an Hfs Tap).  I've seen this in Local model on both a Windows machine and CentOS system.

Here is the code I used:

    String inputPath = "enc.csv";
    String outPath = "local.out";
    LocalFlowConnector flowConnector = new LocalFlowConnector();

    Fields f = new Fields("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14",
            "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26",
            "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40",
            "41", "42", "43", "44", "45", "46", "47", "48");

    Tap<?,?,?> doc1 = new FileTap( new cascading.scheme.local.TextDelimited(f, false, ",", "\""), inputPath );
     
    Pipe pipe = new Pipe("in");
   
    Tap<?,?,?> outTap = new FileTap(new cascading.scheme.local.TextDelimited(Fields.ALL, false, ","), outPath, SinkMode.REPLACE);
   
    FlowDef flowDef = FlowDef.flowDef().addSource(pipe, doc1)
            .addTailSink(pipe, outTap);
    Flow<?> flow = flowConnector.connect(flowDef);
    flow.complete();
    System.out.println("DONE");  // never executes

The Hadoop/Hfs code is identical except for the FlowConnector and Hfs Tap.


Interestingly, if I shorten the number of entries per Tuple to, say 33, by taking away the first 15 entries, (but still leaving "Endocrine without a close quote), it no longer hangs and instead throws an error (error below from running in local mode):

cascading.tuple.TupleException: unable to read from input identifier: enc.csv
    at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
    at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 33, got: 28:"Urgent","02/02/2012",2,"Discharged/transferred to a short term general hospital for inpatient care",10,"Endocrine,637,\N,13,"Medicine",1,"B","CI","56592",0,"03/11/2011","43434",0,0,,112,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
    at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:297)
    at cascading.scheme.local.TextDelimited.source(TextDelimited.java:567)
    at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
    at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
    ... 8 more
[pool-3-thread-1] ERROR cascading.flow.stream.SourceStage - caught throwable
cascading.tuple.TupleException: unable to read from input identifier: enc.csv
    at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
    at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 33, got: 28:"Urgent","02/02/2012",2,"Discharged/transferred to a short term general hospital for inpatient care",10,"Endocrine,637,\N,13,"Medicine",1,"B","CI","56592",0,"03/11/2011","43434",0,0,,112,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
    at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:297)
    at cascading.scheme.local.TextDelimited.source(TextDelimited.java:567)
    at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
    at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
    ... 8 more
[flow] INFO cascading.flow.Flow - [] stopping all jobs
[flow] INFO cascading.flow.FlowStep - [] stopping: local
[flow] INFO cascading.flow.Flow - [] stopped all jobs
[flow] INFO cascading.flow.Flow - [] shutting down job executor
[flow] INFO cascading.flow.Flow - [] shutdown complete
Exception in thread "main" cascading.flow.FlowException: local step failed
    at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:204)
    at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:145)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:120)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
Caused by: cascading.tuple.TupleException: unable to read from input identifier: enc.csv
    at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
    at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
    ... 5 more
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 33, got: 28:"Urgent","02/02/2012",2,"Discharged/transferred to a short term general hospital for inpatient care",10,"Endocrine,637,\N,13,"Medicine",1,"B","CI","56592",0,"03/11/2011","43434",0,0,,112,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
    at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:297)
    at cascading.scheme.local.TextDelimited.source(TextDelimited.java:567)
    at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
    at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
    ... 8 more   

Chris K Wensel

unread,
Jun 15, 2013, 6:36:06 PM6/15/13
to cascadi...@googlegroups.com
 I hit a snag where when there is an unclosed quote, Cascading hangs.

if it's a malformed csv file it won't get parsed. can't really expect it would.

the solution is to use TextLine and build out your own parsing mechanism in the assembly/flow. use that flow to clean your data.

ckw


Michael Peterson

unread,
Jun 15, 2013, 7:16:02 PM6/15/13
to cascadi...@googlegroups.com
Your proposed alternative in terms of what I should do seems fine, but the fact that the Cascading job in this scenario goes into an infinite loop that never returns seems like an issue that should be addressed.  An error being thrown would be much preferable.

Chris K Wensel

unread,
Jun 15, 2013, 8:56:48 PM6/15/13
to cascadi...@googlegroups.com
Your proposed alternative in terms of what I should do seems fine, but the fact that the Cascading job in this scenario goes into an infinite loop that never returns seems like an issue that should be addressed.  An error being thrown would be much preferable.

fair, 

but I suspect the infinite loop is within the regex (Pattern/Matcher) code in the java libraries, a stack trace would be helpful.

Michael Peterson

unread,
Jun 15, 2013, 11:11:28 PM6/15/13
to cascadi...@googlegroups.com
Here's a stack dump taken with jtrace while the thing is churning away.  As you suspected, the loop looks to be in the regex.Pattern code called by the DelimitedParser.createSplit method.

Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.6-b01 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00007fb00c003800 nid=0x109d waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"pool-3-thread-1" prio=10 tid=0x00007fb00c001800 nid=0x107c runnable [0x00007fb011e66000]
   java.lang.Thread.State: RUNNABLE
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3760)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Neg.match(Pattern.java:4598)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4304)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4314)
    at java.util.regex.Pattern$Prolog.match(Pattern.java:4251)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$Pos.match(Pattern.java:4572)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Start.match(Pattern.java:3055)
    at java.util.regex.Matcher.search(Matcher.java:1105)
    at java.util.regex.Matcher.find(Matcher.java:535)
    at java.util.regex.Pattern.split(Pattern.java:1000)
    at cascading.scheme.util.DelimitedParser.createSplit(DelimitedParser.java:209)
    at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:290)

    at cascading.scheme.local.TextDelimited.source(TextDelimited.java:567)
    at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
    at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
    at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

   Locked ownable synchronizers:
    - <0x00000000bd08a328> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

"pool-2-thread-1" prio=10 tid=0x00007fb0040ac800 nid=0x107b waiting on condition [0x00007fb011f6b000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000bd08a2c0> (a java.util.concurrent.FutureTask$Sync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
    at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
    at java.util.concurrent.FutureTask.get(FutureTask.java:83)
    at cascading.flow.local.planner.LocalStepRunner.call(LocalStepRunner.java:103)
    at cascading.flow.local.planner.LocalStepRunner.call(LocalStepRunner.java:43)

    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

   Locked ownable synchronizers:
    - <0x00000000bd017ba0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

"pool-1-thread-1" prio=10 tid=0x00007faffc010800 nid=0x107a waiting on condition [0x00007fb01206c000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at cascading.flow.planner.FlowStepJob.sleepForPollingInterval(FlowStepJob.java:318)
    at cascading.flow.planner.FlowStepJob.blockTillCompleteOrStopped(FlowStepJob.java:247)
    at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:191)

    at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:145)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:120)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

   Locked ownable synchronizers:
    - <0x00000000bcfcca18> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

"Timer-0" daemon prio=10 tid=0x00007faffc00d000 nid=0x1079 in Object.wait() [0x00007fb01216d000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000bcfc5e68> (a java.util.TaskQueue)
    at java.util.TimerThread.mainLoop(Timer.java:509)
    - locked <0x00000000bcfc5e68> (a java.util.TaskQueue)
    at java.util.TimerThread.run(Timer.java:462)

   Locked ownable synchronizers:
    - None

"flow" prio=10 tid=0x00007fb028122000 nid=0x1078 waiting on condition [0x00007fb02410d000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000bcfcc9b0> (a java.util.concurrent.FutureTask$Sync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
    at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
    at java.util.concurrent.FutureTask.get(FutureTask.java:83)
    at java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:205)
    at cascading.management.UnitOfWorkExecutorStrategy.start(UnitOfWorkExecutorStrategy.java:45)
    at cascading.flow.BaseFlow.spawnJobs(BaseFlow.java:1121)
    at cascading.flow.BaseFlow.run(BaseFlow.java:1065)
    at cascading.flow.BaseFlow.access$100(BaseFlow.java:77)
    at cascading.flow.BaseFlow$1.run(BaseFlow.java:749)
    at java.lang.Thread.run(Thread.java:662)

   Locked ownable synchronizers:
    - None

"Low Memory Detector" daemon prio=10 tid=0x00007fb028089000 nid=0x1076 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"C2 CompilerThread1" daemon prio=10 tid=0x00007fb028086800 nid=0x1075 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"C2 CompilerThread0" daemon prio=10 tid=0x00007fb028084000 nid=0x1074 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"Signal Dispatcher" daemon prio=10 tid=0x00007fb028082000 nid=0x1073 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"Finalizer" daemon prio=10 tid=0x00007fb028065800 nid=0x1072 in Object.wait() [0x00007fb02475d000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000bca01300> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    - locked <0x00000000bca01300> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

   Locked ownable synchronizers:
    - None

"Reference Handler" daemon prio=10 tid=0x00007fb028063800 nid=0x1071 in Object.wait() [0x00007fb02485e000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000bca011d8> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    - locked <0x00000000bca011d8> (a java.lang.ref.Reference$Lock)

   Locked ownable synchronizers:
    - None

"main" prio=10 tid=0x00007fb028008000 nid=0x106c in Object.wait() [0x00007fb02c623000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000bcf70880> (a java.lang.Thread)
    at java.lang.Thread.join(Thread.java:1186)
    - locked <0x00000000bcf70880> (a java.lang.Thread)
    at java.lang.Thread.join(Thread.java:1239)
    at cascading.flow.BaseFlow.complete(BaseFlow.java:801)
    at thornydev.Main.v15_TextDelimitedQuoteBugLocalTapCommaDelim(Main.java:109)
    at thornydev.Main.main(Main.java:53)

   Locked ownable synchronizers:
    - None

"VM Thread" prio=10 tid=0x00007fb02805d000 nid=0x1070 runnable

"VM Periodic Task Thread" prio=10 tid=0x00007fb028094000 nid=0x1077 waiting on condition

JNI global references: 971

Jeremy Davis

unread,
Apr 7, 2015, 4:04:16 PM4/7/15
to cascadi...@googlegroups.com, midpe...@gmail.com
We are running in to this from time to time as well, and it has a tendency to bring down the entire pipeline. Have any changes/fixes/work arounds been done since this message?
For us, I think it would be enough to shim in before the regex parse and look for an invalid number of quotes, and then throw out the line entirely. 
Any thoughts?
-JD

Oscar Boykin

unread,
Apr 7, 2015, 4:07:00 PM4/7/15
to cascadi...@googlegroups.com, midpe...@gmail.com
Side note: regexs are often very slow. A parser that avoids using regular expressions will probably speed you up.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/8654a013-5acf-4a14-b82b-49e9d6ea1486%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Oscar Boykin :: @posco :: http://twitter.com/posco

Jeremy Davis

unread,
Apr 7, 2015, 4:13:24 PM4/7/15
to cascadi...@googlegroups.com, midpe...@gmail.com
Agreed,
I don't believe we are using a Regex intentionally.. It looks like quote/escape parsing apparently does use a Regex.

-JD

Ken Krugler

unread,
Apr 7, 2015, 5:08:19 PM4/7/15
to cascadi...@googlegroups.com
In general we avoid using TextDelimited when sourcing potentially messy (quoted, invalid) text files.

We use OpenCSV - see http://opencsv.sourceforge.net/

Currently via a base Function that we extend as needed.

But it would be useful to have this as a regular scheme.

-- Ken


From: Jeremy Davis

Sent: April 7, 2015 1:13:23pm PDT

To: cascadi...@googlegroups.com

Cc: midpe...@gmail.com

Subject: Re: Defect found in TextDelimited?





--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Cyrille Chépélov

unread,
Apr 7, 2015, 5:48:22 PM4/7/15
to cascadi...@googlegroups.com
Ran into this too a couple of months ago. Ended up rewriting a non-regex parser, which works fine for what we're doing here. Never had a chance to submit a proper pull request for lack of time to write the necessary test cases, but now we're knee deep, the need for cleaning this up started to creep up…

(on a regular Cascading workload, I guess putting the attached file in src/main/java/* should be enough. On our Scalding workload, we're doing it with a copy-paste implementation of Tsv/Csv/etc. in addition to putting this in src/main/java/* alongside the (scala) app)

    -- Cyrille
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/8654a013-5acf-4a14-b82b-49e9d6ea1486%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Logo Transparency

Cyrille CHÉPÉLOV
Chief Innovation Officer

Transparency Rights Management
15 rue Jean-Baptiste Berlier - Hall B, 75013 Paris
T : +33 1 84 16 52 74 / F : +33 1 84 17 83 34

SafeDelimitedParser.java

Koert Kuipers

unread,
Apr 7, 2015, 5:51:42 PM4/7/15
to cascadi...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

Ken Krugler

unread,
Apr 7, 2015, 7:49:01 PM4/7/15
to cascadi...@googlegroups.com
Thanks Koert!

BTW, I'm trying to remember if throwing an exception from a source Scheme always kills the Flow, even if there's a trap on the pipe.

-- Ken


From: Koert Kuipers

Sent: April 7, 2015 2:51:38pm PDT

To: cascadi...@googlegroups.com


For more options, visit https://groups.google.com/d/optout.
--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Koert Kuipers

unread,
Apr 8, 2015, 12:46:51 PM4/8/15
to cascadi...@googlegroups.com
We have not used traps. I am not sure either. Let me know if there is a better way to do this.

Luis Casillas

unread,
Apr 16, 2015, 2:42:16 PM4/16/15
to cascadi...@googlegroups.com
We've also had problems with TextDelimited that we've had to work around.  Strictly speaking a separate scheme is not necessary; the TextDelimited scheme supports pluggable DelimitedParser objects, and although this class can be a bit messy to deal with, it's the quickest fix.  

We wrote the following class to deal both the the hanging issue, but also with some problems where Redshift did not like the CSV output from TextDelimited.  Warning: it's a total hack, and I make no promise that it's suitable for any one purpose:


On Tuesday, April 7, 2015 at 2:08:19 PM UTC-7, kkrugler wrote:
In general we avoid using TextDelimited when sourcing potentially messy (quoted, invalid) text files.

We use OpenCSV - see http://opencsv.sourceforge.net/

Currently via a base Function that we extend as needed.

But it would be useful to have this as a regular scheme.


This message and any files or text attached to it are intended only for the recipients named above, and contain information that is confidential or privileged.  If you are not an intended recipient, you must not read, copy, use or disclose this communication. Please also notify the sender by replying to this message, and then delete all copies of it from your system.

Este mensaje y cualquier archivo o texto adjunto es dirigido solamente a los destinatarios especificados en el encabezado y contiene información confidencial y/o privilegiada. Si usted no es el destinatario no deberá leer, copiar, usar o divulgar el contenido. Por favor notifique al remitente, respondiendo a esté mensaje y elimine todas las copias del mismo de su sistema.
Reply all
Reply to author
Forward
0 new messages