Bug in TextDelimited Scheme for quotes char support - For Some combination

156 views
Skip to first unread message

Mayilazhagan K

unread,
Feb 7, 2012, 9:15:52 AM2/7/12
to cascading-user
Hi,

I have a .csv file delimited by ,. Some data fields contain comma as
part of the value. However they are seperated by quotes. i.e only
fields which contain comma as part of data value is seperated by
double quotes. Rest of the fields are not in double quotes. For the
below combination the row fields are extracted wrongly.

"a",b,,"d1,d2",3

I have tested this combination with TextDelimitedTest present in
Cascading-Test Project for the test method
testQuotedTextAll and i am getting an error.

I have replaced data line 3 with my data.

delimited.txt
------------------

foo,bar,baz,bin,1
foo,"bar",baz,bin,2
"a",b,,"d1,d2",3
foo,"bar"",bar",baz,bin,4
foo,"bar"""",bar",baz,bin,5
,"",baz,,6
,,,,7
foo,,,,8
,"",,,9
"f",,,,"10"
"f",,,",bin","11"
"f",,,",bin","11"

Below is the error in parsing.

2012-02-07 19:32:21,155 WARN mapred.LocalJobRunner
(LocalJobRunner.java:run(256)) - job_local_0001
cascading.operation.OperationException: number of input tuple values:
4, does not match destination array size: 5
at cascading.tuple.Tuples.asArray(Tuples.java:48)
at cascading.scheme.TextDelimited.sink(TextDelimited.java:670)
at cascading.tap.Tap.sink(Tap.java:280)
at
cascading.flow.stack.SinkMapperStackElement.operateSink(SinkMapperStackElement.java:
95)
at
cascading.flow.stack.SinkMapperStackElement.collect(SinkMapperStackElement.java:
72)
at cascading.flow.stack.FlowMapperStack.map(FlowMapperStack.java:220)
at cascading.flow.FlowMapper.map(FlowMapper.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner
$Job.run(LocalJobRunner.java:177)
2012-02-07 19:32:25,799 WARN flow.FlowStep
(FlowStep.java:logWarn(643)) - [pipe] task completion events identify
failed tasks

Is this bug addressed and a fix is available?

I am using Cascading 1.2.3 currently.

Thanks,
Mayilazhagan.K

Chris K Wensel

unread,
Feb 7, 2012, 12:10:27 PM2/7/12
to cascadi...@googlegroups.com
I'll see if I can add this as a test and resolve it simply, but know TextDelimited is only intended to the best case plus some variations. Otherwise the regex we use would be too slow for any particular use other than passing the tests.

If data is fairly complicated, it makes sense to build your own parsing rules in the Flow to cleanse the data.

chris

> --
> You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
>

--
Chris K Wensel
ch...@concurrentinc.com
http://concurrentinc.com

Ken Krugler

unread,
Feb 7, 2012, 12:23:21 PM2/7/12
to cascadi...@googlegroups.com
For more complex CSV handling, we use the au.com.bytecode.opencsv.CSVParser class in a custom Cascading Function.

-- Ken

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.


--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Chris K Wensel

unread,
Feb 7, 2012, 12:28:40 PM2/7/12
to cascadi...@googlegroups.com
Ken

do you know if its faster than TextDelimited by any margin? 

chris

Ken Krugler

unread,
Feb 7, 2012, 1:27:13 PM2/7/12
to cascadi...@googlegroups.com
On Feb 7, 2012, at 9:28am, Chris K Wensel wrote:

Ken

do you know if its faster than TextDelimited by any margin? 

No - haven't done any speed comparisons.

My guess is that it would be slower, since it has to do a bit more work to maintain state to handle quoting/escaping properly.

Mayilazhagan K

unread,
Feb 7, 2012, 10:28:16 PM2/7/12
to cascading-user
Chris,

I have tested with this data as well,.

a,b,,"d1,d2",3

This also does not work.

What i feel is that, quoted text does n't work if an empty field is
present immediately before that data. May be the regular expression
has to be tweaked.

Our requirement is quite simple, i.e the record will be delimited by
comma and if comma if present inside data then the field data will be
enclosed in double quotes.

But the quote char option does n't support this and this issue makes
it unusable.

Any thoughts on progress please share.

Onethought, I already had was using etiher the OpenCSV (as Ken
Suggested) Or javaCSV API as a Custom TextDelimited Implementation in
the meantime.
ii) Tweak the existing TextDelimited RegEx which looks fairly
complicated at FirstSight to handle this bug.

Need help on the second thought especially.

Thanks
Mayil
> > cascading.flow.stack.SinkMapperStackElement.operateSink(SinkMapperStackElem­ent.java:
> > 95)
> >    at
> > cascading.flow.stack.SinkMapperStackElement.collect(SinkMapperStackElement.­java:
> > 72)
> >    at cascading.flow.stack.FlowMapperStack.map(FlowMapperStack.java:220)
> >    at cascading.flow.FlowMapper.map(FlowMapper.java:75)
> >    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >    at org.apache.hadoop.mapred.LocalJobRunner
> > $Job.run(LocalJobRunner.java:177)
> > 2012-02-07 19:32:25,799 WARN  flow.FlowStep
> > (FlowStep.java:logWarn(643)) - [pipe] task completion events identify
> > failed tasks
>
> > Is this bug addressed and a fix is available?
>
> > I am using Cascading 1.2.3 currently.
>
> > Thanks,
> > Mayilazhagan.K
>
> > --
> > You received this message because you are subscribed to the Google Groups "cascading-user" group.
> > To post to this group, send email to cascadi...@googlegroups.com.
> > To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/cascading-user?hl=en.

Chris K Wensel

unread,
Feb 7, 2012, 11:00:12 PM2/7/12
to cascadi...@googlegroups.com
Will be faster if you use a custom solution.

I'm unlikely to add any more edge cases to the current regex (see below), so I need more time to evaluate the impact of the change.

chris

> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Mayilazhagan K

unread,
Feb 8, 2012, 3:53:28 AM2/8/12
to cascading-user
Chris,

I have currently extended TextDelimited with Custom Class and
overriden the createSplit method & ignored cleanSplit to use stream
based parsing instead of regex as an interim fix.

The createSplit method is based on OpenCSV parsing logic with
modifications for null emission.
However there may be an impact in performance compared to regex.

private String[] createSplit(String line) {

if (line == null) {
return null;
}

List tokensOnThisLine = new ArrayList();
StringBuffer sb = new StringBuffer();
boolean inQuotes = false;
do {
if (inQuotes) {
// continuing a quoted section, reappend newline
sb.append("\n");
break;
}
for (int i = 0; i < line.length(); i++) {

char c = line.charAt(i);
if (c == getQuoteChar()) {
// this gets complex... the quote may end a quoted block, or
escape another quote.
// do a 1-char lookahead:
if (inQuotes // we are in quotes, therefore there can be
escaped quotes in here.
&& line.length() > (i + 1) // there is indeed another
character to check.
&& line.charAt(i + 1) == getQuoteChar()) { // ..and that
char. is a quote also.
// we have two quote chars in a row == one quote char, so
consume them both and
// put one on the token. we do *not* exit the quoted text.
sb.append(line.charAt(i + 1));
i++;
} else {
inQuotes = !inQuotes;
// the tricky case of an embedded quote in the middle:
a,bc"d"ef,g
if (i > 2 // not on the begining of the line
&& line.charAt(i - 1) != getDelimiterChar() // not at
the begining of an escape sequence
&& line.length() > (i + 1) && line.charAt(i + 1) !=
getDelimiterChar() // not at the
//
end
// of an escape
// sequence
) {
sb.append(c);
}
}
} else if (c == getDelimiterChar() && !inQuotes) {
if (sb.length() == 0) {
tokensOnThisLine.add(null);
} else {
tokensOnThisLine.add(sb.toString());
}
sb = new StringBuffer(); // start work on next token
} else {
sb.append(c);
}
}
} while (inQuotes);
tokensOnThisLine.add(sb.toString());
return (String[]) tokensOnThisLine.toArray(new String[0]);

Chris K Wensel

unread,
Feb 8, 2012, 11:38:38 AM2/8/12
to cascadi...@googlegroups.com
That's great!

It would be wonderful if you packaged that up and put it on the conjars.org maven repo, i'll then link to it from the .org site.

Or just stick it in github and I can link it.

chris

> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Reply all
Reply to author
Forward
0 new messages