Message from discussion
Bug in TextDelimited Scheme for quotes char support - For Some combination
Received: by 10.50.203.99 with SMTP id kp3mr11895747igc.0.1328634631221;
Tue, 07 Feb 2012 09:10:31 -0800 (PST)
X-BeenThere: cascading-user@googlegroups.com
Received: by 10.50.61.40 with SMTP id m8ls5958007igr.3.canary; Tue, 07 Feb
2012 09:10:29 -0800 (PST)
Received: by 10.68.213.68 with SMTP id nq4mr14615778pbc.2.1328634629624;
Tue, 07 Feb 2012 09:10:29 -0800 (PST)
Received: by 10.68.213.68 with SMTP id nq4mr14615776pbc.2.1328634629613;
Tue, 07 Feb 2012 09:10:29 -0800 (PST)
Return-Path: <ch...@wensel.net>
Received: from mxout-08.mxes.net (mxout-08.mxes.net. [216.86.168.183])
by gmr-mx.google.com with ESMTPS id e6si25449704pbt.1.2012.02.07.09.10.29
(version=TLSv1/SSLv3 cipher=OTHER);
Tue, 07 Feb 2012 09:10:29 -0800 (PST)
Received-SPF: neutral (google.com: 216.86.168.183 is neither permitted nor denied by best guess record for domain of ch...@wensel.net) client-ip=216.86.168.183;
Authentication-Results: gmr-mx.google.com; spf=neutral (google.com: 216.86.168.183 is neither permitted nor denied by best guess record for domain of ch...@wensel.net) smtp.mail=ch...@wensel.net
Received: from [192.168.1.105] (unknown [108.94.26.174])
(using TLSv1 with cipher AES128-SHA (128/128 bits))
(No client certificate requested)
by smtp.mxes.net (Postfix) with ESMTPSA id D329750A5D
for <cascading-user@googlegroups.com>; Tue, 7 Feb 2012 12:10:27 -0500 (EST)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Apple Message framework v1257)
Subject: Re: Bug in TextDelimited Scheme for quotes char support - For Some combination
From: Chris K Wensel <ch...@wensel.net>
In-Reply-To: <75d31318-2f06-46bb-85c3-8684a2361a79@z31g2000vbt.googlegroups.com>
Date: Tue, 7 Feb 2012 09:10:27 -0800
Content-Transfer-Encoding: quoted-printable
Message-Id: <100EFB4F-F94B-4304-90FE-3ED7A148E6EB@wensel.net>
References: <75d31318-2f06-46bb-85c3-8684a2361a79@z31g2000vbt.googlegroups.com>
To: cascading-user@googlegroups.com
X-Mailer: Apple Mail (2.1257)
I'll see if I can add this as a test and resolve it simply, but know =
TextDelimited is only intended to the best case plus some variations. =
Otherwise the regex we use would be too slow for any particular use =
other than passing the tests.
If data is fairly complicated, it makes sense to build your own parsing =
rules in the Flow to cleanse the data.
chris
On Feb 7, 2012, at 6:15 AM, Mayilazhagan K wrote:
> Hi,
>=20
> I have a .csv file delimited by ,. Some data fields contain comma as
> part of the value. However they are seperated by quotes. i.e only
> fields which contain comma as part of data value is seperated by
> double quotes. Rest of the fields are not in double quotes. For the
> below combination the row fields are extracted wrongly.
>=20
> "a",b,,"d1,d2",3
>=20
> I have tested this combination with TextDelimitedTest present in
> Cascading-Test Project for the test method
> testQuotedTextAll and i am getting an error.
>=20
> I have replaced data line 3 with my data.
>=20
> delimited.txt
> ------------------
>=20
> foo,bar,baz,bin,1
> foo,"bar",baz,bin,2
> "a",b,,"d1,d2",3
> foo,"bar"",bar",baz,bin,4
> foo,"bar"""",bar",baz,bin,5
> ,"",baz,,6
> ,,,,7
> foo,,,,8
> ,"",,,9
> "f",,,,"10"
> "f",,,",bin","11"
> "f",,,",bin","11"
>=20
> Below is the error in parsing.
>=20
> 2012-02-07 19:32:21,155 WARN mapred.LocalJobRunner
> (LocalJobRunner.java:run(256)) - job_local_0001
> cascading.operation.OperationException: number of input tuple values:
> 4, does not match destination array size: 5
> at cascading.tuple.Tuples.asArray(Tuples.java:48)
> at cascading.scheme.TextDelimited.sink(TextDelimited.java:670)
> at cascading.tap.Tap.sink(Tap.java:280)
> at
> =
cascading.flow.stack.SinkMapperStackElement.operateSink(SinkMapperStackEle=
ment.java:
> 95)
> at
> =
cascading.flow.stack.SinkMapperStackElement.collect(SinkMapperStackElement=
.java:
> 72)
> at =
cascading.flow.stack.FlowMapperStack.map(FlowMapperStack.java:220)
> at cascading.flow.FlowMapper.map(FlowMapper.java:75)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at =
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.LocalJobRunner
> $Job.run(LocalJobRunner.java:177)
> 2012-02-07 19:32:25,799 WARN flow.FlowStep
> (FlowStep.java:logWarn(643)) - [pipe] task completion events identify
> failed tasks
>=20
> Is this bug addressed and a fix is available?
>=20
> I am using Cascading 1.2.3 currently.
>=20
> Thanks,
> Mayilazhagan.K
>=20
> --=20
> You received this message because you are subscribed to the Google =
Groups "cascading-user" group.
> To post to this group, send email to cascading-user@googlegroups.com.
> To unsubscribe from this group, send email to =
cascading-user+unsubscribe@googlegroups.com.
> For more options, visit this group at =
http://groups.google.com/group/cascading-user?hl=3Den.
>=20
--
Chris K Wensel
ch...@concurrentinc.com
http://concurrentinc.com