scalding Csv reader breaks when having double quote in the record

326 views
Skip to first unread message

Haidar Hadi

unread,
Sep 9, 2013, 10:43:35 PM9/9/13
to cascadi...@googlegroups.com
I modified the sample data file https://github.com/twitter/scalding/blob/develop/tutorial/data/phones.txt 
from 
john smith 5551212 30 US
harry bovik 4122680000 55 US 
jane doe 4125551212 40 CN

to
john smith"s 5551212 30 US
harry bovik 4122680000 55 US 
jane doe 4125551212 40 CN 

and then I ran tutorial6 

$scripts/scald.rb --local tutorial/Tutorial6.scala
compiling tutorial/Tutorial6.scala
scalac -classpath /tmp/maven/hadoop-core-0.20.2.jar:/tmp/maven/slf4j-log4j12-1.6.6.jar:/tmp/maven/log4j-1.2.15.jar:/tmp/maven/commons-httpclient-3.1.jar:/tmp/maven/commons-cli-1.2.jar:/tmp/maven/zookeeper-3.3.4.jar:/root/.sbt/boot/scala-2.10.0/lib/scala-library.jar:/root/.sbt/boot/scala-2.10.0/lib/scala-reflect.jar:/mnt/storage1/scala/myprj7/scalding-develop/scalding-core/target/scala-2.10/scalding-core-assembly-0.8.5.jar -d /tmp/script-build tutorial/Tutorial6.scala
13/09/09 19:41:54 INFO property.AppProps: using app.id: D3BF88BADED0BF6E107309480870DE83
13/09/09 19:41:54 INFO util.Version: Concurrent, Inc - Cascading 2.1.6
13/09/09 19:41:54 INFO flow.Flow: [Tutorial6] starting
13/09/09 19:41:54 INFO flow.Flow: [Tutorial6]  source: FileTap["TextDelimited[['first', 'last', 'phone', 'age', 'country']]"]["tutorial/data/phones.txt"]
13/09/09 19:41:54 INFO flow.Flow: [Tutorial6]  sink: FileTap["TextDelimited[[UNKNOWN]->[ALL]]"]["tutorial/data/output6.tsv"]
13/09/09 19:41:54 INFO flow.Flow: [Tutorial6]  parallel execution is enabled: true
13/09/09 19:41:54 INFO flow.Flow: [Tutorial6]  starting jobs: 1
13/09/09 19:41:54 INFO flow.Flow: [Tutorial6]  allocating threads: 1
13/09/09 19:41:54 INFO flow.FlowStep: [Tutorial6] starting step: local
13/09/09 19:41:54 ERROR stream.TrapHandler: caught Throwable, no trap available, rethrowing
cascading.tuple.TupleException: unable to read from input identifier: tutorial/data/phones.txt
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
        at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 5, got: 4:john smith"s,5551212,30,US
        at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:297)
        at cascading.scheme.local.TextDelimited.source(TextDelimited.java:567)
        at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
        ... 8 more
13/09/09 19:41:54 ERROR stream.SourceStage: caught throwable
cascading.tuple.TupleException: unable to read from input identifier: tutorial/data/phones.txt
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
        at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 5, got: 4:john smith"s,5551212,30,US
        at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:297)
        at cascading.scheme.local.TextDelimited.source(TextDelimited.java:567)
        at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
        ... 8 more
13/09/09 19:41:54 INFO flow.Flow: [Tutorial6] stopping all jobs
13/09/09 19:41:54 INFO flow.FlowStep: [Tutorial6] stopping: local
13/09/09 19:41:54 INFO flow.Flow: [Tutorial6] stopped all jobs
13/09/09 19:41:54 INFO flow.Flow: [Tutorial6] shutting down job executor
13/09/09 19:41:54 INFO flow.Flow: [Tutorial6] shutdown complete
cascading.flow.FlowException: local step failed
        at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:208)
        at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:145)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:120)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: cascading.tuple.TupleException: unable to read from input identifier: tutorial/data/phones.txt
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
        at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
        ... 5 more
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 5, got: 4:john smith"s,5551212,30,US
        at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:297)
        at cascading.scheme.local.TextDelimited.source(TextDelimited.java:567)
        at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
        ... 8 more
If you know what exactly caused this error, please consider contributing to GitHub via following link.



is that a bug or there are ways to better split the record ?
thanks, 
Haidar.

William Briggs

unread,
Sep 9, 2013, 10:51:19 PM9/9/13
to cascadi...@googlegroups.com
I'm mostly a lurker on this list, but at a glance, it likely expects quotation marks to be escaped with a backslash - this is pretty typical for delimited file formats.

-Will


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.

Haidar Hadi

unread,
Sep 9, 2013, 11:00:08 PM9/9/13
to cascadi...@googlegroups.com
I did that 
John smith\"s 5551212 30 US
and still got the exact same error. 

Haidar.

Mansur Ashraf

unread,
Sep 11, 2013, 4:55:08 PM9/11/13
to cascadi...@googlegroups.com
your record is either \t or space separated. Try changing the separator and it should work

Csv("input", separator = "\t").read

Haidar Hadi

unread,
Sep 11, 2013, 10:26:56 PM9/11/13
to cascadi...@googlegroups.com
Thank you Mansur for answering this, unfortunately it is still not working . 

I changed the reader to be :
  Csv("tutorial/data/phones.txt", separator = "\t", fields = Schema)

but when I executed the code , I got :


$scripts/scald.rb --local tutorial/Tutorial6.scala
compiling tutorial/Tutorial6.scala
scalac -classpath /tmp/maven/hadoop-core-0.20.2.jar:/tmp/maven/slf4j-log4j12-1.6.6.jar:/tmp/maven/log4j-1.2.15.jar:/tmp/maven/commons-httpclient-3.1.jar:/tmp/maven/commons-cli-1.2.jar:/tmp/maven/zookeeper-3.3.4.jar:/root/.sbt/boot/scala-2.10.0/lib/scala-library.jar:/root/.sbt/boot/scala-2.10.0/lib/scala-reflect.jar:/mnt/storage1/scala/myprj7/scalding-develop/scalding-core/target/scala-2.10/scalding-core-assembly-0.8.5.jar -d /tmp/script-build tutorial/Tutorial6.scala
13/09/11 19:21:14 INFO property.AppProps: using app.id: 2E0FA8BA367CAD527CB38AC08161813F
13/09/11 19:21:14 INFO util.Version: Concurrent, Inc - Cascading 2.1.6
13/09/11 19:21:14 INFO flow.Flow: [Tutorial6] starting
13/09/11 19:21:14 INFO flow.Flow: [Tutorial6]  source: FileTap["TextDelimited[['first', 'last', 'phone', 'age', 'country']]"]["tutorial/data/phones.txt"]
13/09/11 19:21:14 INFO flow.Flow: [Tutorial6]  sink: FileTap["TextDelimited[[UNKNOWN]->[ALL]]"]["tutorial/data/output6.tsv"]
13/09/11 19:21:14 INFO flow.Flow: [Tutorial6]  parallel execution is enabled: true
13/09/11 19:21:14 INFO flow.Flow: [Tutorial6]  starting jobs: 1
13/09/11 19:21:14 INFO flow.Flow: [Tutorial6]  allocating threads: 1
13/09/11 19:21:14 INFO flow.FlowStep: [Tutorial6] starting step: local
13/09/11 19:21:14 ERROR stream.TrapHandler: caught Throwable, no trap available, rethrowing
cascading.tuple.TupleException: unable to read from input identifier: tutorial/data/phones.txt
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
        at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 5, got: 1:john smith"s 5551212 30 US
        at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:297)
        at cascading.scheme.local.TextDelimited.source(TextDelimited.java:567)
        at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
        ... 8 more
13/09/11 19:21:14 ERROR stream.SourceStage: caught throwable
cascading.tuple.TupleException: unable to read from input identifier: tutorial/data/phones.txt
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
        at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 5, got: 1:john smith"s 5551212 30 US
        at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:297)
        at cascading.scheme.local.TextDelimited.source(TextDelimited.java:567)
        at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
        ... 8 more
13/09/11 19:21:14 INFO flow.Flow: [Tutorial6] stopping all jobs
13/09/11 19:21:14 INFO flow.FlowStep: [Tutorial6] stopping: local
13/09/11 19:21:14 INFO flow.Flow: [Tutorial6] stopped all jobs
13/09/11 19:21:14 INFO flow.Flow: [Tutorial6] shutting down job executor
13/09/11 19:21:14 INFO flow.Flow: [Tutorial6] shutdown complete
cascading.flow.FlowException: local step failed
        at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:208)
        at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:145)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:120)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: cascading.tuple.TupleException: unable to read from input identifier: tutorial/data/phones.txt
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
        at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
        at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
        ... 5 more
Caused by: cascading.tap.TapException: did not parse correct number of values from input data, expected: 5, got: 1:john smith"s 5551212 30 US
        at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:297)
        at cascading.scheme.local.TextDelimited.source(TextDelimited.java:567)
        at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
        at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
        ... 8 more
If you know what exactly caused this error, please consider contributing to GitHub via following link.


Reply all
Reply to author
Forward
0 new messages