Malformed Sentence.csv file

62 views
Skip to first unread message

Jerrad

unread,
Mar 15, 2017, 4:04:51 PM3/15/17
to tatoeba
I am trying to parse the sentence.csv file from the download group and I noticed that when you do a csvformat on it (using python csvkit) it fails and says one of the fields is 13000 char too long. On closer inspection it would appear that an EOL is changed as it's counting multilines instead of just one. 

Can anyone else confirm or deny?

Jerrad

Alan F

unread,
Mar 15, 2017, 7:30:02 PM3/15/17
to tatoeba...@googlegroups.com
First of all, many thanks for letting me know about csvkit. It's a nice package, and lets me do things I thought I'd have to write my own code to do.

Yes, I can confirm that I get that message when I execute utilities from that package, but it goes away when I use "-u 3" (or "--quoting 3") as an argument. I got that solution from nealmcb's comment on this Stack Overflow answer:

http://stackoverflow.com/a/18408911/523124

For instance, this command:

csvgrep -r "three months[?!]" -c 1 sentences.csv

gives this output:

CSV contains fields longer than maximum length of 131072 characters. Try raising the maximum with the field_size_limit parameter, or try setting quoting=csv.QUOTE_NONE.

But this command:

csvgrep -r "three months[?!]" -c 1 -u 3 sentences.csv

works just fine.

If that doesn't work for you, let me know.

Hope this helps,
Alan



--
You received this message because you are subscribed to the Google Groups "tatoeba" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tatoebaproject+unsubscribe@googlegroups.com.
To post to this group, send email to tatoebaproject@googlegroups.com.
Visit this group at https://groups.google.com/group/tatoebaproject.
To view this discussion on the web visit https://groups.google.com/d/msgid/tatoebaproject/4a0ee5df-699c-4329-aa5b-9623be59a0c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gilles Bedel

unread,
Mar 15, 2017, 9:17:00 PM3/15/17
to tatoeba...@googlegroups.com
Hello Jerrad,

It seems the problem is related to quotes (character ") inside
sentences. They are interpreted as field quotation while they are not
mean to be. For instance there is a sentence that goes like this:

282871 eng When he said "water," she gave him water.

When csvgrep parses this line, it thinks there is a quoted field
running on multiple lines after. As Alan suggested, try adding the
option -u 3 to csvgrep.

— gillux
Reply all
Reply to author
Forward
0 new messages