Malformed Sentence.csv file

Skip to first unread message


Mar 15, 2017, 4:04:51 PM3/15/17
to tatoeba
I am trying to parse the sentence.csv file from the download group and I noticed that when you do a csvformat on it (using python csvkit) it fails and says one of the fields is 13000 char too long. On closer inspection it would appear that an EOL is changed as it's counting multilines instead of just one. 

Can anyone else confirm or deny?


Alan F

Mar 15, 2017, 7:30:02 PM3/15/17
First of all, many thanks for letting me know about csvkit. It's a nice package, and lets me do things I thought I'd have to write my own code to do.

Yes, I can confirm that I get that message when I execute utilities from that package, but it goes away when I use "-u 3" (or "--quoting 3") as an argument. I got that solution from nealmcb's comment on this Stack Overflow answer:

For instance, this command:

csvgrep -r "three months[?!]" -c 1 sentences.csv

gives this output:

CSV contains fields longer than maximum length of 131072 characters. Try raising the maximum with the field_size_limit parameter, or try setting quoting=csv.QUOTE_NONE.

But this command:

csvgrep -r "three months[?!]" -c 1 -u 3 sentences.csv

works just fine.

If that doesn't work for you, let me know.

Hope this helps,

You received this message because you are subscribed to the Google Groups "tatoeba" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To post to this group, send email to
Visit this group at
To view this discussion on the web visit
For more options, visit

Gilles Bedel

Mar 15, 2017, 9:17:00 PM3/15/17
Hello Jerrad,

It seems the problem is related to quotes (character ") inside
sentences. They are interpreted as field quotation while they are not
mean to be. For instance there is a sentence that goes like this:

282871 eng When he said "water," she gave him water.

When csvgrep parses this line, it thinks there is a quoted field
running on multiple lines after. As Alan suggested, try adding the
option -u 3 to csvgrep.

— gillux
Reply all
Reply to author
0 new messages