Re: read_csv : escaped newlines in some columns

2,655 views
Skip to first unread message

Emanuele

unread,
Aug 24, 2012, 10:47:52 AM8/24/12
to pyd...@googlegroups.com
On Friday, August 24, 2012 4:41:43 PM UTC+2, Emanuele wrote:
> Hi,
>
> I'm trying to parse a MySQL TAB-separated csv dump of a db table with read_csv. Unfortunately for some records the textual fields have newlines ("\r\n") inside - sometimes even escaped ("\\r\t").

Obviously I meant "\\r\n" :)

Best,

Emanuele

Wes McKinney

unread,
Aug 27, 2012, 3:38:19 PM8/27/12
to pyd...@googlegroups.com
On Sun, Aug 26, 2012 at 11:11 PM, Randy Olson <rhi...@gmail.com> wrote:
> Probably the easiest thing to do is pre-process the MySQL tab-separated
> files and strip all of the \r\n & \\r\n. Something like:
>
>> open file for reading and writing & read it into memory
>> replace all \r\n and \\r\n with empty strings
>> overwrite file with processed text
>> close file
>>
>> open file with read_csv
>
>
> On Friday, August 24, 2012 10:41:43 AM UTC-4, Emanuele wrote:
>>
>> Hi,
>>
>> I'm trying to parse a MySQL TAB-separated csv dump of a db table with
>> read_csv. Unfortunately for some records the textual fields have newlines
>> ("\r\n") inside - sometimes even escaped ("\\r\t"). These newlines interfere
>> with parsing, in the sense that the csv parser stops reading the record as
>> soon as a newline is encountered. This behavior is usually correct but in
>> some cases the point where it stops is the middle of a text field of a
>> record, and not its end. As you can imagine in these cases parsing goes
>> wrong and usually throw exceptions after a little while.
>>
>> A way to overcome this issue could be to assume that a record consists of
>> a known number of TAB-separated values. So the parser would go on reading a
>> record until an appropriate number of TABs are found [0]. This is different
>> from the current assumption, i.e. records are separated by newlines.
>>
>> Is there a way to tell read_csv to act in these different ways?
>>
>> Best,
>>
>> Emanuele
>>
>> [0]: of course there still would be the problem on where to stop when the
>> last TAB is found because the last field could contain newlines etc. This is
>> not my case, to this potential issue is not a problem for me.
>
> --
>
>

Are the fields containing the newlines quoted? It may be a limitation
of Python's built-in CSV parser, but it could be simply a dialect
option. Have you looked at the csv module documentation?

- Wes

Emanuele Olivetti

unread,
Aug 27, 2012, 6:45:14 PM8/27/12
to pyd...@googlegroups.com
Randy and Wes, thanks for your answers.

@Randy: if I remove all newlines then I guess read_csv will not work because
it will never find when a record ends. As far as I've seen you cannot specify the
number of fields when reading a csv file.

@Wes: I read the csv module's doc but did not find a quick solution to my problem.

At the moment I am writing a custom parser and trying to pipe data to pandas.

If you have further ideas and hints, do not hesitate to send them here!

Best,

Emanuele


- Wes

--



Randy Olson

unread,
Aug 27, 2012, 7:09:23 PM8/27/12
to pyd...@googlegroups.com
Then only replace all \r and \\r?


--
 
 



--
Randal S. Olson

Computer Science PhD Student
Michigan State University

Emanuele Olivetti

unread,
Aug 28, 2012, 3:56:52 AM8/28/12
to pyd...@googlegroups.com
There are (sometimes) "\\r\n" and "\r\n" in a certain text field, and there is "\r\n" at
the end of each record. If I remove "\\r" and "\r" I still remain with "\n" both
in the troubled field and at the end of each record, so the situation would be
ambiguous anyway.

Thanks for you hints. In the meanwhile I set up a custom parser specific for my
data which works decently. Anyway in future it would be nice to add further flexibility
to read_csv in order to improve the already awesome pandas! :)

Best,

Emanuele

--
 
 

Reply all
Reply to author
Forward
0 new messages