On Sun, Aug 26, 2012 at 11:11 PM, Randy Olson <
rhi...@gmail.com> wrote:
> Probably the easiest thing to do is pre-process the MySQL tab-separated
> files and strip all of the \r\n & \\r\n. Something like:
>
>> open file for reading and writing & read it into memory
>> replace all \r\n and \\r\n with empty strings
>> overwrite file with processed text
>> close file
>>
>> open file with read_csv
>
>
> On Friday, August 24, 2012 10:41:43 AM UTC-4, Emanuele wrote:
>>
>> Hi,
>>
>> I'm trying to parse a MySQL TAB-separated csv dump of a db table with
>> read_csv. Unfortunately for some records the textual fields have newlines
>> ("\r\n") inside - sometimes even escaped ("\\r\t"). These newlines interfere
>> with parsing, in the sense that the csv parser stops reading the record as
>> soon as a newline is encountered. This behavior is usually correct but in
>> some cases the point where it stops is the middle of a text field of a
>> record, and not its end. As you can imagine in these cases parsing goes
>> wrong and usually throw exceptions after a little while.
>>
>> A way to overcome this issue could be to assume that a record consists of
>> a known number of TAB-separated values. So the parser would go on reading a
>> record until an appropriate number of TABs are found [0]. This is different
>> from the current assumption, i.e. records are separated by newlines.
>>
>> Is there a way to tell read_csv to act in these different ways?
>>
>> Best,
>>
>> Emanuele
>>
>> [0]: of course there still would be the problem on where to stop when the
>> last TAB is found because the last field could contain newlines etc. This is
>> not my case, to this potential issue is not a problem for me.
>
> --
>
>
Are the fields containing the newlines quoted? It may be a limitation
of Python's built-in CSV parser, but it could be simply a dialect
option. Have you looked at the csv module documentation?
- Wes