Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Dealing with \r in CSV fields in Python2.4

1,048 views
Skip to first unread message

Tim Chase

unread,
Sep 4, 2013, 11:04:03 AM9/4/13
to pytho...@python.org
I've got some old 2.4 code (requires an external lib that hasn't been
upgraded) that needs to process a CSV file where some of the values
contain \r characters. It appears that in more recent versions (just
tested in 2.7; docs suggest this was changed in 2.5), Python does the
Right Thing™ and just creates values in the row containing that \r.
However, in 2.4, the csv module chokes on it with

_csv.Error: newline inside string

as demoed by the example code at the bottom of this email. What's the
best way to deal with this? At the moment, I'm just using something
like

def unCR(f):
for line in f:
yield line.replace('\r', '')

f = file('input.csv', 'rb')
for row in csv.reader(unCR(f)):
code_to_process(row)

but this throws away data that I'd really prefer to keep if possible.

I know 2.4 isn't exactly popular, and in an ideal world, I'd just
upgrade to a later 2.x version that does what I need. Any old-time
2.4 pythonistas have sage advice for me?

-tkc


from cStringIO import StringIO
import csv
f = file('out.txt', 'wb')
w = csv.writer(f)
w.writerow(["One", "Two"])
w.writerow(["First\rSecond", "Third"])
f.close()

f = file('out.txt', 'rb')
r = csv.reader(f)
for i, row in enumerate(r): # works in 2.7, fails in 2.4
print repr(row)
f.close()


Skip Montanaro

unread,
Sep 4, 2013, 11:20:36 AM9/4/13
to Tim Chase, Python
> _csv.Error: newline inside string

How are the lines actually terminated, with \r\n or with just \n? If
it's just \n, what happens if you specify \n as the line terminator?

Skip

MRAB

unread,
Sep 4, 2013, 11:31:06 AM9/4/13
to pytho...@python.org
On 04/09/2013 16:04, Tim Chase wrote:
> I've got some old 2.4 code (requires an external lib that hasn't been
> upgraded) that needs to process a CSV file where some of the values
> contain \r characters. It appears that in more recent versions (just
> tested in 2.7; docs suggest this was changed in 2.5), Python does the
> Right Thing™ and just creates values in the row containing that \r.
> However, in 2.4, the csv module chokes on it with
>
> _csv.Error: newline inside string
>
> as demoed by the example code at the bottom of this email. What's the
> best way to deal with this? At the moment, I'm just using something
> like
>
> def unCR(f):
> for line in f:
> yield line.replace('\r', '')
>
> f = file('input.csv', 'rb')
> for row in csv.reader(unCR(f)):
> code_to_process(row)
>
> but this throws away data that I'd really prefer to keep if possible.
>
> I know 2.4 isn't exactly popular, and in an ideal world, I'd just
> upgrade to a later 2.x version that does what I need. Any old-time
> 2.4 pythonistas have sage advice for me?
>
[snip]
You could try replacing the '\r' with another character that doesn't
appear elsewhere and then change it back afterwards.

MARKER = '\x01'

def cr_to_marker(f):
for line in f:
yield line.replace('\r', MARKER)

def marker_to_cr(item):
return item.replace(MARKER, '\r')

f = file('out.txt', 'rb')
r = csv.reader(cr_to_marker(f))
for i, row in enumerate(r): # works in 2.7, fails in 2.4
row = [marker_to_cr(item) for item in row]
print repr(row)
f.close()

Which OS are you using? On Windows the lines (rows) end with '\r\n', so
the last item of each row will end with '\r', which you'll need to
strip off. (That would be a problem only if the last item of a row
could end with '\r'.)

Tim Chase

unread,
Sep 4, 2013, 11:32:48 AM9/4/13
to Skip Montanaro, Python
Unfortunately, the customer feed contains DOS newlines ("\r\n").

I'm not quite sure what """
Note
The reader is hard-coded to recognize either '\r' or '\n' as
end-of-line, and ignores lineterminator. This behavior may change in
the future.
""" means at [1]. Does that mean that efforts to change the
lineterminator don't have any effect? Or that you can't (currently)
specify anything other than "\r" or "\n"? Though that is a bit
tangent to the actual issue.

-tkc


[1] http://docs.python.org/2/library/csv.html




Tim Chase

unread,
Sep 4, 2013, 11:41:17 AM9/4/13
to pytho...@python.org
On 2013-09-04 16:31, MRAB wrote:
> You could try replacing the '\r' with another character that doesn't
> appear elsewhere and then change it back afterwards.
>
> MARKER = '\x01'
>
> def cr_to_marker(f):
> for line in f:
> yield line.replace('\r', MARKER)
>
> def marker_to_cr(item):
> return item.replace(MARKER, '\r')
>
> f = file('out.txt', 'rb')
> r = csv.reader(cr_to_marker(f))
> for i, row in enumerate(r): # works in 2.7, fails in 2.4
> row = [marker_to_cr(item) for item in row]
> print repr(row)
> f.close()

This works pretty well. I'm not sure if there's a grave performance
penalty for mucking with strings so much, but at this point my
Care-o-Meter is barely registering, as long as it works.

> Which OS are you using? On Windows the lines (rows) end with
> '\r\n', so the last item of each row will end with '\r', which
> you'll need to strip off. (That would be a problem only if the last
> item of a row could end with '\r'.)

It's on Win32.

-tkc


Terry Reedy

unread,
Sep 4, 2013, 5:15:10 PM9/4/13
to pytho...@python.org
On 9/4/2013 11:04 AM, Tim Chase wrote:
> I've got some old 2.4 code (requires an external lib that hasn't been
> upgraded) that needs to process a CSV file where some of the values
> contain \r characters. It appears that in more recent versions (just
> tested in 2.7; docs suggest this was changed in 2.5), Python does the
> Right Thing™ and just creates values in the row containing that \r.
> However, in 2.4, the csv module chokes on it with
>
> _csv.Error: newline inside string
>
> as demoed by the example code at the bottom of this email.

While probably not necessary for this problem, one can use more that one
Python version to solve a problem. For instance, You could use a current
version to read the data and transform it so that it can be piped to 2.4
code running in a subprocess.

> What's the
> best way to deal with this? At the moment, I'm just using something
> like
>
> def unCR(f):
> for line in f:
> yield line.replace('\r', '')
>
> f = file('input.csv', 'rb')
> for row in csv.reader(unCR(f)):
> code_to_process(row)
>
> but this throws away data that I'd really prefer to keep if possible.
>
> I know 2.4 isn't exactly popular, and in an ideal world, I'd just
> upgrade to a later 2.x version that does what I need. Any old-time
> 2.4 pythonistas have sage advice for me?
>
> -tkc
>
>
> from cStringIO import StringIO
> import csv
> f = file('out.txt', 'wb')
> w = csv.writer(f)
> w.writerow(["One", "Two"])
> w.writerow(["First\rSecond", "Third"])
> f.close()
>
> f = file('out.txt', 'rb')
> r = csv.reader(f)
> for i, row in enumerate(r): # works in 2.7, fails in 2.4
> print repr(row)
> f.close()
>
>


--
Terry Jan Reedy


0 new messages