pandas: line terminator for DOS file (ctrl-M)?

1,049 views
Skip to first unread message

Samantha Zeitlin

unread,
May 15, 2014, 12:26:20 PM5/15/14
to pystat...@googlegroups.com
Hi, 

I'm trying to import a wide table format "txt" file and parse the rows, e.g.

pandas.read_csv(filename, sep='\t', lineterminator='\r')

I also tried '\015', and it doesn't throw an error, but it still gives me back the whole thing as one big string (75 columns and 0 rows) with ^M scattered throughout. 

Am I missing something in the documentation? It looks like pandas.read_csv doesn't support this?

I can't tell if I have the mythical "C parser" or not, and I couldn't find documentation explaining what that is or if it's now built in to the version of pandas I'm using (0.13.1)? 
I pip installed Cython, but that didn't seem to fix anything - do I need to rebuild pandas?

Alternatively, I'm guessing I would need to use the universal line terminator mode 'rU', e.g.

o = open(mypath,'rU')
mydata = csv.reader(o)
Unfortunately, pandas.read_csv won't let me just enter 

lineterminator='rU' 

because it's a string of length 2. :(

Any other tricks to get around this and shorten the journey to a dataframe?

Thanks, 

Sam







Paul Hobson

unread,
May 15, 2014, 1:43:01 PM5/15/14
to pystat...@googlegroups.com
I've never had to specify the line terminator parameter on windows. What happens when you don't specifyit?
-p

Skipper Seabold

unread,
May 15, 2014, 1:46:38 PM5/15/14
to pystat...@googlegroups.com
On Thu, May 15, 2014 at 1:43 PM, Paul Hobson <pmho...@gmail.com> wrote:
I've never had to specify the line terminator parameter on windows. What happens when you don't specifyit?
-p

Just a guess, but I'm wondering if there are mixed line endings in the file. The ^M characters show up in vim when this is the case.

Kyle Kastner

unread,
May 15, 2014, 1:53:29 PM5/15/14
to pystat...@googlegroups.com
You could run dos2unix on it (if you are on a Linux PC). I think that will handle mixed encodings and force them all to Linux style. unix2dos, not surprisingly, performs the opposite conversion.

Samantha Zeitlin

unread,
May 15, 2014, 2:31:52 PM5/15/14
to pystat...@googlegroups.com
I'm on OS X, using bash, vim, and IPython. 

I could probably do some kind of trick to export it from vim, but this is something I expect to have to do frequently, so… 

Turns out it's not a DOS file, it's got a <ff><fe> BOM at the top and comes out as a bunch of not-even unicode garbage if I just do vim -b. Might be a Mac OS 9 file (?). 

Looks like I can do:

o=open('file.txt')
firstkb = o.read(1024).split('\r')

then I get 2 strings with null ('\00') characters in between all the letters. Yuck!

This sort of works (I get back an English string with tab symbols in the appropriate places):

firstkb[0].decode('UTF16')

but this doesn't:

firstkb[1].decode('UTF16')

UnicodeDecodeError:utf16' codec can't decode byte 0x00 in position 440: truncated data

presumably because the BOM is in the first string but not the second...

Looks like I should try codecs open…

It's really strange, though, that pandas does fine with the UTF16 decoding and the tab separation, it just doesn't break the rows on the line endings, whether I specify the line terminator or not. 

Sam

Samantha Zeitlin

unread,
May 15, 2014, 3:45:25 PM5/15/14
to pystat...@googlegroups.com
fwiw, bc this was a PITA in case you ever run into it, here's what I ended up doing:

import codecs

opened = codecs.open("filename.txt", 'rU', 'UTF16')

df = pandas.read_csv(opened, sep='\t')

Sam
Reply all
Reply to author
Forward
0 new messages