CSV with variable number of columns

3,934 views
Skip to first unread message

Daniel Cordeiro

unread,
Dec 2, 2014, 12:57:51 PM12/2/14
to pyd...@googlegroups.com
Hi all,

I want to read (from sys.stdin) a bunch of lines in CSV format that may have a different number of columns.
I have some lines with 11 and some with 18 columns. I need to read only the first 6 columns, so it should not be a problem. Unfortunately, I always get the error message:

ValueError: Expected 11 fields in line 776483, saw 18

Since the data comes from sys.stdin, I cannot make any assumption about the number of columns or order of lines.
I think it is odd that even passing "usecols = range(0,6)" I get this kind of error. Should I report this as a bug?

Before I start to write my own parser, do you know if there is a way to use read_csv() in my context?


Thanks much,
Daniel

Ivan Ogasawara

unread,
Dec 2, 2014, 1:41:22 PM12/2/14
to pyd...@googlegroups.com
Hi Daniel,

I think it is a bug, because with header=False or equal to True arg it seems to work.

I did a ugly trick to solve that (temporally until the bug is not fixed). This is my example:

csv = u'''19,29,39

10,20,30,40,50

11,21,31

12,22,32,52,42

13,23,33'''


df = pd.read_csv(io.StringIO(csv), header=False, usecols=list(range(3)))

f_line = df.keys().values

df.rename(columns={k: i for i, k in enumerate(f_line)}, inplace=True)

df = pd.concat((pd.DataFrame(f_line).T, df))

df.reset_index(drop=True, inplace=True)


result: 

    0   1   2

0  19  29  39

1  10  20  30

2  11  21  31

3  12  22  32

4  13  23  33



Maybe this can help you for a while.


My best regards



--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Cordeiro

unread,
Dec 2, 2014, 2:59:26 PM12/2/14
to pyd...@googlegroups.com
Hi,

Thanks for your promptly response.

Your workaround works only for small inputs. If I try:

csv = '19,29,39\n'*1024 + '10,20,30,40\n'
df = pd.read_csv(io.StringIO(csv), engine='python', header=False, usecols=list(range(3)))

I get:

ValueError: Expected 3 fields in line 1025, saw 4

:-(

Just for info, I'm using Pandas 0.15.0 (distributed by Debian) on Python 3.4.2


Thanks,
Daniel

Ivan Ogasawara

unread,
Dec 3, 2014, 1:13:55 PM12/3/14
to pyd...@googlegroups.com
You're right, Daniel, I'm so sorry. If I find other solution, I let you know.

You can open a issue on https://github.com/pydata/pandas/issues and report the error.

My best regards,

Ivan Ogasawara

Daniel Cordeiro

unread,
Dec 3, 2014, 2:32:23 PM12/3/14
to pyd...@googlegroups.com
I just did: #8985
Thanks for you help.

Best regards,
Daniel

Dharhas Pothina

unread,
Dec 3, 2014, 4:09:30 PM12/3/14
to pyd...@googlegroups.com

So for a workaround. Often when pandas read_csv doesn't do what I need it to (which was quite often in earlier versions). I used numpy's genfromtxt function to read in the csv data into a numpy array and then converted it into a pandas dataframe. Genfromtxt has (or at least used to have) a larger set of flags to deal with oddball csv files compared to pandas.read_csv.

- dharhas

Daniel Cordeiro

unread,
Dec 8, 2014, 7:40:14 AM12/8/14
to pyd...@googlegroups.com
Works like a charm.

Thanks much!
Reply all
Reply to author
Forward
0 new messages