pandas: how to import data from a text file with a custom parser and missing values?

469 views
Skip to first unread message

EOL

unread,
Nov 5, 2011, 5:24:00 PM11/5/11
to pystatsmodels
In pandas, what would be the best way of importing data from a text
file through a custom parser that can yield missing values?

The goal is to read a relatively large (300 MB, 3 million line) file
line by line and to custom-parse each line in turn. Some of the lines
contain missing data, which should appear as such in pandas.

A custom parser is needed so as to convert strings contained in some
columns to integers. The type of the column can be set in advance, if
necessary.

Can this kind of input file be read easily in pandas? NumPy as the
quite powerful genfromtxt(), which can do that, but I did not succeed
in feeding pandas with the NumPy masked array returned by genfromtxt()
(missing integer values were converted to -1 instead of appearing as
NaN).

Any suggestion would be very much appreciated.

Wes McKinney

unread,
Nov 5, 2011, 5:40:45 PM11/5/11
to pystat...@googlegroups.com

hi Eric,

have you tried out the parsing functions in pandas?

http://pandas.sourceforge.net/io.html

They take a chunksize parameter so you can process the file piece by
piece if you wish. Depends on what you mean by "custom parsing"-- if
you have a set of strings to be recognized as NA, you can pass those
to the parser. If read_csv or read_table can't handle your data, I'd
be interested to know exactly why so I can improve them.

Performance should be a lot better than np.genfromtxt, about 3-4x
better in many cases.

best,
Wes

EOL

unread,
Nov 6, 2011, 5:42:56 AM11/6/11
to pystatsmodels
On Nov 5, 10:40 pm, Wes McKinney <wesmck...@gmail.com> wrote:
Thank you for your answer, Wes. I now understand that NaN values are
only available for float columns, so I stopped wanting to have an
integer column with NaN values (missing values returned by
genfromtxt() were converted to -1 when put in integer DataFrame
columns, and this was initially a problem, for me).

However, the data still cannot be handled by read_csv or read_table
because I first need to transform the value of some of the cells; this
can be illustrated with the following data:

1,3.14,HELLO
,2.718,BONJOUR

The custom parser for each line would convert the strings to integers
(HELLO->0, BONJOUR->1) before they are appended to the DataFrame being
built (so that later data access and row manipulation are faster).

I guess I can first go through numpy.loadtxt() and its custom parser
(converters optional argument) and then create the DataFrame from the
resulting NumPy array, no? Would this use double the memory used,
though, compared to a situation where pandas.read_csv() had a
converters optional argument?

Wes McKinney

unread,
Nov 7, 2011, 1:26:49 AM11/7/11
to pystat...@googlegroups.com

Wouldn't you rather just apply post-processing to the DataFrame? For
example, you could easily do:

mapping = {'HELLO' : 0, 'BONJOUR' : 1}
df[col] = df[col].map(mapping)

Or mapping could be any functions.

I do see the use case for a converters argument-- should be
straightforward to add:

https://github.com/wesm/pandas/issues/343

- Wes

EOL

unread,
Nov 7, 2011, 4:17:11 PM11/7/11
to pystatsmodels
Great. I did not know that the map() method existed and that the
column types of a DataFrame could be changed like this.

I too do see the case for a converters argument, though. It would be
more time and memory efficient than the df[col] replacement approach.
It would also allow users to perform sanity checks before, say, pandas
decides to cast integers to floats because their column contains empty
cells, even though this might be a problem if digits are lost,
etc.. :) So, a big upvote for a converters argument for read_csv()!

EOL

On Nov 7, 7:26 am, Wes McKinney <wesmck...@gmail.com> wrote:

Wes McKinney

unread,
Nov 8, 2011, 10:46:14 AM11/8/11
to pystat...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages