Using data_utils.py, reading csv files, for beginners guide

137 views
Skip to first unread message

Vincent Davis

unread,
Apr 11, 2010, 11:55:40 PM4/11/10
to pystat...@googlegroups.com
I am working on a StatsModels/Stata side by side. Actually it might be a beginners guide that Stata or other software code could be added to make side by side examples.
I am assuming that the user knows little about python, and has maybe looked at the python tutorial.
With that in mind how would you recommend presenting the importing data from a csv file. I see there is data_utils.py should I use this or the python or numpy csv reader. 

Skipper Seabold

unread,
Apr 12, 2010, 12:05:32 AM4/12/10
to pystat...@googlegroups.com
On Sun, Apr 11, 2010 at 11:55 PM, Vincent Davis
<vin...@vincentdavis.net> wrote:
> I am working on a StatsModels/Stata side by side. Actually it might be a beginners guide that Stata or other software code could be added to make side by side examples.
> I am assuming that the user knows little about python, and has maybe looked at the python tutorial.
> With that in mind how would you recommend presenting the importing data from a csv file. I see there is data_utils.py should I use this or the python or numpy csv reader.

data_utils is for making datasets that go into our datasets
collection, but I am probably going to rewrite this at some point in
the near future, so it shouldn't be of too much concern and is
definitely not for reading data.

You want to look at

numpy.genfromtxt (and related functions in npyio).

http://docs.scipy.org/numpy/docs/numpy.lib.npyio.genfromtxt/

You might also have a look at

scikits.statsmodels.lib.io.genfromdta

It will read relatively Stata binary .dta files into structured
ndarrays. Though just looking at the docs, they're pretty
non-existent. Note that it only works on Stata datasets for data
version >= 9 and might not be bug-free (hence it being hidden for
now). I won't have too much time to look at this soon, so you might
just want to remember it for later.

Skipper

Skipper Seabold

unread,
Apr 12, 2010, 12:08:17 AM4/12/10
to pystat...@googlegroups.com

Vincent Davis

unread,
Apr 12, 2010, 12:10:02 AM4/12/10
to pystat...@googlegroups.com

Vincent Davis

unread,
Apr 12, 2010, 12:53:19 AM4/12/10
to pystat...@googlegroups.com
numpy.lib.npyio.loadtxt/
 I am not able to import this, I actually don't have a npyio file/folder in my numpy install. Not sure why this is

But I do have this
numpy.lib.io.recfromcsv



Vincent Davis

unread,
Apr 12, 2010, 12:58:28 AM4/12/10
to pystat...@googlegroups.com
On Sun, Apr 11, 2010 at 10:53 PM, Vincent Davis <vin...@vincentdavis.net> wrote:
numpy.lib.npyio.loadtxt/
 I am not able to import this, I actually don't have a npyio file/folder in my numpy install. Not sure why this is

But I do have this
numpy.lib.io.recfromcsv

Dumb mistake, I was trying to import it from  numpy.lib.npyio
numpy.loadtxt works fine

Skipper Seabold

unread,
Apr 12, 2010, 1:16:43 AM4/12/10
to pystat...@googlegroups.com
On Mon, Apr 12, 2010 at 12:58 AM, Vincent Davis
<vin...@vincentdavis.net> wrote:
>
> On Sun, Apr 11, 2010 at 10:53 PM, Vincent Davis <vin...@vincentdavis.net> wrote:
>>>
>>> numpy.lib.npyio.loadtxt/
>>
>>  I am not able to import this, I actually don't have a npyio file/folder in my numpy install. Not sure why this is
>> But I do have this
>> numpy.lib.io.recfromcsv
>
> Dumb mistake, I was trying to import it from  numpy.lib.npyio
> numpy.loadtxt works fine
>

They recently changed numpy.lib.io to numpy.lib.npyio because of a
conflict with python's built-in io. Not a big fan of this change, but
so it goes...

Mike

unread,
Apr 12, 2010, 9:10:34 AM4/12/10
to pystatsmodels
Before I say anything, I'll point out I have never used Stata, and
came to python from an R/S-Plus background.

So, in python I tend to use csv very heavily. It is simple and
straightforward and clear what you are doing - you can load SOMETHING
fairly quickly - even if its not in the right layout straight away!
You don't get the data into the pretty dataset form you might need,
but I find it much easier to think about doing that once I am in
python. I often find with R that it plain refuses to load a file for
one reason or another, and it can be a real pain to debug.

Also the advantage csv has is that is comes with python (or numpy) so
the user will already have it installed. Since dependencies can be
some of the most confusing aspects of getting a new piece of kit
working, I think having a dependency that is only needed for an
example would be little irritating.

By all means support and demo the complicated fancy methods, but I
think we should also include the basic simple ones - even if they take
more line of code - so that its clear what is going on.

There will be a lot of people coming to this module (like me) who have
large libraries of existing code, and just need it for a couple of
specific algos - so they will already have the data loaded from their
databases etc how they want it and won't use any of these loaders.

Just my two penneth; feel free to ignore.

On 12 Apr, 06:16, Skipper Seabold <jsseab...@gmail.com> wrote:
> On Mon, Apr 12, 2010 at 12:58 AM, Vincent Davis
>

> <vinc...@vincentdavis.net> wrote:

josef...@gmail.com

unread,
Apr 12, 2010, 10:08:20 AM4/12/10
to pystat...@googlegroups.com
On Mon, Apr 12, 2010 at 9:10 AM, Mike <m.j.a...@googlemail.com> wrote:
> Before I say anything, I'll point out I have never used Stata, and
> came to python from an R/S-Plus background.
>
> So, in python I tend to use csv very heavily.  It is simple and
> straightforward and clear what you are doing - you can load SOMETHING
> fairly quickly - even if its not in the right layout straight away!
> You don't get the data into the pretty dataset form you might need,
> but I find it much easier to think about doing that once I am in
> python.  I often find with R that it plain refuses to load a file for
> one reason or another, and it can be a real pain to debug.
>
> Also the advantage csv has is that is comes with python (or numpy) so
> the user will already have it installed.  Since dependencies can be
> some of the most confusing aspects of getting a new piece of kit
> working, I think having a dependency that is only needed for an
> example would be little irritating.

I used python csv module a lot in the past, but Pierre with the help
of Skipper and Bruce have improved genfromtxt a lot last year. And it
only requires numpy. There are still a few rough edges, but in
contrast to the csv module, it is very nice to have automatic type
conversion, from string to numbers with nan handling. And it works
well for clean csv files.

The advantage of using (a unicode enabled) csv module is that it is
much more flexible to handle "weird" csv files, e.g. I have some with
different non-ASCII characters for various missing value codes and any
automatic conversion barfs (or it might be possible but with more
effort than just using plain python).

Another small problem with genfromtxt examples is that they may
require numpy 1.4 and might not work with numpy 1.3

my 2 cents (Canadian)

Josef


>
> By all means support and demo the complicated fancy methods, but I
> think we should also include the basic simple ones - even if they take
> more line of code - so that its clear what is going on.
>
> There will be a lot of people coming to this module (like me) who have
> large libraries of existing code, and just need it for a couple of
> specific algos - so they will already have the data loaded from their
> databases etc how they want it and won't use any of these loaders.
>
> Just my two penneth; feel free to ignore.
>
> On 12 Apr, 06:16, Skipper Seabold <jsseab...@gmail.com> wrote:
>> On Mon, Apr 12, 2010 at 12:58 AM, Vincent Davis
>>
>> <vinc...@vincentdavis.net> wrote:
>>
>> > On Sun, Apr 11, 2010 at 10:53 PM, Vincent Davis <vinc...@vincentdavis.net> wrote:
>>
>> >>> numpy.lib.npyio.loadtxt/
>>
>> >>  I am not able to import this, I actually don't have a npyio file/folder in my numpy install. Not sure why this is
>> >> But I do have this
>> >> numpy.lib.io.recfromcsv
>>
>> > Dumb mistake, I was trying to import it from  numpy.lib.npyio
>> > numpy.loadtxt works fine
>>
>> They recently changed numpy.lib.io to numpy.lib.npyio because of a
>> conflict with python's built-in io.  Not a big fan of this change, but
>> so it goes...
>
>

Vincent Davis

unread,
Apr 12, 2010, 10:53:02 AM4/12/10
to pystat...@googlegroups.com
Josef  "it is very nice to have automatic type
conversion, from string to numbers with nan handling."
 
I agree, I have not used these but The genfromtxt and loadtxt the recfromcsv seems really nice.

Maybe I missed it but is the documentation that points out that you can specify the delimiter in recfromcsv?
There is also recfromtxt but The docs are the same as for recfromcsv  (I think) Which is a little strange.

Skipper Seabold

unread,
Apr 12, 2010, 11:01:12 AM4/12/10
to pystat...@googlegroups.com
On Mon, Apr 12, 2010 at 10:53 AM, Vincent Davis
<vin...@vincentdavis.net> wrote:
>>
>> Josef  "it is very nice to have automatic type
>>
>> conversion, from string to numbers with nan handling."
>
>
> I agree, I have not used these but The genfromtxt and loadtxt the recfromcsv seems really nice.
> Maybe I missed it but is the documentation that points out that you can specify the delimiter in recfromcsv?
> There is also recfromtxt but The docs are the same as for recfromcsv  (I think) Which is a little strange.
>

Almost all of these (except loadtxt I think without checking) just
call from genfromtxt but specify different defaults, so I mainly use
genfromtxt and fiddle with the arguments myself. So recfromcsv
deafults for the delimiter to be a comma, though you can change it I
think. There is recfromtxt also that uses a space as a delimiter (the
default in genfromtxt).

So basically, just see genfromtxt for the arguments you can specify
except in the case of loadtxt, which can't handle names and missing
data, I don't think, so is a bit different but more "lightweight".

Also note that savetxt doesn't let you specify names (there is a
ticket filed for this, but I can't ever get anyone to commit the one
line change), so I added a savetxt to our scikits.statsmodels.lib.io

Skipper

josef...@gmail.com

unread,
Apr 12, 2010, 11:09:27 AM4/12/10
to pystat...@googlegroups.com

I didn't know about the statsmodels savetxt. A summary of IO will be
very useful, I always struggle with this.

Thanks,

Josef


>
> Skipper

Skipper Seabold

unread,
Apr 12, 2010, 11:14:05 AM4/12/10
to pystat...@googlegroups.com

Yeah, I got sick of rewriting it every time I wanted to save a csv
with a header row. It's in my revision 2003 (really 2004 with a bug
fix). Note that I used the slightly older version of numpy.savetxt,
since savetxt has been rewritten for Python 3 transition and it
requires some compatibility functions that are only available in newer
versions of numpy and I didn't want to force people to be on the
bleeding edge.

Skipper

Bruce Southey

unread,
Apr 12, 2010, 11:59:02 AM4/12/10
to pystat...@googlegroups.com
It was a little more than that!
Basically it broke the 2to3 tool so there had to be some solution.

However, I do not recall any complaints about that change.

Bruce

Skipper Seabold

unread,
Apr 12, 2010, 12:25:07 PM4/12/10
to pystat...@googlegroups.com

IIUC, Isn't that why relative imports were introduced?

http://docs.python.org/whatsnew/2.5.html#pep-328

> However, I do not recall any complaints about that change.
>

Someone suggested changing to relative imports IIRC, but not very loudly.

Skipper

Bruce Southey

unread,
Apr 12, 2010, 12:56:04 PM4/12/10
to pystat...@googlegroups.com
On 04/12/2010 11:25 AM, Skipper Seabold wrote:
> On Mon, Apr 12, 2010 at 11:59 AM, Bruce Southey<bsou...@gmail.com> wrote:
>
>> On 04/12/2010 12:16 AM, Skipper Seabold wrote:
>>
>>> On Mon, Apr 12, 2010 at 12:58 AM, Vincent Davis
>>> <vin...@vincentdavis.net> wrote:
>>>
>>>
>>>> On Sun, Apr 11, 2010 at 10:53 PM, Vincent Davis<vin...@vincentdavis.net>
>>>> wrote:
>>>>
>>>>
>>>>>> numpy.lib.npyio.loadtxt/
>>>>>>
>>>>>>
>>>>> I am not able to import this, I actually don't have a npyio file/folder
>>>>> in my numpy install. Not sure why this is
>>>>> But I do have this
>>>>> numpy.lib.io.recfromcsv
>>>>>
>>>>>
>>>> Dumb mistake, I was trying to import it from numpy.lib.npyio
>>>> numpy.loadtxt works fine
>>>>
>>>>
>>>>
>>> They recently changed numpy.lib.io to numpy.lib.npyio because of a
>>> conflict with python's built-in io. Not a big fan of this change, but
>>> so it goes...
>>>
>>>
>>>
>>>
>> It was a little more than that!
>> Basically it broke the 2to3 tool so there had to be some solution.
>>
>>
> IIUC, Isn't that why relative imports were introduced?
>
> http://docs.python.org/whatsnew/2.5.html#pep-328
>
>
As Robert Kern showed, part of the issue was a 2to3 bug.

>> However, I do not recall any complaints about that change.
>>
>>
> Someone suggested changing to relative imports IIRC, but not very loudly.
>
> Skipper
>
>
>

I do not know if that would have even solved the problem. Also I think
at least Chuck does not like have modules with the same name across
projects. I know in R it causes some issues of shadowing of functions
that can be a problem.

Bruce

Skipper Seabold

unread,
Apr 12, 2010, 1:04:37 PM4/12/10
to pystat...@googlegroups.com
On Mon, Apr 12, 2010 at 12:56 PM, Bruce Southey <bsou...@gmail.com> wrote:
>>> It was a little more than that!
>>> Basically it broke the 2to3 tool so there had to be some solution.
>>>
>>>
>>
>> IIUC, Isn't that why relative imports were introduced?
>>
>> http://docs.python.org/whatsnew/2.5.html#pep-328
>>
>>
>
> As Robert Kern showed, part of the issue was a 2to3 bug.
>

Yeah you're right. Oh well.

Skipper

Reply all
Reply to author
Forward
0 new messages