datasets : obsolete modules ? and other thoughts

5 views
Skip to first unread message

josef...@gmail.com

unread,
Apr 1, 2011, 6:25:54 PM4/1/11
to pystat...@googlegroups.com
Skipper,

As far as I can see the old dataset modules, e.g. grunfeld.py,
randhi.py, and so on, are not used in the data.py anymore.

Are they still used for anything?

Also, I find the long, full license text a bit annoying, can we just
mention it and refer to a central location. How much is there actually
left of David's original dataset proposal?

the convert function writes out all the data in text. This makes 2to3
conversion very slow.
For some tests that I had written using matlab for comparison, I had
automatically created module files that created np.arrays (with
numbers) directly. The script file is not in the repository, but if
there is an interest we could expand a function "dump2module" that
saves some numpy arrays to a module file, for quickly dumping
test_result files.

For example we could dump
"scikits\statsmodels\tsa\var\tests\results\vars_results.npz" to a
module, since the pickled arrays in the npz cannot be loaded with
python 3.2
I think "scikits\statsmodels\iolib\tests\results\macrodata.npy" didn't
have any problems in python 3.2, if I remember correctly.

Just some thoughts from the python 3.2 porting, datasets is all your work.

Cheers and a happy April 1st :)

Josef

Skipper Seabold

unread,
Apr 1, 2011, 7:01:38 PM4/1/11
to pystat...@googlegroups.com
On Fri, Apr 1, 2011 at 6:25 PM, <josef...@gmail.com> wrote:
> Skipper,
>
> As far as I can see the old dataset modules, e.g. grunfeld.py,
> randhi.py, and so on, are not used in the data.py anymore.
>
> Are they still used for anything?
>

No I think they can be removed. Vestiges from before the refactor to
current usage.

Looking more, hmm, I think I need to change some of the current
documentation to reflect that this isn't how we do it anymore.

> Also, I find the long, full license text a bit annoying, can we just
> mention it and refer to a central location. How much is there actually
> left of David's original dataset proposal?

Sure. I don't really know what best practice is here for "must retain
this notice," or if it even really applies to the datasets as I've
been implementing them. I think moving it to the README or LICENSE
would probably work fine. We'd just need to remove it from all the
data.py and from the template.

I've definitely changed it a bit to fit our needs,

http://statsmodels.sourceforge.net/devel/dataset_proposal.html#dataset-proposal
http://www.scipy.org/scipy/scikits/browser/trunk/learn/scikits/learn/datasets/DATASET_PROPOSAL.txt

>
> the convert function writes out all the data in text. This makes 2to3
> conversion very slow.

Right, and I'm not even using this anymore. I'm just importing the
.csv file directly in data.py.

> For some tests that I had written using matlab for comparison, I had
> automatically created module files that created np.arrays (with
> numbers) directly. The script file is not in the repository, but if
> there is an interest we could expand a function "dump2module" that
> saves some numpy arrays to a module file, for quickly dumping
> test_result files.

Sure. I've been editing a lot of the test results stuff by hand, and I
shudder to think about it. It was more of a pain, but my vim-fu has
gotten better...

>
> For example we could dump
> "scikits\statsmodels\tsa\var\tests\results\vars_results.npz" to a
> module, since the pickled arrays in the npz cannot be loaded with
> python 3.2
> I think "scikits\statsmodels\iolib\tests\results\macrodata.npy" didn't
> have any problems in python 3.2, if I remember correctly.
>
> Just some thoughts from the python 3.2 porting, datasets is all your work.

But just look how easy it is to add data now...

http://statsmodels.sourceforge.net/devel/datasets_developer.html#add-data

So basically, I'll remove those few fname.py files, update the docs
(mainly the proposal) to reflect the current state of the code, and
remove the notice to a single README or LICENSE in the datasets
folder.

Skipper

Skipper Seabold

unread,
Apr 3, 2011, 1:43:26 PM4/3/11
to pystat...@googlegroups.com
On Fri, Apr 1, 2011 at 6:25 PM, <josef...@gmail.com> wrote:
>
> Skipper,
>
> As far as I can see the old dataset modules, e.g. grunfeld.py,
> randhi.py, and so on, are not used in the data.py anymore.
>

<snip>

> the convert function writes out all the data in text. This makes 2to3
> conversion very slow.

See if the latest commit in devel makes 2to3 better.

Skipper

josef...@gmail.com

unread,
Apr 7, 2011, 8:41:49 PM4/7/11
to pystat...@googlegroups.com

Thanks Skipper, I forgot to finish my reply yesterday

You confirm pretty much how I saw the refactorings of the datasets.
Using csv files looks much cleaner, but it makes numpy 1.3
compatibility pretty difficult.

I will try 2to3 again tomorrow, but I expect it will be considerably
faster without the data modules.

The only thing that I haven't merged yet from my python 3.2 version of
statsmodels is to replace all filenames in genfromtxt (and family) to
open file handles. As the discussion on the numpy list has shown, then
we can just use "rb" as long as we can ignore old Mac "\r" file
endings. I think we should use filehandles instead of filenames for
now so that we don't need numpy 1.6 as a requirement.

Josef

>
> Skipper
>

josef...@gmail.com

unread,
Apr 7, 2011, 8:42:23 PM4/7/11
to pystat...@googlegroups.com
On Thu, Apr 7, 2011 at 8:41 PM, <josef...@gmail.com> wrote:
> On Sun, Apr 3, 2011 at 1:43 PM, Skipper Seabold <jsse...@gmail.com> wrote:
>> On Fri, Apr 1, 2011 at 6:25 PM, <josef...@gmail.com> wrote:
>>>
>>> Skipper,
>>>
>>> As far as I can see the old dataset modules, e.g. grunfeld.py,
>>> randhi.py, and so on, are not used in the data.py anymore.
>>>
>>
>> <snip>
>>
>>> the convert function writes out all the data in text. This makes 2to3
>>> conversion very slow.
>>
>> See if the latest commit in devel makes 2to3 better.
>
> Thanks Skipper, I forgot to finish my reply yesterday
not yesterday, last weekend
Reply all
Reply to author
Forward
0 new messages