structured array to larry conversion

12 views
Skip to first unread message

Kyle Foreman

unread,
Jul 6, 2010, 11:18:26 PM7/6/10
to labeled-array
I've got a bunch of Stata files that I want to turn into larrys, but
bumping into a few problems. The genfromdta() in scikits.statsmodels
returns structured arrays, and I haven't yet found a way to turn those
into fully functional larrys. #5 below gets pretty close, but not
quite.... Any suggestions? There's probably something really simple
I'm overlooking. As a last resort I'll alter the genfromdta()
function, but I'm certain there's a better way.



from la import larry

mystarry = np.zeros((2,),dtype=('i4,f4,a10'))
mystarry[:] = [(1,2.,'Hello'),(2,3.,"World")]

mylarry1 = larry(mystarry)
# no labels on axis 1, correct data types

mylarry2 = larry(mystarry.tolist(), label=[range(len(mystarry)),
list(mystarry.dtype.names)])
# correct labels, but lost the datatypes

mylarry3 = larry(mystarry.tolist(), label=[range(len(mystarry)),
list(mystarry.dtype.names)], dtype=mystarry.dtype)
# ValueError: Exactly one label per dimension needed

mylarry4 = larry.fromlist(mystarry.tolist())
# TypeError: zip argument #1 must support iteration

mylarry5 = larry(mystarry.tolist(), dtype=mystarry.dtype)
mylarry5.label
# only one dimension of labels
mylarry5.label = [mylarry5.label[0], list(mystarry.dtype.names)]
mylarry5[0,['f2']]
# ValueError: setting an array element with a sequence.
mylarry5[0]['f2']
# sort of works, but not in the normal larry way...
mylarry5[:]['f2']
# ValueError: Exactly one label per dimension needed
# can't find any way to grab a column

Vincent Davis

unread,
Jul 6, 2010, 11:35:09 PM7/6/10
to labele...@googlegroups.com

You will need to convert your structured array into an ndarray
probably of floats. ie no string values.
I have a some code you might find useful to convert strings to
categorical values. it is very beta but should work
http://bazaar.launchpad.net/~vincent-vincentdavis/statsmodels/Descriptive-Stats/annotate/head%3A/scikits/statsmodels/sandbox/categorical.py
There is also a function for categorizing for this in statsmodels but
(I forget where) One way or another you will need to keep track of the
convertion from strings to numbers.

Let me know if this helps
Vincent

Keith Goodman

unread,
Jul 7, 2010, 12:05:49 AM7/7/10
to labele...@googlegroups.com

Vincent is the expert on this stuff. I'll just repeat the part about
larry storing the data in a ndarray which means the data type must be
the same for each element in a larry.

One of your examples:

>> mylarry1 = larry(mystarry)
>> #       no labels on axis 1, correct data types

I think that gives the right result, even though it is not the result
you want. Note that mystarry is 1d:

mystarry.shape --> (2,)
mystarry.ndim --> 1

That's why mylarry1 only has labels for one dimension. But if you
could convert it to a 2x3 ndarray where each element has the same data
type then you'd get what you wanted. Can that be done with your data?

Vincent Davis

unread,
Jul 7, 2010, 12:34:28 AM7/7/10
to labele...@googlegroups.com

I had a 18month old helping on the last email so I had to be brief :-)

@Kyle, I struggle with data of mixed types often. Structured arrays
are ok but in the end you are often forced into needing an ndarray.
This is where some method of keeping track of the conversion from
string values (you might also have numerical values that you need to
recode) to numerical. Also and then if you want to have dummy
variables you need to keep track of that. I have been working/thinking
of ways to do this similar to stata e.g. in stat you can encode() and
then in a regression you can used i.x, if x is a categorical variable,
to get the dummy variable automatically and even labeled with there
string representation. I am up for helping out or getting feedback on
this.

I have some examples I was experimenting with where the label for an
individual column in a larry is actually dict() (it can be done other
ways) I was doing this to allow the conversion from strings type
values to numerical and still be able to use the dict to "translate"
the numbers. This would be similar to encode() in stata. Where stata
take the string values and "encodes" them and applies labels to make
it easy to do other thing without having to remember what the endocing
was.

@ Kieth, essentially this is labeling the data points in the
columns/rows of an array. What do you think about this? It would need
to be a dict() but then data could be printed, sorted, selected based
on the string value rather than the numerical value. I would be very
interested in this and would be able to help provide tools for
converting from strings to numerical values and building the
dictionary. I have been meaning to spend time but need some direction
from you as to how you would like it to fit in.

Vincent

Kyle Foreman

unread,
Jul 7, 2010, 12:36:07 AM7/7/10
to labeled-array
No wonder I was confused, I kept going through thinking that the
mystarry (which I grabbed from the numpy docs) was 2d! Not used to np
arrays yet, I guess. Oops...

I have an array with multiple data types right now (i.e. year as int,
gdp per capita as float, country code as iso3), but I could definitely
convert the strings to numbers as you suggested, Vincent. I guess I
would lose some efficiency by having to store my integers and the
integer-coded strings as floats, but should be worth it for the
convenience I suppose. Thanks!

On Jul 6, 9:05 pm, Keith Goodman <kwgood...@gmail.com> wrote:
> On Tue, Jul 6, 2010 at 8:35 PM, Vincent Davis <vinc...@vincentdavis.net> wrote:
> >http://bazaar.launchpad.net/~vincent-vincentdavis/statsmodels/Descrip...

Kyle Foreman

unread,
Jul 7, 2010, 12:47:33 AM7/7/10
to labeled-array
Shortly after I get all this data in I'll start doing robust linear
regression (using RLM in statsmodels) so I will quickly run into the
same dummy variables problem, Vincent. I had so far only come up with
using set() on the np.array and then looping through those values to
produce the dummies, but I haven't gotten to the point where I've put
anymore thought into it. I'll definitely keep you updated. Thanks
again, I'll tackle the larry thing again in the morning.

On Jul 6, 9:34 pm, Vincent Davis <vinc...@vincentdavis.net> wrote:
> On Tue, Jul 6, 2010 at 10:05 PM, Keith Goodman <kwgood...@gmail.com> wrote:
> > On Tue, Jul 6, 2010 at 8:35 PM, Vincent Davis <vinc...@vincentdavis.net> wrote:
> >>http://bazaar.launchpad.net/~vincent-vincentdavis/statsmodels/Descrip...

Skipper Seabold

unread,
Jul 7, 2010, 12:50:55 AM7/7/10
to labele...@googlegroups.com
On Wed, Jul 7, 2010 at 12:47 AM, Kyle Foreman <kylef...@gmail.com> wrote:
> Shortly after I get all this data in I'll start doing robust linear
> regression (using RLM in statsmodels) so I will quickly run into the
> same dummy variables problem, Vincent. I had so far only come up with
> using set() on the np.array and then looping through those values to
> produce the dummies, but I haven't gotten to the point where I've put
> anymore thought into it. I'll definitely keep you updated. Thanks
> again, I'll tackle the larry thing again in the morning.
>

If you are dealing with strings (or integers) that are a categorical
variable there is scikist.statsmodels.tool.categorical that will make
dummy variables and take care of the names. It's well-tested and
optionally returns a column to map values to the column number for the
dummy variable or a structured array. For your example,

In [6]: import scikits.statsmodels as sm

In [7]: mystarry = sm.tools.categorical(mystarry, col=2, drop=True)

In [8]: mystarry
Out[8]:
array([(1, 2.0, 1.0, 0.0), (2, 3.0, 0.0, 1.0)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2_Hello', '<f8'),
('f2_World', '<f8')])

To view it as a float array (for larry)

In [9]: mystarry.view((float,3))
Out[9]:
array([[ 2., 1., 0.],
[ 32., 0., 1.]])


Where 3 is the number of columns. There are other ways to use this
and it is pretty well documented. Let me know if I can improve it or
if anything is unclear. To handle the ints though, you have to make a
copy as far as I know.

Skipper

Skipper Seabold

unread,
Jul 7, 2010, 12:53:15 AM7/7/10
to labele...@googlegroups.com
On Wed, Jul 7, 2010 at 12:36 AM, Kyle Foreman <kylef...@gmail.com> wrote:
> No wonder I was confused, I kept going through thinking that the
> mystarry (which I grabbed from the numpy docs) was 2d! Not used to np
> arrays yet, I guess. Oops...

Structured arrays are all 1d where each "row" is an object that holds
a number of items of (possibly) different types.

>
> I have an array with multiple data types right now (i.e. year as int,
> gdp per capita as float, country code as iso3), but I could definitely
> convert the strings to numbers as you suggested, Vincent. I guess I
> would lose some efficiency by having to store my integers and the
> integer-coded strings as floats, but should be worth it for the
> convenience I suppose. Thanks!
>

We are working on doing this efficiently behind the scenes, but it's
not there yet.

Skipper

Vincent Davis

unread,
Jul 7, 2010, 1:01:00 AM7/7/10
to labele...@googlegroups.com
On Tue, Jul 6, 2010 at 10:47 PM, Kyle Foreman <kylef...@gmail.com> wrote:
> Shortly after I get all this data in I'll start doing robust linear
> regression (using RLM in statsmodels) so I will quickly run into the
> same dummy variables problem, Vincent. I had so far only come up with
> using set() on the np.array and then looping through those values to
> produce the dummies, but I haven't gotten to the point where I've put
> anymore thought into it. I'll definitely keep you updated. Thanks
> again, I'll tackle the larry thing again in the morning.

If you are planing to use RLM in statsmodels then I would use
scikist.statsmodels.tool.categorical as Skipper suggests as it is well
tested for use with statsmodels. I am working the printing of summary
results for RLM and GLM and might have a draft done this week, I am
nearly done with the GLM summary. Note that I am only talking about
printing the results the values.

Vincent

Reply all
Reply to author
Forward
0 new messages