You will need to convert your structured array into an ndarray
probably of floats. ie no string values.
I have a some code you might find useful to convert strings to
categorical values. it is very beta but should work
http://bazaar.launchpad.net/~vincent-vincentdavis/statsmodels/Descriptive-Stats/annotate/head%3A/scikits/statsmodels/sandbox/categorical.py
There is also a function for categorizing for this in statsmodels but
(I forget where) One way or another you will need to keep track of the
convertion from strings to numbers.
Let me know if this helps
Vincent
Vincent is the expert on this stuff. I'll just repeat the part about
larry storing the data in a ndarray which means the data type must be
the same for each element in a larry.
One of your examples:
>> mylarry1 = larry(mystarry)
>> # no labels on axis 1, correct data types
I think that gives the right result, even though it is not the result
you want. Note that mystarry is 1d:
mystarry.shape --> (2,)
mystarry.ndim --> 1
That's why mylarry1 only has labels for one dimension. But if you
could convert it to a 2x3 ndarray where each element has the same data
type then you'd get what you wanted. Can that be done with your data?
I had a 18month old helping on the last email so I had to be brief :-)
@Kyle, I struggle with data of mixed types often. Structured arrays
are ok but in the end you are often forced into needing an ndarray.
This is where some method of keeping track of the conversion from
string values (you might also have numerical values that you need to
recode) to numerical. Also and then if you want to have dummy
variables you need to keep track of that. I have been working/thinking
of ways to do this similar to stata e.g. in stat you can encode() and
then in a regression you can used i.x, if x is a categorical variable,
to get the dummy variable automatically and even labeled with there
string representation. I am up for helping out or getting feedback on
this.
I have some examples I was experimenting with where the label for an
individual column in a larry is actually dict() (it can be done other
ways) I was doing this to allow the conversion from strings type
values to numerical and still be able to use the dict to "translate"
the numbers. This would be similar to encode() in stata. Where stata
take the string values and "encodes" them and applies labels to make
it easy to do other thing without having to remember what the endocing
was.
@ Kieth, essentially this is labeling the data points in the
columns/rows of an array. What do you think about this? It would need
to be a dict() but then data could be printed, sorted, selected based
on the string value rather than the numerical value. I would be very
interested in this and would be able to help provide tools for
converting from strings to numerical values and building the
dictionary. I have been meaning to spend time but need some direction
from you as to how you would like it to fit in.
Vincent
If you are dealing with strings (or integers) that are a categorical
variable there is scikist.statsmodels.tool.categorical that will make
dummy variables and take care of the names. It's well-tested and
optionally returns a column to map values to the column number for the
dummy variable or a structured array. For your example,
In [6]: import scikits.statsmodels as sm
In [7]: mystarry = sm.tools.categorical(mystarry, col=2, drop=True)
In [8]: mystarry
Out[8]:
array([(1, 2.0, 1.0, 0.0), (2, 3.0, 0.0, 1.0)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2_Hello', '<f8'),
('f2_World', '<f8')])
To view it as a float array (for larry)
In [9]: mystarry.view((float,3))
Out[9]:
array([[ 2., 1., 0.],
[ 32., 0., 1.]])
Where 3 is the number of columns. There are other ways to use this
and it is pretty well documented. Let me know if I can improve it or
if anything is unclear. To handle the ints though, you have to make a
copy as far as I know.
Skipper
Structured arrays are all 1d where each "row" is an object that holds
a number of items of (possibly) different types.
>
> I have an array with multiple data types right now (i.e. year as int,
> gdp per capita as float, country code as iso3), but I could definitely
> convert the strings to numbers as you suggested, Vincent. I guess I
> would lose some efficiency by having to store my integers and the
> integer-coded strings as floats, but should be worth it for the
> convenience I suppose. Thanks!
>
We are working on doing this efficiently behind the scenes, but it's
not there yet.
Skipper
If you are planing to use RLM in statsmodels then I would use
scikist.statsmodels.tool.categorical as Skipper suggests as it is well
tested for use with statsmodels. I am working the printing of summary
results for RLM and GLM and might have a draft done this week, I am
nearly done with the GLM summary. Note that I am only talking about
printing the results the values.
Vincent