Recommended skills: Proficiency with Ruby, Good understanding of designing Ruby APIs, preferably should have worked with data analysis and statistics in the past.
This project can be subdivided into 2 major components:
- Support categorical data with a new
:categorical
data type andCategoricalIndex
index class in daru.- Support operations on categorical data from daru on statsample and statsample-glm.
...and change the regression methods in statsample and statsample-glm so that they support categorical data supplied by daru.
--
You received this message because you are subscribed to the Google Groups "SciRuby Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sciruby-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hello Alexej
Thanks for explaining. I think I have some idea now. All there's needed is to convert the category values to attributes and then assign them 0/1 values based on whether row has particular value in the category variable or not and then do the regression.
One way I could think of implementing this is by having a procedure of exploding the categorical variable to multiple columns having 0/1 values in Daru and then statsample could simply take them and do the regression as like any other variable. Am I thinking in the right direction?
Regards
Lokesh
--
From what I understand, I guess this can be done by first trying a particular coding scheme and see whether it gives full column rank matrix or not. If it doesn't give a full column matrix, then we have to move on to another categorical variable codings scheme like Hermet Coding, Deviation Coding, etc to code that particular categorical variable till a suitable coding scheme is found which doesn't give dependent columns.
So, in essence we have to implement different coding systems in Daru, and then have a mechanism to choose the suitable one depending upon the data which wouldn't cause a problem in regression. Right?
--
If the coding is implemented correctly (i.e. without redundancy among dummy variables), then the resulting matrix will usually have full column rank
Keep in mind that when you replace a categorical column with multiple numerical dummy variables, you have to make sure that the resulting matrix has full column rank, because otherwise the regression methods will fail.
Finally I have some implementation in mind. The following is the high level picture of the implementation.
(You can click here to see the diagram)
What if the variables are "nested", for example, 3 schools ("A", "B", "C") and 12 classes within each school ("1", "2", ..., "12").
Hello Victor. Could you explain what does "inside the blocks" mean? I'm not able to understand.
If we make a Ruby gem like pasty, do you want it to just behave like formula parser or you also expect it to return model matrices like actual pasty does.
From python pasty Readme, it seems like it behave both like a formula parser and building design matrices.
"Patsy is a Python library for describing statistical models (especially linear models, or models that have a linear component) and building design matrices. Patsy brings the convenience of R "formulas" to Python."
Thanks Sameer for the new idea. I'd love to make a new gem. Does your idea includes coding categorical variables of dataframe in ruby-pasty or in Daru?
Thank you Alexej for your feedback and links you mentioned. They are helping me to move forward.
--
f = formula { a =~ b + c - d + t + 1 }
puts f
# => 'a =~ b + c - d + t + 1'
# full example with implementation at https://gist.github.com/dilcom/42a0ec18565c38458603
Seems like we can't build such DSL as lme4 offers but we still can to define something likef = formula { a =~ b + c - d + t + 1 }puts f# => 'a =~ b + c - d + t + 1'# full example with implementation at https://gist.github.com/dilcom/42a0ec18565c38458603The problem is that all the operators should be named as default Ruby binary operators so we can't define ':' (there is no such operator at all) or '~'(it's unary operator). There is also a gem for defining custom operators (https://rubygems.org/gems/superators19/versions/0.9.3) but it also can't define ':' or '~'.Regards,Ivan.
Uses of Categorical Index:
> Are there more reasons why do we need Categorical Data and Categorical Index for?
I think we need at least basic plotting capabilities for categorical data, such as histograms and pie charts and the like (not sure if what is already present in daru needs to be adjusted). And basic summary statistics for categorical data (for example, table how many times each factor occurs; return unique factors, etc.; again, might be already present in daru).
Alexej
--
Hi,
I am applying for GSoC this year. I was having a look at various project ideas. The one I like the most is Categorical data support. It's because I like Python and am familiar with all the tools used for Data Analysis in Python like SciPy, NumPy, Pandas. I had a look at the Recommended Skills section which say:Recommended skills: Proficiency with Ruby, Good understanding of designing Ruby APIs, preferably should have worked with data analysis and statistics in the past.
I am familiar with Ruby and have worked with Rails but I don't know what designing Ruby API's mean. Does it mean that one should be familiar with creating gems and where can I look to get some experience on that? What else are the requirements for this project? I am looking forward to fill the gap in my knowledge to be able to implement this idea.
Thank You
Your project will involve adding a third type called :categorical. Please have this very clear. The ACTUAL DATA that will you will store in daru could be stored in either Array or NMatrix, depending on the argument supplied to the :dtype parameter in the Vector/DataFrame constructor.
Do you like it?
Their presence would help me decide what naming convention to follow and whether to create a new function or not.
And about the storage structure of :category, I consulted how's it's done with Pandas. They seem to be using the same data structure like I suggested, however how they map each category to set of elements belonging to that category is a little mysterious. Here's what I found in their code:
# there were two ways if categories are present
# - the old one, where each value is a int pointer to the levels
# array -> not anymore possible, but code outside of pandas could
# call us like that, so make some checks
# - the new one, where each value is also in the categories array
# (or np.nan)
Sameer,
I was working on adding visualization support for categorical data in my proposal. Could you comment on whether Daru already has the following abilities:
- Show a summary of vector like what's the maximum value, minimum value, etc.
- Frequency of each value in a vector.
- Plotting histogram of a vector.
2. Here's my initial idea to implement the formula parser in statsample and statsample-glm. If it's fine, I will go on to develop it. I will implement a class similar to "LMMFormula" in mixed_models. To keep things simple for now I will only support two operations "+" and ":".
I will split the expression by "+" to have a number of terms. For example "a + b + b:c + c:d" would give me "a", "b", "b:c" and "c:d".
Now I will have two types of terms: interaction terms like "b:c", "c:d" and other non-interaction terms like "a" and "b". I will call the #contrast_code on the dataframe with ["a", "b"] as argument to code the "a" and "b" column. Then I will call #contrast_interact with argument [[b, c], [c, d]] to code "b:c" and "c:d". The resulting dataframe will be now ready for regression.
What's your opinion regarding this type of formula parser?
One thing I don't quite understand is what do expression on LHS do. What's the difference between "a + b" and 'y ~ a + b"? I am sorry for my lack of knowledge on regression and I will fill it soon.
Also please help me understand what to do when there are more terms in LHS, for example "y + x ~ a + b"?
3. Beside regression, are there other important operations on categorical variable that I should consider supporting?
Regards
Lokesh
I am working on the details regarding implementation of formula parser for statsample. Please help me figure out the following:
1. Is there any specific reason why there are three gems for statistical analysis? Do they all perform different functions?:
- Statsample
- Statsample-glm
- mixed_models
2. Here's my initial idea to implement the formula parser in statsample and statsample-glm. If it's fine, I will go on to develop it. I will implement a class similar to "LMMFormula" in mixed_models. To keep things simple for now I will only support two operations "+" and ":".
I will split the expression by "+" to have a number of terms. For example "a + b + b:c + c:d" would give me "a", "b", "b:c" and "c:d".
Now I will have two types of terms: interaction terms like "b:c", "c:d" and other non-interaction terms like "a" and "b". I will call the #contrast_code on the dataframe with ["a", "b"] as argument to code the "a" and "b" column. Then I will call #contrast_interact with argument [[b, c], [c, d]] to code "b:c" and "c:d". The resulting dataframe will be now ready for regression.
What's your opinion regarding this type of formula parser?
One thing I don't quite understand is what do expression on LHS do. What's the difference between "a + b" and 'y ~ a + b"? I am sorry for my lack of knowledge on regression and I will fill it soon.
Also please help me understand what to do when there are more terms in LHS, for example "y + x ~ a + b"?
3. Beside regression, are there other important operations on categorical variable that I should consider supporting?
Regards
Lokesh
--
This is one part of it. However, say a user wants to perform a regression that does not fit into your formula language (maybe because you could not handle such a case with the formula language), he should be able to use methods/options that enable this. They will more complex than the formula language, but it should do the job.
a / b
" is shorthand for "a + a:b
",
and is intended to be useful in cases where one wants to fit a
standard sort of ANOVA model, but b
is nested within a. Isn't that right Alexej?
Hello Alexej
Could you tell about how did you implemented the formula parser like R in mixed_models? I saw that pasty did it with using Shunting-yard-algorithm.
Thank you
Formula interface
LMM#from_formula
takes in aString
containing a formula specifying the model, for example
"z ~ x + y + x:y + (x | u)".
It transforms this formula into anotherString
, for the above example:
"lmm_formula(:intercept) + lmm_variable(:x) + lmm_variable(:y) + lmm_variable(:x) * lmm_variable(:y) + (lmm_variable(:intercept) + lmm_variable(:x) | lmm_variable(:u)))",
adding intercept terms and wrapping all variables inlmm_variable()
.
The Ruby expression in theString
is evaluated witheval
. This evaluation uses a specially defined classLMMFormula
(defined here), which overloads the+
,*
and|
operators, in order to combine the variable names into arrays, which can be fed intoLMM#from_daru
. The classLMMFormula
was an idea that I got from Will Levine (wlevine). In particular, the methodLMMFormula#to_input_for_lmm_from_daru
transforms anLMMFormula
object into a number ofArray
, which have the form required byLMM#from_daru
.
Finally,LMM#from_daru
constructs the model matrices, vectors and the covariance functionProc
, which are passed on toLMM#initialize
that performs the actual model fit.
--
Gem::ConflictError: Unable to activate mixed_models-0.1.1, because daru-0.1.2 conflicts with daru (= 0.1.0)
What are your thoughts Alexej?
But have you looked at how it is implemented?
>> Formula.new.from_formula '(a+b):(c+d)'
=> "a:c+a:d+b:c+b:d"
>> Formula.new.from_formula 'a*b*c*d'
=> "a+b+a:b+c+a:c+b:c+a:b:c+d+a:d+b:d+a:b:d+c:d+a:c:d+b:c:d+a:b:c:d"
Do you also plan to include capabilities to fit models without an intercept term? It is denoted with 0 or -1 in formulas (see "intercept handling" in https://patsy.readthedocs.org/en/latest/formulas.html).
--
I don't think it's quite right. If the model has an intercept term (as it usually does by default), then you cannot code any variable with n dummy variables, because their sum would give a vector of ones, which corresponds to the intercept term. Right?
Non-interaction terms are always coded by #contrast_code to n-1 variables. In order to compute terms such as 'a:b', I am either coding 'a' to n variables or n-1 variables but that's just temporarily for only computing 'a:b'. Following example will illustrate this point:
Let the expression be 'y ~ a + a:b'.
Now we need to code 'a' and 'a:b'. Let's say 'a' has categories 'yes', 'no' and 'b' has categories 'true' and 'false'.
Firstly 'a' will be coded to dummy variables ['a_no'].
Now to compute 'a:b' I will code 'a' to ['a_yes', 'a_no'] and 'b' to ['b_false']. This is because 'a' is in the expression but not 'b'. Notice, coding of 'a' and 'b' is just to compute 'a:b'. The coding of 'a' which goes into final matrix will be performed by #contrast_code.
Therefore 'a:b' will be coded to ['a_yes':'b_false', 'a_no:b_false']
Total dummy variables will be ['a_no', 'a_yes:b_false', 'a_no:b_false']
In the gist I haven't mentioned the #contrast_code. It always convert every variable to n-1 variables and the coding which happens within in #contrast_interact to code interaction terms is only temporary.
Hope that make things clear. Please let me know if something else needs clarification.
Regards
Lokesh
--
Oh I see! Yes that's a problem. Guess I'll have to see that Patsy dev guide and adopt their method. BTW how did you implemented it?
Oh I see! Yes that's a problem. Guess I'll have to see that Patsy dev guide and adopt their method. BTW how did you implemented it?
--
Nice. To handle more than two way interaction my method won't work. I saw Patsy methods of resolving redundancy which is by dividing the interaction terms into subterms, choosing the small subterms mentioned in the expression and then combing the remaining left subterms forming big terms. It requires some additional work though. With this strategy it will also be easy to work with intercept terms. So I'm thinking of including them too.
Sometimes you want a formula that has no left-hand side, and you can write that as~ x1 + x2
or evenx1 + x2
.
--
BTW Is adding support of visualization of categorical variable for nyaplot sufficient? Or should we also include the support for gnuplotrb?
Regards
Lokesh