GSoC Idea [Categorical data support]

457 views
Skip to first unread message

Lokesh Sharma

unread,
Feb 13, 2016, 5:16:20 AM2/13/16
to SciRuby Development
Hi,

I am applying for GSoC this year. I was having a look at various project ideas. The one I like the most is Categorical data support. It's because I like Python and am familiar with all the tools used for Data Analysis in Python like SciPy, NumPy, Pandas. I had a look at the Recommended Skills section which say:


Recommended skills: Proficiency with Ruby, Good understanding of designing Ruby APIs, preferably should have worked with data analysis and statistics in the past.

I am familiar with Ruby and have worked with Rails but I don't know what designing Ruby API's mean. Does it mean that one should be familiar with creating gems and where can I look to get some experience on that? What else are the requirements for this project? I am looking forward to fill the gap in my knowledge to be able to implement this idea.

Thank You


Sameer Deshmukh

unread,
Feb 15, 2016, 8:55:38 AM2/15/16
to SciRuby Development
Designing Ruby API's basically states that you should be able to make a Ruby-like API for the new functionality, i.e. the API should ideally take advantage of Ruby's unique features like blocks and metaprogramming.

In other words, you should not simply port functions from R or Python.

Sameer Deshmukh

unread,
Feb 15, 2016, 8:57:03 AM2/15/16
to SciRuby Development
Increasing your awareness of this will mostly involved having a good look at some of the famous ruby gems.

A good one to start with would be rails/arel (https://github.com/rails/arel). 


On Saturday, February 13, 2016 at 3:46:20 PM UTC+5:30, Lokesh Sharma wrote:

Lokesh Sharma

unread,
Mar 4, 2016, 7:08:54 AM3/4/16
to SciRuby Development
Any idea where I can learn to implement a data type in Ruby?

Sameer Deshmukh

unread,
Mar 4, 2016, 2:15:36 PM3/4/16
to SciRuby Development
What exactly do you mean? AFAIK classes are analogous to data types :P

Lokesh Sharma

unread,
Mar 5, 2016, 1:39:42 PM3/5/16
to SciRuby Development
That's great news. Does Daru has any defined data types in Ruby? I just want to learn things like from which class to inherit from, what methods to override, etc.

Lokesh Sharma

unread,
Mar 5, 2016, 2:41:38 PM3/5/16
to SciRuby Development

This project can be subdivided into 2 major components:
  • Support categorical data with a new :categorical data type and CategoricalIndex index class in daru.
  • Support operations on categorical data from daru on statsample and statsample-glm.
The first component seems perfectly doable since now I have some familiarity with Daru.

I am not sure about the second component though.

...and change the regression methods in statsample and statsample-glm so that they support categorical data supplied by daru.

Could you give me a rough idea what will be requirements for implementing the regression methods apart from programming skills?

Thank You

Alexej Gossmann

unread,
Mar 5, 2016, 2:53:27 PM3/5/16
to sciru...@googlegroups.com
Hi Lokesh,

Could you give me a rough idea what will be requirements for implementing the regression methods apart from programming skills? 

You should understand how a categorical variable can be translated into a set of numerical variables (see ANOVA models, contrasts). For example, see in the description of this issue how in R a categorical input variable in a linear regression model gets automatically replaced by four 0-1-valued dummy variables.
You can research how it is done in patsy (from Python's statmodels), see for example this. You can also look at how it is done in R.

Best,
Alexej

--
You received this message because you are subscribed to the Google Groups "SciRuby Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sciruby-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lokesh Sharma

unread,
Mar 6, 2016, 9:53:48 AM3/6/16
to sciru...@googlegroups.com

Hello Alexej

Thanks for explaining. I think I have some idea now. All there's needed is to convert the category values to attributes and then assign them 0/1 values based on whether row has particular value in the category variable or not and then do the regression.

One way I could think of implementing this is by having a procedure of exploding the categorical variable to multiple columns having 0/1 values in Daru and then statsample could simply take them and do the regression as like any other variable. Am I thinking in the right direction?

Regards
Lokesh

Alexej Gossmann

unread,
Mar 6, 2016, 10:07:45 PM3/6/16
to sciru...@googlegroups.com
Yes, exactly!
Keep in mind that when you replace a categorical column with multiple numerical dummy variables, you have to make sure that the resulting matrix has full column rank, because otherwise the regression methods will fail (for example, if the model already has an intercept term which in the corresponding model matrix is represented by a column of ones, then whichever 0/1-valued columns represent a categorical variable should not sum up to a column of ones). The term "contrasts" refers in statistics to the different ways to recode a categorical variable into numerical variables.
So, if you design a procedure in daru to recode categorical variables before passing them to statsample, it should be done in a clever way to take the above into account.

Best,
Alexej

--

Lokesh Sharma

unread,
Mar 8, 2016, 12:29:20 PM3/8/16
to SciRuby Development
Oh, it's starting to make sense now. Thank You Alexej.

From what I understand, I guess this can be done by first trying a particular coding scheme and see whether it gives full column rank matrix or not. If it doesn't give a full column matrix, then we have to move on to another categorical variable codings scheme like Hermet Coding, Deviation Coding, etc to code that particular categorical variable till a suitable coding scheme is found which doesn't give dependent columns.

So, in essence we have to implement different coding systems in Daru, and then have a mechanism to choose the suitable one depending upon the data which wouldn't cause a problem in regression. Right?

Regards
Lokesh

Alexej Gossmann

unread,
Mar 8, 2016, 2:55:47 PM3/8/16
to sciru...@googlegroups.com
From what I understand, I guess this can be done by first trying a particular coding scheme and see whether it gives full column rank matrix or not. If it doesn't give a full column matrix, then we have to move on to another categorical variable codings scheme like Hermet Coding, Deviation Coding, etc to code that particular categorical variable till a suitable coding scheme is found which doesn't give dependent columns.
So, in essence we have to implement different coding systems in Daru, and then have a mechanism to choose the suitable one depending upon the data which wouldn't cause a problem in regression. Right?

No, I don't think that that would be the right approach. :)
Let me explain:
If we have a categorical variable with k levels, then it must be replaced by (k-1) numeric variables (Not k numeric variables!). Because if we replace it with k variables (one for each factor level) instead, then their linear combination would give a column of ones which is not linearly independent of the intercept (aka constant) term of the linear model. However, if we replace a k-valued categorical variable with (k-1) numeric ones, then each of those numeric variables *does not* correspond to a particular categorical value, which makes interpretation of the result difficult. Now, the different coding schemes exist, so that the user can choose a particular coding scheme which *makes sense in interpretation* of the results in their particular use case.
For example, if the categorical variable has levels A1, A2 and A3, then the default coding scheme in R and Python assumes that the first level, A1, is the base level, and the two generated numerical dummy variables represent the differences (A1-A2) and (A1-A3). But this might not be reasonable in every case. Maybe you want A3 to be the base level, then the two generated numerical variables should represent (A3-A1) and (A3-A2). Or maybe you are interested in how the effect of A3 is different from the average of A1 and A2, then the numerical variables should code for (A2-A1) and (A3 - average(A1,A2)). Or maybe your categorical variable represents some kind of ordered factors (like 5 different concentrations of the same drug), and you want to find out if there is a polynomial pattern, then you would use polynomial contrasts.
So, I think that the user should choose a coding scheme suitable to their application. If the coding is implemented correctly (i.e. without redundancy among dummy variables), then the resulting matrix will usually have full column rank, no matter the coding scheme. However, the method should probably still check whether the matrix has full column rank, and return an error (or warning) if that's not the case.

I'm sure you can find more on this online or in books on linear regression and ANOVA.

Best,
Alexej

--

Lokesh Sharma

unread,
Mar 9, 2016, 3:51:56 AM3/9/16
to sciru...@googlegroups.com
Thank You for explaining in detail. Please clarify one thing.

I understand now that different coding schemes are there to enhance interpretation, not to fit the regression model better. In fact all coding schemes give the same performance i.e. same fit to the model. One thing I am not sure about is what does it mean when you said:

If the coding is implemented correctly (i.e. without redundancy among dummy variables), then the resulting matrix will usually have full column rank

Keep in mind that when you replace a categorical column with multiple numerical dummy variables, you have to make sure that the resulting matrix has full column rank, because otherwise the regression methods will fail.



I don't really understand what implementing coding scheme *correctly* means.  For example, if we are following the Dummy Coding scheme in which we replace k levels with (k-1) dummy codes. Does *correctly* implementing this scheme simply means that we should have (k-1) dummy codes (and not k dummy codes) or does correctness means more than that?

Further, say we apply (k-1) dummy codes for k levels and the resulting matrix isn't full column rank. Is there anything we can do to solve it?

Regards
Lokesh



Alexej Gossmann

unread,
Mar 9, 2016, 11:08:21 AM3/9/16
to sciru...@googlegroups.com
I don't really understand what implementing coding scheme *correctly* means. 

Sorry, if my comment on that was a little confusing. All I really wanted to say is that the dummy coding might lead to an issue with the model matrix not being of full column rank. And I think you have to keep that in mind when you decide on the design.
I think you need to consider whether your design plays well with different use cases, and decide which use cases you want to cover.
For example, do you want to give the user the option to fit a regression model without an intercept term? In that case you would actually replace the k-valued categorical variable with k numeric ones. So, your coding scheme would be slightly different depending on whether the user wants an intercept term or not (but you might just not give that option to the user, you need to decide).
Another possible question is, what happens if you have multiple categorical variables? Do you replace each of them with (k-1) numeric ones? What if the variables are "nested", for example, 3 schools ("A", "B", "C") and 12 classes within each school ("1", "2", ..., "12"). Do you code them as a total of 13 variables -- 2 variables corresponding to each school and 11 variables corresponding to each class? Or do you do something else, such as code them as (3*12-1) variables, i.e., something like "A_1", "A_2", ..., "A_12", ..., "C_1", ..., "C_12"? Or do you let the user decide? I'm not sure if any of that actually poses a problem, but you probably should think about this type of questions when you come up with the implementation.

Best,
Alexej

Sameer Deshmukh

unread,
Mar 9, 2016, 2:19:13 PM3/9/16
to SciRuby Development
If you need a point of reference, you can consider checking out the pandas.CategoricalIndex class.

Lokesh Sharma

unread,
Mar 12, 2016, 1:51:58 PM3/12/16
to SciRuby Development

Finally I have some implementation in mind. The following is the high level picture of the implementation.




(You can click here to see the diagram)


At the first step the Statsample will accept the input and figure out what categorical variables are needed to perform the particular regression.

Next, it will invoke some function in Daru which will code the particular categorical variables required for the regression according to the information sent by Statsample.

And finally Daru will return a modified DataFrame on which regression will be easily performed by the Statsample.

If you have some concern please feel free to mention.

Now, below is some low level details regarding how I am planning to implement each step.

Step 1: The important thing here is to decide how and which categorical variables to code. Now, in order to have a full column matrix I am in favor replacing k-valued categories with (k-1) dummy variables.

Now, there's a problem of handling multiple categorical variables. To handle them I have a simple solution in mind.

What if the variables are "nested", for example, 3 schools ("A", "B", "C") and 12 classes within each school ("1", "2", ..., "12").

If the user wants to simply add the effect of two different categorical variables (i.e. he wants no interaction) then he would supply (y ~ a+b). The coding would be trivial in this case i.e. (k-1) dummy variables for each categorical variable, and in the case where the user wants the two variables to interact he would supply (y ~ a*b) for regression.

And in the case like Alexej mentioned where there are nested variables, I think this is a case of interaction. It can be solved by introducing interaction among the two classes of variables i.e. ("A", "B", "C") and (1, 2, ..., 12). The user in this case would need to supply (y ~ a*b) instead of (y ~ a+b) to specify that there's different effect of 12 classes depending upon 3 classes. I think this would be equivalent to saying that the user wants 12 classes to be treated as nested inside the 3 schools. Please correct me if my assumption regarding the equivalent of these two is incorrect.

Step 2: In this step, Categorical Data class will hold the responsibility in Daru to code the categorical variable depending upon the specification sent by Statsample via some interface provided by Categorical Data class.

Step 3: Here statsample would accept the dataframe provided by Daru and perform the regression and output the result.

Please let me know what do you think about this?

Regards
Lokesh







Sameer Deshmukh

unread,
Mar 12, 2016, 3:05:59 PM3/12/16
to SciRuby Development
Looks good to me.

Alexej Gossmann

unread,
Mar 12, 2016, 8:42:36 PM3/12/16
to sciru...@googlegroups.com
Good plan!

Regarding what you say about step 1. If you want to use the formula language, such as "y ~ a + b" or "y ~ a*b", as you mention, then I highly suggest that you take a look at "How formulas work" from Python's statmodels, especially the section "Redundancy and categorical factors". And I would suggest you to keep it simple in the beginning. I have implemented a similar interface for mixed_models last summer, see LMMFormula and LMM#from_formula. But I am not quite satisfied with the way I did it, because the code that generates the model matrices turned out very confusing. I would probably adapt your code to use for mixed_models if it turns out nice :)

Alexej

Sameer Deshmukh

unread,
Mar 13, 2016, 2:03:50 PM3/13/16
to SciRuby Development
We might also consider creating a gem like Python's patsy for making sense of the formula language, especially if Lokesh has something to add to Alexej's LMMFormula interface. It would be a good way to start that project (ruby-patsy). 

Victor Shepelev

unread,
Mar 13, 2016, 2:34:14 PM3/13/16
to sciru...@googlegroups.com
I always wondered if there can be Ruby DSL for such a formulae (maybe inside blocks)?..

Lokesh Sharma

unread,
Mar 14, 2016, 10:53:33 PM3/14/16
to sciru...@googlegroups.com

Hello Victor. Could you explain what does "inside the blocks" mean? I'm not able to understand.

If we make a Ruby gem like pasty, do you want it to just behave like formula parser or you also expect it to return model matrices like actual pasty does.

From python pasty Readme, it seems like it behave both like a formula parser and building design matrices.
"Patsy is a Python library for describing statistical models (especially linear models, or models that have a linear component) and building design matrices. Patsy brings the convenience of R "formulas" to Python."

Thanks Sameer for the new idea. I'd love to make a new gem. Does your idea includes coding categorical variables of dataframe in ruby-pasty or in Daru?

Thank you Alexej for your feedback and links you mentioned. They are helping me to move forward.

Alexej Gossmann

unread,
Mar 15, 2016, 12:25:30 AM3/15/16
to sciru...@googlegroups.com
Lokesh,

I think if you make a gem like patsy, it would be more convenient to implement it, so that it returns a modified daru data frame (with new columns for categorical contrasts, interaction effects, etc.) rather than a model matrix. Statsample and Statsample-glm should then be able to fit a model from such a data frame.

In general, I think that you should think about how much is feasible to do over the summer. Especially, since this project requires that you first implement CategoricalIndex in daru. If it is not possible to replicate all of the formula stuff in the given time frame, then maybe we should keep it more simple. For example, by using only the most simple coding schemes for categorical factors, or not implementing categorical interaction effect support at first, but doing it in such a way that the missing functionality can be added without problems in the future.

A Ruby DSL for model formulas would be awesome, but I'm also not sure how to do it. 

Best,
Alexej

--

Иван Евграфов

unread,
Mar 15, 2016, 3:07:42 AM3/15/16
to sciru...@googlegroups.com
Seems like we can't build such DSL as lme4 offers, but we still can to define something like

f = formula { a =~ b + c - d + t + 1 }
puts f
# => 'a =~ b + c - d + t + 1'
# full example with implementation at https://gist.github.com/dilcom/42a0ec18565c38458603

The problem is that all the operators should be named as default Ruby binary operators so we can't define ':' (there is no such operator at all) or '~'(it's unary operator). There is also a gem for defining custom operators (https://rubygems.org/gems/superators19/versions/0.9.3) but it also can't define  ':' or '~'.

Regards,
Ivan

2016-03-15 9:59 GMT+03:00 Иван Евграфов <iv...@evgrafov.work>:
Seems like we can't build such DSL as lme4 offers but we still can to define something like

f = formula { a =~ b + c - d + t + 1 }
puts f
# => 'a =~ b + c - d + t + 1'
# full example with implementation at https://gist.github.com/dilcom/42a0ec18565c38458603

The problem is that all the operators should be named as default Ruby binary operators so we can't define ':' (there is no such operator at all) or '~'(it's unary operator). There is also a gem for defining custom operators (https://rubygems.org/gems/superators19/versions/0.9.3) but it also can't define  ':' or '~'.

Regards,
Ivan.

Victor Shepelev

unread,
Mar 15, 2016, 2:14:20 PM3/15/16
to sciru...@googlegroups.com
Yes, I meant something like Ivan already said. Also, I've already answered privatly to "The problem is that all the operators should be named as default Ruby binary operators":

My idea (maybe wrong!) was that we should not necessarily 100% mimic this notation. As far as I can understand (maybe wrong!), it is not some heavenly predefined best notation ever, just an agreement to use lme4's formula language.

Maybe defining similar, but not completely copycat DSL can provide a great possibility of development Ruby tools (instead of just mimicing R tools).

All in all, Rubyists are writing Math.sqrt(x) and x**2 instead of √x or x² and nobody complains.

Or maybe its all is just my fantasy :) I'm not too familiar with a topic, but seems like R's notation is far from "common scientific" (as it looks in Wikipedia explanations of linear mixed models).

Alexej Gossmann

unread,
Mar 15, 2016, 3:34:14 PM3/15/16
to sciru...@googlegroups.com
> My idea (maybe wrong!) was that we should not necessarily 100% mimic this notation. As far as I can understand (maybe wrong!), it is not some heavenly predefined best notation ever, just an agreement to use lme4's formula language. Maybe defining similar, but not completely copycat DSL can provide a great possibility of development Ruby tools (instead of just mimicing R tools).

It's not just the notation of the lme4 package. It is actually part of base R (by which I mean the base installation of R), and is used in most modeling packages in R. Also, the same notation is used in Python's statistical modeling packages, as far as I can tell (I'm not too familiar with Python).
I'm not sure about the exact origin of this notation, but since it is in base R, which was developed by statisticians for statisticians, I'm sure that a lot of thought went into it. So, my personal opinion is that, we should not dismiss it so quickly.

For my gem mixed_models I have actually already implemented a type of DSL with the class LMMFormula (linked to in a previous message). But I also ran into the difficulty that ":" can't be used as an operator. So, I ended up using "*" for what ":" is in R, and nothing for what "*" is in R (because "a*b" is shorthand for "a+b+a:b"). However, the user still supplies the formula in R notation as a string to LMM#from_formula, where "a:b" is automatically replaced with "a*b", and "a*b" replaced with "a+b+a:b" etc. Then the result of that is evaluated with eval() according to LMMFormula DSL. As, I said before I'm not too happy with that, but I'm also not sure what a better solution would be. I'm extremely busy this month, but I will definitely put some more thought into this issue.

Alexej

Lokesh Sharma

unread,
Mar 16, 2016, 1:45:05 PM3/16/16
to SciRuby Development
I guess I will leave the idea of creating separate gem for now because I am not sure how to do it and I do not want to leave the non-trivial things like interaction effects for later on. I think it's better to implement them at first because leaving them to be handled later might not turn out that easy. I fear once we have the simple functionality in place, adding things like interaction effects will probably involve changing the earlier made functionality in significant ways. One the other hand I will make sure while working with Statsample that the functionality that's added there to handle parse formulas should to be made such that it can be moved to another gem easily if required in the future.

What do you say?

Regarding implementing Categorical Index and Categorical Data in Daru, I am working on to decide what functionalities should there be in Categorical Data and Categorical Index. For that purpose I am coming up with reasons for which we need them for. Below are some I know of:

Uses of Categorical Data:
  • To be able to efficiently store categorical data.
  • To code categorical variable using different schemes.

Uses of Categorical Index:

  • Efficiently save index made out of categorical values


Are there more reasons why do we need Categorical Data and Categorical Index for?

This would really help me with decide what other things should be kept in mind while designing Categorical Data and Categorical Index.

Thank You

Sameer Deshmukh

unread,
Mar 16, 2016, 4:23:56 PM3/16/16
to SciRuby Development
None that I can think of.

Alexej Gossmann

unread,
Mar 16, 2016, 4:57:13 PM3/16/16
to sciru...@googlegroups.com

> Are there more reasons why do we need Categorical Data and Categorical Index for?

I think we need at least basic plotting capabilities for categorical data, such as histograms and pie charts and the like (not sure if what is already present in daru needs to be adjusted). And basic summary statistics for categorical data (for example, table how many times each factor occurs; return unique factors, etc.; again, might be already present in daru).

Alexej

--

Lokesh Sharma

unread,
Mar 19, 2016, 1:27:04 PM3/19/16
to SciRuby Development
Thanks for the other features you mentioned Alexej.

Sameer,
I checked out your blog here and from there I understand that there are two types of data types in Daru currently: Array and NMatrix. And the other one which this project aims to add is Categorical Data. Right?

Could you give a brief overview how array data type is implemented, so that I can think about implementing Categorical Data data type. I think a hash table would be best to implement the Categorical Data data type as it will efficiently store the categorical data and also it would be easy to form coding schemes out of it. Also, it will naturally answer all the questions like Alexej mentioned like how many times each factor occurs, return unique factors, etc.

Regards
Lokesh


Sameer Deshmukh

unread,
Mar 19, 2016, 2:50:28 PM3/19/16
to SciRuby Development
You misunderstood. Array and NMatrix aren't data types, they are merely data containers.

You will have to work with the TYPE of the data. It is stored in the @type variable of a Vector. Currently it can be either :object (for all sorts of generic data) or :numeric (only numbers). The @type is determine internally and the user does not have control over it.

Your project will involve adding a third type called :categorical. Please have this very clear. The ACTUAL DATA that will you will store in daru could be stored in either Array or NMatrix, depending on the argument supplied to the :dtype parameter in the Vector/DataFrame constructor.

You may possibly give control over the @type instance variable with some other option to Vector/DataFrame.

Sameer Deshmukh

unread,
Mar 19, 2016, 2:51:20 PM3/19/16
to SciRuby Development
I have answered a query posted by Arafat Khan about this. Please go through it: https://groups.google.com/forum/#!topic/sciruby-dev/ZAPz15ERER4

On Saturday, February 13, 2016 at 3:46:20 PM UTC+5:30, Lokesh Sharma wrote:
Hi,

I am applying for GSoC this year. I was having a look at various project ideas. The one I like the most is Categorical data support. It's because I like Python and am familiar with all the tools used for Data Analysis in Python like SciPy, NumPy, Pandas. I had a look at the Recommended Skills section which say:


Recommended skills: Proficiency with Ruby, Good understanding of designing Ruby APIs, preferably should have worked with data analysis and statistics in the past.

I am familiar with Ruby and have worked with Rails but I don't know what designing Ruby API's mean. Does it mean that one should be familiar with creating gems and where can I look to get some experience on that? What else are the requirements for this project? I am looking forward to fill the gap in my knowledge to be able to implement this idea.

Thank You


Lokesh Sharma

unread,
Mar 20, 2016, 12:38:09 AM3/20/16
to SciRuby Development
Thank you Sameer for clarifying things. Please help me understand a few things here.



Your project will involve adding a third type called :categorical. Please have this very clear. The ACTUAL DATA that will you will store in daru could be stored in either Array or NMatrix, depending on the argument supplied to the :dtype parameter in the Vector/DataFrame constructor.

One of the aim of Categorical Data is to represent duplicate values in efficient way. I thought that if we implement Categorical Data as data container, it would enable us to store duplicate values effectively. As far as I can tell :object and :numeric do not have any control over how objects are stored and manage inside the storage class.

If we implement :category inside :array, how will we efficiently store it inside the storage class mentioned and include all the functionality it should have? Even if we manage to somehow store it efficiently, I don't see an easy to way to cleanly implement functionality like indexing.

As far as Pandas goes, there Categorical Data class like other data types like object, int64 has the full control over everything from storage to its indexing. This situation is different from what we have here in Daru. We have storage containers like you mentioned :array, :nmatrix etc. and they store :numeric and :object. These data types :object and :numeric have no control over indexing, storage, etc.

I might have taken wrong assumptions regarding some of the things. In that case please bear my ignorance and correct me.

Thank You



Sameer Deshmukh

unread,
Mar 20, 2016, 5:55:09 AM3/20/16
to SciRuby Development
I'm saying that whether a data container has @type = :categorical will determine if the data stored in the daru data structure is categorical in nature. You can implement your own data containers for efficiently storing Categorical Data using Ruby Array/NMatrix.

The data container will be different when the @type variable is :categorical. If you feel that using pure Ruby Array or NMatrix will not do the job, then please feel free to suggest an implementation of your own (I think that's going to be necessary). It need not build upon :array or :nmatrix.

When the @type is :categorical, the question of :object of :numeric will not arise since the data in the Vector or DataFrame will not be numerical or object.

Lokesh Sharma

unread,
Mar 20, 2016, 10:20:46 AM3/20/16
to SciRuby Development
Perfect! Everything seems to fall in place now. Thank you clarifying matters.

I want to implement :categorical using Hash-table and array. Here's a prototype:

Say we want to store [:a, :b, :a, :c, :b, :b] with categories :a, :b and :c.

I plan to use one hash table and one array.
  • The hash table would be useful to get all the values which lie under that category. So hash table in the example would be: {a: [0, 2], b: [1, 4, 5], c: [3]}
  • Array would be useful to find for every element through it's index what's the category it belongs to. So, array in the example would be. Before that, let's define map_cats = [:a, :b, :c]. Now the array would be [0, 1, 0, 2, 1, 1].

Do you like it?


Regards
Lokesh

Sameer Deshmukh

unread,
Mar 20, 2016, 1:56:55 PM3/20/16
to SciRuby Development
Looks good. You should also explore how other languages have implemented this and see if you can come up with something better.

Lokesh Sharma

unread,
Mar 21, 2016, 2:48:47 AM3/21/16
to SciRuby Development

Sameer,

I was working on adding visualization support for categorical data in my proposal. Could you comment on whether Daru already has the following abilities:
  • Show a summary of vector like what's the maximum value, minimum value, etc.
  • Frequency of each value in a vector.
  • Plotting histogram of a vector.

Their presence would help me decide what naming convention to follow and whether to create a new function or not.


And about the storage structure of :category, I consulted how's it's done with Pandas. They seem to be using the same data structure like I suggested, however how they map each category to set of elements belonging to that category is a little mysterious. Here's what I found in their code:


            # there were two ways if categories are present
            # - the old one, where each value is a int pointer to the levels
            #   array -> not anymore possible, but code outside of pandas could
            #   call us like that, so make some checks
            # - the new one, where each value is also in the categories array
            #   (or np.nan)


I am not clear on the subject how each value is in the category array. Anyway I am assuming they are doing what I think they are doing till it becomes clear to me what they are doing and going ahead with my implementation.

Regards
Lokesh

Sameer Deshmukh

unread,
Mar 22, 2016, 4:49:31 AM3/22/16
to SciRuby Development
Replies inline:


On Monday, March 21, 2016 at 12:18:47 PM UTC+5:30, Lokesh Sharma wrote:

Sameer,

I was working on adding visualization support for categorical data in my proposal. Could you comment on whether Daru already has the following abilities:
  • Show a summary of vector like what's the maximum value, minimum value, etc.
See DataFrame#describe or Vector#describe. 
  • Frequency of each value in a vector.
Yes see Vector#frequencies. 
  • Plotting histogram of a vector.

Lokesh Sharma

unread,
Mar 23, 2016, 5:11:01 AM3/23/16
to SciRuby Development
I am working on the details regarding implementation of formula parser for statsample. Please help me figure out the following:

1. Is there any specific reason why there are three gems for statistical analysis? Do they all perform different functions?:
  • Statsample
  • Statsample-glm
  • mixed_models


2. Here's my initial idea to implement the formula parser in statsample and statsample-glm. If it's fine, I will go on to develop it. I will implement a class similar to "LMMFormula" in mixed_models. To keep things simple for now I will only support two operations "+" and ":".


I will split the expression by "+" to have a number of terms. For example "a + b + b:c + c:d" would give me "a", "b", "b:c" and "c:d".

Now I will have two types of terms: interaction terms like "b:c", "c:d" and other non-interaction terms like "a" and "b". I will call the #contrast_code on the dataframe with ["a", "b"] as argument to code the "a" and "b" column. Then I will call #contrast_interact with argument [[b, c], [c, d]] to code "b:c" and "c:d". The resulting dataframe will be now ready for regression.


What's your opinion regarding this type of formula parser?


One thing I don't quite understand is what do expression on LHS do. What's the difference between "a + b" and 'y ~ a + b"? I am sorry for my lack of knowledge on regression and I will fill it soon.


Also please help me understand what to do when there are more terms in LHS, for example "y + x ~ a + b"?


3. Beside regression, are there other important operations on categorical variable that I should consider supporting?


Regards

Lokesh

Sameer Deshmukh

unread,
Mar 23, 2016, 3:04:28 PM3/23/16
to SciRuby Development
Alexej is the best person to answer regression-specific questions. However, I have answered some below:


On Wednesday, March 23, 2016 at 2:41:01 PM UTC+5:30, Lokesh Sharma wrote:
I am working on the details regarding implementation of formula parser for statsample. Please help me figure out the following:

1. Is there any specific reason why there are three gems for statistical analysis? Do they all perform different functions?:
  • Statsample
  • Statsample-glm
  • mixed_models
Each cater to different methods. statsample-glm was created to abstract away GLM methods into another gem. Mixed_models is specific to mixed models.

2. Here's my initial idea to implement the formula parser in statsample and statsample-glm. If it's fine, I will go on to develop it. I will implement a class similar to "LMMFormula" in mixed_models. To keep things simple for now I will only support two operations "+" and ":".


I will split the expression by "+" to have a number of terms. For example "a + b + b:c + c:d" would give me "a", "b", "b:c" and "c:d".

Now I will have two types of terms: interaction terms like "b:c", "c:d" and other non-interaction terms like "a" and "b". I will call the #contrast_code on the dataframe with ["a", "b"] as argument to code the "a" and "b" column. Then I will call #contrast_interact with argument [[b, c], [c, d]] to code "b:c" and "c:d". The resulting dataframe will be now ready for regression.


What's your opinion regarding this type of formula parser?


One thing I don't quite understand is what do expression on LHS do. What's the difference between "a + b" and 'y ~ a + b"? I am sorry for my lack of knowledge on regression and I will fill it soon.


Also please help me understand what to do when there are more terms in LHS, for example "y + x ~ a + b"?


This is one part of it. However, say a user wants to perform a regression that does not fit into your formula language (maybe because you could not handle such a case with the formula language), he should be able to use methods/options that enable this. They will more complex than the formula language, but it should do the job. 


3. Beside regression, are there other important operations on categorical variable that I should consider supporting?

Visualization, for one. As a start you can go through the statsample methods and see if there are any that need categorical data support but currently don't have it. Have you seen the statsample-timeseries gem? 


Regards

Lokesh

Alexej Gossmann

unread,
Mar 23, 2016, 8:04:53 PM3/23/16
to sciru...@googlegroups.com
Hi Lokesh!

To answer your questions:

1.) The three gems fit different types of statistical models. 
- statsample only has capabilities to fit the basic linear regression models, which have the general form y = X*b+error. You use it when you have a continuous response variable y that depends linearly on a set of predictors X.
- statsample-glm can fit generalized linear models, for example a logistic regression model (which has the general form logit(p)=X*b). You use it when you have a continuous or discrete response variable y, such that a function of the mean of y depends linearly on the predictor variables X (i.e., much more general than the basic linear model, but more difficult to solve).
- mixed_models is for linear mixed models, which have the general form y = X*b + Z*u + error where u are random effects (in particular, the formula language for mixed models is different because the random effects u need to be specified too). You use it when you have a continuous response variable y, which depends linearly not only on the predictor variables X but also on random effects in a way specified in the columns of Z. Again it is a generalization of the basic linear model, because it reduces to the basic linear model when Z=0, but it is not efficient to use when Z=0.

2.) The formula parser you suggest looks fine to me :)
I have never seen multiple terms on the left hand side. The left hand side represents just the response vector; i.e.,a vector which depends linearly on the variables on the right hand side; I don't see how it can be necessary to have multiple terms on the left hand side.

Alexej


--

Lokesh Sharma

unread,
Mar 24, 2016, 5:12:47 AM3/24/16
to SciRuby Development
Thank you Alexej for the detailed information. Wouldn't it be great in the future to have single gem to serve all the three purposes? Perhaps we can create a 4th gem that would integrate all of these 3 gems. I think this would make easy for the user because then he would have to learn just one interface for all the purposes.



This is one part of it. However, say a user wants to perform a regression that does not fit into your formula language (maybe because you could not handle such a case with the formula language), he should be able to use methods/options that enable this. They will more complex than the formula language, but it should do the job.

If I am not wrong then this simple formula language with just '+' and ':' should be able to handle regression equation of any sort. After all, all other operations such as '/', '*', '**' are short form which reduces to one of the forms which only involve '+' and ':', For example "a / b" is shorthand for "a + a:b", and is intended to be useful in cases where one wants to fit a standard sort of ANOVA model, but b is nested within a. Isn't that right Alexej?

I will look forward to how they've done it in Pasty. Hope it isn't that complicated ;)

Regards
Lokesh

Lokesh Sharma

unread,
Mar 27, 2016, 1:26:44 AM3/27/16
to sciru...@googlegroups.com

Hello Alexej

Could you tell about how did you implemented the formula parser like R in mixed_models? I saw that pasty did it with using Shunting-yard-algorithm.

Thank you

Alexej Gossmann

unread,
Mar 28, 2016, 12:41:03 AM3/28/16
to sciru...@googlegroups.com
Hi Lokesh,

I did not use Shunting-yard or any named algorithm. I rather came up with my own ad hoc solution.

A section from my blog post:

Formula interface 
LMM#from_formula takes in a String containing a formula specifying the model, for example
"z ~ x + y + x:y + (x | u)".
It transforms this formula into another String, for the above example:
"lmm_formula(:intercept) + lmm_variable(:x) + lmm_variable(:y) + lmm_variable(:x) * lmm_variable(:y) + (lmm_variable(:intercept) + lmm_variable(:x) | lmm_variable(:u)))",
adding intercept terms and wrapping all variables in lmm_variable().
The Ruby expression in the String is evaluated with eval. This evaluation uses a specially defined class LMMFormula (defined here), which overloads the +, * and | operators, in order to combine the variable names into arrays, which can be fed into LMM#from_daru. The class LMMFormula was an idea that I got from Will Levine (wlevine). In particular, the method LMMFormula#to_input_for_lmm_from_daru transforms an LMMFormula object into a number of Array, which have the form required by LMM#from_daru.
Finally, LMM#from_daru constructs the model matrices, vectors and the covariance function Proc, which are passed on to LMM#initialize that performs the actual model fit.

Some additions to the above:

- The formula language for mixed models is richer than that for "ordinary" linear models, because the random effects need to be specified additionally to the fixed effect terms. That's why additional symbols "|", "(" an ")" play a very important role.
- LMMFormula uses "*" for what ":" is in R
- LMM has essentially three constructors. LMM#initialize operates on design matrices; LMM#from_daru takes in a daru data frame and many arguments, which it puts in proper form to call LMM#new; LMM#from_formula takes in a formula as a character string, and transforms it into arguments for LMM#from_daru.

Let me know if you have more questions.

Best,
Alexej

--

Alexej Gossmann

unread,
Mar 28, 2016, 12:47:38 AM3/28/16
to sciru...@googlegroups.com
Lokesh,

By the way, here is the method from mixed_models that creates k or k-1 dummy variables for each non-numeric variable in the data frame:

Alexej

Lokesh Sharma

unread,
Mar 28, 2016, 6:52:37 AM3/28/16
to SciRuby Development
Thank You Alexej for the explanation. It has made many things easy to understand for me.

After skimming over your code, I am realizing this is basically what I have in mind right now though I am not sure.

One problem I am having with my approach is to handle brackets and operators such as '*', '/', '**'. I am having problem evaluating terms like (a+b):(c+d) or reducing it to (ac:ad:bc:bd). I took a shallow look at your implementation and I think it too doesn't handle cases like (a+b):(c+d). Is that right?

I wasn't able to verify it because of this error which I got by running the notebook:

Gem::ConflictError: Unable to activate mixed_models-0.1.1, because daru-0.1.2 conflicts with daru (= 0.1.0)

Now, as I was thinking that this problem of expression evaluation. I am dissatisfied with my approach of evaluating expressions. It works with '+' and ':' but not with '*', '/', '**', '()', etc. But I'm thrilled by the way Pasty has done it. They have used Shunting Yard Algorithm, which basically converts infix expression to postfix expression. It's an elegant solution and everything fit's very naturally with whatever we need. We will input a complicated expression like (a+b)*(c+d) and it will naturally take care of the precedence and provide us with an expression without brackets which will be very easy to evaluate. Now all we need to do is to evaluate this expression and evaluating post fix expression is way easier than evaluating infix expression.

Here's a demo:

3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3  # Evaluating this will be a night mare for a computer like it's us for evaluating (a+b)*(c+d)
3 4 2 * 1 5 - 2 3 ^ ^ / +      # This is equivalent post fix expression of above and is way easier to evaluate. (Google evaluation of post fix notation)

If we go for this way rejecting my earlier proposed method, then a lot of things are going to change and I encourage we do because of the shortcoming I mentioned above but a lot of things need to be reconsidered. I am ready for it because I think we have much time for it. What do you say?

Regards
Lokesh


Lokesh Sharma

unread,
Mar 28, 2016, 9:25:07 PM3/28/16
to SciRuby Development
Good news! I came across this function in Patsy, ModelDesc.from_formula. It takes any expression like '(a+b):(c+d)' and convert it into expression only involving '+', ':' i.e. (a:c+a:d+b:c+b:d). So, it turns out we don't need to reconsider the whole thing. I can just stick with my previous plan to only implement a language involving ':' and '+' and then reduce all the expressions to this language. This is exactly what I was looking for. All I need to do know is to study how this works and I am sticking with my previous proposal.
 

Lokesh Sharma

unread,
Mar 30, 2016, 10:46:48 AM3/30/16
to sciru...@googlegroups.com

What are your thoughts Alexej?

Alexej Gossmann

unread,
Mar 30, 2016, 12:46:12 PM3/30/16
to sciru...@googlegroups.com
Hi Lokesh!

In general I think that it would be great to have a function like ModelDesc.from_formula from patsy. It looks like it processes the entirety of R's formula language and beyond.
But have you looked at how it is implemented? We need to make sure that your plan is feasible in the given time frame.
I took a quick look at ModelDesc.from_formula, then I ended up in parse_formula, which looks pretty complicated to me.
Honestly, I have doubts that it would be feasible to implement the full capacities of ModelDesk.from_formula, given all the other things you have to do. But I might be wrong, as I have not spend very much time on it.

Regarding my implementation of mixed_models, I did not include capabilities for interaction terms like '(a+b):(c+d)' and 'a:b:c' because I don't think that they are used very often, and I did not want to spend more than a week on the formula language.

Alexej

2016-03-30 9:46 GMT-05:00 Lokesh Sharma <lokeshh...@gmail.com>:

What are your thoughts Alexej?

Lokesh Sharma

unread,
Mar 31, 2016, 4:45:56 PM3/31/16
to SciRuby Development

But have you looked at how it is implemented?

Yes Alexej, from some past days I've trying to learn about how ModelDesc.from_formula work. Though I do not entirely understand how it works but I have been able to came up with a way to parse the expression using some of the idea like Shunting yard algorithm. Here's a simple implementation of formula parse which I have made. It takes a formula language involving '*', '/', ':' and brackets and reduces it to an expression involving only '+', ':' from which it's very easy to perform the regression. Here's how one can use it:

>> Formula.new.from_formula '(a+b):(c+d)'
=> "a:c+a:d+b:c+b:d"

>> Formula.new.from_formula 'a*b*c*d'
=> "a+b+a:b+c+a:c+b:c+a:b:c+d+a:d+b:d+a:b:d+c:d+a:c:d+b:c:d+a:b:c:d"

Let me know what you think about this?

Regards
Lokesh
 


Alexej Gossmann

unread,
Apr 2, 2016, 12:52:58 AM4/2/16
to sciru...@googlegroups.com
Hi Lokesh,

It looks pretty good to me. Sorry for late reply.

Do you also plan to include capabilities to fit models without an intercept term? It is denoted with 0 or -1 in formulas (see "intercept handling" in https://patsy.readthedocs.org/en/latest/formulas.html). Then you would cover pretty much the entire R formula language. But I'm actually not sure if statsample can fit linear models without an intercept term.

Best,
Alexej

Lokesh Sharma

unread,
Apr 4, 2016, 9:43:49 AM4/4/16
to sciru...@googlegroups.com
Thank you for the feedback Alexej.
 
Do you also plan to include capabilities to fit models without an intercept term? It is denoted with 0 or -1 in formulas (see "intercept handling" in https://patsy.readthedocs.org/en/latest/formulas.html).

I will have a look at it. I haven't thought about that yet.


Could you take a look at the way I've resolved redundancy in coding of categorical variables. It's here (It's the same as I have in my proposal)

I created this on my own and it works for all the cases I've seen so far but it would be great if you could verify if it works for all cases.

Thanks again :)

Alexej Gossmann

unread,
Apr 5, 2016, 8:56:05 AM4/5/16
to sciru...@googlegroups.com
Could you take a look at the way I've resolved redundancy in coding of categorical variables. It's here (It's the same as I have in my proposal)

I don't think it's quite right. If the model has an intercept term (as it usually does by default), then you cannot code any variable with n dummy variables, because their sum would give a vector of ones, which corresponds to the intercept term. Right?

Alexej

--

Lokesh Sharma

unread,
Apr 5, 2016, 10:13:14 PM4/5/16
to SciRuby Development

I don't think it's quite right. If the model has an intercept term (as it usually does by default), then you cannot code any variable with n dummy variables, because their sum would give a vector of ones, which corresponds to the intercept term. Right?

Hi Alexej

Let me explain what is happening because I've provided only the part which code interaction terms in the gist so it might be confusing for you.

There are two functions:
  • #contrast_code: to code non-interaction terms such as 'a', 'b', 'c', etc. which doesn't involve the ':'
  • #contrast_interact: to code interaction terms such as 'a:b', 'b:c:a', etc. which include ':' 

Non-interaction terms are always coded by #contrast_code to n-1 variables. In order to compute terms such as 'a:b', I am either coding 'a' to n variables or n-1 variables but that's just temporarily for only computing 'a:b'. Following example will illustrate this point:


Let the expression be 'y ~ a + a:b'.

Now we need to code 'a' and 'a:b'. Let's say 'a' has categories 'yes', 'no' and 'b' has categories 'true' and 'false'.

Firstly 'a' will be coded to dummy variables ['a_no'].

Now to compute 'a:b' I will code 'a' to ['a_yes', 'a_no'] and 'b' to ['b_false']. This is because 'a' is in the expression but not 'b'. Notice, coding of 'a' and 'b' is just to compute 'a:b'. The coding of 'a' which goes into final matrix will be performed by #contrast_code.

Therefore 'a:b' will be coded to ['a_yes':'b_false', 'a_no:b_false']


Total dummy variables will be ['a_no', 'a_yes:b_false', 'a_no:b_false']


In the gist I haven't mentioned the #contrast_code. It always convert every variable to n-1 variables and the coding which happens within in #contrast_interact to code interaction terms is only temporary.


Hope that make things clear. Please let me know if something else needs clarification.


Regards

Lokesh

Alexej Gossmann

unread,
Apr 5, 2016, 11:46:07 PM4/5/16
to sciru...@googlegroups.com
Hi Lokesh,

I think this approach works in most cases, but it can still give you redundancies.
If I understand you correctly, you would code the interaction term in the model "y~a:b" with n*m dummy variables? Then the sum of the n*m variables is going to give a constant column (which conflicts with the intercept term).

Alexej

--

Lokesh Sharma

unread,
Apr 6, 2016, 12:42:41 AM4/6/16
to sciru...@googlegroups.com

Oh I see! Yes that's a problem. Guess I'll have to see that Patsy dev guide and adopt their method. BTW how did you implemented it?

Alexej Gossmann

unread,
Apr 6, 2016, 12:58:22 AM4/6/16
to sciru...@googlegroups.com
I coded it with n*m-1 dummy variables in that case. You can see it here: https://github.com/agisga/mixed_models/blob/master/lib/mixed_models/ModelSpecification.rb#L307

2016-04-05 23:42 GMT-05:00 Lokesh Sharma <lokeshh...@gmail.com>:

Oh I see! Yes that's a problem. Guess I'll have to see that Patsy dev guide and adopt their method. BTW how did you implemented it?

--

Lokesh Sharma

unread,
Apr 8, 2016, 2:43:22 PM4/8/16
to SciRuby Development
Nice. To handle more than two way interaction my method won't work. I saw Patsy methods of resolving redundancy which is by dividing the interaction terms into subterms, choosing the small subterms mentioned in the expression and then combing the remaining left subterms forming big terms. It requires some additional work though. With this strategy it will also be easy to work with intercept terms. So I'm thinking of including them too.

Alexej Gossmann

unread,
Apr 11, 2016, 6:21:57 PM4/11/16
to sciru...@googlegroups.com
Sounds good! It would be great if you can build Patsy-like capabilities. But keep in mind that if it's too complex and you need more time to work on other important things, you can always restrict the scope of the formula language (for example, more than 2-way interaction are rarely used in statistics).

Alexej

2016-04-08 13:43 GMT-05:00 Lokesh Sharma <lokeshh...@gmail.com>:
Nice. To handle more than two way interaction my method won't work. I saw Patsy methods of resolving redundancy which is by dividing the interaction terms into subterms, choosing the small subterms mentioned in the expression and then combing the remaining left subterms forming big terms. It requires some additional work though. With this strategy it will also be easy to work with intercept terms. So I'm thinking of including them too.

Lokesh Sharma

unread,
Apr 12, 2016, 8:55:31 PM4/12/16
to SciRuby Development
I will keep that in mind. I will try to make it in such a way that complex functionality can be introduced later on if not now.

One question: What does it mean where there aren't any terms on left side in regression equation?

I saw this case in Patsy dev guide. It says:


Sometimes you want a formula that has no left-hand side, and you can write that as ~ x1 + x2 or even x1 + x2.

From what I know regression is to model relationship among one dependent and some independent variables. So, terms on both the sides are necessary.

Regards
Lokesh

Alexej Gossmann

unread,
Apr 12, 2016, 9:42:24 PM4/12/16
to sciru...@googlegroups.com
One question: What does it mean where there aren't any terms on left side in regression equation?

I don't know how it is used in python packages. I guess there are statistical methods (outside of regression) where the formula language makes sense, but where no left hand side is necessary. In context of regression, I can imagine that it would be useful, if you want to pass the right hand side of a formula to a function, which connects different left hand side terms to it, for example. So, I think that it can be useful, but I don't know a very common use case.

Best,
Alexej

--

Lokesh Sharma

unread,
Apr 25, 2016, 3:42:17 PM4/25/16
to SciRuby Development
Hello Sameer and Alexej

Since the community bonding period is on, I was wondering what should be the tasks to perform during this period. Here are couple of things I have in mind:

  • Migrating the test in statsample to rspec. Here's an issue open regarding this.
  • Deciding upon high level integration tests.

BTW Is adding support of visualization of categorical variable for nyaplot sufficient? Or should we also include the support for gnuplotrb?


Regards

Lokesh

Sameer Deshmukh

unread,
Apr 29, 2016, 2:51:19 PM4/29/16
to SciRuby Development
It would be ideal if you add support for both nyaplot and gnuplotrb. Both are equally mature projects and are important in their own respects.

Migrating statsample to rspec is somewhat lower priority since its working fine the way it is now. You can definitely start with high level integration tests. Don't forget to give us daily updates, as discussed in today's meeting.
Reply all
Reply to author
Forward
0 new messages