fixed effects

Matthieu

unread,

Jun 9, 2015, 6:35:53 PM6/9/15

to julia...@googlegroups.com

Hello,

I'm just starting with Julia today and I've coded a simple algorithm to demean columns of a data.frame with respect to multiple high dimensional fixed effects (where groups are potentially defined by multiple columns).

FixedEffects.jl

The algorithm simply returns a new dataframe with the partialled out columns. One may use these partialled out variables in a simple OLS. This corresponds to a very basic version of the R package felm.

I've tried to minimize copies (using subdataframes) and to return residuals aligned with the original data.frame (in case of NA).

Since this is my first experience with Julia, I'd welcome any kind of feedback.

A couple of beginner questions:

- Is `copy` the best way to keep the previous result in [this iteration loop](https://github.com/matthieugomez/FixedEffects.jl/blob/master/src/fixedeffects.jl#L60)?

- What's the best way to add a subset argument to my function ? I'd like this argument to allow the user to (estimate the model and return the residuals) on a subset of the dataframe only.

Patrick Kofod Mogensen

unread,

Jun 13, 2015, 2:52:46 PM6/13/15

to julia...@googlegroups.com

Just an FYI. When you export the demean-function, it enters the namespace when you write using. This means that you can just write demean(...); no need to write Package.Function(args...).

I take it you are an econometrics-student of some sort, feel free to hit me up if you have any future projects you need help with. Stuff like this is nice to have, if we want julia to enter exercise classes at universities.

Good to have you on board!
Patrick

Matthieu

unread,

Jun 24, 2015, 12:25:42 PM6/24/15

to julia...@googlegroups.com

Thanks.

The current version of the package now estimates models with instrumental variables (2SLS), high dimensional fixed effects, and white / clustered standard errors. This allows to estimate a large part of models used in applied economics research. Moreover, this function seems faster than Stata and R corresponding functions (respectively areg / lfe), in particular for models with one high dimensional fixed effect.

Two more points make this function differ from the lm function in GLM:

1. The regression result object is very light (basically the initial formula, a vector of coefficients, and a covariance matrix). In contrast, since the output of GLM contains the original dataframe, the converted matrix of regressors, the model response etc, the output from GLM can actually take much more space than the initial DataFrame.

I have chosen to return a light object because it allows to estimate multiple models without requiring more RAM at every step. Methods such as predict and residual can be defined as long as the user provides a DataFrame

2. The function has an argument that allows to change the way errors are computed. In R, correct errors are generally estimated in a second step, through a different package like vcov, multiwayvcov. This strikes me as inefficient and counterintuitive.

I've defined an abstract type AbstractVcov. Any user can define a new type (child of this abstract type), as long as he/she defines a method, vcov, that acts on a regressor matrix (X), a hat matrix (X'X in the simple case), and a vector of residuals. This seems enough to define a wide range of standard errors.

I've only defined 3 types (simple, white, clustered).

For instance, to estimate a model with white robust standard errors

reg(formula, df, VceWhite())

To estimate a model with clustered standard errors

reg(formula, df, VceCluster(:clustervar))

Milan Bouchet-Valat

unread,

Jun 24, 2015, 1:19:18 PM6/24/15

to julia...@googlegroups.com

Le mercredi 24 juin 2015 à 09:25 -0700, Matthieu a écrit :
> Thanks.
>
> The current version of the package now estimates models with
> instrumental variables (2SLS), high dimensional fixed effects, and
> white / clustered standard errors. This allows to estimate a large
> part of models used in applied economics research. Moreover, this
> function seems faster than Stata and R corresponding functions
> (respectively areg / lfe), in particular for models with one high
> dimensional fixed effect.

I'm not very familiar with these models, but that looks really nice.
Have you considered using the fit() function with a model type to be
more similar to GLM.jl?

> Two more points make this function differ from the lm function in
> GLM:
>
> 1. The regression result object is very light (basically the initial
> formula, a vector of coefficients, and a covariance matrix). In
> contrast, since the output of GLM contains the original dataframe,
> the converted matrix of regressors, the model response etc, the
> output from GLM can actually take much more space than the initial
> DataFrame.
> I have chosen to return a light object because it allows to estimate
> multiple models without requiring more RAM at every step. Methods
> such as predict and residual can be defined as long as the user
> provides a DataFrame

I agree that's likely a good idea. With data sources like databases, it
wouldn't make any sense to try saving all of the data with the model.
We could imagine adding an argument to keep a copy of the data, if it
turns out that's needed.

I think the only case where having the data in the model object is when
calling predict(). Maybe it would be possible to save just the name of
the data frame, and use it if it's in scope?

> 2. The function has an argument that allows to change the way errors
> are computed. In R, correct errors are generally estimated in a
> second step, through a different package like vcov, multiwayvcov.
> This strikes me as inefficient and counterintuitive.
>
> I've defined an abstract type AbstractVcov. Any user can define a new
> type (child of this abstract type), as long as he/she defines a
> method, vcov, that acts on a regressor matrix (X), a hat matrix (X'X
> in the simple case), and a vector of residuals. This seems enough to
> define a wide range of standard errors.
>
> I've only defined 3 types (simple, white, clustered).
> For instance, to estimate a model with white robust standard errors
> reg(formula, df, VceWhite())
>
> To estimate a model with clustered standard errors
> reg(formula, df, VceCluster(:clustervar))

Sounds cool. I had open an issue in GLM.jl about this:
https://github.com/JuliaStats/GLM.jl/issues/42

Do you have any ideas about how to handle bootstrap in the same
framework?

Regards

Matthieu Gomez

unread,

Jun 24, 2015, 10:58:49 PM6/24/15

to julia...@googlegroups.com

Thanks!. I'm glad you also think standard errors should be an argument in the fit option!

I have considered using the fit function, but I don't really understand what the first argument is supposed to be : the syntax is very different between, say, GLM, MixedModels, and NLreg (https://github.com/JuliaStats/StatsBase.jl/issues/116).

--
You received this message because you are subscribed to a topic in the Google Groups "julia-stats" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/julia-stats/PvAs1MceAnc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Patrick Kofod Mogensen

unread,

Jun 26, 2015, 5:55:59 AM6/26/15

to julia...@googlegroups.com

I'll have a look at the updates later. Would you be against having the std. errors as a keyword instead, with some default (sandwich, or whatever)? The "stardard" way (or at least how a lot of people seem to be doing it) is to have

reg(formula, df; se = :sandwich)

so you would run
reg( y ~ x + z, df)

for default, and

reg(y ~ x + z; se = :my_custom_se)

for some other standard error-method. You would have to do the clustering a bit different, but I think you get the idea. You can see Optim.jl or QuantileRegression.jl to see what I mean (they have "method" keywords).

Patrick Kofod Mogensen

unread,

Jun 26, 2015, 1:50:18 PM6/26/15

to julia...@googlegroups.com

Ah, now I see what you did (looked in the repo), the VceWhite() is a constructor, I thought it was the vcov-function for White standard errors :)

Reply all

Reply to author

Forward