I would like to contribute : episode 2

48 views
Skip to first unread message

Jean-Baptiste FONCIN

unread,
Dec 16, 2016, 4:46:20 AM12/16/16
to pystatsmodels
Hello everyone,

Few weeks ago, I created a post and asked for help to build test on survival models I wish to share with the statsmodels community.

Now, my tests are done and I would like to know how properly fork a new branch on Github, thing I never done.

First of all, here are my tests to be sure they are OK.

Abstract :

My model is a Weibull survival distribution with something special : survival can become infinite. It is useful for some analysis like unemployment duration where some people will never find a job : they come in "absorbing state".
The model allows truncated values to be used and is estimated by the maximum likelihood with reparameterization.
Coefficient's VAR matrix is estimated by the delta method. Optimizer used is Powell because it is the only one able to converge, the likelihood is pretty strange.

Here is the test procedure, I use an exponential distribution because it is just a special Weibull case :

def test() :
    test_sample = np.random.exponential(scale = 2, size = 25000)
    test_sample[20000:] = 10000.0
    test_is_complete = np.ones(shape = 25000, dtype = float)
    test_is_complete[20000:] = 0.0
    test_df = DataFrame({"duration" : test_sample, "is_complete" : test_is_complete})
    test_reg = Weibull_AS_Estimator("duration", "is_complete", test_df)
    test_reg.fit(tol = 0.0)
    print("estimated lamda is "+str(test_reg.estimated_coeff["lambda"]))
    print("lambda used for simulation is 2.0")
    print("estimated k is "+str(test_reg.estimated_coeff["k"]))
    print("k used for simulation is 1")   
    print("estimated proportion in absorbing state is "+str(test_reg.estimated_coeff["AS prop"]))
    print("proportion in absorbing state used for simulation is 0.2")
    test_reg.plot_survival()      

And here is what tests return :



Does it look like correct for you? If you like this model, please tell me : I am able to implement a lot of survival analysis tools.
Auto Generated Inline Image 1

Paul Hobson

unread,
Dec 16, 2016, 11:25:36 AM12/16/16
to pystat...@googlegroups.com
Jean-Baptiste,

Statsmodels follows a git and GitHub work flow similar to the matplotlib project, whose docs do a decent job of going through all of the steps:


-Paul

josef...@gmail.com

unread,
Dec 16, 2016, 11:52:26 AM12/16/16
to pystatsmodels
On Fri, Dec 16, 2016 at 11:25 AM, Paul Hobson <pmho...@gmail.com> wrote:
Jean-Baptiste,

Statsmodels follows a git and GitHub work flow similar to the matplotlib project, whose docs do a decent job of going through all of the steps:


Thanks Paul

Here is our version http://www.statsmodels.org/stable/dev/git_notes.html#working-with-the-statsmodels-code which hasn't been updated in a while but should still apply.

Aside: Except for basic setup and complicated git commands, I'm using git-gui which requires less human memory and provides a nice view to check the changes that are about to be committed.

Once a PR has been opened, the unit tests will automatically run after each push.
The unit test need assert_xxx from numpy.testing. We don't use any print in the regular code and unit tests.
Preferably the size of the data sets and computation time should be as small as possible given what they are testing.
Also, we compare the numbers with other stats packages if possible. Large sample or Monte Carlo tests can be ok, but we need to be careful about both computation time and memory consumption.

Josef

Jean-Baptiste FONCIN

unread,
Dec 17, 2016, 5:03:49 AM12/17/16
to pystatsmodels
Thank you very much for your answers and advices.

I will assert and avoid printing. About the sample size, I use a large one because comparison could fail sometimes just because the sample is in distribution tails. About time consumption it is really correct : around five seconds. And my computer is old.

It is because the function is written with Cython. About comparison, I actually don't know any open source packages where it is implemented.

Jean-Baptiste

josef...@gmail.com

unread,
Dec 17, 2016, 9:05:25 AM12/17/16
to pystatsmodels
On Sat, Dec 17, 2016 at 5:03 AM, Jean-Baptiste FONCIN <jeanbapti...@gmail.com> wrote:
Thank you very much for your answers and advices.

I will assert and avoid printing. About the sample size, I use a large one because comparison could fail sometimes just because the sample is in distribution tails. About time consumption it is really correct : around five seconds. And my computer is old.

It is because the function is written with Cython. About comparison, I actually don't know any open source packages where it is implemented.

Once you have opened a PR or the code somewhere publicly visible, I can start looking at it.

I don't know how complex your models are, but Stata has parametric survival/duration models as proportional hazard or accelerated failure time models or both. R also has them, but I never looked more closely at those.

Open source is not relevant for the unit tests, we just need to be able to get the results for an example. And being open source does not really help, when we cannot look at or translate any code because it is not license compatible.

Josef

Jean-Baptiste FONCIN

unread,
Dec 17, 2016, 11:56:42 AM12/17/16
to pystatsmodels
Josef,

I tried to follow every step on your link, but there I had administration problem using git from shell probably in link with my linux distribution driving me crazy. I am about 200 lines of shell scripting to solve the problem and I need a break.

My contribution is in pull requests, but I don't really know how I did it and if all is correct...

I feel totally dumb with git, I will ask friends to teach me how it works because it is very frustrating.

josef...@gmail.com

unread,
Dec 17, 2016, 1:01:54 PM12/17/16
to pystatsmodels
On Sat, Dec 17, 2016 at 11:56 AM, Jean-Baptiste FONCIN <jeanbapti...@gmail.com> wrote:
Josef,

I tried to follow every step on your link, but there I had administration problem using git from shell probably in link with my linux distribution driving me crazy. I am about 200 lines of shell scripting to solve the problem and I need a break.

My contribution is in pull requests, but I don't really know how I did it and if all is correct...

You created the PR with your master branch

It would be better if you work with your feature branch
There is a "New pull request" button next to the branch name, that seems to have the same content as your master-PR.

After hitting the "New pull request" button, you can verify that the source and target branches and the changes are ok.

You could try that and then keep your master as a copy of statsmodels/master

 

I feel totally dumb with git, I will ask friends to teach me how it works because it is very frustrating.

That's unfortunately the usual experience and feeling when starting with git.
The basic commands to make commits and similar are relatively easy to pick up.
Then a bit trickier are rebase and resolving merge conflicts.
And then there are many features in git that we better not bother to figure out, or only when we cannot avoid it at all.

One recommendation: If you are not sure about what to do, then create a temporary local copy of a branch to try things out. I did that for several years for everything that wasn't routine to me.

Josef

Jean-Baptiste FONCIN

unread,
Dec 20, 2016, 10:04:55 AM12/20/16
to pystatsmodels
Thank you again for your understanding and advices. Even if Git is hard to understand, I will use as much time as needed to fully understand how it works. This tool seems to be necessary to build large projects...
Reply all
Reply to author
Forward
0 new messages