Distributions/Rmath, DataFrames/(NullableArrays,CategoricalArrays)

140 views
Skip to first unread message

Douglas Bates

unread,
Jul 18, 2016, 1:38:21 PM7/18/16
to julia-stats
I know people are working hard on reconciling changes w.r.t. Rmath split off as a package and the Distributions package, and converting DataFrames to use CategoricalArrays and NullableArrays.  I don't want to cause anyone even more work to respond to repeated questions but it would be handy to have an ETA on when DataFrames and Distributions will be able to pass Pkg.test on version 0.5 and/or v0.4.

Also are there things other can do to help - reviewing and revising documentation, reviewing and revising tests, etc.?

Andreas Noack

unread,
Jul 18, 2016, 1:46:06 PM7/18/16
to julia...@googlegroups.com
Distributions should work as of this morning. I think the only thing left for DataFrames is to tag a version for Distributions so we should be pretty close as that could be done now.

On Mon, Jul 18, 2016 at 1:38 PM, Douglas Bates <dmb...@gmail.com> wrote:
I know people are working hard on reconciling changes w.r.t. Rmath split off as a package and the Distributions package, and converting DataFrames to use CategoricalArrays and NullableArrays.  I don't want to cause anyone even more work to respond to repeated questions but it would be handy to have an ETA on when DataFrames and Distributions will be able to pass Pkg.test on version 0.5 and/or v0.4.

Also are there things other can do to help - reviewing and revising documentation, reviewing and revising tests, etc.?

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Douglas Bates

unread,
Jul 18, 2016, 1:54:01 PM7/18/16
to julia-stats
On Monday, July 18, 2016 at 12:46:06 PM UTC-5, Andreas Noack wrote:
Distributions should work as of this morning. I think the only thing left for DataFrames is to tag a version for Distributions so we should be pretty close as that could be done now.

In which version of Julia should Distributions be expected to work?  I just did a Pkg,update() on version 0.5-dev and Pkg.test("Distributions") threw a lot of errors. 

julia> versioninfo()
Julia Version 0.5.0-dev+5484
Commit eec64e1* (2016-07-18 14:47 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)

julia> Pkg.status("Distributions")
 - Distributions                 0.9.0

julia> Pkg.status("Rmath")

julia> Pkg.status("StatsFuns")
 - StatsFuns                     0.2.2



David Anthoff

unread,
Jul 18, 2016, 1:57:15 PM7/18/16
to julia...@googlegroups.com

But note that everything is broken on Windows until Rmath.jl works on Windows. Rmath was removed from Base last week without Rmath.jl being ready on Windows…

Andreas Noack

unread,
Jul 18, 2016, 2:16:53 PM7/18/16
to julia...@googlegroups.com
I haven't tagged Distributions yet so you'll have to checkout master. I've just run the tests locally without errors.

David Anthoff

unread,
Jul 18, 2016, 2:20:51 PM7/18/16
to julia...@googlegroups.com

You mean it should work on Win? How so? Distributions depends on StatsFun, StatsFun depends on Rmath and Rmath doesn’t work on Windows. So nothing precompiles and nothing can be used on Windows, right?

Douglas Bates

unread,
Jul 18, 2016, 2:25:39 PM7/18/16
to julia-stats
On Monday, July 18, 2016 at 1:20:51 PM UTC-5, David Anthoff wrote:

You mean it should work on Win? How so? Distributions depends on StatsFun, StatsFun depends on Rmath and Rmath doesn’t work on Windows. So nothing precompiles and nothing can be used on Windows, right?


I think Andreas was replying to my question about the version on non-Windows systems. 

Andreas Noack

unread,
Jul 18, 2016, 2:25:58 PM7/18/16
to julia...@googlegroups.com
Sorry for not being clear. That was to Doug so it's only about Mac and Linux. I'm not up to date with the Windows issues.

Andreas Noack

unread,
Jul 18, 2016, 2:27:35 PM7/18/16
to julia...@googlegroups.com
DataFrames should as of now also work on 0.5 (on Mac and Linux)

Douglas Bates

unread,
Jul 18, 2016, 2:33:49 PM7/18/16
to julia-stats
On Linux Pkg.test("Distribuitons") fails at

        From worker 2:      testing Weibull(20.0, 1.0)
        From worker 2:      testing Weibull(1.0, 2.0)
        From worker 2:      testing Weibull(5.0, 2.0)
        From worker 2:
ERROR: LoadError: On worker 3:
LoadError: assertion failed: |cov(Xs,2) - cov(g)| <= 0.01
  cov(Xs,2) = [1.55864 0.529817; 0.529817 4.52288]
  cov(g) = [1.5605 0.53; 0.53 4.508]
  difference = 0.014877712873417437 > 0.01
 in test_approx_eq at ./test.jl:843
 in test_mixture at /home/bates/.julia/v0.5/Distributions/test/mixture.jl:121
 in include_string at ./loading.jl:380
 in include_from_node1 at ./loading.jl:429
 in #5 at /home/bates/.julia/v0.5/Distributions/test/runtests.jl:44
 in #501 at ./multi.jl:1193
 in run_work_thunk at ./multi.jl:844
 in macro expansion at ./multi.jl:1193 [inlined]


The problem seems to be reproducible.  That is, it doesn't look like it is caused by an unfortunate set of values from an RNG.

Andreas Noack

unread,
Jul 18, 2016, 2:36:14 PM7/18/16
to julia...@googlegroups.com
How many workers do you have?

Douglas Bates

unread,
Jul 18, 2016, 2:37:12 PM7/18/16
to julia-stats
Problem goes away when updating to latest master.  Thanks Andreas.

Andreas Noack

unread,
Jul 19, 2016, 10:04:05 AM7/19/16
to julia...@googlegroups.com
David, I just tried on Windows and it seems that we are also all set there now thanks to Tony.

David Anthoff

unread,
Jul 19, 2016, 10:52:26 AM7/19/16
to julia...@googlegroups.com

Yep, Tony saved the world (for Windows users, at least)! Thanks, David

Douglas Bates

unread,
Jul 19, 2016, 12:23:05 PM7/19/16
to julia-stats
Yes, thanks to Tony, Andreas, Milan and others who worked on this.

At the risk of making myself unpopular I would like to return to the issue of ModelFrame, ModelMatrix, etc. because a lot of code is still broken for me.  At present `DataFrames/REQUIRE` lists `DataArrays 0,3.4` but neither `NullableArrays` nor `CategoricalArrays`.  Contrasts are defined in  `DataFrames/src/statsmodels/formula..jl` but we would need to require `CategoricalArrays` if contrasts for that type were to be defined there.  To me it would make more sense to define the contrasts where the array types are defined.

I can add `CategoricalArrays` to `DataFrames/REQUIRE` to get ModelMatrix working again but that might have a knock-on effect for many packages that require `DataFrames`.

Although I'd really like to get ModelMatrix working again, I don't want to make changes like DataFrames requiring CategoricalArrays that later need to be backed out.

Andreas Noack

unread,
Jul 19, 2016, 1:18:17 PM7/19/16
to julia...@googlegroups.com
It seems to me that contrasts should be defined in defined in the array packages and not in DataFrames. We'd probably need the functions to be defined in an upstream package like StatsBase or (ArrayBase/DataBase?) such that all array packages can extend them.

We have the usual problem of optional dependencies. Should DataFrames depend on any data array package or all of them? Is it possible the DataFrames doesn't use any features of concrete data array types and only define methods for abstract types? Then the user would have to load a specific array package. This might be a bit demanding to keep working and from a user perspective, a single good implementation might be better.

What are the specific issues you are having right now? Are the things that are broken things that used to work or is work in progress towards using Nullable and Categorical arrays?

--

Douglas Bates

unread,
Jul 19, 2016, 1:45:59 PM7/19/16
to julia-stats
Could someone who is able to use git tell me how I should rebase the db/modelmatrix branch to incorporate the changes in master from the last commit?  According to the manual it is exceedingly simple, I just check out the db/modelmatrix branch and run

bates@thin206:~/.julia/v0.5/DataFrames⟫ git rebase master
First, rewinding head to replay your work on top of it...
Applying: Refactor ModelMatrix - still pending contrast changes
Using index info to reconstruct a base tree...
M       src/statsmodels/formula.jl
Falling back to patching base and 3-way merge...
Auto-merging src/statsmodels/formula.jl
CONFLICT (content): Merge conflict in src/statsmodels/formula.jl
error: Failed to merge in the changes.
Patch failed at 0001 Refactor ModelMatrix - still pending contrast changes
The copy of the patch that failed is found in: .git/rebase-apply/patch

When you have resolved this problem, run "git rebase --continue".
If you prefer to skip this patch, run "git rebase --skip" instead.
To check out the original branch and stop rebasing, run "git rebase --abort".


Other than losing all the changes that I made and leaving me in a state that, in my experience, it is impossible to recover from, that worked well.  Is there any hope of my being able to recover

Andreas Noack

unread,
Jul 19, 2016, 1:56:18 PM7/19/16
to julia...@googlegroups.com
I've pushed a rebased version to anj/modelmatrix

What I did was to edit formula.jl and then

git add src/statsmodels/formula.jl
git rebase --continue

--

Douglas Bates

unread,
Jul 19, 2016, 1:58:04 PM7/19/16
to julia-stats

On Tuesday, July 19, 2016 at 12:18:17 PM UTC-5, Andreas Noack wrote:
It seems to me that contrasts should be defined in defined in the array packages and not in DataFrames. We'd probably need the functions to be defined in an upstream package like StatsBase or (ArrayBase/DataBase?) such that all array packages can extend them.

That's the approach that makes the most sense to me too.  Right now CategoricalArrays only requires Compat and it does not seem that Milan is available to make changes in it.

We have the usual problem of optional dependencies. Should DataFrames depend on any data array package or all of them? Is it possible the DataFrames doesn't use any features of concrete data array types and only define methods for abstract types? Then the user would have to load a specific array package. This might be a bit demanding to keep working and from a user perspective, a single good implementation might be better.

What are the specific issues you are having right now? Are the things that are broken things that used to work or is work in progress towards using Nullable and Categorical arrays?

I was trying to use CategoricalArrays and failing.  This only affects PooledDataArrays and CategoricalArrays but there are other aspects like the termnames methods, whose generic is currently defined in DataFrames, but is linked to the contrasts.

Ultimately if PooledDataArray is replaced by CategoricalArray then these generics can all go into CategoricalArrays.  It would be necessary to have DataFrames require CategoricalArrays but I suspect that would happen anyway.

In a way I would like to split the Formula/Terms/ModelFrame/ModelMatrix material into a separate package but that package would need to depend on DataFrames so it wouldn't buy us much.

Douglas Bates

unread,
Jul 19, 2016, 1:58:49 PM7/19/16
to julia-stats
Sorry, I meant to discard this rant.  The issue has been, to some extent, resolved.

Milan Bouchet-Valat

unread,
Jul 20, 2016, 6:07:19 PM7/20/16
to julia...@googlegroups.com
The CategoricalArrays port won't be ready in time for the Julia 0.5
release, so we need to get a DataFrames version based on DataArrays to
work anyway. Do whatever improvements you think are needed, and then
I'll port them to CategoricalArrays in a later step.

Regarding dependencies, I agree we should move the model frame/matrix
methods to StatsBase (or a standalone package), and import it from
DataFrames to define actual methods. That will allow packages to
support the formula interface without adding a dependency on
DataFrames, and will allow experimenting with other implementations
like TypedTables.



Regards
Reply all
Reply to author
Forward
0 new messages