Performance and Size problem with JLD saving DecisionTree model

98 views
Skip to first unread message

Ian Watson

unread,
Jan 21, 2016, 3:57:08 PM1/21/16
to julia-users
Using DecisionTree to build a random forest model.  Small, 200 items, 664 predictors for each item, input file size under 1 MB

I can build a random forest model with 1000 trees in about 8 seconds - great.

@time model=build_forest(yvalues[:,1],features,2,1000,0.5)

Then I tried to save that model for subsequent scoring by writing it to a JLD file.

Writing to an NFS mounted disk took multiple minutes, while writing a 194MB (!!) file.

If I write that to /dev/shm, it still takes 51 seconds (and still 194MB)

@time save("/dev/shm/foo.jld","model",model)
 51.406531 seconds (12.01 M allocations: 465.667 MB, 0.38% gc time)

When I do something comparable in R with the same dataset, build the model and then use save() to save the model and the features, the whole process takes about 14 seconds, and is 2.8MB on disk. The save() part of the processing is very fast.

whos() shows

                         model   6884 KB     DecisionTree.Ensemble

so if this is a good estimate of memory, I don't think the problem is with the DecisionTree object.

Am I doing something wrong, or is JLD doing something horrible?

Saw this. https://github.com/JuliaLang/julia/issues/7893, so perhaps problems still persist?


Tim Holy

unread,
Jan 21, 2016, 6:08:43 PM1/21/16
to julia...@googlegroups.com
Depends on the internal storage of the random forest model. You might need to
create a custom serializer:
https://github.com/JuliaLang/JLD.jl/blob/master/doc/jld.md#custom-serialization

--Tim
Reply all
Reply to author
Forward
0 new messages