Extract predictions of random forest

550 views
Skip to first unread message

Nima Mehrafshan

unread,
Apr 3, 2014, 4:04:35 PM4/3/14
to jp...@googlegroups.com
Hi,

I'm just getting familiar with PMML and the possibilities for predicting with random forest models estimated in R. The prediction of a random forest is an average of the predictions of an ensemble of decision trees. Now, I'm wondering if it is possible to extract the individual predictions of the trees of a random forest model via jpmml/openscoring?

I would like to do that to calculate a prediction interval using the distribution of predictions of the forest.

Thanks for any advice!

Nima

Villu Ruusmann

unread,
Apr 4, 2014, 3:52:30 AM4/4/14
to jp...@googlegroups.com
Hi Nima,

>
> I'm just getting familiar with PMML and the possibilities for predicting
> with random forest models estimated in R. The prediction of a random forest
> is an average of the predictions of an ensemble of decision trees. Now, I'm
> wondering if it is possible to extract the individual predictions of the
> trees of a random forest model via jpmml/openscoring?
>

Yes, it is possible to retrieve the predicted values of individual
decision trees in the random forest model. You can do so by tweaking
your PMML file. There is absolutely no need to "customize" the source
code of JPMML-Evaluator or any other library.

What you are looking for is the "segmentId" attribute of the
OutputField element (see http://www.dmg.org/v4-2/Output.html).
Basically, in your PMML file, you need to define the Output element
and add an OutputField for every segment in your segmentation model.
An OutputField element should specify three attributes:
1) segmentId - References the decision tree by the id attribute of its
container segment element.
2) feature - Fixed as "predictedValue". If you need any other data
(e.g. associated probabilities) then check the PMML specification for
more constant values.
3) name - The name by which you will identify this OutputField in
scoring results.

When exporting a random forest model from R then, by default, it
contains 500 decision trees. So, you need to modify its Output element
to contain 500 additional OutputField elements like this:
<Output>
<OutputField segmentId="1" name="dt_1" feature="predictedValue"/>
<OutputField segmentId="2" name="dt_2" feature="predictedValue"/>
<!-- OutputFields 3 .. 498 omitted for brevity -->
<OutputField segmentId="499" name="dt_499" feature="predictedValue"/>
<OutputField segmentId="500" name="dt_500" feature="predictedValue"/>
</Output>

You can code a small PMML enhancer application using the JPMML-Model
library to automate this task.

Now, when you deploy this modified PMML file using openscoring REST
service then you will see 500 additional output fields values in the
prediction result. It shouldn't degrage the overall performance of the
service, because JPMML-Evaluator has all this information ready
anyway, and does not need to perform any extra computations.

Please notice that the support for the "segmentId" attribute was
implemented in JPMML-Evaluator 1.1.0. You won't find it in the 1.0.X
development branch.

> I would like to do that to calculate a prediction interval using the
> distribution of predictions of the forest.
>

If you're very much into PMML hacking then you could define another
two OutputFields at the bottom of your Output element, and use the
built-in functions "min" and "max" to sort out the extreme values (see
http://www.dmg.org/v4-2/BuiltinFunctions.html#min). Something along
those lines:

<OutputField name="min_dt" feature="transformedValue">
<Apply function="min">
<FieldRef field="dt_1"/>
<FieldRef field="dt_2"/>
...
<FieldRef field="dt_499"/>
<FieldRef field="dt_500"/>
</Apply>
</OutputField>

This way your consumer application can look directly for the "min_dt"
and "max_dt" output fields and does not need to implement any
aggregation functionality.

I have long thought about starting a blog about PMML use cases. I'll
see how things go, but maybe you just gave me an idea for the first
blog post :-)


VR

Nima Mehrafshan

unread,
Apr 4, 2014, 5:57:29 AM4/4/14
to jp...@googlegroups.com
Hi Villu,

Thanks for the quick reply!!

As you point out, the best approach would be to generate the desired statistics within the PMML. From the distribution of RF predictions I would like to calculate the mean (default output), the standard deviation, and the 5% and 95% percentiles.

However, I'm wondering if it is possible to do so without having to specify a field for each individual tree score. I spent some time reading the PMML and found that for multiple models (i.e., ensemble models such as the RF) there is an attribute called multipleModelMethod (default "average") that may be set to "selectAll" (http://www.dmg.org/v4-2/MultipleModels.html). If I get it right, this makes the standard output field carrying all predictions (?).

Now, how does the output element have to be specified to get the statistics mentioned above? There is an aggregate element described in the transformations section of the documentation (http://www.dmg.org/v4-2/Transformations.html), but I couldn't figure out how to calculate the statistics other than the mean:

<Output>
  <OutputField name="Mean" optype="continuous" dataType="string" targetField="response" feature="transformedValue">
    <Aggregate field="Predicted_DV" function="mean"/>
  <OutputField/>

</Output>

Is this even correct? Do you have a hint on how to calculate standard deviation, and quantiles?

I think a blog about PMML would definitely be appreciated by an increasing number of people, since the use of PMML seems to be taking off!


Thanks again,
Nima

Villu Ruusmann

unread,
Apr 4, 2014, 7:14:15 AM4/4/14
to jp...@googlegroups.com
Hi Nima,

>
> However, I'm wondering if it is possible to do so without having to specify
> a field for each individual tree score. I spent some time reading the PMML
> and found that for multiple models (i.e., ensemble models such as the RF)
> there is an attribute called multipleModelMethod (default "average") that
> may be set to "selectAll" (http://www.dmg.org/v4-2/MultipleModels.html). If
> I get it right, this makes the standard output field carrying all
> predictions (?).
>

That's a terrific idea! You seem to know the PMML specification extremely well.

Indeed, if you are looking to do the post-processing yourself then it
makes sense to use multipleModelMethod "selectAll" instead of
"average". This feature is not advertised much, because unlike all
other functions it returns a collection-values result, not a
single-valued result. For example, in the java terminology,
"selectAll" gives you java.util.List<Double>, whereas "average" gives
you Double.

At the moment you cannot use "selectAll" with JPMML-Evaluator library,
because it will always throw an UnsupportedFeatureException. But it
would be very easy to fix it. In fact, your request comes in at a very
opportune time, because I was working with class MiningModelEvaluator
just yesterday (per Alex's feature request). So, if everything goes
well, we could have an updated JPMML-Evaluator library ready and
released by monday morning.

> Now, how does the output element have to be specified to get the statistics
> mentioned above? There is an aggregate element described in the
> transformations section of the documentation
> (http://www.dmg.org/v4-2/Transformations.html), but I couldn't figure out
> how to calculate the statistics other than the mean:
>
> <Output>
> <OutputField name="Mean" optype="continuous" dataType="string"
> targetField="response" feature="transformedValue">
> <Aggregate field="Predicted_DV" function="mean"/>
> <OutputField/>
> </Output>
>
> Is this even correct? Do you have a hint on how to calculate standard
> deviation, and quantiles?
>

That's exactly the way to do it. There's only a small detail that you
cannot specify optype="continuous" and dataType="string" together. You
probably meant dataType="double" anyway, because you are dealing with
mean values. I would recommend to omit optype and dataType attributes
if you really don't intend to cast the value from one datatype to
another (e.g. converting from int to double).

As for the mean, standard deviation and percentile functions then you
only get the "average" from the PMML specification. However, when
using the JPMML-Evaluator library then you get a chance to define
additional "user-defined Java-backed functions". Simply implement
interface org.jpmml.evaluator.Function and register the instance with
method org.jpmml.evaluator.FunctionRegistry#putFunction(String,
Function). Please remember that this is a JPMML-specific functionality
and you will lose portability with other PMML consumer applications if
you do so.

For example, you could create class PercentileFunction that takes two
arguments. First, the FieldValue class instance that contains the
result of the "selectAll" function and second, the FieldValue instance
that contains the percentile value. So you could use the same class
for calculating both 5% and 95% percentiles, by only changing the
second argument of the Apply element. I would recommend you to
register this class PercentileFunction with its fully qualified class
name so that it would be plainly obvious for third parties that it is
not a PMML built-in function. Something like this:
<Apply function="de.uni-hamburg.pmml.PercentileFunction">
<FieldRef field="rf_selectAll"/>
<Constant>5</Constant>
</Apply>

Of course, it would possible to collect more common functions under
the JPMML-Evaluator library (e.g. package
org.jpmml.evaluator.extensions) and deploy them automatically to
FunctionRegistry.

> I think a blog about PMML would definitely be appreciated by an increasing
> number of people, since the use of PMML seems to be taking off!
>

You just gave me an idea for the second blog post :-)


VR

Villu Ruusmann

unread,
Apr 21, 2014, 5:22:20 AM4/21/14
to jp...@googlegroups.com
Hi everyone,

I have just completed a two-part blog about random forest intelligence.

"Random forest intelligence. Part 1: Outputting member decision tree scores":
http://openscoring.io/blog/2014/04/10/randomforest_output_decisiontree_scores/

"Random forest intelligence. Part 2: Estimating the prediction
interval based on member decision tree scores":
http://openscoring.io/blog/2014/04/18/randomforest_estimate_prediction_interval_decisiontree_scores/

I would like to elaborate some parts of those posts in more detail,
though. If you have any comments or questions regarding this topic
then I would gladly add it to the discussion.


VR
Reply all
Reply to author
Forward
0 new messages