Controlling scientific notation in PMML document

27 views
Skip to first unread message

Patrick Hofmann

unread,
Oct 6, 2020, 12:12:18 PM10/6/20
to Java PMML API
Hello,

I've been actively using your PySpark2PMML package to write RF spark models into PMML documents, and was just noticing that sometimes I get scientific notation in the output:

<ScoreDistribution value="0" recordCount="2.3252954E7"/>

Is there a way to control whether or not scientific notation is used in the output?  I'd prefer that it isn't used, as my C++ parser isn't written to accept it.  Thanks!

Patrick Hofmann

Villu Ruusmann

unread,
Oct 6, 2020, 12:50:25 PM10/6/20
to Java PMML API
Hi Patrick,

>
> I was just noticing that sometimes I get scientific notation in the output:
> <ScoreDistribution value="0" recordCount="2.3252954E7"/>
>
> Is there a way to control whether or not scientific notation is used in the output?
>

The JPMML-SparkML library (that powers the PySpark2PMML Python
package) is assigning ScoreDistribution@recordCount attribute values
as reported by org.apache.spark.mllib.tree.impurity.ImpurityCalculator#stats()
method:
https://github.com/jpmml/jpmml-sparkml/blob/1.6.1/src/main/java/org/jpmml/sparkml/model/TreeModelUtil.java#L142-L147

The return type of this method is an array of Java 'double' primitive
values. These array elements are boxed from primitive values to
java.lang.Double values and then printed during XML marshalling using
the org.dmg.pmml.adapters.NumberAdapter#marshal(Number) method:
https://github.com/jpmml/jpmml-model/blob/1.5.4/pmml-model/src/main/java/org/dmg/pmml/adapters/NumberAdapter.java#L21-L29

As you can see, the
org.dmg.pmml.adapters.NumberUtil#printNumber(Number) utility method is
redirecting java.lang.Float and java.lang.Double formatting work to
JAXB library (javax.xml.bind.DatatypeConverter):
https://github.com/jpmml/jpmml-model/blob/1.5.4/pmml-model/src/main/java/org/dmg/pmml/adapters/NumberUtil.java#L33-L47

Now, back to addressing your issue.

There are two conceptual ways of achieving a custom (eg. heavily
controlled) number formatting:
1) Configure your own JAXB adapter class (instead of the default
o.d.p.adapters.NumberAdapter) that would use a custom formatting
pattern for Float and Double values.
2) Modify the in-memory org.dmg.pmml.PMML object, by replacing
ScoreDistribution@recordCount values with a custom java.lang.Number
subclass that uses custom formatting in its #toString() method.

I'd personally suggest the second option. The workflow is simple and
straightforward using the Visitor API from the JPMML-Model library.

PipelineModel pipelineModel = ..;
org.dmg.pmml.PMML pmml = new PMMLBuilder(schema, pipelineModel).build();
// THIS!
Visitor recordCountTransformer = new org.jpmml.model.visitor.AbstractVisitor(){
@Override
public VisitorAction visit(ScoreDistribution scoreDistribution){
double doubleRecordCount =
(scoreDistribution.getRecordCount()).doubleValue();
String properlyFormattedDoubleValue = format(doubleRecordCount);
// Use java.lang.Long when the record count is an integer-like value
scoreDistribution.setRecordCount(new Long(properlyFormattedDoubleValue));
// Alternatively, use BigDecimal
// scoreDistribution.setRecordCount(new
BigDecimal(properlyFormattedDoubleValue));
return super.visit(scoreDistribution);
}
};
recordCountTransformer.applyTo(pmml);
JAXBUtil.marshalPMML(pmml, new StreamResult(System.out));

This is how things work in Java/Scala API level. There's no
obvious/direct way how to connect it to the Python API level
(PySpark2PMML package)...

Alternatively, you might save the original PMML file (uses scientific
notation for double values) into a temporary file in Python, and then
write a small command-line Java/Scala application to re-format it
using the above Visitor API snippet.


VR

Patrick Hofmann

unread,
Oct 7, 2020, 10:00:16 AM10/7/20
to Java PMML API
Awesome, thank you Villu!

Patrick

Reply all
Reply to author
Forward
0 new messages