Hi Patrick,
>
> I was just noticing that sometimes I get scientific notation in the output:
> <ScoreDistribution value="0" recordCount="2.3252954E7"/>
>
> Is there a way to control whether or not scientific notation is used in the output?
>
The JPMML-SparkML library (that powers the PySpark2PMML Python
package) is assigning ScoreDistribution@recordCount attribute values
as reported by org.apache.spark.mllib.tree.impurity.ImpurityCalculator#stats()
method:
https://github.com/jpmml/jpmml-sparkml/blob/1.6.1/src/main/java/org/jpmml/sparkml/model/TreeModelUtil.java#L142-L147
The return type of this method is an array of Java 'double' primitive
values. These array elements are boxed from primitive values to
java.lang.Double values and then printed during XML marshalling using
the org.dmg.pmml.adapters.NumberAdapter#marshal(Number) method:
https://github.com/jpmml/jpmml-model/blob/1.5.4/pmml-model/src/main/java/org/dmg/pmml/adapters/NumberAdapter.java#L21-L29
As you can see, the
org.dmg.pmml.adapters.NumberUtil#printNumber(Number) utility method is
redirecting java.lang.Float and java.lang.Double formatting work to
JAXB library (javax.xml.bind.DatatypeConverter):
https://github.com/jpmml/jpmml-model/blob/1.5.4/pmml-model/src/main/java/org/dmg/pmml/adapters/NumberUtil.java#L33-L47
Now, back to addressing your issue.
There are two conceptual ways of achieving a custom (eg. heavily
controlled) number formatting:
1) Configure your own JAXB adapter class (instead of the default
o.d.p.adapters.NumberAdapter) that would use a custom formatting
pattern for Float and Double values.
2) Modify the in-memory org.dmg.pmml.PMML object, by replacing
ScoreDistribution@recordCount values with a custom java.lang.Number
subclass that uses custom formatting in its #toString() method.
I'd personally suggest the second option. The workflow is simple and
straightforward using the Visitor API from the JPMML-Model library.
PipelineModel pipelineModel = ..;
org.dmg.pmml.PMML pmml = new PMMLBuilder(schema, pipelineModel).build();
// THIS!
Visitor recordCountTransformer = new org.jpmml.model.visitor.AbstractVisitor(){
@Override
public VisitorAction visit(ScoreDistribution scoreDistribution){
double doubleRecordCount =
(scoreDistribution.getRecordCount()).doubleValue();
String properlyFormattedDoubleValue = format(doubleRecordCount);
// Use java.lang.Long when the record count is an integer-like value
scoreDistribution.setRecordCount(new Long(properlyFormattedDoubleValue));
// Alternatively, use BigDecimal
// scoreDistribution.setRecordCount(new
BigDecimal(properlyFormattedDoubleValue));
return super.visit(scoreDistribution);
}
};
recordCountTransformer.applyTo(pmml);
JAXBUtil.marshalPMML(pmml, new StreamResult(System.out));
This is how things work in Java/Scala API level. There's no
obvious/direct way how to connect it to the Python API level
(PySpark2PMML package)...
Alternatively, you might save the original PMML file (uses scientific
notation for double values) into a temporary file in Python, and then
write a small command-line Java/Scala application to re-format it
using the above Visitor API snippet.
VR