Hi Patrick,
> I've been trying to build a PMML document using
> PMMLBuilder in a pyspark Jupyter notebook attached
> to an AWS EMR cluster running spark 2.4.6.
>
Sorry, I'm not qualified to help you with such a multi-layer setup,
but here's how I would approach the situation.
First, upgrade to the latest JPMML-SparkML library version. For the
Apache Spark 2.4.X development line, this should be JPMML-SparkML
1.5.8.
Second, check out Apache Spark's server side logs to make sure it
really found and picked up the specified JPMML-SparkML library file.
Third, you don't need to execute a full training run just to verify if
some Java class is discoverable or not. Simply start a new PySpark
session and type "javaPmmlBuilderClass =
sc._jvm.org.jpmml.sparkml.PMMLBuilder" into it (the variable "sc"
refers to the spark context). See this:
https://github.com/jpmml/pyspark2pmml/blob/0.5.1/pyspark2pmml/__init__.py#L12
If your server side is (still-) broken, then you should instantly see
the same Py4J error. Iterate with config changes until it is resolved.
If everything else fails, simply dump the DataFrame schema and the
PipelineModel object into some S3 bucket (in JSON data format), and
perform the conversion manually using JPMML-SparkML command-line
application as detailed here:
https://github.com/jpmml/jpmml-sparkml#example-application-1
VR