PMMLBuilder issue on AWS EMR cluster

155 views
Skip to first unread message

Patrick Hofmann

unread,
Nov 10, 2020, 9:37:36 AM11/10/20
to Java PMML API
Hello,

I've been trying to build a PMML document using PMMLBuilder in a pyspark Jupyter notebook attached to an AWS EMR cluster running spark 2.4.6.  I've configured the spark application to have the following options to add the jpmml-sparkml jar:

%%configure -f
{
    "conf": {
        "spark.jars": "s3://<my_dir>/jpmml-sparkml-executable-1.5.1.jar",
        "spark.pyspark.python": "python3",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
    }
}

I then install via pypi: pyspark2pmml

And then import the package: from pyspark2pmml import PMMLBuilder

I then run through my code, build a pipeline of transformers, and then try to build the PMML using PMMLBuilder like this:

pmml = PMMLBuilder(sc, df, pipeline_model).buildByteArray()

But get the following error:
org.jpmml.sparkml.PMMLBuilder does not exist in the JVM Traceback (most recent call last): File "/tmp/1605018025068-0/lib/python3.7/site-packages/pyspark2pmml/__init__.py", line 12, in __init__ javaPmmlBuilderClass = sc._jvm.org.jpmml.sparkml.PMMLBuilder File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1598, in __getattr__ raise Py4JError("{0} does not exist in the JVM".format(new_fqn)) py4j.protocol.Py4JError: org.jpmml.sparkml.PMMLBuilder does not exist in the JVM 

Anyone have any thoughts on this?  Thank you!

Patrick Hofmann

Villu Ruusmann

unread,
Nov 10, 2020, 1:54:50 PM11/10/20
to Java PMML API
Hi Patrick,

> I've been trying to build a PMML document using
> PMMLBuilder in a pyspark Jupyter notebook attached
> to an AWS EMR cluster running spark 2.4.6.
>

Sorry, I'm not qualified to help you with such a multi-layer setup,
but here's how I would approach the situation.

First, upgrade to the latest JPMML-SparkML library version. For the
Apache Spark 2.4.X development line, this should be JPMML-SparkML
1.5.8.

Second, check out Apache Spark's server side logs to make sure it
really found and picked up the specified JPMML-SparkML library file.

Third, you don't need to execute a full training run just to verify if
some Java class is discoverable or not. Simply start a new PySpark
session and type "javaPmmlBuilderClass =
sc._jvm.org.jpmml.sparkml.PMMLBuilder" into it (the variable "sc"
refers to the spark context). See this:
https://github.com/jpmml/pyspark2pmml/blob/0.5.1/pyspark2pmml/__init__.py#L12

If your server side is (still-) broken, then you should instantly see
the same Py4J error. Iterate with config changes until it is resolved.

If everything else fails, simply dump the DataFrame schema and the
PipelineModel object into some S3 bucket (in JSON data format), and
perform the conversion manually using JPMML-SparkML command-line
application as detailed here:
https://github.com/jpmml/jpmml-sparkml#example-application-1


VR
Reply all
Reply to author
Forward
0 new messages