Error in using tfidf with dictionary in PMML pipeline

Harshit Karnatak

unread,

Dec 13, 2017, 4:04:12 AM12/13/17

to Java PMML API

Hi VR,
I am trying to use TFIDF in pmml pipeline
tfidfVectorizer1 = TfidfVectorizer(analyzer='word',preprocessor=None,min_df = 100,stop_words='english',vocabulary={'totals','grand'},tokenizer=Splitter(),token_pattern=None,norm=None)
But it throws the error, but when I remove vocabulary parameter, it works fine.
I am wondering if using vocabulary is supported or not.If not, is there a workaround to include vocabulary in tfidf.
The error thrown is as follows:

java.lang.ClassCastException: java.lang.Integer cannot be cast to numpy.core.Scalar
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:99)
at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:76)
at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
at sklearn.Initializer.encodeFeatures(Initializer.java:53)
at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:82)
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:128)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)

Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to numpy.core.Scalar
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:99)
at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:76)
at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
at sklearn.Initializer.encodeFeatures(Initializer.java:53)
at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:82)
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:128)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)
Preserved joblib dump file(s): C:\Users\HARSHI~1.000\AppData\Local\Temp\pipeline-3wg7fyiw.pkl.z
Traceback (most recent call last):
File "C:\Users\harshit.karnata.NOTEBOOK436.000\AppData\Roaming\Python\Python36\site-packages\sklearn2pmml\__init__.py", line 216, in sklearn2pmml
subprocess.check_call(cmd)
File "E:\ANACONDA\lib\subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['java', '-cp', 'C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\guava-20.0.jar;C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\istack-commons-runtime-3.0.5.jar;C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\jaxb-core-2.3.0.jar;C:\\Users\\harshit.karnata.NOTEBOOK436.000\\AppData\\Roaming\\Python\\Python36\\site-packages\\sklearn2pmml\\resources\\slf4j-jdk14-1.7.25.jar;C:\\ai_datasciences_python\\sklearn2pmml-plugin-1.0-SNAPSHOT.jar', 'org.jpmml.sklearn.Main', '--pkl-pipeline-input', 'C:\\Users\\HARSHI~1.000\\AppData\\Local\\Temp\\pipeline-3wg7fyiw.pkl.z', '--pmml-output',

Thanks in advance

Villu Ruusmann

unread,

Dec 13, 2017, 3:27:42 PM12/13/17

to Java PMML API

Hi Harshit,

> I am trying to use TFIDF in pmml pipeline ...

> But it throws the error, but when I remove vocabulary parameter, it works fine.

> The error thrown is as follows:
>
> java.lang.ClassCastException: java.lang.Integer cannot be cast to numpy.core.Scalar
> at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:99)
> at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:76)
>

The vocabulary argument is fully supported.

The vocabulary needs to be supplied in the form of a Python dict,
where the key datatype is string, and the value datatype is
numpy.core.Scalar (this is some Scikit-Learn internal convention).
However, in your code you've constructed a Python dict, where the
value datatype is a "raw" int datatype.

In principle, the JPMML-SkLearn library should be more liberal here,
and accept any numeric datatype. Perhaps it may be desirable to use
double datatype in some applications (provided that double values can
be losslessly cast to int values).

I've opened a GitHub issue, and hope to fix it in a very near future:
https://github.com/jpmml/jpmml-sklearn/issues/61

VR

Harshit Karnatak

unread,

Dec 14, 2017, 10:27:49 AM12/14/17

to Java PMML API

Hi VR,
Thanks for the help. CAting raw int to int32 worked for me.
vocabulary now looks like
vocabulary={'key1':np.int32(0),'key2':np.int32(1)}

Reply all

Reply to author

Forward