The tokenizer object (null) is not Splitter

316 views
Skip to first unread message

Michael D.

unread,
Jun 13, 2017, 10:29:03 AM6/13/17
to Java PMML API
I just tried to convert a sklearn model stored in pickle format with "jpmml-sklearn" converter.
I got "java.lang.IllegalArgumentException: The tokenizer object (null) is not Splitter".

I read the comments on github:
https://github.com/jpmml/jpmml-sklearn/issues/28#issuecomment-277728623

As I understand, this exception occurred because the "tokenizer" field of my "CountVectorizer" is set to "None" and I should use "sklearn2pmml.feature_extraction.text.Splitter". Unfortunately I don't understand what I should do to convert my model to PMML.

Should I use this "sklearn2pmml.feature_extraction.text.Splitter" while training the model ?

Thank you,
M.

Villu Ruusmann

unread,
Jun 13, 2017, 11:38:03 AM6/13/17
to Java PMML API
Hi Michael,

> I just tried to convert a sklearn model stored in pickle format with "jpmml-sklearn" converter.
> I got "java.lang.IllegalArgumentException: The tokenizer object (null) is not Splitter".
>

The problem is that Scikit-Learn and PMML use different sentence
tokenization modes:
1) In Scikit-Learn, the default mode is to perform "token matching",
where regexp selects sequences of 2 or more alphanumeric characters.
2) In PMML, the default mode is to perform "splitting", where regexp
splits by whitespace character(s), and then raw tokens are cleaned by
trimming leading and trailing punctuation characters.

In Apache Spark ML you can switch between "token matching" and
"splitting" modes by toggling the "gaps" parameter
(https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/ml/feature/RegexTokenizer.html#setGaps(boolean)).

It is likely that future versions of Scikit-Learn and PMML will be
able to provide similar mode selection functionality. However, today,
if you want to be able to reproduce Scikit-Learn predictions in PMML,
then you must specifically provide a PMML-compatible tokenizer in
CountVectorizer or TfidfVectorizer constructor.

The SkLearn2PMML package provides class
'sklearn2pmml.feature_extraction.text.Splitter' exactly for this
purpose. It prioritizes correctness over performance. So, if you're
unhappy about its performance during model training (eg. when dealing
with extremely large text bodies), then you may consider writing your
own.

The usage is straightforward:
<python>
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn2pmml.feature_extraction.text import Splitter

vectorizer = TfidfVectorizer(analyzer = "word", token_pattern = None,
tokenizer = Splitter())
</python>

> Unfortunately I don't understand what I should do to convert my model to PMML.
>

The JPMML-SkLearn project contains a Python script that is used for
generating integration testing resources. There's a special section
dedicated to bag-of-words text classification models:
https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L303-L329

As you can see, stopwords and word N-grams are also supported.


VR
Reply all
Reply to author
Forward
0 new messages