Hi Michael,
> I just tried to convert a sklearn model stored in pickle format with "jpmml-sklearn" converter.
> I got "java.lang.IllegalArgumentException: The tokenizer object (null) is not Splitter".
>
The problem is that Scikit-Learn and PMML use different sentence
tokenization modes:
1) In Scikit-Learn, the default mode is to perform "token matching",
where regexp selects sequences of 2 or more alphanumeric characters.
2) In PMML, the default mode is to perform "splitting", where regexp
splits by whitespace character(s), and then raw tokens are cleaned by
trimming leading and trailing punctuation characters.
In Apache Spark ML you can switch between "token matching" and
"splitting" modes by toggling the "gaps" parameter
(
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/ml/feature/RegexTokenizer.html#setGaps(boolean)).
It is likely that future versions of Scikit-Learn and PMML will be
able to provide similar mode selection functionality. However, today,
if you want to be able to reproduce Scikit-Learn predictions in PMML,
then you must specifically provide a PMML-compatible tokenizer in
CountVectorizer or TfidfVectorizer constructor.
The SkLearn2PMML package provides class
'sklearn2pmml.feature_extraction.text.Splitter' exactly for this
purpose. It prioritizes correctness over performance. So, if you're
unhappy about its performance during model training (eg. when dealing
with extremely large text bodies), then you may consider writing your
own.
The usage is straightforward:
<python>
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn2pmml.feature_extraction.text import Splitter
vectorizer = TfidfVectorizer(analyzer = "word", token_pattern = None,
tokenizer = Splitter())
</python>
> Unfortunately I don't understand what I should do to convert my model to PMML.
>
The JPMML-SkLearn project contains a Python script that is used for
generating integration testing resources. There's a special section
dedicated to bag-of-words text classification models:
https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L303-L329
As you can see, stopwords and word N-grams are also supported.
VR