I'm trying to build a TfidfVectorizer passing a customer tokenizer which takes lang as parameter.
def keyphrases(text, lang):
#TODO: extracts keyphrases from different languages
vectorizer = TfidfVectorizer(
lowercase=True, min_df=2, norm='l2', smooth_idf=True,
stop_words='english', tokenizer=keyphrases(lang),
sublinear_tf=True)
how can be this achieved?