ngrams and phraser

92 views

Skip to first unread message

andreas heiner

unread,

Sep 17, 2021, 5:29:50 AM9/17/21

to Gensim

Hi,

I need to generate ngrams, and I'm not aware of a model that is freely available (for commercial use).

I follow the example in (https://radimrehurek.com/gensim/models/phrases.html) but I can't get it working. This is my approach

# example sentence

sentence = "I like living in New York and London and travel with time travel"

# the example suggests it uses lemmas of NOUNs etc. only, so I use spaCy for that

import spacy

nlp = spacy.load("en_core_web_sm")

line = [token.lemma_ for token in nlp(sentence) if token.pos_ in ["ADJ", "NOUN", "PROPN"]]

# Load corpus and train a model.

# (The reason I use a standard corpus is that it lowers the chances of oov words)

import gensim

from gensim.test.utils import datapath

from gensim.models.word2vec import Text8Corpus

from gensim.models import Phrases

from gensim.models.phrases import Phraser

file = "./text8.txt" # from https://deepai.org/dataset/text8

sentences = Text8Corpus(datapath(file))

# I also used a large corpus of my own data, sentences = mycorpus

bigram_model = Phrases(sentences)

bigram_phraser = Phraser(bigram_model)

bigram_model[line] (and bigram_phraser) both give

['New', 'York', 'London', 'time', 'travel']

not

['New_York', 'London', 'time_travel']

Where do I make the error?

thanks,

Andreas

Gordon Mohr

unread,

Sep 17, 2021, 9:29:49 PM9/17/21

to Gensim

The `Phrases` mechanism is purely based on statistical cooccurrences, which means, among other things:

* It will be highly sensitive to the corpus on which it is trained, and the effectivive (but tunable) `threshold` & `min_count` values.

* The results will usually not be aesthetically conformant to what a meaning-aware reader (like a human) might prefer, and even extensive tuning may only improve the promotion of some desirable bigrams at the cost of others. So, presenting its results to end-users may often be unappealing. Still, its combinations will often improve the raw text, internally, for info-retrieval or classification purposes - via the addition of bigram tokens that have more useful 'signal' than the original.

The `text8` is a probably a pretty bad set of training data for this purpose. It's not very large (only 100MB), and is just a tiny subset of some old raw Wikipedia text.

Also, all of its text is case-flattened, so no matter how many times `new york` might appear in its training data, it could never possibly learn to promote your `['New', 'York']` to `['New_York']`. It might work on `['new', 'york']` depending on the `text8` frequencies & tuning – I haven't tried.

So: you probably want to apply it to your own domain data, as large as possible, whenever you can. If using outside training text, you'd want something larger & more applicable to your data than `text8`. You'll want to remain sensitive to applying the same case-handling, & other preprocessing, to both training and later application data.

`Phraser` (aka `FrozenPhrases` in recent releases) is an optimized alternative that discards some state/flexibility for smaller/faster operation – so while trying to tinker to get acceptable results, you probably want to work with only `Phrases` for experimentation. For example, you could tamper with its `threshold` to try to get more or less of the bigrams of interest.

(And if you switch `FrozenPhrases` for later steps, it'll mainly deliver its benefit if you're sure to discard the `Phrases` instance/variable once you're using `FrozenPhrases` instead.)