ngrams and phraser

Skip to first unread message

andreas heiner

Sep 17, 2021, 5:29:50 AM9/17/21
to Gensim

I need to generate ngrams, and I'm not aware of a model that is freely available (for commercial use). 
I follow the example in ( but I can't get it working. This is my approach
# example sentence
sentence = "I like living in New York and London and travel with time travel"
# the example suggests it uses lemmas of NOUNs etc. only, so I use spaCy for that
import spacy
nlp = spacy.load("en_core_web_sm")
line = [token.lemma_ for token in nlp(sentence) if token.pos_ in ["ADJ", "NOUN", "PROPN"]]
# Load corpus and train a model. 
# (The reason I use a standard corpus is that it lowers the chances of oov words)
import gensim
from gensim.test.utils import datapath
from gensim.models.word2vec import Text8Corpus
from gensim.models import Phrases
from gensim.models.phrases import Phraser

file = "./text8.txt"  # from
sentences = Text8Corpus(datapath(file))
# I also used a large corpus of my own data, sentences = mycorpus
bigram_model = Phrases(sentences)
bigram_phraser = Phraser(bigram_model)

bigram_model[line] (and bigram_phraser) both give
['New', 'York', 'London', 'time', 'travel']
['New_York', 'London', 'time_travel']

Where do I make the error?



Gordon Mohr

Sep 17, 2021, 9:29:49 PM9/17/21
to Gensim
The `Phrases` mechanism is purely based on statistical cooccurrences, which means, among other things:

* It will be highly sensitive to the corpus on which it is trained, and the effectivive (but tunable) `threshold` & `min_count` values.
* The results will usually not be aesthetically conformant to what a meaning-aware reader (like a human) might prefer, and even extensive tuning may only improve the promotion of some desirable bigrams at the cost of others. So, presenting its results to end-users may often be unappealing. Still, its combinations will often improve the raw text, internally, for info-retrieval or classification purposes - via the addition of bigram tokens that have more useful 'signal' than the original. 

The `text8` is a probably a pretty bad set of training data for this purpose. It's not very large (only 100MB), and is just a tiny subset of some old raw Wikipedia text.

Also, all of its text is case-flattened, so no matter how many times `new york` might appear in its training data, it could never possibly learn to promote your `['New', 'York']` to `['New_York']`. It might work on `['new', 'york']` depending on the `text8` frequencies & tuning – I haven't tried.

So: you probably want to apply it to your own domain data, as large as possible, whenever you can. If using outside training text, you'd want something larger & more applicable to your data than `text8`. You'll want to remain sensitive to applying the same case-handling, & other preprocessing, to both training and later application data. 

`Phraser` (aka `FrozenPhrases` in recent releases) is an optimized alternative that discards some state/flexibility for smaller/faster operation – so while trying to tinker to get acceptable results, you probably want to work with only `Phrases` for experimentation. For example, you could tamper with its `threshold` to try to get more or less of the bigrams of interest. 

 (And if you switch `FrozenPhrases` for later steps, it'll mainly deliver its benefit if you're sure to discard the `Phrases` instance/variable once you're using `FrozenPhrases` instead.)

- Gordon
Reply all
Reply to author
0 new messages