Bigram/Trigram model of Word2vec

Aman Kumar

unread,

May 25, 2020, 5:56:20 AM5/25/20

to Gensim

Hi all,

I am new to word2vec. I am working on a domain-specific topic that has most terms with 2words (bigram) and 3words (trigram), and wanted to know if there is any model available on the internet which contains bigram and trigram of existing google corpus? Does the current google word2vec model already contain a bi-gram/tri-gram/quad-gram? If yes, Upto how many n-grams does it contains

If n-grams are not available in the mentioned google's model, how can I create one?

I do have an idea how to train the new word2vec model using our corpus following Gensim's link. And using the sentences in our corpus with bigrams = phrases.Phrases(sentences) & bigrams[sentences] we can get the vectors of bigram.

I am looking for an existing google n-gram model.

Looking forward to expert suggestions.

Thanks

Regards,

Aman Kumar

Mueller, Mark-Christoph

unread,

May 25, 2020, 6:25:02 AM5/25/20

to gen...@googlegroups.com

Hi Aman,

actually, the GoogleNews vectors contain *mostly* n-grams. If you convert the bin format to text (e.g. using the tool from here: https://gist.github.com/ottokart) , you can do a grep on the vocabulary.

You'll find that of the 3 mio vectors, more than 2 mio contain at least one underscore.

Best, Christoph

Mark-Christoph Müller

Research Associate

HITS gGmbH

Schloss-Wolfsbrunnenweg 35

69118 Heidelberg

Germany

phone +49 6221 533 238

fax +49 6221 533 298

email mark-christ...@h-its.org

http://www.h-its.org

_________________________________________________

Amtsgericht Mannheim / HRB 337446

Managing Director: Dr. Gesa Schönberger

Von: gen...@googlegroups.com <gen...@googlegroups.com> im Auftrag von Aman Kumar <akum...@ncsu.edu>
Gesendet: Montag, 25. Mai 2020 11:56
An: Gensim
Betreff: [gensim:13495] Bigram/Trigram model of Word2vec

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/823e9425-320f-49bb-826a-3a62b096c59c%40googlegroups.com.

Gordon Mohr

unread,

May 26, 2020, 12:10:07 AM5/26/20

to Gensim

As per Christoph's response, the 'GoogleNews' vectors contain plenty of n-grams. But note that:

* those were trained from Google's giant internal corpus of news articles as of ~2013 – if your domain isn't news, or involves terms that are novel or have shifted in sense since then, those vectors may be suboptimal

* while they used some variant of the statistical algorithm that's also implemented in gensim's `Phrases` class, run in multiple passes (as each pass only pairs previously-separate tokens), their exact parameters/tokenization/etc haven't been fully documented, as far as I can tell

If you had your own more-appropriate training texts, you could likely create more up-to-date vectors well-tuned for your domain.

You could use some statistical method, like that in the `Phrases` class - but note that the generated bigrams/etc may not be aesthetically-pleasing, or match your human-level understandings about what the true logical entities/phrases are. Its purely-statistical approach tends to miss things you'd like combined, and combine things you'd rather not - and tuning the thresholds only helps up to a point, where gaining some desired phrases costs others and vice-versa. The resulting processed text, with new bigram/etc tokens, can be very useful for classification/info-retrieval, but won't necessarily "look right" if presented to end-users.

If you have some other domain-specific way of identifying the key multi-word phrases, like a glossary or other heuristics, you could also use that to preprocess your text before Word2Vec training, to ensure your preferred multi-word phrases get word-vectors.

- Gordon

Aman Kumar

unread,

May 26, 2020, 1:00:57 AM5/26/20

to Gensim

Thank you Mr Mark for your reply.

Mr Gordon,

As you mentioned, "If you have some other domain-specific way of identifying the key multi-word phrases, like a glossary or other heuristics, you could also use that to preprocess your text before Word2Vec training, to ensure your preferred multi-word phrases get word-vectors."

This is quite interesting. I do have the glossary/vocab list of my domain words that I extracted from the index of the textbook. Can I use that? Can you share some resource/stackoverflow link where I can find how to do that? As far as I understood, Skip-gram involves use of sentences to make individual vector (based on context), without using Bigram & phraser by gensim how can I add multi-word phrases to get their vectors?

Gordon Mohr

unread,

May 26, 2020, 1:32:00 AM5/26/20

to Gensim

All that `Word2Vec` expects are texts filled with tokens - it's up to your own preprocessing to create those tokens. As part of whatever tokenization/normalization/etc you might do to training text, you could detect known multiword phrases and combine them. Maybe that'd use `Phrases`, or your own other process.

A StackOverflow answer I once wrote showing an example of combining unigrams to bigrams, based on some fixed list of preferred combinations, is at: https://stackoverflow.com/questions/58839049/python-connect-composed-keywords-in-texts/58864397#58864397

- Gordon

Aman Kumar

unread,

May 26, 2020, 2:30:14 AM5/26/20

to Gensim

Thank you so much. I got what you explained. I mistook your previous sentence with interpretation that I can use just the keywords to build the vectors. It's clear now.

I have one more question (which can lead to other)

Using below piece of code, can I add my own corpus vocab into the existing google news corpus?

model = gensim.models.Word2Vec.load(temporary_filepath)

more_sentences = [

['Advanced', 'users', 'can', 'load', 'a', 'model',

'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']

]

model.build_vocab(more_sentences, update=True)

model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)

I need some results of pretrained model, and most of my own corpus. Is Fasttext recommended over word2vec?

Gordon Mohr

unread,

May 26, 2020, 2:58:39 AM5/26/20

to Gensim

The GoogleNews vectors are vectors alone; it's not a full model on which training can be continued. So there's no supported way to load them into a gensim `Word2Vec` model, then continue training as you suggest.

Even if you were starting with your own full model, then you *can* attempt those sorts of steps – updating the model's vocabulary then doing new training with new texts. But it's not a simple matter of "more data with more words makes a better model". The new training is diluting the old, and if old words aren't repeated in the new texts then words from the different eras may drift arbitrarily away from meaningful comparability. So you're out of the realm of straightforward steps, & have to verify each step, & the final results, make sense.

If you have enough data to train up new vectors for your important words, are you sure you need to complicate things with someone else's old vectors from another domain?

FastText is essentially a superset of Word2Vec. (Picking certain parameters essentially reduces FT to Word2Vec.) FastText's biggest potential advantage is the ability to synthesize fair 'guess' vectors, from word fragments, for words that weren't in training. Whether that's worth the extra training time & model size will depend on your data/goals.

- Gordon

Aman Kumar

unread,

May 26, 2020, 9:19:16 PM5/26/20

to Gensim

Thank you again.

I did some reading about this and I stumbled upon this link which talks about Retrofitting. Could you please put some light if retrofitting could be used to add some vectors if it is possible?

Gordon Mohr

unread,

May 27, 2020, 1:15:12 PM5/27/20

to Gensim

Is such retrofitting theoretically possible? Sure, the referenced paper probably describes a technique that could be copied. (Similarly, other possible approaches to "fine-tuning" word-vectors for a domain, or incrementally extending vectors.)

But there's no built-in support for specific processes like that in gensim. And whether such processes are a benefit justifying the complication for any particular project would need to be researched by that particular project.

Why are you interested in such 'retrofitting'? (What problems have you run into for which that might be a solution?)

- Gordon

Aman Kumar

unread,

Jun 1, 2020, 5:32:31 PM6/1/20

to Gensim

The only thing that can to my mind while posting about retrofitting was the addition of my corpus in the existing one. I was getting (for few most_similar queries) expected results from the pre-trained news word embedding and thought would help me if my corpus (which is quite small as compared to usual ones used) is added in the existing one using retrofitting. This might improve my results. Not sure, up to what extent I was thinking in a right direction.

Reply all

Reply to author

Forward