Use pre-trained word embeddings from HistWords

Phil

unread,

Oct 6, 2021, 5:04:03 AM10/6/21

to Gensim

I have downloaded some pre-trained word embeddings from this site:

https://nlp.stanford.edu/projects/histwords/

In particular I have downloaded the "All English (1800s-1990s)" from here:

http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip

The embeddings are in ".npy", which I believe are Numpy arrays.

How can I import them in Gensim?

Thanks you and best regards.

Gordon Mohr

unread,

Oct 11, 2021, 7:43:17 PM10/11/21

to Gensim

The `.npy` suffix is used for direct numpy array save-files. But such arrays aren't enough, alone, to reconstruct a set of word-vectors, as they don't describe which vectors are in which rows.

Gensim will create such `.npy` files as part of its multi-file saving format - but along with a root file with other info.

From a glance at your links, and especially their "detailed description of the data" link <https://nlp.stanford.edu/projects/histwords/data_description.html>, it appears they've created their own format, and you'll likely want to use the specific classes listed on that page to reload them.

After loading them using that project's purpose-built classes, if you'd prefer them in a Gensim model object or format, you could try either:

* adding them to a new Gensim `KeyedVectors` object - then using/saving/reloading that `KeyedVectors`

* manually writing them to a file in a more-common format - the plain-text format that's read by `KeyedVectors.load_word2vec_format(path, binary=True)` is pretty simple, and you can see source code that writes it in the `KeyedVectors.save_word2vec_format()` method in Gensim.

- Gordon

Andrey Kutuzov

unread,

Oct 12, 2021, 11:46:34 AM10/12/21

to gen...@googlegroups.com

Hi Phil,

As an alternative, if you need time-dependent ("diachronic") word
embeddings for English, you can use the models described in
https://aclanthology.org/W19-4725/
They can be downloaded as a zip archive from
http://vectors.nlpl.eu/repository/20/188.zip

These models are trained on the Corpus of Historical American English
(COHA) decades from 1960s to 2000s and are already aligned to be
directly comparable. They are provided in the native word2vec binary and
plain text formats, so it's easy to use them with Gensim.

> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gensim/f3d0bf53-6860-44f4-8b22-77cedc64706cn%40googlegroups.com
> <https://groups.google.com/d/msgid/gensim/f3d0bf53-6860-44f4-8b22-77cedc64706cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Solve et coagula!
Andrey

Phil

unread,

Nov 2, 2021, 11:43:33 AM11/2/21

to Gensim

@Gordon Mohr thank you very much for your answer. I'll try to load them using their classes and maybe post a code that can load them into Gensim directly

@Andrey Thank you very very much.

The word embeddings you posted might indeed be useful.

I remember to have read somewhere that in order to compare word embeddings trained on different corpora (or different years of the same corpora), you had to normalize them.

Do you have any bibliographical reference, or some code that can normalize such word embeddings using Gensim?

Thank you very much,

Phil

Andrey Kutuzov

unread,

Nov 2, 2021, 8:26:46 PM11/2/21

to gen...@googlegroups.com

Hi Phil,

There are different definitions of "normalizing" word embeddings. If you
mean post-processing steps like mean centering, etc, then you can
probably start with this paper:
https://aclanthology.org/2021.eacl-main.10/

If you mean aligning two word embedding models to make them comparable,
here is some code adapted for recent versions of Gensim:
https://github.com/wadimiusz/diachrony_webvectors/blob/master/algos/procrustes.py

On 02.11.2021 16:43, Phil wrote:
> @Gordon Mohr thank you very much for your answer. I'll try to load them
> using their classes and maybe post a code that can load them into Gensim
> directly
>
> @Andrey Thank you very very much.
>
> The word embeddings you posted might indeed be useful.
>
> I remember to have read somewhere that in order to compare word
> embeddings trained on different corpora (or different years of the same
> corpora), you had to normalize them.
>
> Do you have any bibliographical reference, or some code that can
> normalize such word embeddings using Gensim?
>
> Thank you very much,
> Phil
>
> On Tuesday, October 12, 2021 at 5:46:34 PM UTC+2 akutu...@gmail.com wrote:
>
> Hi Phil,
>
> As an alternative, if you need time-dependent ("diachronic") word
> embeddings for English, you can use the models described in

> https://aclanthology.org/W19-4725/ <https://aclanthology.org/W19-4725/>

> <https://groups.google.com/d/msgid/gensim/f3d0bf53-6860-44f4-8b22-77cedc64706cn%40googlegroups.com?utm_medium=email&utm_source=footer

> <https://groups.google.com/d/msgid/gensim/f3d0bf53-6860-44f4-8b22-77cedc64706cn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Solve et coagula!
> Andrey
>

> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/gensim/26190d27-115e-4ed0-bfe5-25ad3253984cn%40googlegroups.com
> <https://groups.google.com/d/msgid/gensim/26190d27-115e-4ed0-bfe5-25ad3253984cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Andrey Kutuzov

unread,

Nov 2, 2021, 8:26:49 PM11/2/21

to gen...@googlegroups.com

Hi Phil,

There are different definitions of "normalizing" word embeddings. If you
mean post-processing steps like mean centering, etc, then you can
probably start with this paper:
https://aclanthology.org/2021.eacl-main.10/

If you mean aligning two word embedding models to make them comparable,
here is some code adapted for recent versions of Gensim:
https://github.com/wadimiusz/diachrony_webvectors/blob/master/algos/procrustes.py

On 02.11.2021 16:43, Phil wrote:

> @Gordon Mohr thank you very much for your answer. I'll try to load them
> using their classes and maybe post a code that can load them into Gensim
> directly
>
> @Andrey Thank you very very much.
>
> The word embeddings you posted might indeed be useful.
>
> I remember to have read somewhere that in order to compare word
> embeddings trained on different corpora (or different years of the same
> corpora), you had to normalize them.
>
> Do you have any bibliographical reference, or some code that can
> normalize such word embeddings using Gensim?
>
> Thank you very much,
> Phil
>
> On Tuesday, October 12, 2021 at 5:46:34 PM UTC+2 akutu...@gmail.com wrote:
>
> Hi Phil,
>
> As an alternative, if you need time-dependent ("diachronic") word
> embeddings for English, you can use the models described in

> https://aclanthology.org/W19-4725/ <https://aclanthology.org/W19-4725/>

> <https://groups.google.com/d/msgid/gensim/f3d0bf53-6860-44f4-8b22-77cedc64706cn%40googlegroups.com?utm_medium=email&utm_source=footer

> <https://groups.google.com/d/msgid/gensim/f3d0bf53-6860-44f4-8b22-77cedc64706cn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Solve et coagula!
> Andrey
>

> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/gensim/26190d27-115e-4ed0-bfe5-25ad3253984cn%40googlegroups.com
> <https://groups.google.com/d/msgid/gensim/26190d27-115e-4ed0-bfe5-25ad3253984cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Phil

unread,

Nov 3, 2021, 9:12:46 AM11/3/21

to Gensim

Hi Andrey,

yes I mean "aligning two word embedding models to make them comparable".

Thank you very much for the link and the code.

Do you have some bibliographical references for this second meaning of "aligning two word embedding models to make them comparable"?

Why we need to align need before comparing them?

Sorry for the newbie question.

Thank you very much,

Phil

Andrey Kutuzov

unread,

Nov 3, 2021, 9:50:50 AM11/3/21

to gen...@googlegroups.com

On 03.11.2021 14:12, Phil wrote:
> Do you have some bibliographical references for this second meaning of
> "aligning two word embedding models to make them comparable"?
> Why we need to align need before comparing them?

https://aclanthology.org/C18-1117/
Section 3.3

Phil

unread,

Nov 3, 2021, 10:28:37 AM11/3/21

to Gensim

Thank you very much Andrey :)

Reply all

Reply to author

Forward