Loading BioWordVec pretrained model

1,948 views
Skip to first unread message

seva....@gmail.com

unread,
Apr 24, 2019, 9:05:02 AM4/24/19
to Gensim
Hi everyone, 

I am trying to load a pre-trained model, available from here. The model I am looking at is the BioWordVec model, trained with fastText CLI. 

This should be straightforward, but I cant make it work. The code itself doesn't produce an error, but its not loading anything either; as though its constantly working on loading the bin/model file. 

The code is taken from the gensim documentation on gensims' fastText interface

from gensim.models.fasttext import *
gensim_fasttext_model = load_facebook_vectors(root_path+"models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin")


The same thing happens when I try to load it with load_facebook_model

Any idea what might be happening would be greatly appreciated, as this should be quite straightforward. 

My env ist:
Linux-3.10.0-693.5.2.el7.x86_64-x86_64-with-centos-7.6.1810-Core
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION 1

Mueller, Mark-Christoph

unread,
Apr 24, 2019, 9:11:23 AM4/24/19
to gen...@googlegroups.com

Hi Seva,


this model is pretty big, how long did you wait for it to finish loading? I also noticed that reading even a much smaller model can take some time.


Can you see memory being occupied, using a system monitor,  while loading?


Best, Christoph


Mark-Christoph Müller

Research Associate

HITS gGmbH
Schloss-Wolfsbrunnenweg 35
69118 Heidelberg
Germany

_________________________________________________
Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger

Von: gen...@googlegroups.com <gen...@googlegroups.com> im Auftrag von seva....@gmail.com <seva....@gmail.com>
Gesendet: Mittwoch, 24. April 2019 15:05
An: Gensim
Betreff: [gensim:12515] Loading BioWordVec pretrained model
 
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

seva....@gmail.com

unread,
Apr 24, 2019, 9:53:41 AM4/24/19
to Gensim


On Wednesday, April 24, 2019 at 3:11:23 PM UTC+2, Mueller, Mark-Christoph wrote:

Hi Seva,


this model is pretty big, how long did you wait for it to finish loading? I also noticed that reading even a much smaller model can take some time.


Well, it didnt load. Thats the problem. 
 

Can you see memory being occupied, using a system monitor,  while loading?


It seems it keep on loading 'till it fills the memory, and then produces a system error (Process finished with exit code 9). Same behavior can be seen on my Mac (16Gbs of memory) and a server (64 GB of memory). 

When I load ithe model with fastText interface (from https://github.com/facebookresearch/fastText/tree/master/python) it takes aprox. 90 secs to load. It doesnt use too much of RAM but fills the swap memory (cca 30ish GB). Unfortunately, I need an interface which support the __getitem__ method (e.g. getting a representation the old fashion way: model[word]). fastText doesnt. 

Best,
Jurica


Radim Řehůřek

unread,
Apr 24, 2019, 11:00:03 AM4/24/19
to Gensim
Hi Seva,

it looks like you're already using the latest version of Gensim, right?

If that's the case, do you mind opening a reproducible (with full context) report on Github, https://github.com/RaRe-Technologies/gensim/issues?

We'll try to have a look why Gensim is behaving so differently compared to FB's fastText.

Cheers,
Radim

Gordon Mohr

unread,
Apr 24, 2019, 3:22:24 PM4/24/19
to Gensim
The link you provided (<https://github.com/ncbi-nlp/BioSentVec/wiki>) shows the pre-trained vectors being loaded as either vectors-only (via `KeyedVectors.load_word2vec_format()`) or as a `sent2vec` model (via `model = sent2vec.Sent2vecModel(); model.load_model('model.bin')`) – `sent2vec` being a separate package unrelated to `gensim`. There's no suggestion there that the files are loadable as a plain FastText model.

As the file you're trying to load (as viewed at <https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/>) is 26GB on disk, I wouldn't expect any useful loading success, from any library, on a 16GB machine. (Even if it could succeed by using lots of virtual-memory, that could take a lot of time, and then be uselessly slow once loaded.) 

- Gordon

Andrey Kutuzov

unread,
Apr 24, 2019, 5:08:05 PM4/24/19
to gen...@googlegroups.com
I also always helps to enable logging and see what is actually happening.

On 4/24/19 3:11 PM, Mueller, Mark-Christoph wrote:
> Hi Seva,
>
>
> this model is pretty big, how long did you wait for it to finish
> loading? I also noticed that reading even a much smaller model can take
> some time.
>
>
> Can you see memory being occupied, using a system monitor,  while loading?
>
>
> Best, Christoph
>
>
> Mark-Christoph Müller
>
> Research Associate
>
> HITS gGmbH
> Schloss-Wolfsbrunnenweg 35
> 69118 Heidelberg
> Germany
>
> phone+49 6221 533 238
> fax+49 6221 533 298
> emailmark-chr...@h-its.org
> http://www.h-its.org 
> _________________________________________________
> Amtsgericht Mannheim / HRB 337446
> Managing Director: Dr. Gesa Schönberger
> ------------------------------------------------------------------------
> *Von:* gen...@googlegroups.com <gen...@googlegroups.com> im Auftrag von
> seva....@gmail.com <seva....@gmail.com>
> *Gesendet:* Mittwoch, 24. April 2019 15:05
> *An:* Gensim
> *Betreff:* [gensim:12515] Loading BioWordVec pretrained model
>  
> Hi everyone, 
>
> I am trying to load a pre-trained model, available from here
> <https://github.com/ncbi-nlp/BioSentVec/wiki>. The model I am looking at
> is the BioWordVec model
> <https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioWordVec_PubMed_MIMICIII_d200.bin>,
> trained with fastText CLI. 
>
> This should be straightforward, but I cant make it work. The code itself
> doesn't produce an error, but its not loading anything either; as though
> its constantly working on loading the bin/model file. 
>
> The code is taken from the gensim documentation on gensims' fastText
> interface <https://radimrehurek.com/gensim/models/fasttext.html>. 
>
> /from gensim.models.fasttext import *
> gensim_fasttext_model =
> load_facebook_vectors(root_path+"models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin")/
>
> The same thing happens when I try to load it with /load_facebook_model/. 
>
> Any idea what might be happening would be greatly appreciated, as this
> should be quite straightforward. 
>
> My env ist:
> Linux-3.10.0-693.5.2.el7.x86_64-x86_64-with-centos-7.6.1810-Core
> Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
> [GCC 7.3.0]
> NumPy 1.16.2
> SciPy 1.1.0
> gensim 3.7.2
> FAST_VERSION 1
>
> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

Radim Řehůřek

unread,
Apr 25, 2019, 3:38:13 AM4/25/19
to Gensim
On Wednesday, April 24, 2019 at 9:22:24 PM UTC+2, Gordon Mohr wrote:
The link you provided (<https://github.com/ncbi-nlp/BioSentVec/wiki>) shows the pre-trained vectors being loaded as either vectors-only (via `KeyedVectors.load_word2vec_format()`) or as a `sent2vec` model (via `model = sent2vec.Sent2vecModel(); model.load_model('model.bin')`) – `sent2vec` being a separate package unrelated to `gensim`. There's no suggestion there that the files are loadable as a plain FastText model.

As the file you're trying to load (as viewed at <https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/>) is 26GB on disk, I wouldn't expect any useful loading success, from any library, on a 16GB machine. (Even if it could succeed by using lots of virtual-memory, that could take a lot of time, and then be uselessly slow once loaded.) 


But this contradicts Seva's report of "When I load ithe model with fastText interface (from https://github.com/facebookresearch/fastText/tree/master/python) it takes aprox. 90 secs to load."

(which I take at face value; I'm not familiar with this particular model or its format)

-rr

seva....@gmail.com

unread,
Apr 25, 2019, 7:39:06 AM4/25/19
to Gensim


On Wednesday, April 24, 2019 at 5:00:03 PM UTC+2, Radim Řehůřek wrote:
Hi Seva,

it looks like you're already using the latest version of Gensim, right?

If that's the case, do you mind opening a reproducible (with full context) report on Github, https://github.com/RaRe-Technologies/gensim/issues?

Will do, today or tomorrow. 

Do you have any instructions/link what information I should include? 
 

We'll try to have a look why Gensim is behaving so differently compared to FB's fastText.

That would be great! 

seva....@gmail.com

unread,
Apr 25, 2019, 7:51:17 AM4/25/19
to Gensim
On Thursday, April 25, 2019 at 9:38:13 AM UTC+2, Radim Řehůřek wrote:
On Wednesday, April 24, 2019 at 9:22:24 PM UTC+2, Gordon Mohr wrote:
The link you provided (<https://github.com/ncbi-nlp/BioSentVec/wiki>) shows the pre-trained vectors being loaded as either vectors-only (via `KeyedVectors.load_word2vec_format()`) or as a `sent2vec` model (via `model = sent2vec.Sent2vecModel(); model.load_model('model.bin')`) – `sent2vec` being a separate package unrelated to `gensim`.

If you read closer, youll see that the BIoWordVec is trained with fastText, and is available as a model file (.bin) and as the vectors themselves (.vec). 

The BioSentVec model is, indeed, trained with sent2vec; ubfortunatelly, I am not trying to use that model. 

 
There's no suggestion there that the files are loadable as a plain FastText model.

There does not have to be, as the BioWordVec model is trained with the Fabecbooks fastText tool. First sentence of the paragraph related to BioWordVec:

We applied fastText to compute 200-dimensional word embeddings.

Additionally, I am able to load the model (.bin) with Facebooks fastText python interface (0.8.2) and get embedding representation for words both in and out of vocab. 
 

As the file you're trying to load (as viewed at <https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/>) is 26GB on disk, I wouldn't expect any useful loading success, from any library, on a 16GB machine. (Even if it could succeed by using lots of virtual-memory, that could take a lot of time, and then be uselessly slow once loaded.) 

I can successfully load the model (.bin) with fastText Python interface (0.8.2). It does use a bit of virtual memory (35ish GBs). It is a big model. It takes aprox. 100 seconds, which is a reasonable time IMHO. 
 
But this contradicts Seva's report of "When I load ithe model with fastText interface (from https://github.com/facebookresearch/fastText/tree/master/python) it takes aprox. 90 secs to load."

(which I take at face value; I'm not familiar with this particular model or its format)

Exactly. I can load the model (.bin file) with Facebooks fastText Py interface, on both the Mac (late 201, 16Gbs RAM) and Linux (64GBs RAM). 

When I try and use gensims interface, the system uses all memory resources before it gets killed. I did manage to load the .bin file with gensim (load_facebook_vectors) on the server; it took 35ish minutes and it using 36GBs of virtual memory and 35 GBs of RAM. 

It would be great if we could get to the bottom of this! 

 

Gordon Mohr

unread,
Apr 25, 2019, 3:57:40 PM4/25/19
to Gensim
On Thursday, April 25, 2019 at 12:38:13 AM UTC-7, Radim Řehůřek wrote:
On Wednesday, April 24, 2019 at 9:22:24 PM UTC+2, Gordon Mohr wrote:
The link you provided (<https://github.com/ncbi-nlp/BioSentVec/wiki>) shows the pre-trained vectors being loaded as either vectors-only (via `KeyedVectors.load_word2vec_format()`) or as a `sent2vec` model (via `model = sent2vec.Sent2vecModel(); model.load_model('model.bin')`) – `sent2vec` being a separate package unrelated to `gensim`. There's no suggestion there that the files are loadable as a plain FastText model.

As the file you're trying to load (as viewed at <https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/>) is 26GB on disk, I wouldn't expect any useful loading success, from any library, on a 16GB machine. (Even if it could succeed by using lots of virtual-memory, that could take a lot of time, and then be uselessly slow once loaded.) 


But this contradicts Seva's report of "When I load ithe model with fastText interface (from https://github.com/facebookresearch/fastText/tree/master/python) it takes aprox. 90 secs to load."

(which I take at face value; I'm not familiar with this particular model or its format)


Without evidence that load-by-Facebook's-Fasttext was successfully usable for something, that report is suspect. (What kind of interface loads a model but then can't even return the vectors for individual words, as was also reported?) The attempt may have loaded garbage, or errored in a way that wasn't recognized. 

- Gordon

seva....@gmail.com

unread,
Apr 29, 2019, 9:36:55 AM4/29/19
to Gensim
Hi Gordon, 

On Thursday, April 25, 2019 at 9:57:40 PM UTC+2, Gordon Mohr wrote:

Without evidence that load-by-Facebook's-Fasttext was successfully usable for something, that report is suspect.

What kind of evidence would you prefer/need? Would it not be easier to just try for yourself and skip this back and forth?
 
(What kind of interface loads a model but then can't even return the vectors for individual words, as was also reported?) The attempt may have loaded garbage, or errored in a way that wasn't recognized. 

Where was this reported? Not by me. 

To summarise: 

while using FBs fastText Python lib the BioWordVec embeddings are loaded successfully and work as advertised (i.e. they produce representation of both in- and out-of-vocab words). This works on both a MacBook Pro (late 2018, 16Gbs of RAM with OS X Mojave 10.14.4) and on a Linux server (CentOS 7) with 64GBs of RAM. The loading takes 100ish seconds and uses some swap memory (but is quite efficient when it comes to virt/ram memory). 

The code which successfully executes this performance is: 

import fastText as fasttext
ftModel = fasttext.load_model(rpath/to/model/+'BioWordVec_PubMed_MIMICIII_d200.bin')

to load the model and either 
ftModel.get_word_vector(word) to get the word representation or 
ftModel.get_input_vector(word_ID) to get a representation of word with said ID. 

When I try to load the same model with gensim, either with load_facebook_model() or load_facebook_vectors(), the code:
1. fails to load the model on the Mac; the process exists as all memory get eaten away
2. successfully loads on the CentOS server, but only after more than 30 mins and uses 35(ish) GB of Virt and RAM memory. 

Both the Mac and the CentOS server have the same libs (and their versions) installed:


Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION 1

I have opened an issue on github. Hopefully this can get inspected; if I am doing something wrong, I am happy to be educated. 

Best,
Jurica

Radim Řehůřek

unread,
Apr 29, 2019, 10:51:05 AM4/29/19
to Gensim
Thanks for the github issue Jurica! We'll look into this.

Radim

Gordon Mohr

unread,
Apr 29, 2019, 2:48:59 PM4/29/19
to Gensim
On Monday, April 29, 2019 at 6:36:55 AM UTC-7, seva....@gmail.com wrote:
Hi Gordon, 

On Thursday, April 25, 2019 at 9:57:40 PM UTC+2, Gordon Mohr wrote:

Without evidence that load-by-Facebook's-Fasttext was successfully usable for something, that report is suspect.

What kind of evidence would you prefer/need? Would it not be easier to just try for yourself and skip this back and forth?

Downloading a 29GB file, and installing extra software, is much harder than requesting clarification! And since the page you linked, <https://github.com/ncbi-nlp/BioSentVec/wiki>, makes no mention of the files being compatible with plain FastText, and instead only shows examples of loading using an alternate library (`sent2Vec`), it was curious to see it even being attempted. Though I believe you if you say it works, giving meaningful vectors (not just results), it's not uncommon for such software to silently fail, or produce garbage results, when supplied with an almost-but-not-quite-right file format. (And certain kinds of silent-fails or partial-failures might also explain why a 29GB single file appears to load without problems in a 16GB machine - when in my experience, that will more often fail or trigger intolerable levels of swapping.)

(What kind of interface loads a model but then can't even return the vectors for individual words, as was also reported?) The attempt may have loaded garbage, or errored in a way that wasn't recognized. 

Where was this reported? Not by me. 

I misinterpreted your report that "Unfortunately, I need an interface which support the __getitem__ method (e.g. getting a representation the old fashion way: model[word]). fastText doesnt." as implying you couldn't get individual words from that loaded model. 

If in fact the only problem with that model is the lack of `__getitem__` method for []-style access, a simple workaround could be to supply your own adapter class with that method, eg something like:

    class FTAdapter:
        def __init__(self, ft):
            self.ft = ft
        def __getitem__(self, word):
            return self.ft.get_word_vector(word)

- Gordon

John Dagdelen

unread,
May 6, 2020, 9:04:44 PM5/6/20
to Gensim
I was able to successfully run this model using fastText and get word embeddings for words from it. I'm still experiencing the same excruciating load times as these users are reporting, even on a system with 124GB of memory. 

-John 

Sean Bethard

unread,
May 7, 2020, 12:50:39 AM5/7/20
to gen...@googlegroups.com

Nice!

Bummer. Which users?

Sean Bethard

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/a287947f-b2b5-4847-802b-21ccb415fda1%40googlegroups.com.

John Dagdelen

unread,
May 7, 2020, 1:57:57 AM5/7/20
to gen...@googlegroups.com
The users I’m referring to are the people who were participating in this conversation above about a year ago. The issue of slow loading times still seem to be relevant to using the NCBI PubMed embeddings. 

-John 

Cindy Wang

unread,
May 19, 2020, 2:33:55 PM5/19/20
to Gensim
Hi all, 

I am wondering is this problem solved? I am trying to load this word vectors using gensim, but I cannot make it work. 

Thanks, 
Cindy

Cindy Wang

unread,
May 19, 2020, 2:36:55 PM5/19/20
to Gensim
Hi John, 

Could you please provide the code that how you use fastText to get the word embeddings? I am using 
fasttext.load_model
and it doesn't work. 

Thanks, 
Cindy
Reply all
Reply to author
Forward
0 new messages