Training Doc2Vec model on huge corpus - using file-based training

486 views
Skip to first unread message

Naveed Afzal

unread,
Apr 3, 2019, 5:37:01 PM4/3/19
to Gensim
Hi, 
I have over 9 million articles that I want to train doc2vec model. Once I have built vocabulary (using model.build_vocab((iterable of list of TaggedDocument)) which itself is quite time consuming task itself. I have 64 cores machine and when I used all cores for training it don't optimally use all cores. I came across this https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb that uses file-based training for efficient use of all cores during doc2vec model training.

But I am not clear regarding how can I prepare my data for file-base training. Currently, I am using the following:
class TaggedData(object):
    def __init__(self, data):
        self.data = data

    def __iter__(self):
        for i,_d in enumerate(self.data):
            yield TaggedDocument(words=clean_doc(_d), tags=[str(i)]) 


Here clean_doc function do some data specific preprocessing and tokenization and I want to use this function and transform my data to so that I can pass that as a corpus_file during file-base doc2vec training. Furthermore,  https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb this is using:  from gensim.utils import save_as_line_sentence but when I try to import I get the following error message: ImportError: cannot import name 'save_as_line_sentence'
I am using gensim version 3.4.0

Really appreciate guidance to solve these issues.

Regards,
Naveed

Gordon Mohr

unread,
Apr 4, 2019, 1:33:15 PM4/4/19
to Gensim
The file-based training mode was added in gensim 3.6.0, released 2018-09-20, per:


When using the `corpus_file` option, where documents appear one-per-line in a plain text file, you no longer have the option of constructing your own `TaggedDocument` instances. Instead, documents only receive a single tag based on their line-number, per the `Doc2Vec` docs for the `corpus_file` parameter:


- Gordon

Naveed Afzal

unread,
Apr 4, 2019, 1:44:44 PM4/4/19
to Gensim
What are you suggestion to handle such a large corpus for doc2vec training? how should I proceed?
Thanks,
-Naveed

Gordon Mohr

unread,
Apr 4, 2019, 7:16:42 PM4/4/19
to Gensim
It looks like the only doc-tags you're assigning are an integer serial number, starting at 0 - so the file-based training should work OK for you. I suggest you upgrade your gensim and try that. 

(Note that if any docs are more than 10,000 words, they'll be silently truncated. In the classic stream/object based interface, you could split such docs into multiple parts, and assign each the same tag, to get roughly the same effect as one oversized document. With the auto-tagging of the file-based approach, you can't quite do that. But it might not make that much difference, depending on what you ultimately intend to use the model or learned doc-vectors for.)

- Gordon
Message has been deleted
Message has been deleted

Naveed Afzal

unread,
Apr 5, 2019, 12:43:10 AM4/5/19
to Gensim
Hi Gordon,
I have upgraded gensim package and now have a gensim version 3.7.1
I have preprocessed my dataset (above 9 million documents) also and have it in the following format (a small sample) where one document per line:

[effect metformin glyburide tablets lactic acid patients type diabetes po outpatient lactate measurements collected two clinical trials fixed combination metformin glyburide tablets one stud.......
exercise randomized receive either placebo glyburide weeks lactate concentrations measured baseline weeks second study patients well controlled .....]

When I used 
from gensim.models.word2vec import LineSentence
docs = LineSentence(docs_lst)
from gensim.utils import save_as_line_sentence
save_as_line_sentence(docs, "my_corpus.txt")

but "my_corpus.txt" do not contain anything. Can you please guide me regarding what I am missing....

I want to use file-based training. can you please guide me what steps I have to do next. I try to follow example at https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb
but it did not work for me. 
Thanks

Gordon Mohr

unread,
Apr 5, 2019, 11:51:44 AM4/5/19
to Gensim
What's in `docs_lst`? Can you show a few entries, say `docs_lst[0:2]` and `len(docs_lst)`?

Was the file `my_corpus.txt` non-existent before running your code, then existent-but-empty? Were there any errors displayed?

- Gordon

Naveed Afzal

unread,
Apr 5, 2019, 11:57:49 AM4/5/19
to Gensim
doc_lst is a list that contains one document per line and its length is over 9 million. 
doc_lst[0:2] will give you the output:

[effect metformin glyburide tablets lactic acid patients type diabetes po outpatient lactate measurements collected two clinical trials fixed combination metformin glyburide tablets one stud.......
exercise randomized receive either placebo glyburide weeks lactate concentrations measured baseline weeks second study patients well controlled .....]


Gordon Mohr

unread,
Apr 5, 2019, 12:10:01 PM4/5/19
to Gensim
A "document per line" doesn't describe what actual Python objects/types it might be, nor does that output appear to be what would actually show for evaluating `docs_lst[0:2]` (or `print(docs_list[0:2])`). If you're having a problem, it's essential to see exactly what's involved, not a human-interpreted summary.

- Gordon

Naveed Afzal

unread,
Apr 5, 2019, 12:14:18 PM4/5/19
to Gensim
doc_lst[0:2] will give following output:

['diabetes post operative hyperglycemia predict complications length stay coronary artery bypass patients diabetes markedly increases risk cardiovascular disease many individuals cvd undiagnosed diabetes also treatment hyperglycemia post operative period cab surgery often sub optimal two hundred patients underwent cab surgery evaluation diabetes status consisting preoperative immediate post operative glucose day fasting glucose data collected age bmi length stay readmission within days death patients previously diagnosed diabetes type type another patients suspected diabetes defined day fasting glucose greater mg dl group one preoperative normal range supporting criterion thus total number patients diagnosed probable diabetes total complications defined readmissions within days death three deaths patients diabetes strongest predictors complications post op glucose gte mg dl diagnosed suspected diabetes previously diagnosed diabetes age bmi predictive complications mean length stay entire group days post op glucose gte mg dl days post op glucose gte mg dl days age strongest predictor length stay post op glucose also predictor length stay independent diabetes diagnosed suspected predictive conclusion results support aggressive approach carefully screening cab patients diabetes developing strategies intensive treatment post operative hyperglycemia goal reduce hospital stay complications mortality',

 'type ii idiopathic macular telangiectasia soft confluent drusen purpose describe simultaneous presentation soft confluent drusen type idiopathic macular telangiectasia eyes one patient methods year old man bilateral metamorphopsia gradual reduction central vision underwent complete ophthalmologic examination results patient fundus biomicroscopy revealed soft confluent drusen cystic appearance within fovea fluorescein angiography showed late dye leakage interestingly indocyanine green angiography showed absence late hypercyanescence spectral domain optical coherence tomography clearly revealed presence bilateral foveal cysts thinning loss normal architecture outer retina well absence retinal thickening within parafoveolar area showing discrete late dye leakage fa based findings patient diagnosed nonexudative age related macular degeneration foveal soft confluent drusen coincident nonproliferative type imt conclusions knowledge previously reported case simultaneous presentation soft confluent drusen type imt report highlights importance icga oct correct diagnosis cases introduction idiopathic juxtafoveal retinal telangiectasis clinical entity distinctly different secondary telangiectasis result various diseases type idiopathic macular telangiectasia common group patients type imt usually present mild blurring vision one eyes fifth sixth decades life fluorescein angiography eyes usually reveals temporal parafoveal telangiectatic vessels intraretinal fluorescein leakage spares foveal center absence cystic macular edema possible presence lamellar hole key distinguishing features entity charbel issa recently confirmed central retinal thickness slightly less patients type imt compared normative data describe presence different conditions eyes one patient knowledge previously reported case simultaneous presentation soft confluent drusen type imt case report year old woman referred department history bilateral metamorphopsia gradual reduction central vision patient signed comprehensive consent form according good clinical practice guidelines proceeding examinations otherwise healthy medical evaluations excluded presence diabetes best corrected visual acuity right eye left eye fundus biomicroscopy eyes cystic appearance evident within fovea moreover several soft confluent drusen foveal parafoveal region small hard drusen concentrated perifoveal region fa revealed discrete late dye leakage mainly temporal parafoveal region le indocyanine green angiography showed areas focal hypocyanescence due confluence soft drusen absence late hypercyanescence choroidal neovascularization detected icga spectral domain optical coherence tomography clearly revealed presence bilateral foveal cysts thinning loss normal architecture outer retina well presence small drusenoid pigment epithelium detachments therefore spectralis sd oct confirmed absence cnvs eyes interestingly spectralis sd oct scans demonstrate retinal thickening within parafoveolar area based findings patient diagnosed nonexudative age related macular degeneration foveal soft confluent drusen coincident nonproliferative type imt discussion type imt clinical fluorescein angiographic oct findings unlike macular diseases gass blodi divided type imt subgroups type describing bilateral occult nonexudative form imt found adults common form imt divided stages progressed subtle presence parafoveolar telangiectasias right angle venules hyperplastic retinal pigment epithelium ultimately subretinal neovascularization yannuzzi recently updated classification scheme imt group referred perifoveal telangiectasia entity divided stages nonproliferative characterized perifoveal telangiectasis crystalline deposits subretinal pigment plaques right angle vessels inner lamellar cysts proliferative defined presence subretinal neovascularization report case year old woman simultaneous presentation soft confluent drusen type imt knowledge first description coexistence patient fundus biomicroscopy revealed soft confluent drusen cystic appearance within fovea fa showed late dye leakage interestingly icga showed absence late hypercyanescence spectralis sd oct clearly revealed presence bilateral foveal cysts thinning loss normal architecture outer retina well absence retinal thickening within parafoveolar area thus icga oct demonstrated absence cnvs patient showing simultaneous presentation soft confluent drusen type imt due absence cnv decided treat patient anti vascular endothelial growth factor agents report highlights importance complete examination treating leaking lesion sense injecting patient anti vegf agents based isolated oct isolated fa icga cannot considered good strategy case correct diagnosis achieved means icga oct believe patients showing coexistence soft confluent drusen type imt cystic appearance fundus biomicroscopy well discrete late dye leakage fa parafoveal region may erroneously lead diagnosis exudative agerelated macular degeneration icga oct performed conclusion treating patients anti vegf injections complete thorough examination remains mandatory least first examination angiographic examinations still needed references gass jdm stereoscopic atlas macular disease ed st louis mo mosby gass jd oyakawa rt idiopathic juxtafoveal retinal telangiectasis arch ophthalmol gass jd oyakawa rt idiopathic juxtafoveal retinal telangiectasis arch ophthalmol gass jdm blodi ba idiopathic juxtafoveolar retinal telangiectasis update classification follow study ophthalmology gass jdm blodi ba idiopathic juxtafoveolar retinal telangiectasis update classification follow study ophthalmology charbel issa helb hm holz fg scholl hp mactel study group correlation macular function retinal thickness nonproliferative type idiopathic macular telangiectasia ophthalmol charbel issa helb hm holz fg scholl hp mactel study correlation macular function retinal thickness nonproliferative type idiopathic macular telangiectasia ophthalmol gaudric ducos de lg cohen sy massin haouchine optical coherence tomography group idiopathic juxtafoveolar retinal telangiectasis arch ophthalmol gaudric ducos de lg cohen sy massin haouchine optical coherence tomography group idiopathic juxtafoveolar retinal telangiectasis arch ophthalmol chan duker js ko th fujimoto jg schuman js normal macular thickness measurements healthy eyes using stratus optical coherence tomography arch ophthalmol chan duker js ko th fujimoto jg schuman js normal macular thickness measurements healthy eyes using stratus optical coherence tomography arch ophthalmol yannuzzi la bardal amc freund kb idiopathic macular telangiectasia arch ophthalmol yannuzzi la bardal amc freund kb idiopathic macular telangiectasia arch ophthalmol']

Gordon Mohr

unread,
Apr 5, 2019, 12:27:01 PM4/5/19
to Gensim
OK, your `docs_lst` is a simple list-of-strings. Per the `save_as_line_sentence()` documentation...


...it is expecting an "iterable of iterables of strings". A list-of-lists-of-strings would work, like say:

  ['diabetes', 'post', 'operative'],
  ['type', 'ii', 'idiopathic'],
]

But, since you've already got usable "lines" (that don't need to be tokenized just to be re-connected to write out), you could just write those lines to a suitable file yourself. You could use the source of that function as an inspiration...


...just without doing the...

    ' '.join(...)

...part.

- Gordon

Naveed Afzal

unread,
Apr 5, 2019, 12:33:51 PM4/5/19
to Gensim
Do you mean like this:
Corpus_file = 'corpus.txt'

with smart_open(filename, mode='wb', encoding='utf8') as fout: for sentence in corpus: line = any2unicode(sentence + '\n') fout.write(line)

Gordon Mohr

unread,
Apr 5, 2019, 2:25:44 PM4/5/19
to Gensim
That's close enough you should now consult the ultimate and always-available authority: running it in your Python environment to see if it does what you want.

- Gordon

Naveed Afzal

unread,
Apr 5, 2019, 2:59:27 PM4/5/19
to Gensim
Gordon, if I can convert my data into this format (a list contain tokens of one document per line):

  ['diabetes', 'post', 'operative'],
  ['type', 'ii', 'idiopathic'],

Then how can I convert such data into  "iterable of iterables of strings". can you please provide guidance.
I have seen this example:
ef processed_corpus():
    raw_corpus = api.load('wiki-english-20171001')
    for article in raw_corpus:
        # concatenate all section titles and texts of each Wikipedia article into a single "sentence"
        doc = '\n'.join(itertools.chain.from_iterable(zip(article['section_titles'], article['section_texts'])))
        yield preprocess_string(doc)

In my case I don't have section_titles and section_texts.

Thanks for guidance.

Gordon Mohr

unread,
Apr 5, 2019, 4:47:58 PM4/5/19
to Gensim
I think you are confusing things that are precise types in Python (like `list`), with things that are loose descriptions of possible roles ('documents'), and with things that are general interfaces (like 'iterables'), and with things that only make sense in the context of files ('lines').

If you want a file that's ready to be passed to the `corpus_file` interface, that file should have space-delimited words and newline-delimited lines, and so would look like (without the `---` to indicate start/end):

---
diabetes post operative
type ii idiopathic
---

If you had a Python list, where each item in the list was also a list, and each item in those lists was a single-word string, then that Python object would print itself as something like:

  ['diabetes', 'post', 'operative'],
  ['type', 'ii', 'idiopathic'],
]

But that's just literal-notation for describing that list-of-lists-of-strings; you wouldn't necessarily use that format, with brackets and commas and quotes, as storage.  Lists are themselves iterable, so a list-of-lists-of-strings is already also an iterable-of-iterables-of-strings. But if you already have your data in some format that's closer the the file-format above, there's no need to get it into this list-of-lists. Just get it into the right on-disk format. (And that's more a matter of basic Python IO/data-structures, than gensim.)

- Gordon

Naveed Afzal

unread,
Apr 6, 2019, 9:25:29 AM4/6/19
to Gensim
Hi Gordon,

Thanks for guidance. I was able to convert my corpus into corpus-file and able to initiate doc2vec model. Model build the vocabulary and crashed during first iteration of training by saying "Segmentation fault". What this error means in the context of doc2vec model training and how it can be handled.
Thanks

Gordon Mohr

unread,
Apr 7, 2019, 2:58:19 PM4/7/19
to Gensim
That indicates some sort of raw-addressing error, and isn't an expected failure mode unless there's some bug.

Most interesting to debug would be to test how reproducible the failure is: with INFO logging enabled, does it always happen at the same step, and especially, the same input data? If so, can a minimal reproducible example, with a much smaller corpus, be crafted?

- Gordon
Reply all
Reply to author
Forward
Message has been deleted
0 new messages