FastText on large corpora (streaming training)?

537 views
Skip to first unread message

Jurica Seva

unread,
Jul 6, 2017, 8:55:27 AM7/6/17
to gensim
Hi everyone, 

I am trying to training FastText using gensim on the entire PMC/PubMed dump (cca 150GB) but I cant seem to figure out how to initialize the object and/or setup the iterator to submit the sentences to the train method. I tried it with some dummy txt data, to see if everything works and it does. I just cant seem to figure out how to train on a very large corpora which doesn't fit in to RAM. 

Also, what is the difference between the sentences param in the FastText init() and corpus_file in train() method? How should I submit the iterator to the pipeline, through which param? 

My code below

class MySentences(object):
    
    def __init__(self, dirname):
        self.dirname = dirname
        self.nlp = spacy.load('en')
        self.textFields = ['title', 'full_title','abstract']
 
    def __iter__(self):

        for root, dirs, files in os.walk(self.dirname, topdown=True):
            for filename in files:
                fullpath = os.path.join(root, filename)    
                print fullpath        
                articles = getArticle(fullpath)

                for article in articles:
                    text = u'.  '.join([article[x].strip() for x in article if x in self.textFields]).strip()
                    print article
                    print text, type(text)
                    tokens = nlp(text, parse=True)
                    sentences = [sent.string.strip() for sent in tokens.sents]

                    for line in sentences:
                        print line.split()
                        yield line.split()

sentences = MySentences(r"/home/docClass/files/")

model = FastText(workers=cpu_count(), sentences=sentences, size=300)
trained = model.train(ft_path=HOME+'tools/fastText/fasttext', model='skipgram')

Best,
J. 

Ivan Menshikh

unread,
Jul 7, 2017, 3:32:11 AM7/7/17
to gensim, jayantj...@gmail.com
Hi Jurica,
I hope Jayant can help you

jayant jain

unread,
Jul 7, 2017, 9:13:13 AM7/7/17
to gensim
Hi Jurica,

Apologies for the confusion in documentation.

Gensim currently doesn't a true Python implementation of FastText - it uses the original FastText binaries to train the model. Hence, you can use the train method and pass to it the file containing your text instead of an iterator, and it will train and load the model correctly.

from gensim.models.wrappers import FastText

model = fasttext.FastText.train('/Users/kofola/fastText/fasttext', corpus_file='/path/to/text/file')
print model['forests']  # prints vector for given out-of-vocabulary word

Jurica Seva

unread,
Jul 10, 2017, 4:14:54 AM7/10/17
to gensim
Hi Jayant,

thank you for the clarification. Maybe it would be a good idea to stress out in the dox that streaming isn't an options ATM. 

The problem in my case is that the offline file doesn't exist yet and once I create it it would be 10's of GB in size (not sure how big but the entire dump is 150GB; once I extract the info I use for training Im not sure how smaller everything gets). 

Are there any plans to implement streaming training of FT? 

Best,
J. 

Radim Řehůřek

unread,
Jul 10, 2017, 12:57:19 PM7/10/17
to gensim
Reply all
Reply to author
Forward
0 new messages