Fasttext save vocab in large training

107 zobrazení
Preskočiť na prvú neprečítanú správu

alessandra stampi-bombelli

neprečítané,
10. 12. 2021, 6:15:2510. 12. 2021
komu: Gensim
Hello,

I have an extremely large corpus that I have been trying to train fasttext embeddings on. I am running on a HPC and it ran over the maximum time limit (15 days) and the job got killed. I took at look at the log file, and just building the vocabulary took approximately one week. However, I unfortunately did not save the vocabulary because I don't know how to. 

Would anyone know how to do this? Also, once the vocabulary is built, how would I restart the training by giving it the already built vocabulary?

I show my code below, as well as parts of the logfile, in which it states how much memory it needs for the job. If anyone has tips on how to make the code more efficient, that would also be very very appreciated and helpful. 

Thanks so much!

Best,
Sandra

CODE:

# In[1]:
###################################
#     Modules                   ###
###################################

import numpy as np
# define logging mode
import logging
logging.basicConfig(format='%(asctime)s  %(message)s', datefmt='%y-%m-%d %H:%M:%S', level=logging.INFO, filename="logfile.log")
from psycopg2 import extras
import sys
sys.path.append("/cluster/work/lawecon/Work/goessmann/python_common/")
import database_connection
import uuid


import multiprocessing
workers = multiprocessing.cpu_count()


from gensim.models.phrases import Phrases, Phraser
from gensim.corpora.dictionary import Dictionary

import re
from nltk.corpus import stopwords
import string
import pickle as pk
punct = string.punctuation
# importing fastText
from gensim.models import FastText
# for preprocessing
import gensim
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import stem_text
from gensim.parsing.preprocessing import remove_stopwords
import nltk
nltk.download('words')
from nltk.corpus import words
wn_lemmas = set([a.lower() for a in words.words()])
from gensim.models import KeyedVectors
from gensim.test.utils import get_tmpfile


#print output
import subprocess
subprocess.run(['ls', '-l'], capture_output=True, text=True).stdout



stop_words = ['about','above','after','again','all','and','between','both','during','each','few','for','further','how','into','itself','once','only','over','some','such','that','the','then','this','those','through','too','until','what','when','where','which','while','why']



#######################################################
#     Database Connection and Preprocessing         ###
#######################################################

# ## database connection (Chronicling America)

# Below it takes the variables under SELECT from the Chronocling America dataset (and the meta dataset) & from the years 1860-1920.
# It does this by connecting to the database (which requires active connection and ETH VPN).
# Preprocessing:
# Then the text is tokenised. Afterwards, some preprocessing is done.
# Namely, punctuation removal, removing "\n", lower casing, removing 1- and 2-letter strings, removing stopwords, stemming, and keeping pages with less than 75% OCR error in them
# N.B. Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.


class paragraph_generator(object):
    def __init__(self,test=True,itersize=2500,year=None,state=None):
        self.test=test
        self.itersize=itersize
        self.sql = f"""
        SELECT
            text_id,
            lccn_sn,
            date,
            ed,
            chroniclingamerica_meta.statefp,
            chroniclingamerica_meta.countyfp,
            text_ocr
        FROM
            chroniclingamerica natural join chroniclingamerica_meta
        WHERE date_part('year',date) BETWEEN 1860 AND 1920 """
        if self.test:
            self.sql = self.sql+' limit 10000'   # limit 1000 means it only goes through 1000 lines of the database
        else:
            pass
        print(self.sql)
    def __iter__(self):
        con, cur = database_connection.connect(cursor_type='server')
        cur.itersize = self.itersize
        cur.execute(self.sql)
        for p in cur.fetchall():
            tokens = stem_text(p[-1])   # Stem
            #print(tokens)
            tokens = p[-1].translate(str.maketrans('', '', punct)).replace('\n',' ').lower().split(' ')
            tokens_3 = [a for a in tokens if len(a)==3 if a in wn_lemmas ]                      # For 3-letter words, only keep WordNet recognized tokens
            tokens = gensim.parsing.preprocessing.remove_short_tokens(tokens, minsize=4)        # Remove 1-, 2-, and 3-letter words
            tokens = tokens + tokens_3                                                           # Add back in 3-letter WordNet-recognized tokens
            tokens = gensim.parsing.preprocessing.remove_stopword_tokens(tokens, stopwords=stop_words)    # Remove stopwords in stopword list above
            print("THIS IS THE LENGTH OF TOKENS")
            a=len(tokens)
            print(a)
            if len(tokens)!=0:
                ocr_2 = 1 - (len([a for a in tokens if a in wn_lemmas ])/len(tokens))                       # Generate a measure for proportion of OCR errors in a page
            else:
                ocr_2 = float("nan")
            print("THIS IS OCR")
            print(ocr_2)
            ocr=ocr_2
            if ocr<0.75 and ~np.isnan(ocr):                        # If the % of OCR in a page is less than 75%, then keep the page and all tokens
                tokens=tokens
            else:
                tokens=[]                       # Otherwise, give it an empty list (i.e. drop the page)                                                                
            yield tokens      
        con.close()



# In[6]:



#######################################################
#     FastText Embeddings Set up and Training       ###
#######################################################

# window size 8 (context before and after target word)
# epochs = number of iterations over the corpus
# max_final_vocab = sets limit of vocab to 100k words
# vector_size = dimensionality reduction to 256
# workers =  how it splits the job on different nodes in computing on Euler - this sets to the number of cores requested (see workers above)
       


model = FastText(vector_size=256, window=8, min_count=10, max_final_vocab=100000,  epochs=5, workers=workers)


total_words = model.corpus_total_words

# NB: Set test=False when calling the paragraph_generation if you want it to run on the full database

# build vocab
vocab = model.build_vocab(paragraph_generator(test=False, itersize=2500, year=None, state=None))


# In[9]:


total_words = model.corpus_total_words
total_words

# loop in cui ogni tot lo salva e ricarca (time stamp)
# if time
# rilanciare lo script da bash
model.train(paragraph_generator(test=False, itersize=2500, year=None, state=None),
               epochs=5, total_examples=model.corpus_count)


# In[11]:


# saving embedding model
fasttext_allyears = model.wv
fasttext_allyears.save('/cluster/work/lawecon/Projects/Immigration_Discourse/models/fasttext_1860-1920_100k_preprocessed.kv')
model.save('/cluster/work/lawecon/Projects/Immigration_Discourse/models/fasttext_1860-1920_100k_preprocessed.bin')



LOGFILE

21-11-28 22:33:54  collected 1373909979 word types from a corpus of 51044102133 raw words and 17704173 sentences
21-11-28 22:51:40  FastText lifecycle event {'msg': 'max_final_vocab=100000 and min_count=10 resulted in calc_min_count=17041, effective_min_count=17041', 'datetime': '2021-11-28T22:51:40.481280', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'prepare_vocab'}
21-11-28 22:51:40  Creating a fresh vocabulary
21-11-28 23:00:40  FastText lifecycle event {'msg': 'effective_min_count=17041 retains 99997 unique words (0.007278278892244657%% of original 1373909979, drops 1373809982)', 'datetime': '2021-11-28T23:00:40.995271', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'prepare_vocab'}
21-11-28 23:00:40  FastText lifecycle event {'msg': 'effective_min_count=17041 leaves 41277623245 word corpus (80.86658697110087%% of original 51044102133, drops 9766478888)', 'datetime': '2021-11-28T23:00:40.996206', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'prepare_vocab'}
21-11-28 23:00:41  deleting the raw counts dictionary of 1373909979 items
21-11-28 23:01:16  sample=0.001 downsamples 22 most-common words
21-11-28 23:01:16  FastText lifecycle event {'msg': 'downsampling leaves estimated 39742566097.818306 word corpus (96.3%% of prior 41277623245)', 'datetime': '2021-11-28T23:01:16.675995', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'prepare_vocab'}
21-11-28 23:21:28  estimated required memory for 99997 words, 2000000 buckets and 256 dimensions: 2320145456 bytes
21-11-28 23:21:28  resetting layer weights
21-11-28 23:21:37  FastText lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2021-11-28T23:21:37.553806', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'build_vocab'}
21-11-28 23:21:37  FastText lifecycle event {'msg': 'training model with 128 workers on 99997 vocabulary and 256 features, using sg=0 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2021-11-28T23:21:37.555200', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'train'}

Gordon Mohr

neprečítané,
10. 12. 2021, 16:11:2610. 12. 2021
komu: Gensim
You can definitely `model.save()` just after `build_vocab()`, to save the allocated (& frozen-vocabulary) model, then `.load()` it later for a training session. 

But looking at your approach & the scale of your data, some additional ideas with regard to performance:

* neither stemming nor stopword-removal are especially necessary steps, unless you have other reasons for them
* re-reading the corpus from a (maybe slow?) database query on every iteration (both initial vocabulary-scan and every training epoch) may add a bottleneck: you may want to only iterate over the entire corpus from a database just once, writing the actual training data to a local file
* that local cache of the training data can also be the data *after* any preprocessing/cleanup - preventing that expensive work from being done repeatedly. In particular, some of your preprocessing steps may use expensive regexes, or (like `stem_words()`) redundantly tokenize texts, then re-concatenate into a string, only for them to be re-tokenized again. Even if you can't eliminate some of these inefficiencies, & they might be tolerable in a single pass, repeating them (epochs+1) times can be avoided by saving the interim corpus after 1 preprocessing pass.
* 1.4 billion unique words after a survey is atypically large, and implies a lot of atypical tokens of little value, such as arbitrary serial numbers or other glitches (like perhaps OCR errors). The log lines showing `calc_min_count` of 17041 imply there are words appearing a massive 17,040 times that are nonetheless being dropped as too rare to be of interest! You probably want to take a closer look at your data, and perform extra cleaning *before* the `build_vocab()` survey occurs - either discarding tokens that are sure to be junk, or performing some sort of probabilistic fixup (such as coercing typo-like suspect tokens into plausible, close-edit-distance context-appropriate corrections). 
* if a machine has 16 or more CPU cores, then training throughput using the traditional Gensim corpus iterable approach tends to be best using a number of `workers` somewhere in the 8-12 range, rather than fully equal to the number of cores, due to thread-contention issues. (The best `workers` value can only be experimentally-determined, with regard to the influence of other parameters like `vector_size`, `window`, & `negative` as well. One approach is looking at the log-reported throughput over several trial-and-error runs, starting from repeated `.load()`s of the same built-vocab model, using different `workers` values. Alternatively, if possible to store the full corpus as one pre-tokenized text file, the alternate `corpus_file` mode can saturate `workers` up to the count of CPU cores.)
* for such a really-large corpus, using a more-aggressive (smaller) `sample` value often helps to both speed training *and* improve end-vector quality on downstream tasks, by discarding more of the mostly-redundant instances of very-common words. A value of `1e-05` or smaller may help your runs a lot. 

Separately with regard to correctness/quality-of-results:

* after surveying 1.3B words, to then proceed with just 100,000 surviving words also seems a bit odd; retaining another few hundred thousand more would still leave a manageable model size, and perhaps much better coverage of relevant texts' words, & training data for all surviving words.
* your variable-name `wn_lemmas` & surrounding logic implies an intent this be a set of WordNet words, lemmatized. But IIUC `nltk.corpus.words` is more like a UNIX `words` file - neither WordNet nor stemmed/lemmatized in any way. So the logic of checking your (stemmed) texts against this may be misguided.
* removing all 1/2/3-character tokens, but then appending the known 3-letter tokens *to the end* of each text is removing those words from their real neighboring-word contexts - so you'll not be performing real word-to-neighboring-word word-vector training where these words have been removed, or re-inserted. A filtering that retains original positioning would be better.

- Gordon
Odpovedať všetkým
Odpovedať autorovi
Poslať ďalej
0 nových správ