Fasttext save vocab in large training

35 views

Skip to first unread message

alessandra stampi-bombelli

unread,

Dec 10, 2021, 6:14:55 AM12/10/21

to fastText library

Hello,

I have an extremely large corpus that I have been trying to train fasttext embeddings on. I am running on a HPC and it ran over the maximum time limit (15 days) and the job got killed. I took at look at the log file, and just building the vocabulary took approximately one week. However, I unfortunately did not save the vocabulary because I don't know how to.

Would anyone know how to do this? Also, once the vocabulary is built, how would I restart the training by giving it the already built vocabulary?

I show my code below, as well as parts of the logfile, in which it states how much memory it needs for the job. If anyone has tips on how to make the code more efficient, that would also be very very appreciated and helpful.

Thanks so much!

Best,

Sandra

CODE:

# In[1]:
###################################
# Modules ###
###################################

import numpy as np
# define logging mode
import logging
logging.basicConfig(format='%(asctime)s %(message)s', datefmt='%y-%m-%d %H:%M:%S', level=logging.INFO, filename="logfile.log")
from psycopg2 import extras
import sys
sys.path.append("/cluster/work/lawecon/Work/goessmann/python_common/")
import database_connection
import uuid

import multiprocessing
workers = multiprocessing.cpu_count()

from gensim.models.phrases import Phrases, Phraser
from gensim.corpora.dictionary import Dictionary

import re
from nltk.corpus import stopwords
import string
import pickle as pk
punct = string.punctuation
# importing fastText
from gensim.models import FastText
# for preprocessing
import gensim
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import stem_text
from gensim.parsing.preprocessing import remove_stopwords
import nltk
nltk.download('words')
from nltk.corpus import words
wn_lemmas = set([a.lower() for a in words.words()])
from gensim.models import KeyedVectors
from gensim.test.utils import get_tmpfile

#print output
import subprocess
subprocess.run(['ls', '-l'], capture_output=True, text=True).stdout

stop_words = ['about','above','after','again','all','and','between','both','during','each','few','for','further','how','into','itself','once','only','over','some','such','that','the','then','this','those','through','too','until','what','when','where','which','while','why']

#######################################################
# Database Connection and Preprocessing ###
#######################################################

# ## database connection (Chronicling America)

# Below it takes the variables under SELECT from the Chronocling America dataset (and the meta dataset) & from the years 1860-1920.
# It does this by connecting to the database (which requires active connection and ETH VPN).
# Preprocessing:
# Then the text is tokenised. Afterwards, some preprocessing is done.
# Namely, punctuation removal, removing "\n", lower casing, removing 1- and 2-letter strings, removing stopwords, stemming, and keeping pages with less than 75% OCR error in them
# N.B. Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.

class paragraph_generator(object):
def __init__(self,test=True,itersize=2500,year=None,state=None):
self.test=test
self.itersize=itersize
self.sql = f"""
SELECT
text_id,
lccn_sn,
date,
ed,
chroniclingamerica_meta.statefp,
chroniclingamerica_meta.countyfp,
text_ocr
FROM
chroniclingamerica natural join chroniclingamerica_meta
WHERE date_part('year',date) BETWEEN 1860 AND 1920 """
if self.test:
self.sql = self.sql+' limit 10000' # limit 1000 means it only goes through 1000 lines of the database
else:
pass
print(self.sql)
def __iter__(self):
con, cur = database_connection.connect(cursor_type='server')
cur.itersize = self.itersize
cur.execute(self.sql)
for p in cur.fetchall():
tokens = stem_text(p[-1]) # Stem
#print(tokens)
tokens = p[-1].translate(str.maketrans('', '', punct)).replace('\n',' ').lower().split(' ')
tokens_3 = [a for a in tokens if len(a)==3 if a in wn_lemmas ] # For 3-letter words, only keep WordNet recognized tokens
tokens = gensim.parsing.preprocessing.remove_short_tokens(tokens, minsize=4) # Remove 1-, 2-, and 3-letter words
tokens = tokens + tokens_3 # Add back in 3-letter WordNet-recognized tokens
tokens = gensim.parsing.preprocessing.remove_stopword_tokens(tokens, stopwords=stop_words) # Remove stopwords in stopword list above
print("THIS IS THE LENGTH OF TOKENS")
a=len(tokens)
print(a)
if len(tokens)!=0:
ocr_2 = 1 - (len([a for a in tokens if a in wn_lemmas ])/len(tokens)) # Generate a measure for proportion of OCR errors in a page
else:
ocr_2 = float("nan")
print("THIS IS OCR")
print(ocr_2)
ocr=ocr_2
if ocr<0.75 and ~np.isnan(ocr): # If the % of OCR in a page is less than 75%, then keep the page and all tokens
tokens=tokens
else:
tokens=[] # Otherwise, give it an empty list (i.e. drop the page)
yield tokens
con.close()

# In[6]:

#######################################################
# FastText Embeddings Set up and Training ###
#######################################################

# window size 8 (context before and after target word)
# epochs = number of iterations over the corpus
# max_final_vocab = sets limit of vocab to 100k words
# vector_size = dimensionality reduction to 256
# workers = how it splits the job on different nodes in computing on Euler - this sets to the number of cores requested (see workers above)

model = FastText(vector_size=256, window=8, min_count=10, max_final_vocab=100000, epochs=5, workers=workers)

total_words = model.corpus_total_words

# NB: Set test=False when calling the paragraph_generation if you want it to run on the full database

# build vocab
vocab = model.build_vocab(paragraph_generator(test=False, itersize=2500, year=None, state=None))

# In[9]:

total_words = model.corpus_total_words
total_words

# loop in cui ogni tot lo salva e ricarca (time stamp)
# if time
# rilanciare lo script da bash
model.train(paragraph_generator(test=False, itersize=2500, year=None, state=None),
epochs=5, total_examples=model.corpus_count)

# In[11]:

# saving embedding model
fasttext_allyears = model.wv
fasttext_allyears.save('/cluster/work/lawecon/Projects/Immigration_Discourse/models/fasttext_1860-1920_100k_preprocessed.kv')
model.save('/cluster/work/lawecon/Projects/Immigration_Discourse/models/fasttext_1860-1920_100k_preprocessed.bin')

LOGFILE

21-11-28 22:33:54 collected 1373909979 word types from a corpus of 51044102133 raw words and 17704173 sentences
21-11-28 22:51:40 FastText lifecycle event {'msg': 'max_final_vocab=100000 and min_count=10 resulted in calc_min_count=17041, effective_min_count=17041', 'datetime': '2021-11-28T22:51:40.481280', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'prepare_vocab'}
21-11-28 22:51:40 Creating a fresh vocabulary
21-11-28 23:00:40 FastText lifecycle event {'msg': 'effective_min_count=17041 retains 99997 unique words (0.007278278892244657%% of original 1373909979, drops 1373809982)', 'datetime': '2021-11-28T23:00:40.995271', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'prepare_vocab'}
21-11-28 23:00:40 FastText lifecycle event {'msg': 'effective_min_count=17041 leaves 41277623245 word corpus (80.86658697110087%% of original 51044102133, drops 9766478888)', 'datetime': '2021-11-28T23:00:40.996206', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'prepare_vocab'}
21-11-28 23:00:41 deleting the raw counts dictionary of 1373909979 items
21-11-28 23:01:16 sample=0.001 downsamples 22 most-common words
21-11-28 23:01:16 FastText lifecycle event {'msg': 'downsampling leaves estimated 39742566097.818306 word corpus (96.3%% of prior 41277623245)', 'datetime': '2021-11-28T23:01:16.675995', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'prepare_vocab'}
21-11-28 23:21:28 estimated required memory for 99997 words, 2000000 buckets and 256 dimensions: 2320145456 bytes
21-11-28 23:21:28 resetting layer weights
21-11-28 23:21:37 FastText lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2021-11-28T23:21:37.553806', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'build_vocab'}
21-11-28 23:21:37 FastText lifecycle event {'msg': 'training model with 128 workers on 99997 vocabulary and 256 features, using sg=0 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2021-11-28T23:21:37.555200', 'gensim': '4.1.2', 'python': '3.7.4 (default, Oct 16 2019, 13:45:57) \n[GCC 6.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core', 'event': 'train'}

Reply all

Reply to author

Forward

0 new messages