Hello. I am a gensim newbie, please be gentle.
Some background: I have a 1.6 GB line-oriented JSON file. Each line is a document. It contains 1209533 documents. The schema is the following:
root
|-- last_update: string (nullable = true)
|-- id: string (nullable = true)
|-- text: string (nullable = true)
I also have another line-oriented JSON file, also about 1.6 GB. Each line contains data about the authors and titles of the documents. This schema looks like this:
root
|-- id: string (nullable = true)
|-- author: string (nullable = true)
|-- title: string (nullable = true)
I ultimately want to do similarity queries where a user could input an ID of a document or a string of text from a document and get back the N most similar documents. I want to do either a count vectorize of TF-IDF vectorize on the documents, project these vectors onto a smaller, 300-dimensional space using SVD, then compute similarity metrics between each document and every other document. I would ultimately like for this to be a server running on a distributed cluster.
I am looking at the example in tutorial one on how to handle a large amount of data. I am trying to understand the "memory friendly" ways of building a dictionary and corpus.
About 25-30% of the 1209533 documents I have are non-English. I want to completely skip over these. I'd like for them to be dropped from the "final" corpus and not have any of their tokens be part of the overall vocabulary.
How can I build a Gensim dictionary and corpus from just a subset of my overall data? Here is what I have so far:
def confirm_english(tokens_list):
# Adapted from: http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/
languages_ratios = {}
#languages = stopwords.fileids()
# ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']
for language in stopwords.fileids():
stopwords_set = set(stopwords.words(language))
words_set = set(tokens_list)
common_elements = words_set.intersection(stopwords_set)
languages_ratios[language] = len(common_elements) # language "score"
if all(value == 0 for value in languages_ratios.values()) or len(set(languages_ratios.values())) == 1:
# No stop words found, so every score is 0. Assume english
# Can't try to find a "maximum" since everything is tied
most_rated_language = 'english'
else:
most_rated_language = max(languages_ratios, key=languages_ratios.get)
if most_rated_language == 'english':
return tokens_list
else:
return []
dictionary = corpora.Dictionary(confirm_english(simple_preprocess(json.loads(line, strict=False)['text'], min_len=1)) for line in open('/root/data/docs.json'))
# TODO: filter stop words from dictionary
# TODO: filter rare words
Is that the best way to do this? And what about the corpus?