How to skip over documents determined to be in foreign language?

17 views
Skip to first unread message

Tony Panza

unread,
Dec 16, 2017, 5:12:38 AM12/16/17
to gensim
Hello. I am a gensim newbie, please be gentle.

Some background: I have a 1.6 GB line-oriented JSON file. Each line is a document. It contains 1209533 documents. The schema is the following:

root
 |-- last_update: string (nullable = true)
 |-- id: string (nullable = true)
 |-- text: string (nullable = true)

I also have another line-oriented JSON file, also about 1.6 GB. Each line contains data about the authors and titles of the documents. This schema looks like this:

root
 |-- id: string (nullable = true)
 |-- author: string (nullable = true)
 |-- title: string (nullable = true)


I ultimately want to do similarity queries where a user could input an ID of a document or a string of text from a document and get back the N most similar documents. I want to do either a count vectorize of TF-IDF vectorize on the documents, project these vectors onto a smaller, 300-dimensional space using SVD, then compute similarity metrics between each document and every other document. I would ultimately like for this to be a server running on a distributed cluster.

I am looking at the example in tutorial one on how to handle a large amount of data. I am trying to understand the "memory friendly" ways of building a dictionary and corpus.

About 25-30% of the 1209533 documents I have are non-English. I want to completely skip over these. I'd like for them to be dropped from the "final" corpus and not have any of their tokens be part of the overall vocabulary.

I found a good example here: http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/ of using stop words, with the basis being that each language has a unique set of stop words, so for each document, we simply count the occurence of stop words for each language, and declare the language whose stop words have the most occurrences to be the language of the document.

How can I build a Gensim dictionary and corpus from just a subset of my overall data? Here is what I have so far:

def confirm_english(tokens_list):
   
# Adapted from: http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/
    languages_ratios
= {}
   
#languages = stopwords.fileids()
   
# ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']
   
for language in stopwords.fileids():
        stopwords_set
= set(stopwords.words(language))
        words_set
= set(tokens_list)
        common_elements
= words_set.intersection(stopwords_set)
        languages_ratios
[language] = len(common_elements) # language "score"
   
if all(value == 0 for value in languages_ratios.values()) or len(set(languages_ratios.values())) == 1:
       
# No stop words found, so every score is 0. Assume english
       
# Can't try to find a "maximum" since everything is tied
        most_rated_language
= 'english'
   
else:
        most_rated_language
= max(languages_ratios, key=languages_ratios.get)
   
if most_rated_language == 'english':
       
return tokens_list
   
else:
       
return []


dictionary
= corpora.Dictionary(confirm_english(simple_preprocess(json.loads(line, strict=False)['text'], min_len=1)) for line in open('/root/data/docs.json'))

# TODO: filter stop words from dictionary
# TODO: filter rare words




Is that the best way to do this? And what about the corpus?


Reply all
Reply to author
Forward
0 new messages