Tutorial of textcorpus

Sergio E. Tobal

unread,

Oct 2, 2017, 4:42:37 PM10/2/17

to gensim

I'm trying to use corpora.textcorpus to process some medical papers and create a word2vec model, but I don't know how to do it and the docs don't explain it.

I wanted to use TextDirectoryCorpus but it gave me a lot of errors of encoding, so at the end I give up and tried to do something basic:

tr = gensim.corpora.textcorpus.TextCorpus("./papersMedicos/1413-8123-csc-22-09-2797.pdf")

but again, it gives me an error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte" and if I try to copy paste the text to a .txt file and read it, then I try to save it with

tr.save_corpus('prueba.mm', MISSING_PARAMETER)

but I don't know what to put in the second parameter, the corpus is not the tr variable?

I tried reading the tutorials from github and some blog but they use a corpora from wiki or are only some phrases.

Can someone guide me? Thanks a lot

Kenneth Orton

unread,

Oct 3, 2017, 1:47:00 AM10/3/17

to gensim

You should be able to tokenize text files with many of the Gensim utilities.

The utils.any2unicode(text) might help with the encoding error.

Here is a script that should work with some comments.
You'll have to edit the docs_loc variable to be the path of your text docs.

The basic idea is to inherit from TextCorpus and override the get_texts method.
There is a Gensim Jupiter tutorial somewhere that mentions this.

import logging
import os
import sys
import bz2
import re
import itertools
import tarfile
import multiprocessing
from functools import partial
import gensim
from gensim.corpora import MmCorpus, Dictionary, WikiCorpus
from gensim import models, utils
import pyLDAvis
from pyLDAvis import gensim as gensim_vis
import argparse
import argcomplete
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
 
ignore_words = frozenset('the', 'at', 'and', 'if', 'are', 'am', 'be', 'is', 'etc')
 
def list_to_gen(directory):
    for filename in os.listdir(directory):
        yield directory + str(filename)
        
def preprocess_text(lemma, tweet, document):
    # transform document into one string
    with open(document, 'r') as infile:
        text = ' '.join(line.rstrip('\n') for line in infile)
    # convert string into unicode
    text = gensim.utils.any2unicode(text)
 
    # remove URL's
    text = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)
 
    if lemma:
        return utils.lemmatize(text, stopwords=ignore_words, min_length=3)
 
    if tweet:
        # remove symbols excluding the @, # and \s symbol
        text = re.sub(r'[^\w@#\s]', '', text)
 
        # tokenize words using NLTK Twitter Tokenizer
        tknzr = TweetTokenizer()
        text = tknzr.tokenize(text)
 
        # lowercase, remove words less than len 2 & remove numbers in tokenized list
        text = [word.lower() for word in text if len(word) > 2 and not word.isdigit()]
 
        # remove stopwords
        return [word for word in text if not word in ignore_words]
 
    return utils.simple_preprocess(text, deacc=True, min_len=3)
 
# inherit from the TextCorpus class and override the get_texts method
class DocCorpus(gensim.corpora.TextCorpus):
    def __init__(self, docs_loc, lemmatize, twitterize, dictionary=None, metadata=None):
        self.docs_loc = docs_loc
        self.lemmatize = lemmatize
        self.twitterize = twitterize
        self.metadata = metadata
        if dictionary is None:
            self.dictionary = Dictionary(self.get_texts())
        else:
            self.dictionary = dictionary
    def get_texts(self):
        pool = multiprocessing.Pool(max(1, multiprocessing.cpu_count() - 1))
        func = partial(preprocess_text, self.lemmatize, self.twitterize)
        for tokens in pool.map(func, list_to_gen(self.docs_loc)):
            print(tokens) 
            yield tokens
        pool.terminate()
 
def main():
    lemma = False 
    twitterize = True
 
    docs_loc = '/path/to/dir/containing/text_docs/'
 
    doc_corpus = DocCorpus(docs_loc, lemma, twitterize)
 
if __name__ == '__main__':
    sys.exit(main())

Kenneth Orton

unread,

Oct 3, 2017, 1:49:23 AM10/3/17

to gensim

oops... in the last post, this

ignore_words = frozenset('the', 'at', 'and', 'if', 'are', 'am', 'be', 'is', 'etc')

should be this

ignore_words = frozenset(['the', 'at', 'and', 'if', 'are', 'am', 'be', 'is', 'etc'])

On Monday, October 2, 2017 at 1:42:37 PM UTC-7, Sergio E. Tobal wrote:

Ivan Menshikh

unread,

Oct 4, 2017, 1:58:39 AM10/4/17

to gensim

Hi Sergio,

I don't think that TextCorpus suitable for pdf format, this works well with simple texts.

For your case, you need to write "custom" pdf format reader and if you want - you can implement corpus wrapper (similar with code from Kenneth Orton)

Sergio E. Tobal

unread,

Oct 6, 2017, 11:03:47 AM10/6/17

to gensim

Sorry for taking so long to answer, sometimes I think I don't know why I'm doing university, too much time consuming and not learning so much.

Thanks a lot for the answers, they helped me a lot, I was too lost, I thought it was not mandatory to create the get_texts() method and nothing was working for me.

With the code of Kenneth, copying the text from .pdf files to .txt files and using

MmCorpus.serialize('file.mm', doc_corpus)

I could saved the corpus, and I was thinking I could start doing it like I did it with a pretrained model from google

model = gensim.models.KeyedVectors.load('./data/GoogleNews-vectors-negative300.bin.gz', mmap=None)

and I tried to use

mm = MmCorpus.load('file.mm')

but it doesn't work, and I'm not sure what I have there, it's the same content as in google's model? It has the vector? Because it was too small compared to google's model and I never told how many dimensions I wanted to use.

Sergio E. Tobal

unread,

Oct 6, 2017, 1:46:55 PM10/6/17

to gensim

I think I was using very wrong the library or I don't understand yet how it works, because I wanted to train a model with papers of medicine, and I was seeing I have to use gensim.models.word2vec and having the text splitted in lines, I understand this looking this tutorial https://rare-technologies.com/word2vec-tutorial/, so I'm not sure then for what I needed the textcorpus or what is the purpouse of a corpus.

Kenneth Orton

unread,

Oct 6, 2017, 10:41:28 PM10/6/17

to gensim

To save MmCorpus just use:

mm = MmCorpus('file.mm')

The TextCorpus yields vectorized corpus in the __iter__ function. You should be able to either override it or edit the source to only yield tokens like wiki get_texts().

Ivan Menshikh

unread,

Oct 9, 2017, 2:48:06 AM10/9/17

to gensim

Hi Sergio,

if you want, you should ignore corpuses, you can write all data to text file and that's enough.

For w2v, you need to pass any word to model to get word-vector, like model.wv['myword']

Reply all

Reply to author

Forward