Author topic model not running, what am I missing?

122 views
Skip to first unread message

Steven McColl

unread,
Aug 3, 2017, 11:49:16 AM8/3/17
to gensim
I am trying to run an authortopic model on some tweets, around 9000 in total, with only 3 authors. I am interested in getting this kind of pipeline, from a DataFrame to an author topic model working consistently. I have kind of mashed this script together from a bunch of tutorials, but I found that the author-topic model tutorial did not clearly explain any alternative ways to get data into the model. In the end, this script produces 3000 tokens and 8000 documents, and 3 authors. I tried printing the dictionary (fine), corpus (fine), and dictionary.id2token (empty), but I read that this is only created as needed and should work fine even if the output from print is empty. Basically, I'm just not sure what is going wrong, any help would be appreciated. 

Here is the script:

import sys
import os
import time
import pandas
import numpy
from gensim import corpora
from collections import defaultdict
from stop_words import get_stop_words
from recordlinkage.standardise import clean
from nltk.tokenize import RegexpTokenizer

tweet_df = pandas.read_csv('leadertweets.csv')

a2d = defaultdict(list)
for i, j in zip(tweet_df['name'], tweet_df['tweets/text']):
    a2d[i].append(j)

documents = clean(tweet_df['tweets/text'])

documents = documents.tolist()

stopwords = get_stop_words('en')

tokenizer = RegexpTokenizer(r'\w+')

tokens = []

for text in documents:
    token = tokenizer.tokenize(str(text))
    tokens.append(token)

cleaned_tokens = []

for l in tokens:
    cleaned_tokens.append([i for i in l if not i in stopwords])

dictionary = corpora.Dictionary(cleaned_tokens)

array = numpy.asarray(cleaned_tokens)

max_freq = 0.8
min_wordcount = 5
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)

corpus = [dictionary.doc2bow(word) for word in array]

print('Number of authors: %d' % len(a2d))
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

from gensim.models import AuthorTopicModel
%time model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
                author2doc=a2d, chunksize=2000, passes=1, eval_every=0, \
                iterations=1, random_state=1)

Steven M

unread,
Aug 4, 2017, 2:09:22 PM8/4/17
to gensim
I think I have realized my mistake, which is that author2doc should actually be a dictionary of authors as keys and the values should be document IDs. I could be wrong but I will make that change and see what happens.

Ivan Menshikh

unread,
Aug 9, 2017, 1:47:33 AM8/9/17
to gensim, olavurm...@gmail.com
Hi Steven,
I hope Ólavur Mortensen will help you

Ólavur Mortensen

unread,
Aug 9, 2017, 4:29:22 AM8/9/17
to gensim

I tried printing the dictionary (fine), corpus (fine), and dictionary.id2token (empty)

As you can see in the author-topic model tutorial, I usually “initialize” the dictionary as _ = dictionary[0]. I have never known why this is necessary, but it might solve this problem for you.

In the end, this script produces 3000 tokens and 8000 documents, and 3 authors.

So 1000 documents are missing? May these documents be missing from your a2d dictionary?

I found that the author-topic model tutorial did not clearly explain any alternative ways to get data into the model.

I’m not sure what you mean. Gensim has a specific way of representing corpora (see tutorial), like any library will have. Additional requirements by the author-topic model are described in the tutorial.

Did this answer your questions?

Steven M

unread,
Aug 9, 2017, 9:28:26 AM8/9/17
to gensim
Hi Ólavur,

I think I solved the issue, which was that I was creating the author2doc as a dictionary of authors to the full texts, rather than to an index that represents the text in the document.
In the docs page for atmodel it says: "author2doc is a dictionary where the keys are the names of authors, and the values are lists of documents that the author contributes to." I think it should say document IDs.

Also, I did notice the `dict0 = dictionary[0]` and adding it was necessary.

As getting data into Gensim, I was just surprised that there is not a tutorial for turning a DataFrame or CSV of already collected data into the required objects. 
I think part of my frustration with this was that I didn't realize that the real problem was with how I created the author2doc.

The 1000 missing documents were simply cut by the cleaning function. Probably consisted of only links, or non-useful characters.

Ólavur Mortensen

unread,
Aug 9, 2017, 10:49:25 AM8/9/17
to gensim
Ok, yes that is misleading.

A tutorial for how to convert popular data structures to the Gensim corpora would probably be very useful. That's a good idea.
Reply all
Reply to author
Forward
0 new messages