I am trying to run an authortopic model on some tweets, around 9000 in total, with only 3 authors. I am interested in getting this kind of pipeline, from a DataFrame to an author topic model working consistently. I have kind of mashed this script together from a bunch of tutorials, but I found that the author-topic model tutorial did not clearly explain any alternative ways to get data into the model. In the end, this script produces 3000 tokens and 8000 documents, and 3 authors. I tried printing the dictionary (fine), corpus (fine), and dictionary.id2token (empty), but I read that this is only created as needed and should work fine even if the output from print is empty. Basically, I'm just not sure what is going wrong, any help would be appreciated.
import sys
import os
import time
import pandas
import numpy
from gensim import corpora
from collections import defaultdict
from stop_words import get_stop_words
from recordlinkage.standardise import clean
from nltk.tokenize import RegexpTokenizer
tweet_df = pandas.read_csv('leadertweets.csv')
a2d = defaultdict(list)
for i, j in zip(tweet_df['name'], tweet_df['tweets/text']):
a2d[i].append(j)
documents = clean(tweet_df['tweets/text'])
documents = documents.tolist()
stopwords = get_stop_words('en')
tokenizer = RegexpTokenizer(r'\w+')
tokens = []
for text in documents:
token = tokenizer.tokenize(str(text))
tokens.append(token)
cleaned_tokens = []
for l in tokens:
cleaned_tokens.append([i for i in l if not i in stopwords])
dictionary = corpora.Dictionary(cleaned_tokens)
array = numpy.asarray(cleaned_tokens)
max_freq = 0.8
min_wordcount = 5
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)
corpus = [dictionary.doc2bow(word) for word in array]
print('Number of authors: %d' % len(a2d))
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))
from gensim.models import AuthorTopicModel
%time model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
author2doc=a2d, chunksize=2000, passes=1, eval_every=0, \
iterations=1, random_state=1)