Serialized author-topic-model incorrect document count

35 views
Skip to first unread message

Joseph Emmens

unread,
Feb 15, 2024, 4:33:38 AMFeb 15
to Gensim
Dear Gensim team,

I am estimating a serialized author-topic model:

with temporary_file("serialized") as s_path:
model_serialized = AuthorTopicModel(
mm_corpus, author2doc=inv2doc, id2word=dictionary, num_topics=num_topics,
serialized=True, serialization_path=s_path, iterations=10, random_state=1234,
alpha='auto', eta=1/num_topics,
gamma_threshold=0.0001, eval_every=1, passes=20
)
model_serialized.update(mm_corpus, inv2doc

When I look over the log file initially, the no of docs, features, iterations, passes etc. are all correct. However, after completing the 20 passes, the model repeats the estimation in full, starting at pass 0, however now reports double the number of documents (1009284), but then repeats the estimation on the original number (504642): I attach snapshots from the log file in full below. 

My questions are:
  1. Why does the model repeat the full 20 passes twice?
  2. Why does the word count double during the saving process? Is this a counting error in the logging process, or is the model actually repeating observations?
Thanks,
Joe
Screenshot 2024-02-15 at 10.24.29.png
Screenshot 2024-02-15 at 10.26.29.png
Screenshot 2024-02-15 at 10.26.07.png

Gordon Mohr

unread,
Feb 15, 2024, 2:49:12 PMFeb 15
to Gensim
In general, if you specify a corpus when instantiating a Gensim model class, its initialization *includes* an automatic launch of training with that corpus. So for example, in the source for AuthorTopicModel, you can see the end of the `__init__()` will already perform the same `.update()` you already show in your code:


So you likely either want to only specify the corpus in your instantiation (no explicit `.update()`), or leave the corpus unspecified when creating the model then make your explicit `.update()`. Otherwise, you'll have requested the same corpus be processed twice. 

- Gordon

Joseph Emmens

unread,
Feb 15, 2024, 3:48:17 PMFeb 15
to gen...@googlegroups.com
That's exactly right, that's my fault for copying the example on the gensim atm documents that includes both calls.

Thank you so much. I've re-run it on test data and it only estimates once.



--
You received this message because you are subscribed to a topic in the Google Groups "Gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/XFc0XIEHlV0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/836bcae3-5fe4-448b-92ac-2aaa1ba0a7c7n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages