Implementation of Correlated Topic Model

2,184 views
Skip to first unread message

Gabriel L

unread,
Apr 10, 2014, 8:23:01 AM4/10/14
to gen...@googlegroups.com
Hello,

For some research work, I would like to use the Correlated Topic Model (CTM), which is an improvement of the Latent Dirichlet Allocation (LDA) model exhibiting relationships between found topics. As far as I know, CTM is currently not supported by Gensim. I'm considering developing a CTM implementation within Gensim. Would anyone be ready to do that with me ?

Further information (bibliography, implementation in C,...) here https://www.cs.princeton.edu/~blei/ctm-c/

Regards,
Gabriel.

Radim Řehůřek

unread,
Apr 10, 2014, 8:59:43 AM4/10/14
to gen...@googlegroups.com
Sounds great Gabriel!

I could certainly help with polishing & general advice.

Radim

Gabriel L

unread,
Apr 18, 2014, 12:13:41 PM4/18/14
to gen...@googlegroups.com
The code seems to be working (I've tested the algorithm on testcorpus.mm, the results looked ok)
I still have to code the function which gets the topics graph, but it should be done quickly.

As far as I know, there is no online version of this algorithm, but D. Blei advised to look at https://www.cs.princeton.edu/~blei/papers/PaisleyWangBlei2012.pdf which is an online improvement of CTM...


--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/xM1JM2VKkHk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Radim Řehůřek

unread,
Apr 18, 2014, 4:23:13 PM4/18/14
to gen...@googlegroups.com, gabriel.l...@m4x.org
This is great, Gabriel!

The next step will be running CTM on some well-known datasets, both to debug it/build intuition, and to serve as user documentation and tutorial.

Did you try the implementation on some "real" corpus? Compare its results to Blei's implementation?

I promised Lars to look at the py3 port this weekend, and I'll also have a look at this (assuming all goes well).

Cheers,
Radim

To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Gabriel L

unread,
Apr 19, 2014, 7:29:43 AM4/19/14
to gen...@googlegroups.com
No actually I've just checked that results look OK (the optimization increases the likelihood bound, the topics found for testcorpus.mm group co-occuring words, the co-variance matrix shows that each topic relates to itself, ...), which was already painful to achieve :-p

I'm taking a break this week-end, but will test CTM on real corpora next week. I fear it might be very slow for big corpora, but I'm sure there are easy optimization of my code which can speed things up (for instance I'm doing lots of copy.deepcopy which could probably be avoided, etc.)

Gabriel


To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Allan Jabri

unread,
Apr 4, 2015, 2:56:24 PM4/4/15
to gen...@googlegroups.com, gabriel.l...@m4x.org
Was this implementation of CTM for Gensim completed?

By the way, really enjoying gensim so far!
Allan
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Allan Jabri

unread,
Apr 8, 2015, 2:34:40 PM4/8/15
to gen...@googlegroups.com, gabriel.l...@m4x.org

Just wanted to double check if this was ever completed... Thanks!

Bhavya Nandana

unread,
Apr 29, 2020, 1:39:06 AM4/29/20
to Gensim
Hello,

Kindly, let me know if the implementation is working, so that we can see if it can be useful for our research work.

Thank you,
Bhavya Kanuboddu

Ron Rogge

unread,
Apr 17, 2023, 6:13:55 PM4/17/23
to Gensim
Hello Gabriel and everyone!

I am new to gensim and topic modeling. Over the weekend, I taught myself LDA and ran it on a corpus of open-ended gender narratives I collected from a large online study (which I oversampled for sexual and gender diversity). This morning, I tried getting chatGPT to help me create a corpus and run CTM to extract 20 topics from my data. The topics from CTM were noticeably more interpretable than those from LDA and chatGPT was surprisingly helpful (at least it was early in the morning before demand picked up). 

In fact, chatGPT gave me the python code it used, starting with the following:

===================================================
import pandas as pd import nltk from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize from gensim.corpora import Dictionary from gensim.models import ctmodel ===================================================  
After it preprocessed the corpus and converted it into a bag of words (not showing that code to save space),
it used the following code to identify the topics requested.
===================================================   # Apply CTM to identify the top 20 themes num_topics = 20 ctm_model = ctmodel.CtModel(bow_corpus, num_topics=num_topics) ctm_model.fit(lemmatized_tokens) topics = ctm_model.get_topics()
===================================================

Does "ctmodel.CtModel" exist somewhere within gensim?

I have spent the entire day trying to reproduce those results in python on my own, outside of chatGPT (so that I can ensure the stability of the findings, create the necessary tables/figures, and make my de-identified data and syntax available). However, I keep getting an error that gensim does not contain an element like that.  

Did you create a CTM function for gensim? Would you be willing to share it with me? I would be sure to give you the appropriate credit. 

Any help would be greatly appreciated. Now that I've familiarized myself with the gensim code, I'd rather stick with that if at all possible. Of course, if there isn't a ctmodel function in gensim, then I can either use the BERTopic or the topicmodels modules. I just didn't anticipate going down an 8hr rabbit hole trying to get some code to run! 

Thanks!

Ron Rogge
Associate Professor of Clinical Psychology
University of Rochester

Gordon Mohr

unread,
Apr 18, 2023, 6:16:20 PM4/18/23
to Gensim
There's no 'gensim.models.ctmodel` module in the latest Gensim, or prior versions I checked. Google searches of realted tokens imply some other libraries may use a similar class name, but provide no support for the idea it was ever a Gensim module/class. 

So, this appears to be a bit of confusion/confabulation, from whatever ChatGPT-style tool you're using.

When you say, "chatGPT gave me the python code it used", you may also be confused: at least in its base implementations, ChatGPT models generally don't themselves run code. From what I've seen so far, that requires optional plugins, or embedding ChatGPT in some other larger system. 

In its standalone form, it can *try* to simulate the effects of code that you've provided or it has synthesized – but without using a real Python interpreter, that, too, can be erroneous/misleading in its reports of code effects. (For example, it might show the *intended* results of buggy code, which it has intuited via its more-fuzzy internal processes.)

And generally, if recounting the outputs or difficulties of using a ChatGPT-like tool, it is important to be very precise in describing whaat tool, version, & advanced options were used, as the offerings vary greatly in their capabilities.

That's true even within a single company's offerings: OpenAI ChatGPT-3.5 vs OpenAI ChatGPT-4.0 vs the 10+ GPT-based models available in OpenAI's playground vs the Bing search assistant based on OpenAI GPT-4 vs Github Copilot based on some other-tuned OpenAI GPT model. 

So: whatever ChatGPT you used may have given you code here that refers to an imaginary Gensim module/class, and that can't run anywhere. If it reports running code, it wasn't *this* (unrunnable) code, and it may have just imagined pleasing results, using its more general/vague/inchoate understandings of the domains you've been describing to it, rather than the actual results of applying a well-characterized topic-modelling algorithm on specific data.

- Gordon

Ron Rogge

unread,
Apr 19, 2023, 9:50:59 AM4/19/23
to gen...@googlegroups.com
Gordon,

Thank you so much for getting back to me. I really appreciate it!

I have to say that was a pretty compelling confabulation on the part of ChatGPT-3.5. In fact, it was compelling enough for me to waste 8 hours of my time trying to replicate it with the code it provided. No wonder I kept hitting error after error! In hindsight, my time would have been far better spent teaching myself some other package/module: Lesson learned!!!

I am glad that I intrinsically knew that I would have to replicate all of that (confabulated) work outside of ChatGPT-3.5 if I were to ever consider publishing it. 

ONE LAST QUESTION: If I could beg just a tiny bit more of your time, I would appreciate it if you could point me in the right direction. As I will now be running natural language processing analyses using something other than gensim.models.ctmodel, what would you recommend? The following details will help clarify what I am asking:

MY DATA: I have open-ended narratives from 1769 online respondents about their gender identities and what gender means to them (taken from a sample in which in oversampled sexual and gender minorities). 

POSSIBLE MODELS: I have already run an LDA which yielded 8 topics, but to be honest those LDA results were a bit underwhelming. I was excited by Correlated Topic Modeling (CTM) as another option because I do expect the topics to be correlated. What I really would like is to extract topics in a way that pays attention to how close the words occur to one another in each document (response), in hopes of extracting topics with greater coherence. Would you recommend CTM or is there another approach that might be better?

POSSIBLE MODULES/PACKAGES: Having now familiarized myself with running LDA in R and with preparing a bag-of-words corpus in python, I could theoretically proceed to familiarize myself with the code for CTM (or another approach) in either python or R. I have a slight preference for R, but would like to know what you would suggest. 

IN R: it seems that there are two potential packages I could use: topicmodels & gensimr.
IN PYTHON: it seems that I would need to use one of two packages: topicmodels or BERTopic.

To save myself some time moving forward, I would truly appreciate your guidance on the best way for me to move forward. If you, by chance, had run some topic models of the type you would suggest within the package/module you would recommend, it would be a MASSIVE help if you could provide some sample code for me to use as a template.

Thank you again for your time and support as I stumble through the early stages of topic modeling. I really appreciate it! You ROCK!!!

Ron

You received this message because you are subscribed to a topic in the Google Groups "Gensim" group.

To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/xM1JM2VKkHk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Gordon Mohr

unread,
Apr 20, 2023, 3:06:26 AM4/20/23
to Gensim
I can understand why you might prefer techniques that exist over those that are purely imaginary, like `gensim.models.ctmodel`. 

But what might work well is highly dependent on the particulars or your data & the end purposes for which you want topics, so it's hard to make recommendations other than "try a bunch of things & see which does best on your ad-hoc or rigorous evaluations". 

That any single LDA run doesn't impress isn't surprising, but there are plenty of parameters that can be experimented with that might offer better results. 1769 distinct documents isn't that many – finding more data, perhaps from a similar domain, might be one thought – but an upside of smaller data is if each run is quick, you can run lots of comparative experiments – as long as you have some automated way to score the results for desirability. (What made you choose a 8-topic model for your initial test?)

I'm not familiar with Correlated Topic Modeling, nor do I know about any reliable libraries for it. To the extent it seems a more sophisticated refinement of standard LDA, and the paper introducing it describes using it to model 16k Science articles into 100 topics, I'd be concerned it might be even more data-hungry, in order to give good results, than standard LDA. 

If your 1769 personal narratives are long enough, with natural internal breaks (paragraphs or other distinct sections), another variant worth trying might be to break them into sub-documents. 

I have seen a lot of credible testimony of using ChatGPT-based tools to help with coding tasks, but a lot of it mentions having to iteratively prod ChatGPT to correct errors, or with improved approaches. That is, you've still got to be able to review the code, truly run it, & critically evaluate/report its outputs. And, ChatGPT-4 may be far better at coding (& other tasks) than ChatGPT-3.5. 

- Gordon  
Reply all
Reply to author
Forward
0 new messages