Add new topic modeling alorithm to Gensim

134 views
Skip to first unread message

Emil Rijcken

unread,
Jun 4, 2022, 3:17:53 PM6/4/22
to gen...@googlegroups.com
Hi there, 

My name is Emil Rijcken and I am a PhD candidate at Eindhoven University of Technology. With my group, we have developed a few new topic modeling algorithms of which the best one, FLSA-W, outperforms the state-of-the-art (LDA, NMF, ProdLDA, NeuralLDA, LSI etc.) on various open datasets (M10, BBC News, DBLP and 20NewsGroup) in terms of coherence- (c_v), diversity- and interpretability score in most settings.

In this paper, we introduce the algorithm and compare it to LDA:
In this paper, we test and compare the algorithm based on predictive performance (text classification) and topic quality:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9114871/

Furthermore, we are soon submitting the results of experiments on open datasets to a journal. Also, another paper is accepted to WCCI, in which we describe the Python package FuzzyTM, featuring FLSA-W amongst others.

Here is the link to my Github: https://github.com/ERijck/FuzzyTM and in this Medium article, I discuss the package: https://towardsdatascience.com/fuzzytm-a-python-package-for-fuzzy-topic-models-fd3c3f0ae060

I would love to add FLSA-W (and perhaps the other fuzzy algorithms too) to Gensim. How can I do this?

Best,
Emil 

Radim Řehůřek

unread,
Jun 7, 2022, 7:13:13 AM6/7/22
to Gensim
Hi Emil,

that sounds like a great algorithm. I checked the code and it looks well structured too.

Why do you want to add the algos to Gensim, what's your motivation? Especially since I see you offer Octis and FLSA-W as Python packages already?

Best,
Radim

Emil Rijcken

unread,
Jun 7, 2022, 6:19:00 PM6/7/22
to gen...@googlegroups.com
Hi Radim, 

Thank you for getting back to me and checking my code. I have always considered Gensim the NLP hub for topic modeling as you offer various popular algorithms and evaluation metrics*. Given our experimental results, FLSA-W fits well with the algorithms you offer. Instead of using various packages, it is much easier for a programmer to find most algorithms centrally. Gensim offers this option and is used by many. My primary goal is that the algorithm has a broad reach. With this goal in mind, it would be an honor to be featured by Gensim. Also, I have only offered FuzzyTM as a Python package and have written about OCTIS before, but I haven't developed/offered it. 
 
Let me know what you think.

Best,
Emil

*Additionally, I am using your Word2Vec implementation for another variant of the FLSA-based algorithms. The details of this algorithm are described in a technical conference paper that I submitted recently. 

Radim Řehůřek

unread,
Jun 13, 2022, 9:48:24 PM6/13/22
to Gensim
Hi Emil,

that sounds good! Are you able to commit to maintaining your contribution in the future = fixing bugs, answering user questions?

Because we've had contributions in the past that we had to rip out again, due to no author support. Please keep in mind that getting your code into Gensim is an important step, but not the only step.

If you want to go ahead, please check the contributor guide and submit an initial (minimal) PR so we can kickstart the discussion around:

* the new module API (esp. around streaming of the input training data) and its testing
* performance (memory, time, possible parallelization)

The last one is especially important – there are so many different ML algorithms that users experience a research fatigue/overload (I know I do). So a clear TL;DR and pithy differentiation ("why would someone use this?") will be essential, if you want your contribution to have impact.

Thanks,
Radim

Emil Rijcken

unread,
Jun 23, 2022, 8:42:36 AM6/23/22
to gen...@googlegroups.com
Hi Radim, 

That's great! I will commit to maintaining my contribution in the future. I will update the code based on the contributor guide and will get back to you upon finishing.

Cheers,
Emil 

Op di 14 jun. 2022 om 03:48 schreef Radim Řehůřek <m...@radimrehurek.com>:
--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/7af03c04-b51f-47ee-a529-971bacf61b87n%40googlegroups.com.

Emil Rijcken

unread,
Jun 29, 2022, 7:35:14 AM6/29/22
to gen...@googlegroups.com
Hi Radim, 

I have updated the FuzzyTM package so that it satisfies the Gensim requirements. To contribute to Gensim with my algorithms, I see two options:
 1. Write a .py file in which we import several FuzzyTM algorithms to Gensim.
 2. Add a copy of the FuzzyTM code to Gensim.

For the long-term code maintenance, I prefer the first option. What are your thoughts on this?

Best,
Emil

Op di 14 jun. 2022 om 03:48 schreef Radim Řehůřek <m...@radimrehurek.com>:
Hi Emil,
--

Gordon Mohr

unread,
Jun 29, 2022, 1:56:26 PM6/29/22
to Gensim
If your preference is for the bulk of the FuzzyTM code to live in a separate code repo & PyPI package – which is a pretty good plan for maintenance/support purposes! – what is the benefit of an import into Gensim, as opposed to simply including a mention & *optional* import in some comparative demos? 

If your option (1) is relatively small/compact/straightforward, it'd be easiest to evaluate with a concrete implementation PR to review. 

- Gordon

Emil Rijcken

unread,
Jul 17, 2022, 5:30:44 PM7/17/22
to gen...@googlegroups.com
Thank you for your quick answer, Gordon!

An import to Gensim seems easiest for users:
 - they can use the same variable names/types as Gensim,
 - they have to install one package only that contains all different methods.

Sorry for my late reply. I have been busy writing/revising a few papers and was preparing for two conferences (at WCCI this week, I will present FuzyzTM). Also, I have written an implementation (FLSAModel) that allows users to pass the same variables into the method as they do into LdaModel. Also, I have added some methods with the same attributes as the LdaModel's counterparts, so it is easy to train them. 

Let me know what you think.

Best regards,
Emil


Op wo 29 jun. 2022 om 19:56 schreef Gordon Mohr <goj...@gmail.com>:

Emil Rijcken

unread,
Jul 17, 2022, 5:31:35 PM7/17/22
to gen...@googlegroups.com
This time with the code attached ;) 

Op zo 17 jul. 2022 om 23:30 schreef Emil Rijcken <emil.r...@gmail.com>:
FLSAmodel.zip

Gordon Mohr

unread,
Jul 18, 2022, 4:52:29 PM7/18/22
to Gensim
Can you show your proposed changes to Gensim as a Github Pull Request? That's the clearest/safest/most-precise way to evaluate & discuss code, in context. 

- Gordon
Reply all
Reply to author
Forward
0 new messages