Master 2 -- Internship Proposal: Multi-source Online Topic Modeling
Duration. 6 months
Profile. Master 2 or equivalent in Statistics, Data Science, or Artificial Intelligence.
Location. Centre Inria d’Université Côte d’Azur, 2004, route des Lucioles BP 93 06902 Sophia Antipolis Cedex.
Gratification. Standard gratification (~600 euros/month)
Description. Online topic models are unsupervised algorithms to identify latent topics in textual data streams that continuously evolve over time. Although these methods naturally align with real-world scenarios, they have received considerably less attention from the community compared to their offline counterparts, due to specific additional challenges. In [1, 2] we propose SB-SETM, an innovative model extending the Embedded Topic Model (ETM) [2] to process data streams by merging models formed on successive partial document batches. SB-SETM (i) leverages a truncated stick-breaking construction for the topic–per-document distribution, enabling the model to automatically infer from the data the appropriate number of active topics at each timestep; and (ii) introduces a merging strategy for topic embeddings based on a continuous formulation of optimal transport adapted to the high dimensionality of the latent topic space.
The goal of the internship is to extend SB-SETM to a multi-source online setting, where, at each time step, document batches originate from different information sources. Experiments will be conducted on disinformation datasets to analyze how sensitive topics are framed and evolve across heterogeneous information sources over time.
Key words. Topic Modeling, Natural Language Processing, Disinformation Detection.
[1] A. B. Dieng, F. J. Ruiz, and D. M. Blei. Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8:439–453, 2020.
[2] F. Granese, B. Navet, S. Villata, and C. Bouveyron. Merging embedded topics with optimal transport for online topic modeling on data streams. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 290–307. Springer, 2025.
[3] F. Granese, S. Villata, and C. Bouveyron. Stick-breaking embedded topic model with continuous optimal transport for online analysis of document streams. International Conference on Artificial Intelligence and Statistics, 2026.