The first part of the thesis focuses on community detection in the stochastic block model (SBM), a canonical model for graphs with latent communities. Here, our work proposed the first parameter-free community recovery algorithm with near-optimal performance, while partially resolving conjectures of Van Vu and Emmanuel Abbe on the behavior of standard spectral algorithms (NeurIPS 2023, SODA 2024).
Motivated by the limitations of the SBM in modeling real data, the second part examines graphs arising from single-cell RNA-seq (scRNA-seq). These graphs deviate substantially from classical assumptions. We identify a novel organization—multi-core periphery structure with communities (MCPC)—and introduce the concept of relative centrality to extract low-variance representatives of each community. This structure appears consistently across diverse scRNA-seq datasets and improves downstream biological inference (ICLR 2025).
Building on these observations, the final part identifies a novel density–geometry correlation in real datasets with ground-truth labels: dense regions exhibit smoother, more linear geometry, while sparse regions show greater non-linearity and overlap. This phenomenon recurs across scRNA-seq and bulk RNA-seq data, image datasets, self-supervised embeddings, and the learned representation spaces of semi-supervised models. We observe that performance of a large category of algorithms strongly correlates with this organization—unsupervised and semi-supervised methods succeed in dense, geometrically stable regions and degrade in sparser layers.
These findings lead to three main applications:
(1) a layer-extraction, clustering, and visualization framework for scRNA-seq that isolates stable transcriptomic states;
(2) a clustering enhancement framework that boosts simple algorithms such as K-Means and HDBSCAN to compete with state-of-the-art manifold methods at far lower cost;
(3) geometry-aware improvements to CNN-based semi-supervised learning on standard image benchmarks.
Together, these results advance a unified view of how density and geometry govern learnability.