Publication readiness is often an ambiguous decision-making process requiring a lengthy, manual, and convoluted literature search and review. We propose a multi-stage automated solution that provides users with a topic-level readiness assessment. The core questions addressed through this pipeline are “Which topics did the author potentially overlook?’’ and “What topics did the author address that are novel?’’ Our approach combines unsupervised text clustering with state-of-the-art large language model (LLM) reasoning to: (1) describe core topics present in semantically similar published papers that the user may have overlooked, and (2) identify novel topics the user addresses. The pipeline does not provide a binary “ready/ not ready’’ assessment. Instead, the goal is to offer concise, qualitative data to assist in the subjective decision-making of publication readiness.
Existing research began exploring LLMs' potential in performing both literature and peer reviews as separate studies. To the best of our knowledge, this work is the first to connect both aspects. The result summarizes topic comparisons between a user-provided abstract and published works in similar domains. Our proposed research introduces a multi-stage, automated pipeline consisting of optimizing document clustering for domain-specific abstracts, topic modelling, and personalizing topic context with user-provided abstracts through specialized LLM prompting.
Clustering paper abstracts include hierarchical clustering with dimension reduction to optimize for outlier precision and evaluation of Jaccard undersampling effectiveness in producing precise, single-topic clusters. Topics are then extracted from published documents and user-given abstracts. To personalize summaries, topic-enhanced RAG prompts are tested with an LLM reasoning model (e.g., Grok-3) to evaluate topic comprehension between the most similar published abstracts and the user-provided document.
Preliminary experiments show that dimension reduction algorithms such as Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) and Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) are essential for avoiding outliers when clustering domain-specific documents. These experiments show an outlier reduction of over 65% versus principal component analysis (PCA), which reduced outliers by less than 20%. We assume that very large clusters may contain several smaller clusters with overlapping topics. Our remaining work will test this hypothesis with Jaccard undersampling evaluations. Using BERTopic, keywords and topics have been determined for clusters of published papers. We will proceed with experiments in topic modelling for a single user-input abstract to optimize matches with published clusters. We can then evaluate topic-enhanced RAG prompts with a reasoning LLM to produce concise and correct summaries.
---