NLP & Genes

20 views
Skip to first unread message

Renaud Seigneuric

unread,
Mar 6, 2026, 6:01:56 AM (9 days ago) Mar 6
to Europe PMC Developer Forum

Dear all,

I am new to Europe PMC and I am currently setting up an NLP approach in Python to detect gene names from titles, and abstracts (maybe also the full article). I came across the article that pointed me to your Github repository


However, I would like to know where the code for detecting gene names is located, please, and/or other resources I should consider to perform this task as I am not familiar to Java and HTML. 


Thank you in advance,

R


Santosh Tirunagari

unread,
Mar 6, 2026, 7:12:21 AM (9 days ago) Mar 6
to Europe PMC Developer Forum, renaud.s...@gmail.com
Dear Renaud

Thank you for your message and for your interest in Europe PMC and our work.

The machine-learning pipeline we use for literature processing is available here:
https://github.com/ML4LitS/CAPITAL

Within this pipeline, the model specifically used for gene name extraction is part of the annotation models repository:
https://github.com/ML4LitS/annotation_models

These repositories contain the components used to train and run the models that identify biological entities such as genes from article titles, abstracts, and other text sources.
If your goal is to perform gene name recognition from titles and abstracts, the annotation models repository should be the most relevant starting point. The CAPITAL pipeline provides additional context on how the models are used within a broader literature processing workflow using our Annotations API.

Please feel free to reach out if you have further questions or need clarification on any of the components.

Best wishes
Santosh
Reply all
Reply to author
Forward
0 new messages