NLP & Genes

26 views

Skip to first unread message

Renaud Seigneuric

unread,

Mar 6, 2026, 6:01:56 AMMar 6

to Europe PMC Developer Forum

Dear all,

I am new to Europe PMC and I am currently setting up an NLP approach in Python to detect gene names from titles, and abstracts (maybe also the full article). I came across the article that pointed me to your Github repository.

However, I would like to know where the code for detecting gene names is located, please, and/or other resources I should consider to perform this task as I am not familiar to Java and HTML.

Thank you in advance,

Santosh Tirunagari

unread,

Mar 6, 2026, 7:12:21 AMMar 6

to Europe PMC Developer Forum, renaud.s...@gmail.com

Dear Renaud

Thank you for your message and for your interest in Europe PMC and our work.

The machine-learning pipeline we use for literature processing is available here:
https://github.com/ML4LitS/CAPITAL

Within this pipeline, the model specifically used for gene name extraction is part of the annotation models repository:
https://github.com/ML4LitS/annotation_models

These repositories contain the components used to train and run the models that identify biological entities such as genes from article titles, abstracts, and other text sources.
If your goal is to perform gene name recognition from titles and abstracts, the annotation models repository should be the most relevant starting point. The CAPITAL pipeline provides additional context on how the models are used within a broader literature processing workflow using our Annotations API.

Please feel free to reach out if you have further questions or need clarification on any of the components.

Best wishes

Santosh

Reply all

Reply to author

Forward

0 new messages