The
management of drug–drug interactions (DDIs) is a critical issue
resulting from the overwhelming amount of information available on them.
Natural Language Processing (NLP) techniques can providean interesting
way to reduce the time spent by healthcare professionals on reviewing
biomedical literature. However, the shortage of annotated corpora for
DDI extraction is the main bottleneck in the development of NLP systems
for this area of Pharmacovigilance. So precisely for this reason, we are
pleased to announce that the DDI corpus, an annotated corpus with
pharmacological substances and drug-drug interactions (DDIs), is now
available at http://labda.inf.uc3m.es/ddicorpus.
The DDI corpus is made up of 792 texts selected from the DrugBank
database and other 233 Medline abstracts on the subject of DDIs. The
corpus was annotated with a total of 18,502 pharmacological substances
and 5028 DDIs, including both pharmacokinetic (PK) as well as
pharmacodynamic (PD) interactions. To date, the corpora annotated with
DDIs have focused in PK DDIs, but not in PD DDIs.
Annotation guidelines were developed by domain experts in order to
ensure a high-quality, reliable and accurate annotation of the corpus.
Pharmacological substances were classified according to four entity
types: drug (for generic drugs), brand (for trade drugs), group (for
drug classes) and drug_n (for active substances not approved for human
use). DDIs were also classified into four types: mechanism (for DDIs
describing the way the interaction occurs), effect (for DDIs describing
the consequence of the interaction), advice (for DDIs described by a
recommendation or advice) and int (for DDIs without any additional
information). Inter-Annotator Agreement (IAA) was measured to assess the
consistency and quality of the corpus. The agreement was almost perfect
(Kappa up to 0.96 and generally over 0.80), except for the DDIs in the
MedLine database (0.55–0.72).
The DDI corpus was developed for the SemEval 2013-DDIExtraction 2013 task (
http://www.cs.york.ac.uk/semeval-2013/task9/),
whose main goal was to provide a common framework for the evaluation of
information extraction techniques applied to the recognition and
classification of pharmacological substances (DrugNER subtask) and the
detection and classification of drug-drug interactions (DDIExtraction
subtask) from biomedical texts. The DDI corpus is a valuable
gold-standard for those research groups interested in the recognition of
pharmacological active substances, including drugs, groups of drugs,
toxins, etc. or those specifically working in the field of DDI relation
extraction.
The DDI corpus is divided into two datasets: training and test.
The training dataset is the same for both subtasks and contains
gold-standard annotations of pharmacological substances and their
interactions. It consists of 714 texts (572 from DrugBank and 142
MedLIne abstracts) annotated with a total of 13029 pharmacological
substances (13029 from DrugBank and 1826 from MedLine) and 4037 DDIs
(3805 from DrugBank and 232 from MedLine). The test dataset for the Drug
NER subtask consists of 52 DrugBank texts (annotated with 303
pharmacological substances) and 58 MedLine abstracts (with 382
pharmacological substances). The test dataset for the subtask of DDI
extraction consists of 158 DrugBank Texts (annotated with 889 DDIs) and
33 MedLine abstracts (with 95 DDIs).
We hope that the release of this dataset will encourage further research on the DDI problem.
A detailed description of the DDI corpus and the DDIExtraction 2013 task can be found in the following articles:
- María Herrero-Zazo, Isabel Segura-Bedmar, Paloma Martínez, Thierry
Declerck, The DDI corpus: An annotated corpus with pharmacological
substances and drug–drug interactions, Journal of Biomedical
Informatics, Volume 46, Issue 5, October 2013, Pages 914-920, ISSN
1532-0464,
http://dx.doi.org/10.1016/j.jbi.2013.07.011.)
- Isabel Segura-Bedmar, Paloma Martínez, María
Herrero-Zazo. SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions
from Biomedical Texts (DDIExtraction 2013). In Proceedings of the 7th
International Workshop on Semantic Evaluation (SemEval 2013).
Contact info:
Isabel Segura-Bedmar (
ise...@inf.uc3m.es)