While writing the introduction of the paper, we've realized that our
corpus stats do not match with the ones given in the corpus
description [1].
Our numbers are:
Training set
=========
* Documents (XML files): 435
* Documents containing, at least, one drug pair: 399
* Total sentences: 4267
* Sentences with, at least, one drug pair: 2812
* Total pairs: 23827
* Total entities (drugs): 11260
* Total entities that participate in a pair: 10374
* Avg drug per doc (considering only docs and sentences with pairs): 26.0
* Avg drug per sentence (considering only sentences with pairs): 3.69
Test set
======
* Documents (XML files): 144
* Documents containing, at least, one drug pair: 134
* Total sentences: 1539
* Sentences with, at least, one drug pair: 965
* Total pairs: 7026
* Total entities (drugs): 3689
* Total entities that participate in a pair: 3398
* Avg drug per doc (considering only docs and sentences with pairs): 25.36
* Avg drug per sentence (considering only sentences with pairs): 3.52
Our guess is that the stats provided in the corpus description are not
up-to-date. But, maybe we are doing something wrong?
Attached you'll find the Python script that we used to output these
stats (takes as unique argument the directory with the documents,
either training or test).
[1] http://labda.inf.uc3m.es/DDIExtraction2011/dataset.pdf
Thank you.
Best regards,
--
Santiago M. Mola
Jabber ID: cool...@gmail.com