Corpus stats

1 view
Skip to first unread message

Santiago M. Mola

unread,
Jun 20, 2011, 8:19:54 AM6/20/11
to ddiextra...@googlegroups.com
Dear organizers,

While writing the introduction of the paper, we've realized that our
corpus stats do not match with the ones given in the corpus
description [1].

Our numbers are:

Training set
=========

* Documents (XML files): 435
* Documents containing, at least, one drug pair: 399
* Total sentences: 4267
* Sentences with, at least, one drug pair: 2812
* Total pairs: 23827
* Total entities (drugs): 11260
* Total entities that participate in a pair: 10374
* Avg drug per doc (considering only docs and sentences with pairs): 26.0
* Avg drug per sentence (considering only sentences with pairs): 3.69

Test set
======

* Documents (XML files): 144
* Documents containing, at least, one drug pair: 134
* Total sentences: 1539
* Sentences with, at least, one drug pair: 965
* Total pairs: 7026
* Total entities (drugs): 3689
* Total entities that participate in a pair: 3398
* Avg drug per doc (considering only docs and sentences with pairs): 25.36
* Avg drug per sentence (considering only sentences with pairs): 3.52

Our guess is that the stats provided in the corpus description are not
up-to-date. But, maybe we are doing something wrong?

Attached you'll find the Python script that we used to output these
stats (takes as unique argument the directory with the documents,
either training or test).

[1] http://labda.inf.uc3m.es/DDIExtraction2011/dataset.pdf

Thank you.

Best regards,
--
Santiago M. Mola
Jabber ID: cool...@gmail.com

count.py

advaned

unread,
Jun 21, 2011, 10:27:52 AM6/21/11
to DDIExtraction2011
Unfortunately, the stats provided in the corpus description were wrong
(some figures refer to the whole corpus)
We are sorry. Your stats are right.
Thank you, Isabel
> Jabber ID: cooldw...@gmail.com
>
>  count.py
> 1 KVerDescargar
Reply all
Reply to author
Forward
0 new messages