Some examples appear to have incorrect annotation spans

John Giorgi

unread,

Jun 24, 2019, 10:57:42 AM6/24/19

to BB 2019

Hi,

My apologies if I am mistaken, but some training examples under the "BB-rel+ner: Entity recognition and relation extraction" dataset appear to have incorrect annotations spans.

I noticed then when I was trying to convert the train partition to CoNLL format using this script. Here are some of the files with issues:

18524407-001
22177851-002
22177851-004

But it appears there are more. I also checked the error by loading these problematic files into Brat, and it also gave me an error about mismatched annotation spans.

As far as I can tell, these mismatched annotation spans arise because of erroneous newlines in the .txt files. I added an example (18524407-001).

If these are actually real errors and not something funky happening on my end, is it possible to have them fixed?

BB-rel+ner-F-18524407-001.txt

Louise Deleger

unread,

Jun 24, 2019, 3:14:27 PM6/24/19

to BB 2019

Hi John,

We apologize for the inconvenience and will investigate the issue.

From what you describe, I can already give you a few pointers about what's going on in these files (although we will double check that there is not anything else):

The newlines in the txt files are not "erroneous" per say. By that I mean that's how the original documents we collected and annotated actually were. And we have not pre-processed them to change that part.
Regarding the annotation spans in the .a2 files, these are actually "normalized" spans (meaning that newlines are replaced by spaces) because the BioNLP-ST format requires that we give one annotation per line. The given offsets are however correct, and they should allow you to reconstruct the original span if need be.

I understand that this formatting issue is very inconvenient for the tools and format that you use. We will discuss it and see if we can do something about it.

Best regards,

Louise

John Giorgi

unread,

Jun 24, 2019, 4:42:08 PM6/24/19

to BB 2019

Hi Louise,

Thanks so much for the quick response and the clarification! Knowing that this isn't an error, I think I will just preprocess the problematic files before trying to convert to a CoNLL-like format.

I know this would be extra work for the organizers, but one possible solution is to make the data available in a CoNLL-2004 like format (example here). In that way, the data would already be tokenized for task participants.

Thanks,

John.

Robert Bossy

unread,

Jun 24, 2019, 4:47:49 PM6/24/19

to BB 2019

Dear John,

it appears that Louise cast the correct diagnostics. We're sorry for this but consider that these files are provided as they were downloaded, such is the hard life of NLP production services...

As a workaround, I suggest feed the conversion script or brat with modified versions of the text: replace all newlines with a single space. Something like:

tr '\n' ' ' file.txt > modified.txt

With respect to annotation spans and rendering, it should change anything since character offsets of entities are still valid.

Robert

Robert Bossy

unread,

Jun 26, 2019, 4:38:32 AM6/26/19

to BB 2019

Hi John,

I apologize for the delayed response.

We have been using the BioNLP-ST format since 2011 and contemplated using other formats. We considered CoNLL format but ruled it out for several reasons:

- we never, still don't, and probably never will provide gold tokenization or gold sentence splitting

- the BIO model is unpractical to represent discontinuous and overlapping entities, these phenomenon cannot be neglected in the BB corpus

- relations and normalization are awkward to represent due to the lack of annotation identifiers

We do provide a tokenization and sentence splitting on the Web site (in the section "Supporting Resources"), though it is not gold, it has been computed with a domain-specific set of patterns.

If you happen to succeed at translating the data in CoNLL, and if you agree, then I think it may be valuable resource to provide other participants.

Best regards,

Robert

John Giorgi

unread,

Jul 3, 2019, 12:27:15 PM7/3/19

to BB 2019

Hi Robert, Louise,

Thank you both for the explanation and the suggestions! I will use those to convert to the CoNLL 2004 format.

I see now why you ruled out CoNLL. Do you know off-hand what % of training examples are discontinuous or contain overlapping entities? My model does not account for these and I just want to know if it has any chance of being competitive on this dataset without modification.

If I am able to convert to the CoNLL 2004 format successfully I will get back to you in case you want to make it available to other participants. In the case of overlapping entities, I will just default to the longest span. In the case of discontinuous spans, I will just break them into separate entities.

Robert Bossy

unread,

Jul 4, 2019, 3:47:58 AM7/4/19

to BB 2019

Hi,

here's a quick answer to your question, I'll be back later with a more substantiated answer.

Discontinuous entities is a marginal phenomenon, I'd say less than 5% of entities are discontinuous.

However overlapping entities are very common in this corpus, especially for Habitat entities. Indeed compound Habitat terms often denote nested habitats. You will find several examples in the annotation guidelines: https://drive.google.com/file/d/1G0po_xlRjQCZ-qxuA_4PLdipXU6rtYTp/view

Best,

Robert

Robert Bossy

unread,

Jul 4, 2019, 4:47:06 AM7/4/19

to bb-...@googlegroups.com

Hi,

I did the actual counting on the union of training and development sets.

3.6% of entities are discontinuous. Discontinuity is significantly rarer for entities of type Microorganism (less than 1%).

17.4% of entities overlap another entity. For Habitat entities, the proportion reaches 26%.

Note that if two entities overlap, I counted only one overlapping.

As for the competitiveness, you should consider:

Our expectation is to receive several predictions from NN architectures, especially [bi-]LSTM that won't handle that well discontinuity and overlaps. Though, I'm sure other teams will address these problems specifically, but with other algorithms.
We are providing a diverse set of metrics that will highlight the strengths of different submissions. We hope that we will be able to acknowledge the advantages of systems on which people put some effort.
We will also have a focus on openness: in production and service situations (which we aim to with this dataset), the ability to replicate a result may be as important as the raw performance.