Hello Matteo,
On 26 September 2013 23:32, Matteo Romanello <
matteo.r...@gmail.com> wrote:
>
> I'm new to this list, so please be kind to me and my first message ;)
We will do our best.
> I'm using Brat to annotate text data to perform some domain-specific named
> entity recognition.
>
> So far I've been using the
>
https://github.com/nlplab/brat/blob/master/tools/BIOtoStandoff.py script to
> go from BIO to Standoff format. But now I find myself in need to go from BIO
> to Standoff and I don't quite know how to go about that.
>
> Does anyone know if there's any script in the codebase that already does
> that? Or perhaps someone has already implemented something similar?
I think that Sampo Pyysalo has written at least one suck script, but I
don't think we ever included anything into the code base. The nasty
thing about going into BIO from Stand-off (SO) is of course that you
loose the coupling to the actual text and it becomes largely
irreversible (oh, the joy of aligning BIO to text in O(n^2)) but for a
sequential labeller it is the way to go. What I would do is the
following.
First, decide on how to tokenise the raw text, you can "steal" some
code from `tokenise.py` if you like Python. Second, decide how to
treat tokens that are not entirely covered by an annotation
("IBM-corporation" as a token and only "IBM" being annotated for
example). I personally tend to go for expanding the span of the
annotation to cover the entire token. The trap here is to tokenise
using the annotations, but then of course you could end up with the
chicken/egg problem of needing to know the location of the entities in
order to tokenise properly and needing the tokenisation to detect the
entities. Third, implement the whole thing (~100 lines of Python?),
read in the SO, iterate over the tokens and check if its offsets
overlap with any annotations (this I find to be the most error-prone
part of working with SO, do some sanity testing), then print out some
nice tsv-esque BIO.
Note: I am assuming that you only have a single category or that you
disallow overlapping spans, otherwise we will have to open a whole new
can of worms.
I hope the above helps, if not, just ask.
Best regards,
Pontus Stenetorp