Hello,
I am totally new to brat's standoff format which is based on start/end-offset.
I mostly use NLTK library in Python. They provide various readers that supports different formats of annotated texts.
What I want to do is making the .ann format properly processed, so that it is able be handled in NLTK.
for an example of brat tutorial,
1
) Citibank was involved in moving about $100 million for Raul Salinas
de Gortari, brother of a former Mexican president, to banks in
Switzerland.
T1 Organization 418 426 Citibank
T2 Money 456 468 $100 million
T3 Transfer-money 443 449 moving
E1 Transfer-money:T3 Giver-Arg:T1 Money-Arg:T2 Beneficiary-Arg:T4 Recipient-Arg:T6
T4 Person 473 496 Raul Salinas de Gortari
T5 Person 511 535 former Mexican president
T6 Organization 540 545 banks
T7 GPE 549 560 Switzerland
R2 Origin Arg1:T6 Arg2:T7
#1 AnnotatorNotes T2 100000000 USD
R1 Family Arg1:T4 Arg2:T5
A1 Mention T4 Name
A2 Individual T4
A3 Mention T5 Nominal
A4 Individual T5
A5 Confidence E1 High
N1 Reference T5 Wikipedia:64488 Carlos Salinas de Gortari
I want to get parsed data in IOB format like below.
Citibank NNP B-ORG
was VBD O
involved VBN O
in IN O
moving VBG O
about RB O
$ $ B-MONEY
100 CD I-MONEY
million CD I-MONEY
for IN O
Raul NNP B-PER
Salinas NNP I-PER
de FW I-PER
Gortari NNP I-PER
, , O
brother NN O
of IN O
a DT O
former JJ B-PER
Mexican JJ I-PER
president NN I-PER
, , O
to TO O
banks NNS B-ORG
in IN O
Switzerland NNP B-GPE
. . O
Of course, POS tagging should be executed seperately with .txt text file. consequently, POS tagging process will change start/end offset of each entities in ann format.
Then how can I locate the entity, say, 'Raul Salinas de Gortari' in post-POS-tagging text? Suppose I can, how can I put these IOB tags on relevant entities?
I believe converting to IOB format is needed for NER task. If there are some misunderstandings, please give me some advice. Thank you in advance.