Identify sentences in standoff format

32 views
Skip to first unread message

Shikhar Misra

unread,
May 21, 2020, 8:46:09 PM5/21/20
to brat-users
Hi,

I was wondering if is possible to identify the labels that belong to a single sentence from the standoff format. Is it possible to separate the sentences and get the labels separately for them?

Thanks,
Shikhar.

Goran Topic

unread,
May 22, 2020, 7:24:46 AM5/22/20
to brat-...@googlegroups.com
Directly in brat, no. However, it is possible to hack it to get the necessary infromation and do it yourself.

That said, I am not sure how to answer that question, because there are many corner cases; they may not be of consequence to you, because of your specific data and/or experiment setting, but I can't make a general decision that I would always be happy with. Specifically, what of spans that annotate across a sentence border, and of relationships or events that exist between spans that are in different sentences?

The code below is a bare-bones demo that you can extend to your needs; it will print span IDs sorted into sentences by their start offset (regardless where they end). Adapt according to your needs.

brat_dir = '/PATH/TO/BRAT' # change this

import pathlib

import os

import sys

from collections import OrderedDict


document_path = sys.argv[1] # without extension!


sys.path.append(brat_dir)

sys.path.append(os.path.join(brat_dir, 'server', 'src'))


from document import _document_json_dict

json_dict = _document_json_dict(document_path)

sentence_offsets = json_dict['sentence_offsets']

spans = json_dict['triggers'] + json_dict['entities']


sentence_spans = OrderedDict((sentence_offset, [])

                             for sentence_offset in sentence_offsets)


for id, label, span_offsets in spans:

    for sentence_offset, span_list in sentence_spans.items():

        if sentence_offset[0] <= span_offsets[0][0] < sentence_offset[1]:

            span_list.append(id)

            continue


print(list(sentence_spans.values()))



Hope that helps.

Goran


--

---
You received this message because you are subscribed to the Google Groups "brat-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to brat-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/brat-users/80846f70-1996-47b7-b59e-254b58f04f29%40googlegroups.com.

Shikhar Misra

unread,
May 22, 2020, 3:13:18 PM5/22/20
to brat-users
Thanks for the quick reply. That was really helpful. The above code works. Is there also a way to detect the sentences with the tag ID's when I have the text data and their corresponding tags in two columns of Pandas dataframe. If not, I will try to work with the above solution. Thanks for your help.
To unsubscribe from this group and stop receiving emails from it, send an email to brat-...@googlegroups.com.

Shah Kash

unread,
May 22, 2020, 3:19:00 PM5/22/20
to brat-...@googlegroups.com
Hi  how can i use brat for arabic langage

--

---
You received this message because you are subscribed to the Google Groups "brat-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to brat-users+...@googlegroups.com.

Goran Topic

unread,
May 23, 2020, 8:04:26 PM5/23/20
to brat-...@googlegroups.com
Dear Shikhar Misra,

Can you please make a MCVE? I execute code better than I read text :)
I.e. send a sample `txt` and `ann` file, the input Pandas frame, and the desired output, and I might be able to puzzle something out.

Goran



To unsubscribe from this group and stop receiving emails from it, send an email to brat-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/brat-users/535ce949-0417-4ace-9df5-ef57992a7519%40googlegroups.com.

Goran Topic

unread,
May 23, 2020, 8:04:33 PM5/23/20
to brat-...@googlegroups.com
Dear Shah Kash,

You need to use a recent version (cloned from GitHub, not downloaded as zip file) and set text direction in `visual.conf` as described in this comment:

Goran

Shikhar Misra

unread,
May 25, 2020, 12:01:11 AM5/25/20
to brat-users
Hi Goran,

I have attached the image of pandas dataframe with just one column having paragraph and its appropriate predicted labels which are obtained through a trained model in the brat format.

demo1.PNG


I have uploaded the sample .txt and .ann file. One alternative is to separate the paragraph into sentences and get the annotation prediction sentence-by-sentence but that takes a lot of time for a large dataset. I was wondering if I can directly separate the demodf['labels] into sentences using information from demodf['paragraph']. The desired output can be 2D list with each list corresponding to annotations from separate sentences like shown in the screenshot below:

demo2.PNG



Thank you.
temp_00002.ann
temp_00002.txt

Goran Topic

unread,
May 25, 2020, 1:06:55 AM5/25/20
to brat-...@googlegroups.com
Yeah, that looks like it should be pretty straightforward, assuming you have nothing but entity annotations.
However, I can't execute images :P and that is a bit too much to retype.

Goran

To unsubscribe from this group and stop receiving emails from it, send an email to brat-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/brat-users/fc732e0a-8fff-4102-b92e-7be03597352c%40googlegroups.com.

Shah Kash

unread,
May 25, 2020, 3:14:54 AM5/25/20
to brat-...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages