Identify sentences in standoff format

Shikhar Misra

unread,

May 21, 2020, 8:46:09 PM5/21/20

to brat-users

Hi,

I was wondering if is possible to identify the labels that belong to a single sentence from the standoff format. Is it possible to separate the sentences and get the labels separately for them?

Thanks,
Shikhar.

Goran Topic

unread,

May 22, 2020, 7:24:46 AM5/22/20

to brat-...@googlegroups.com

Directly in brat, no. However, it is possible to hack it to get the necessary infromation and do it yourself.

That said, I am not sure how to answer that question, because there are many corner cases; they may not be of consequence to you, because of your specific data and/or experiment setting, but I can't make a general decision that I would always be happy with. Specifically, what of spans that annotate across a sentence border, and of relationships or events that exist between spans that are in different sentences?

The code below is a bare-bones demo that you can extend to your needs; it will print span IDs sorted into sentences by their start offset (regardless where they end). Adapt according to your needs.

brat_dir = '/PATH/TO/BRAT' # change this

import pathlib

import os

import sys

from collections import OrderedDict

document_path = sys.argv[1] # without extension!

sys.path.append(brat_dir)

sys.path.append(os.path.join(brat_dir, 'server', 'src'))

from document import _document_json_dict

json_dict = _document_json_dict(document_path)

sentence_offsets = json_dict['sentence_offsets']

spans = json_dict['triggers'] + json_dict['entities']

sentence_spans = OrderedDict((sentence_offset, [])

for sentence_offset in sentence_offsets)

for id, label, span_offsets in spans:

for sentence_offset, span_list in sentence_spans.items():

if sentence_offset[0] <= span_offsets[0][0] < sentence_offset[1]:

span_list.append(id)

continue

print(list(sentence_spans.values()))

Hope that helps.

Goran

--

---
You received this message because you are subscribed to the Google Groups "brat-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to brat-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/brat-users/80846f70-1996-47b7-b59e-254b58f04f29%40googlegroups.com.

Shikhar Misra

unread,

May 22, 2020, 3:13:18 PM5/22/20

to brat-users

Thanks for the quick reply. That was really helpful. The above code works. Is there also a way to detect the sentences with the tag ID's when I have the text data and their corresponding tags in two columns of Pandas dataframe. If not, I will try to work with the above solution. Thanks for your help.

To unsubscribe from this group and stop receiving emails from it, send an email to brat-...@googlegroups.com.

Shah Kash

unread,

May 22, 2020, 3:19:00 PM5/22/20

to brat-...@googlegroups.com

Hi how can i use brat for arabic langage

--

---
You received this message because you are subscribed to the Google Groups "brat-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to brat-users+...@googlegroups.com.

Goran Topic

unread,

May 23, 2020, 8:04:26 PM5/23/20

to brat-...@googlegroups.com

Dear Shikhar Misra,

Can you please make a MCVE? I execute code better than I read text :)

https://stackoverflow.com/help/minimal-reproducible-example

I.e. send a sample `txt` and `ann` file, the input Pandas frame, and the desired output, and I might be able to puzzle something out.

Goran

To unsubscribe from this group and stop receiving emails from it, send an email to brat-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/brat-users/535ce949-0417-4ace-9df5-ef57992a7519%40googlegroups.com.

Goran Topic

unread,

May 23, 2020, 8:04:33 PM5/23/20

to brat-...@googlegroups.com

Dear Shah Kash,

You need to use a recent version (cloned from GitHub, not downloaded as zip file) and set text direction in `visual.conf` as described in this comment:

https://github.com/nlplab/brat/issues/774#issuecomment-195273592

Goran

To view this discussion on the web, visit https://groups.google.com/d/msgid/brat-users/CAJozNO4e3VYri1Kxf-nA9mzTD0qyJ-Sd2zDSder25wRKKH0Wow%40mail.gmail.com.

Shikhar Misra

unread,

May 25, 2020, 12:01:11 AM5/25/20

to brat-users

Hi Goran,

I have attached the image of pandas dataframe with just one column having paragraph and its appropriate predicted labels which are obtained through a trained model in the brat format.

I have uploaded the sample .txt and .ann file. One alternative is to separate the paragraph into sentences and get the annotation prediction sentence-by-sentence but that takes a lot of time for a large dataset. I was wondering if I can directly separate the demodf['labels] into sentences using information from demodf['paragraph']. The desired output can be 2D list with each list corresponding to annotations from separate sentences like shown in the screenshot below:

Thank you.

To view this discussion on the web, visit https://groups.google.com/d/msgid/brat-users/535ce949-0417-4ace-9df5-ef57992a7519%40googlegroups.com.

temp_00002.ann

temp_00002.txt

Goran Topic

unread,

May 25, 2020, 1:06:55 AM5/25/20

to brat-...@googlegroups.com

Yeah, that looks like it should be pretty straightforward, assuming you have nothing but entity annotations.

However, I can't execute images :P and that is a bit too much to retype.

https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question

Goran

To unsubscribe from this group and stop receiving emails from it, send an email to brat-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/brat-users/fc732e0a-8fff-4102-b92e-7be03597352c%40googlegroups.com.

Shah Kash

unread,

May 25, 2020, 3:14:54 AM5/25/20

to brat-...@googlegroups.com

Thanks

To view this discussion on the web, visit https://groups.google.com/d/msgid/brat-users/CAGMASUNs7Kh9gj1tja9c1QMqVseJNMtE7BkeUpK2KsyNfekNOA%40mail.gmail.com.

Reply all

Reply to author

Forward