Pattern or setup for sentence breaking

3 views
Skip to first unread message

Berg Oliveira

unread,
Aug 19, 2025, 8:33:19 AMAug 19
to incepti...@googlegroups.com
Good morning to the Community.
Once again, I'd like to begin by thanking Richard and the Inception community for the excellent tool and continued support.

This time, I have a question about importing files into Inception. 
I work with legal files in Portuguese and import them as plain text (*txt). 
Inception often breaks sentences even though they're on the same line in the original file.

For example, in Portuguese:
- "os ADVG(O)s. das vítimas..." (the victims' lawyers)
- "... Enquanto os EMBDO.(A/S) se manter ilesos" (while the defendants remained unharmed)
  ** "ADVG(O)s." and "EMBDO.(A/S)" are abbreviations for lawyers and appellant, respectively.

In both cases, depending on the file, sometimes the text breaks, sometimes not, for example,
"ADVG
(O)
s."

I couldn't find a pattern way to break sentences, nor any Inception settings to determine this. 
This break sometimes makes it difficult to annotate tokens (in my case, NER), requiring manual work.

Could you give me some guidance?
Thank you again for your patience and support, and I'm available for any help needed.

Best regards

--
---
Berg

Richard Eckart de Castilho

unread,
Aug 19, 2025, 11:30:24 AMAug 19
to inception-users
Hi Berg,

> On 19. Aug 2025, at 14:33, Berg Oliveira <file...@gmail.com> wrote:
>
> I couldn't find a pattern way to break sentences, nor any Inception settings to determine this.
> This break sometimes makes it difficult to annotate tokens (in my case, NER), requiring manual work.

If you provide your input files in such a way that each sentence is one a its own line (and does not span multiple lines)
then you can import the texts in the format "Plain text (one sentence per line)".

Otherwise, if you have programming skills, you could prepare your text in XMI format using the dkpro-cassis
python library and use a sentence/token splitter of your choice.

Finally, there is very experimental functionality in INCEpTION to adjust sentence and token boundaries.
If you want to help testing this, let me know and I can tell you how to activate it. However, please
note that it is really not well tested and changing sentence/token boundaries may have unexpected and
so far unknown side-effects. So best only test this and do not use it for serious work.

Cheers,

-- Richard

Berg Oliveira

unread,
Aug 19, 2025, 1:37:14 PMAug 19
to incepti...@googlegroups.com
Hi Richard...

Em ter., 19 de ago. de 2025 às 12:30, Richard Eckart de Castilho <richard...@gmail.com> escreveu:
Hi Berg,

> On 19. Aug 2025, at 14:33, Berg Oliveira <file...@gmail.com> wrote:
>
> I couldn't find a pattern way to break sentences, nor any Inception settings to determine this.
> This break sometimes makes it difficult to annotate tokens (in my case, NER), requiring manual work.

If you provide your input files in such a way that each sentence is one a its own line (and does not span multiple lines)
then you can import the texts in the format "Plain text (one sentence per line)".


I'm attaching two screenshots: the first one is of the txt file with one sentence per line, used in the import. And the second one, shows how it was imported, with my NER annotation.
 
Otherwise, if you have programming skills, you could prepare your text in XMI format using the dkpro-cassis
python library and use a sentence/token splitter of your choice.

Finally, there is very experimental functionality in INCEpTION to adjust sentence and token boundaries.
If you want to help testing this, let me know and I can tell you how to activate it. However, please
note that it is really not well tested and changing sentence/token boundaries may have unexpected and
so far unknown side-effects. So best only test this and do not use it for serious work.

Yes, I have some skills in coding. If you can send them privately, I can test them. 
It's always good to have more than one option, I believe.
 
Cheers,

-- Richard

--
You received this message because you are subscribed to the Google Groups "inception-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inception-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/inception-users/D57809C5-CA96-4D64-8FBA-2A1CC7F57A7A%40gmail.com.

Thx so much!
Berg
--
---
Berg
Captura de tela de 2025-08-19 14-35-32.png
Captura de tela de 2025-08-19 14-30-23.png

Richard Eckart de Castilho

unread,
Aug 19, 2025, 3:08:49 PMAug 19
to inception-users
Hi Berg,

are you sure you imported with "Plain text (one sentence per line)"?

I have extracted the text from your screenshot and imported it with one-sentence-per-line to INCEpTION.
The attached screenshot shows it renders properly as one-sentene-per-line.
I have also attached the text file for you to reproduce.

-- Richard



berg.txt
Screenshot 2025-08-19 at 21.04.41.png

Richard Eckart de Castilho

unread,
Aug 19, 2025, 3:19:22 PMAug 19
to incepti...@googlegroups.com
Hi,

> On 19. Aug 2025, at 19:36, Berg Oliveira <file...@gmail.com> wrote:
>
> If you can send them privately, I can test them.
> It's always good to have more than one option, I believe.

Oh, its not secret ;) Just highly experimental.

https://inception-project.github.io/releases/37.4/docs/admin-guide.html#sect_settings_segmentation

When this setting is activated, you can "add" (not create) the "Sentence" layer to your project.
I believe it adds as a disabled, layer so you also have to enable it.
Then you can open your document and switch to a presentation mode that is not based on sentences,
e.g. "brat (wrap at 120)".
Now you should see sentence annotations.
You should be able to resize sentences using the orange handles left/right of an annotation label
as usual.
In order to split a sentence, you would shift-click at the split position.
When you delete a sentence, the prior or following sentence will be extended to cover the part of
the text that belonged to the deleted sentence.

Have fun :)

-- Richard

Berg Oliveira

unread,
Aug 19, 2025, 3:28:47 PMAug 19
to incepti...@googlegroups.com
Thanks for the answers.
I will verify the option import used, and test the new format soon.
Cheer's 

---
Berg

--
You received this message because you are subscribed to the Google Groups "inception-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inception-use...@googlegroups.com.

Richard Eckart de Castilho

unread,
Aug 19, 2025, 3:38:55 PMAug 19
to incepti...@googlegroups.com


> On 19. Aug 2025, at 21:28, Berg Oliveira <file...@gmail.com> wrote:
>
> I will verify the option import used, and test the new format soon.

In the documents panel of the project settings should be able to see which format was used to import your documents.

-- Richard

Screenshot 2025-08-19 at 21.35.14.png

Berg Oliveira

unread,
Aug 19, 2025, 3:43:19 PMAug 19
to incepti...@googlegroups.com
Dear Richard and all members.
Thanks for the support again.
I made a mistake when importing the files, using only the "Plain text" option.
I tested now, and it works.
Now, I will think (a lot) about how I'll solve the other annotation files finished ;-)
Thanks a lot agains.
Berg

--
You received this message because you are subscribed to the Google Groups "inception-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inception-use...@googlegroups.com.


--
---
Berg
Reply all
Reply to author
Forward
0 new messages