Sentence segmentation failed

17 views
Skip to first unread message

alek...@gmail.com

unread,
Sep 19, 2022, 2:05:13 PM9/19/22
to incepti...@googlegroups.com
Hi altogether,

inception has a problem with sentence segmentation of the documents if the sentences start with ** --> it does not segment in sentences at all.

I have a project with news documents, where sentences are listed using **. Each sentence is shown in one line when used the line-oriented view. Unfortunately it was discovered to late (after the annotation) that the sentence segmentation does not work at all. I use CAS XMI format, and no sentence is annotated as such. Probably there is no way to fix it now, and I must segment the documents in post-processing?

Best regards, Aleksandra

alek...@gmail.com

unread,
Sep 19, 2022, 2:14:27 PM9/19/22
to incepti...@googlegroups.com
My description was not precise enough: the whole text is merged to one sentence despite the fact that each sentence closed with a dot.

Imran Hassan

unread,
Sep 20, 2022, 12:28:51 AM9/20/22
to incepti...@googlegroups.com
There is a warning in inceptionn saying that inception is not for production it is for testing. We have a 500000 tokens to tag should we use it or not.  Can any body elaborate on that

Best regards:
Imran Hassan

--
You received this message because you are subscribed to the Google Groups "inception-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inception-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/inception-users/CANnApAHLqaGuN9-hX4ukhNYnHLMO05j1PtvesWjfrGNS__X7qA%40mail.gmail.com.

Richard Eckart de Castilho

unread,
Sep 20, 2022, 12:49:29 AM9/20/22
to inception-users
Hi,

> On 20. Sep 2022, at 06:28, Imran Hassan <110...@stud.uot.edu.pk> wrote:
>
> There is a warning in inceptionn saying that inception is not for production it is for testing. We have a 500000 tokens to tag should we use it or not. Can any body elaborate on that

That warning probably says that you are currently using the embedded database. For proper usage, it is recommended to use a MariaDB/MySQL database instead. There have been reports in the past about people having lost data when using the internal database e.g. when there was a power failure of the system.

That said: in any case, always ensure to keep full backups, e.g. by regularly exporting your projects and by regularly making backups of the entire INCEpTION installation including the full INCEpTION application folder and the entire database. Make sure that backups can be restored as well - having an incomplete backup that cannot be restored is not helping.

Cheers,

-- Richard

Richard Eckart de Castilho

unread,
Sep 20, 2022, 12:53:09 AM9/20/22
to inception-users
On 19. Sep 2022, at 20:14, alek...@gmail.com wrote:
>
> My description was not precise enough: the whole text is merged to one sentence despite the fact that each sentence closed with a dot.

Do you remember which format you chose when importing the text?

-- Richard

Imran Hassan

unread,
Sep 20, 2022, 3:00:35 AM9/20/22
to incepti...@googlegroups.com
I have exported  my data set in conll-u and conll series. And I used conll coreNLP as well but non of them are working.


Best regard:
Imran Hassan


--
You received this message because you are subscribed to the Google Groups "inception-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inception-use...@googlegroups.com.

alek...@gmail.com

unread,
Sep 20, 2022, 3:13:10 AM9/20/22
to incepti...@googlegroups.com
Hi Richard,

I am importing the documents as plain text. 

Aleks

Richard Eckart de Castilho

unread,
Sep 20, 2022, 4:44:00 AM9/20/22
to incepti...@googlegroups.com
On 20. Sep 2022, at 08:36, Imran Hassan <110...@stud.uot.edu.pk> wrote:
>
> I have exported my data set in conll-u and conll series. And I used conll coreNLP as well but non of them are working.

Here is an overview of the supported formats, which built-in layers/features they support and if they support custom layers:

https://inception-project.github.io/releases/24.3/docs/user-guide.html#sect_formats

For example, if you use the built-in "Named Entity" and "Part of Speech" layers, then an export in the CoreNLP CoNLL-like format should include your annotations.

If you have defined custom layers for annotation, then they are only included in generic formats like UIMA CAS XMI.

-- Richard

Richard Eckart de Castilho

unread,
Sep 20, 2022, 4:49:06 AM9/20/22
to incepti...@googlegroups.com
Hi Aleks

> On 20. Sep 2022, at 09:12, alek...@gmail.com wrote:
>
> I am importing the documents as plain text.

There are at least three "plain text" formats:

1) Plain Text
2) Plain Text (one sentence per line)
3) Plain Text (pretokenized)

On 1), sentences and tokens should be auto-detected.
On 2), only token boundaries are auto-detected while line breaks are treated as sentence boundaries.
On 3), token boundaries are determined simply by whitespace and line breaks are treated as sentence boundaries.

If you have used 1) and no sentence breaks have been detected, possibly the dots are not plain ASCII dots but maybe special Unicode characters that the sentence boundary detector may not have been able to deal with.

Can you send me an example file exported as UIMA CAS XMI (1.0) directly to my mail (off-list)?

Cheers,

-- Richard

alek...@gmail.com

unread,
Sep 20, 2022, 8:52:24 AM9/20/22
to incepti...@googlegroups.com
Hi Richard,

I will send you the files in a minute.  

The text was unfortunately imported as plain text and not as one sentence per line. When I remove ** and upload the text one more time as simple plain text, then the segmentation works.

Best Aleksandra

--
You received this message because you are subscribed to the Google Groups "inception-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inception-use...@googlegroups.com.

Richard Eckart de Castilho

unread,
Sep 20, 2022, 9:45:11 AM9/20/22
to incepti...@googlegroups.com
Hi,

> On 20. Sep 2022, at 14:52, alek...@gmail.com wrote:
>
> The text was unfortunately imported as plain text and not as one sentence per line. When I remove ** and upload the text one more time as simple plain text, then the segmentation works.

INCEpTION uses a default tokenizer/sentence detector that the Java language provides... looks like that one doesn't like the "**".

How many annotated documents with the problematic sentence boundaries do you have?

-- Richard

alek...@gmail.com

unread,
Sep 20, 2022, 11:19:38 AM9/20/22
to incepti...@googlegroups.com
Hi Richard,

I have 64 documents containing about 25 sentences on average. I can fix the problem in the post-processing, but if there is any other cleaner way, please let me know.
In the future I will pay more attention to the potential segmentation problems and the import modes.

Aleks

--
You received this message because you are subscribed to the Google Groups "inception-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inception-use...@googlegroups.com.

Richard Eckart de Castilho

unread,
Sep 20, 2022, 11:25:08 AM9/20/22
to incepti...@googlegroups.com
Hi,

> On 20. Sep 2022, at 17:19, alek...@gmail.com wrote:
>
> I have 64 documents containing about 25 sentences on average. I can fix the problem in the post-processing, but if there is any other cleaner way, please let me know.
> In the future I will pay more attention to the potential segmentation problems and the import modes.

At the moment, the import modes are the only option except if you also would go through your own full pre-processing and generating XMI files and importing these into INCEpTION.

Some experimental work has gone into making sentences (and tokens) editable annotations, but this is still in early stages. There are bugs, but there are also conceptual questions like what it means if different annotators start having different sentence boundaries or token counts.

The question of being able to configure alternative sentence/token splitters also comes up occasionally - but has not been looked into yet.

-- Richard

alek...@gmail.com

unread,
Sep 20, 2022, 11:33:11 AM9/20/22
to incepti...@googlegroups.com
ok, understood. Thanks a lot for your help!

--
You received this message because you are subscribed to the Google Groups "inception-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inception-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages