Issue with Word Selection for Uzbek Language in Inception Annotation Tool

Sanatbek Matlatipov

unread,

Jun 12, 2024, 3:27:51 PM6/12/24

to inception-users

Hello Inception Support Group,

I am currently using the Inception annotation tool to annotate text in the Uzbek language. However, I have encountered an issue with the tool's tokenization process. Specifically, when a word contains the character "ʻ" (as in "Oʻ" or "Gʻ"), the tool does not select the entire word, but only up to the character "ʻ".

This problem significantly affects the accuracy and efficiency of my annotation work. I have tried to modify the token boundaries by adding this character to the constraints, but I am unsure how to properly implement this change.

It is important to note that in the Uzbek Latin alphabet, these two-character based letters (such as "Oʻ" and "Gʻ") are considered as single letters. There is no separate Unicode character for these combined letters, even though they are officially recognized letters in the Uzbek language. Therefore, accurate tokenization should treat these combinations as single units.

To help illustrate the issue, I have attached a screenshot showing how the tool currently selects only part of the word up to the "ʻ" character.

Could you please provide guidance on how to adjust the tokenizer settings to ensure that words containing the character "ʻ" are correctly recognized and selected as whole words? If there are specific steps or configurations I need to follow, I would greatly appreciate detailed instructions.

Thank you very much for your assistance. I look forward to your response.

Best regards,

issue-word-selection.jpg

Richard Eckart de Castilho

unread,

Jun 12, 2024, 3:37:30 PM6/12/24

to incepti...@googlegroups.com

Hi,

> On 12. Jun 2024, at 08:47, Sanatbek Matlatipov <sanatbek....@gmail.com> wrote:
>
> I am currently using the Inception annotation tool to annotate text in the Uzbek language. However, I have encountered an issue with the tool's tokenization process. Specifically, when a word contains the character "ʻ" (as in "Oʻ" or "Gʻ"), the tool does not select the entire word, but only up to the character "ʻ".
> This problem significantly affects the accuracy and efficiency of my annotation work. I have tried to modify the token boundaries by adding this character to the constraints, but I am unsure how to properly implement this change.

the segmentation process in INCEpTION is presently not configurable.

Can go to the "Layers" pane in the project settings select your layer and switch the granularity to "Characters"?

That should allow you to place annotations anywhere. You might also want to enable the "Allow crossing sentence boundaries" option if you also find that sentence boundaries are not well detected.

Cheers,

-- Richard

Sanatbek Matlatipov

unread,

Jun 13, 2024, 5:58:56 AM6/13/24

to incepti...@googlegroups.com

Hello Richard,

Thank you very much for the reply.

Could you please help me why granularity hover is disabled as shown in the screenshot? Apologies for the simple question :) , but I could not find the reason in the relevant part of the documentation.

Cheers,

Sanatbek

--
You received this message because you are subscribed to the Google Groups "inception-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inception-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/inception-users/DA28D703-25D4-47C3-A273-22D864C94922%40gmail.com.

disabled-hover.jpg

Richard Eckart de Castilho

unread,

Jun 13, 2024, 3:00:02 PM6/13/24

to incepti...@googlegroups.com

Hi Sanatbek,

> On 13. Jun 2024, at 11:58, Sanatbek Matlatipov <sanatbek....@gmail.com> wrote:
>
> Could you please help me why granularity hover is disabled as shown in the screenshot? Apologies for the simple question :) , but I could not find the reason in the relevant part of the documentation.

Hm, I understand your problem.

You want to work with CoNLL-U data. The CoNNL-U support in INCEpTION requires that you use the built-in layers such as Lemma, Part-of-Speech, Dependency, etc. because the format only knows these layers.

However, these layers also impose certain restrictions. E.g. the built-in Lemma, Part-of-Speech and Morphological features layers are bound to token boundaries. And CoNLL-U is a mainly token-oriented format. So, unfortunately, you cannot change the granularity of these layers.

The best thing would be - as you noted in your initial mail - to adjust the token boundaries.
Within INCEpTION, you can neither modify the tokenization procedure nor modify the tokens
themselves (yet). So there would only be the option of making adjustments outside of INCEpTION.

There are several ways in which you could import pre-segmented data into INCEpTION.
The simplest may be to choose "Plain text (space-separated tokens, one sentence per line)" as the import format - provided that you can supply text files where all tokens are separated
by a space character and all tokens belonging to a sentence are on the same line.

If you want to be able to *not* have spaces all tokens, you might consider preparing your
texts as unannotated CoNLL-U files (i.e. only with text and whitespace information).
If you have no way of converting your texts to CoNLL-U outside of INCEpTION, you might
import the texts into INCEpTION, then export them again into CoNLL-U, then edit the CoNLL-U
files in a text editor to fix tokens and then import them back again into INCEpTION.

Does any of that sound viable to you?

-- Richard

Reply all

Reply to author

Forward