Hello Inception Support Group,
I am currently using the Inception annotation tool to annotate text in the Uzbek language. However, I have encountered an issue with the tool's tokenization process. Specifically, when a word contains the character "ʻ" (as in "Oʻ" or "Gʻ"), the tool does not select the entire word, but only up to the character "ʻ".
This problem significantly affects the accuracy and efficiency of my annotation work. I have tried to modify the token boundaries by adding this character to the constraints, but I am unsure how to properly implement this change.
It is important to note that in the Uzbek Latin alphabet, these two-character based letters (such as "Oʻ" and "Gʻ") are considered as single letters. There is no separate Unicode character for these combined letters, even though they are officially recognized letters in the Uzbek language. Therefore, accurate tokenization should treat these combinations as single units.
To help illustrate the issue, I have attached a screenshot showing how the tool currently selects only part of the word up to the "ʻ" character.
Could you please provide guidance on how to adjust the tokenizer settings to ensure that words containing the character "ʻ" are correctly recognized and selected as whole words? If there are specific steps or configurations I need to follow, I would greatly appreciate detailed instructions.
Thank you very much for your assistance. I look forward to your response.
Best regards,
Could you please help me why granularity hover is disabled as shown in the screenshot? Apologies for the simple question :) , but I could not find the reason in the relevant part of the documentation.
Cheers,
Sanatbek
--
You received this message because you are subscribed to the Google Groups "inception-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inception-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/inception-users/DA28D703-25D4-47C3-A273-22D864C94922%40gmail.com.