Tokenization issue in AntWordProfiler

88 views

Skip to first unread message

Ngawang Trinley

unread,

Dec 25, 2022, 12:00:13 AM12/25/22

to AntWordProfiler-Discussion

Hi Laurence,

I've trying to use AntWordProfiler unsuccessfully for Tibetan language. The word lists load properly but the input text doesn't get segmented properly.

Input text: "ཉིན་མོ་ བདེ་ལེགས་ མཚན་ བདེ་ལེགས་ །"

Reference Lists:
------------- Level 1 ---------------

"བདེ་ལེགས་<tab>བདེ་ལེགས་"
"ཉིན་མོ་<tab>ཉིན་མོ་"
"།<tab>།"

------------- Level 2 ---------------

"མཚན་མོ་<tab>མཚན་མོ་<tab>མཚན་"

"བཀྲ་ཤིས་ <tab>བཀྲ་ཤིས་<tab>བཀྲ་ཤིས་པ་"

The lists seem to load properly, judging from the number of entries and from the fact I get the same output when exporting them. However the text isn't tokenized properly and the analysis fails. When looking at the types here's what I get:

Screenshot 2022-12-25 123749.png

...when I'm expecting to get:
1. ཉིན་མོ་
2. བདེ་ལེགས་
3. མཚན་
4. །

I think that the issue comes from the Unicode Punctuation (། and ་ missing) and the Marks (ཉིན split into ཉ and ན with ི missing; ལེགས split into ལ and གས with ེ missing) not being handled properly. AntConc has the same issue by default which we can fix in the Token Definition Settings.

unnamed (3).png

When looking at this issue a couple of questions come to mind.
1. Is there a reason for Unicode Punctuation and Marks not being enabled by default? These days it's fair to assume that most corpora are in Unicode and therefore disabling Unicode Punctuation and Marks should be the exception.
2. If there's a reason to not enable Unicode Punctuations and Marks by default, would it be possible to add the Token Definition Settings to AntWordProfiler just as it appears in the corpus manager for AntConc?

Thanks a lot for your work!
NT

Reply all

Reply to author

Forward

0 new messages