Input text: "ཉིན་མོ་ བདེ་ལེགས་ མཚན་ བདེ་ལེགས་ །"
...when I'm expecting to get:
1. ཉིན་མོ་
2. བདེ་ལེགས་
3. མཚན་
4. །
I think that the issue comes from the Unicode Punctuation (། and ་ missing) and the Marks (ཉིན split into ཉ and ན with ི missing; ལེགས split into ལ and གས with ེ missing) not being handled properly. AntConc has the same issue by default which we can fix in the Token Definition Settings.
When looking at this issue a couple of questions come to mind.
1. Is there a reason for Unicode Punctuation and Marks not being enabled by default? These days it's fair to assume that most corpora are in Unicode and therefore disabling Unicode Punctuation and Marks should be the exception.
2. If there's a reason to not enable Unicode Punctuations and Marks by default, would it be possible to add the Token Definition Settings to AntWordProfiler just as it appears in the corpus manager for AntConc?
Thanks a lot for your work!
NT