Hi,
Is there a way to turn on case-sensitive indexing?
I'm testing AntConc with legacy-encoded Devanagari text. When I use the Word tab/feature to generate a comprehensive word list, I noticed the results weren't producing legible Devanagari. It turns out that every ASCII value in a token is being converted to its lower-case equivalent. In the case of this corpus, that results in character loss!
Example:
Öõý
-> öõý // first char is getting converted to lower-case
है। -> ैंै। // the legacy encoding uses both characters for unique/different Devanagari characters in the ASCII table so we get a different word being stored in the index and lose the original!
If it is possible to set case-sensitivity on for indexing, this tool would work on legacy encoded documents, as well. AntConc is a very good tool for doing a number of tests and updating these legacy documents to their Unicode equivalent.
Thanks for any feedback you can provide!