Hi David,
Character encodings are a very important consideration when designing a corpus. Basically, the simplest procedure to follow these days is to just save all data as UTF-8. The advantage of UTF-8 is that ASCII characters (a-z, A-Z, 0-9, punctuation etc.) are unchanged. So, UTF-8 files often open perfectly fine on any system with any software assuming the text is in English. Even software assuming ASCII will open a UTF-8 perfectly if the text is just ASCII text. For non-English texts, UTF-8 can also encode all the characters correctly but in this case the software must be told that the file is UTF-8 encoded.
Unfortunately, there are also in effect two flavors of UTF-8, i.e. UTF-8 with a BOM and UTF-8 without a BOM. BOM means byte order mark and it's a special set of (invisible) non-ASCII characters that are saved at the start of each text file that tell the software that the encoding is UTF-8. The Unicode official policy is basically that a BOM should not be included. However, on most Windows systems, software (like Word, Notepad, etc.) save the BOM characters as default. If you open these files with tools like Notepad and Word, the software detects the BOM and opens the file in UTF-8 hiding the BOM characters. This is convenient. However, if you use tools that are *not* expecting UTF-8, then the tool will often assume the file is in plain ASCII and open it as such. However, then, the BOM characters can suddenly appear visible because they are non ASCII characters. (I think this is the main reason why the official guide is to not include the BOM in UTF-8).
So, for your project, the best advice is to save the data as UTF-8 with a BOM. If you do this, it will be following the most common international text data standard (UTF-8), the files will open nicely on Windows systems (because of the BOM), and as more tools adapt to work with UTF-8, your files will open on them, too. You should still tell users what the encoding is. You should also be prepared for some users contacting to ask you
why some of the files have a few weird characters at the front. The
answer would be that they are opening a UTF-8 + BOM file as ASCII.
Note that as long as you have saved your data in some flavor of Unicode (e.g. UTF-16 LE) it can easily be converted automatically to UTF-8 (with or without the BOM). I might write a small script to do this because so many people experience the problem.
There is one other problem. Currently, AntConc makes no assumptions about file encodings and attempts to open them in the encoding set in the global settings. If you try to open a UTF-8 file with the ASCII setting, it will reveal all the non ASCII characters (including any BOMs) as mojibake (unrendered characters). If you set the encoding correctly, it will render the file correctly. Luckily, the default encoding for AntConc 3.3.5 and all future versions is UTF-8, and so it will also read UTF-8 + BOM files without problem.
I should also note that very, very few people understand the problems of encodings and so I am now considering introducing a character encoding guessing algorithm into AntConc (the same as that used in your web browser to guess the encoding of foreign web pages). This will mean that for most cases the user will not need to know the encoding at all because AntConc will guess it correctly. In the few cases where it guesses wrongly, the user will have to go to the global settings and set the encoding explicitly. Of course, for the guessing algorithm to work well, having a BOM at the front of a UTF-8 file is very helpful.
I hope that helps!
By the way, did you manage to solve the problem of matching results in AntConc and AntWordProfiler?
Laurence.