Dear Marlene and all,
Here is the update on the issue of token definitions in AntConc and WordSmith.
First, AntConc:
Before processing any text, the user must specify the character encoding of the raw corpus files under the global settings->Language (character) encoding option. The default is 'iso-8859-1, which is commonly called Latin 1, and is commonly used to encode English and European language texts. If other languages are to be used, they will either be encoded in a legacy encoding (e.g. the popular Shift-JIS encoding here in Japan) or perhaps a Unicode encoding (e.g., the international UTF-8 standard). I stress *the encoding must be specified before accurate processing can be done.
Within AntConc, the files are then converted to an internal Unicode standard representation. (Some people think this is UTF-8 but it's not really, and an explanation would get *very* complicated).
The token definition in AntConc is completely determined by the token definition settings under the Global Settings menu option. A series of characters in the token definition delimited by non-token definition characters will be treated as 'words'. In AntConc, the default setting is the "Letter" class of the Unicode standard. In other words, any character in the Unicode standard that has a "Letter" character property will be included. See here for a description:
http://www.unicode.org/versions/Unicode4.0.0/ch04.pdfOther classes, such as the Unicode "number" or "punctuation" classes, can be added to the token definition, or the user can type characters directly to build up a 'user-defined' token definition.
Note that in Version 3.3.0, I am thinking of changing the default from Latin 1 to either the UTF-8 international standard, or perhaps ANSI on Windows (to match WordSmith - see below) and UTF-8 on OS X and Linux.
Next WordSmith:
WordSmith is built for only Windows computers. Because of this, it uses the same assumptions about the encoding of texts as Windows itself does.
On Windows computers, when a user saves a text file (e.g. using Notepad), the default encoding will be something called 'ANSI'. In a Windows system, ANSI is just a poor term used to describe the
system's default 'code page', which is really just another term for the default legacy encoding of the system. Unfortunately, the 'code page' for a Windows system will be different depending on *the language* of the OS. So, the ANSI encoding on most European Windows systems is cp1252 ('Latin
1'). On Japanese Windows, it will be cp932 (Shift-JIS), and on Chinese Windows, Korean Windows, etc. it will be different again.
Compounding the problem are the poor terms used by the Windows system when a user tries to save a file with a non-default encoding. If a file contains non 'ANSI' characters, Windows will suggest to the user to save the texts as 'Unicode', which is another poor term that really means the Unicode UTF-16LE encoding. The user can also access other encodings that Windows terms Unicode Big Endian, which really means UTF-16BE, or UTF-8, which really means UTF-8 BOM. UTF-8 BOM means UTF-8 with an extra character (Byte Order Mark) added to the beginning of the file to mark the file explicitly being in UTF-8.
When WordSmith starts to process corpus files, it will assume the files have the ANSI encoding (i.e., the legacy 'code-page' encoding of the system). It can also auto-detect if the file is encoded in UTF-16LE or UTF-8 BOM (but probably not UTF-16BE). It then converts the file to an internal Unicode representation (based on UTF-16LE).
So, if you create your own corpus on a Windows system or receive a corpus that was created on a Windows system with the same ANSI setting, you really don't have to worry about encodings with WordSmith, because it will process everything in the correct way, as does NotePad. Things will get complicated, however, if you use files created on Windows systems that don't use the same ANSI encoding, and of course, files created on non-Windows systems.
For example, if you are on a Windows system in the UK and try to save a corpus that contains some characters with French accents, or some German umlauts etc., using the ANSI setting, the corpus will be saved in the Latin-1 encoding. Then, if you send this file to somebody in Japan, those characters will become garbled because their system will try to read the file using the Shift-JIS encoding. The same will happen if a Japanese person tries to make and send a corpus to someone in the UK. (And of course, the files will be garbled in different ways if users in China and Korea try to exchange flles, because the ANSI encoding in these countries are different again!).
In short, WordSmith processes texts exactly in the same way that
Notepad does, because it uses the same underlying assumption. As most corpus users in the world are using Windows systems that are based on the 'ANSI' setting of Latin 1, things, will work without problem. People working on Windows in non-Latin 1 countries, however, need to be careful. In these cases, it is obviously safer to use a Unicode encoding (UTF16LE or UTF8 BOM). In fact, I would recommend to always use UTF8 BOM when creating a corpus (because UTF8 is the standard that most corpus linguists around the world use), and when you send a corpus to anyone, always tell them that you used UTF8 BOM.
Now that the encoding issue is dealt with, we can finally understand what WordSmith does with a corpus file internally after it has converted it to a Unicode representation. Here, I can quote Mike Scott directly:
###
WordSmith reads in the corpus files and then for each file, determines for each character whether it's classified as a 'letter', a 'number', a 'space', or a 'punctuation' character based on the Unicode (Microsoft version) standard. A sequence of only 'letter' characters delimited by space or punctuation will be treated as a 'word' and counted separately. Any other space-delimited sequence including a number will be treated as a 'number' and optionally included as a 'word' or else (the default) counted together with the total appearing under the # label. All 'space' characters and 'punctuation' characters will be ignored (unless the characters are specifically added by the user).
###
So, you can see that both AntConc and WordSmith use the same 'Letter' class. This gives me hope that the frequency counts should be *exactly the same*, provided that the encodings all match.
The only real complication is that WordSmith has a feature that allows users to specify non-'Letter' characters *within* a string of 'Letters' characters (for example, an apostrophe ['] to allow words like "don't" to be treated as words). AntConc currently does not have this feature so the frequency counts will differ when this option is used. And this comes back to Marlene's original question. The frequency list that she used was generated in WordSmith with this feature activated, i.e, the word list words included apostrophes, hyphens and other non-letter classes. The biggest problem was that numbers in WordSmith are all conflated into a 'word' labeled #, which of course, is also not a 'Letter' character in AntConc.
Sorry for the very long posting. I hope this clarifies all the issues surrounding encodings and token definitions.
Laurence.