Keyword list - Antconc vs Wordsmith

2,867 views
Skip to first unread message

Marlene

unread,
Nov 9, 2011, 5:58:41 AM11/9/11
to AntConc-discussion
Dear Mr Anthony,

I have a few questions regarding the keyword list tool of Antconc.
I'm working as a student assistant at a university and my professor
would like to adopt antconc for her ESP genre analysis course. She is
currently using wordsmith tools, but the programme has the obvious
disadvantage that students cannot use it at home or for their future
research projects without buying it.

The problems are now the following:

The reference corpus:
In order to be able to compare the student's own corpora with a
reference corpus I am trying to prepare two option for British and
American English respectively. For American English I have downloaded
a 500.000 unlemmatised word frequency list of the CoCa and have
fomatted it in the correct antconc format (RANK FREQUENCY WORD). When
I try to generate a keyword list the following error messages appear:
1. "One or more of the rank column values is not a number."
2. "One or more of the frequency values is not a number."
3. "One or more of the word column values is not defined."
However, regardless of these error messages antconc does generate a
keyword list.
Thus my first questions would be whether these error messages actually
impact on the accuracy of the analysis or how I could resolve the
issues?
I have tried to control the word list manually (in very superficial
manner though) and I found the rank numbers to be accurate (compared
to Excel line numbers). Some of the lexical entries contain special
characters such as \ or (), could this be the problem related to error
message 3?

For the British English reference list I would like to use the BNC,
but which raw frequency list exactly is a bit of a tricky issue. So
far I have found no better option than using the Wordsmith wordlist
(as Kilgariff's lists don't open and the ones by Leech, Rayson &
Wilson can't be formatted properly). I have saved the BNC wordlist of
Wordsmith as a txt file, again formatted in the right way for antconc
in Excel and saved it again as a txt. I tried it the same way I did
with the CoCa, but also here an error messages appears, but only error
3 "One or more of the word column values is not defined". In this case
could the error be due to the use of the symbol # in one of the lines?

Comparison of results:
The second issue is that finally I tried to compare the results of
wordsmith and antconc, generating a keyword list with both using the
same reference list (BNC World wordlist) and the same corpus of 51
company profiles. I have read in one of your articles (published in
IWLeL 2004) that the keyword tool of antconc basically works the same
way as the one of Wordsmith.
However, apparently it does not because the keyword list generated by
the two programmes differ markedly. In the antconc list a lot of words
I would consider this specific to genre of company descriptions appear
on top. The difference could of course be due to the problem with the
BNC reference list in antconc as described above, but still it is
quite astonishing.
I will include the beginnings of each keywordlist below, so that you
can see what I mean, but basically my question is how the keyword tool
in antconc actually works? Where could the differences come from?

I know these are a lot of questions and I thank you in advance for
your patience. I am relatively new to corpus linguistics, so please
forgive me if I have made some fundamental mistakes in my reasoning or
procedures.
Furthermore, the issue is not pressing, the course will only start
next term, but I would appreciate it a lot if you could help!

Marlene Schwarz

Keyword list generated by antconc:

1 770 3385.020 the
2 759 3182.206 and
3 496 1994.045 of
4 433 1432.479 to
5 438 1424.764 in
6 196 1399.418 our
7 173 1060.175 we
8 169 881.670 is
9 153 789.418 with
10 291 784.080 a
11 96 634.201 has
12 99 565.705 are
13 147 564.065 for
14 72 502.655 customers
15 90 437.537 business
16 67 404.940 services
17 75 404.730 company
18 62 394.848 its
19 55 390.785 solutions
20 56 354.941 products
21 91 328.484 as
22 71 321.745 service
23 72 293.960 that
24 45 291.811 also
25 56 284.579 this
26 62 283.397 from
27 40 278.722 companies
28 35 267.183 nextel
29 41 253.956 their
30 38 241.402 clients

Keyword list generated by Wordsmith: (sorry about the layout, but
that's what the txt file output looks like...)

WordSmith Tools 4.0 -- 8.11.2011

Key word Freq. % RC. Freq. RC. % Keyness P Lemmas Set
1 OUR 196 1,18 93455 0,09 633,33 0,0000000000
2 NEXTEL 35 0,21 0 609,02 0,0000000000
3 CUSTOMERS 72 0,43 6698 457,73 0,0000000000
4 AHOLD 25 0,15 0 435,00 0,0000000000
5 SOLUTIONS 55 0,33 2541 425,14 0,0000000000
6 BUSINESS 90 0,54 35127 0,04 323,74 0,0000000000
7 PRODUCTS 56 0,34 10587 0,01 278,63 0,0000000000
8 SERVICES 67 0,40 24866 0,02 247,23 0,0000000000
9 SERVICE 71 0,43 30252 0,03 243,72 0,0000000000
10 GLOBAL 37 0,22 3527 233,38 0,0000000000
11 INTERNET 19 0,11 97 227,18 0,0000000000
12 CLIENTS 38 0,23 4828 218,35 0,0000000000
13 COMMUNICATIONS 34 0,21 3475 209,81 0,0000000000
14 TM 18 0,11 119 206,64 0,0000000000
15 CUSTOMER 35 0,21 4348 202,64 0,0000000000
16 AND 759 4,58 2624341 2,64 199,62 0,0000000000
17 SCO 21 0,13 417 196,98 0,0000000000
18 RIBEYE 11 0,07 0 191,39 0,0000000000
19 WE 172 1,04 300833 0,30 181,08 0,0000000000
20 COMPANY 61 0,37 35947 0,04 173,15 0,0000000000
21 SEMICONDUCTORS 14 0,08 75 166,16 0,0000000000
22 UK 44 0,27 16534 0,02 161,29 0,0000000000
23 MOBILE 23 0,14 1672 157,27 0,0000000000
24 ECOLAB 9 0,05 0 156,59 0,0000000000
25 WARCRAFT 9 0,05 0 156,59 0,0000000000
26 DIGITAL 23 0,14 1915 151,15 0,0000000000
27 LSC 9 0,05 1 150,09 0,0000000000
28 STASYS 9 0,05 4 140,55 0,0000000000
29 FOODSERVICE 10 0,06 15 140,35 0,0000000000
30 COMPANIES 40 0,24 17766 0,02 134,17 0,0000000000

Laurence Anthony

unread,
Nov 9, 2011, 7:17:54 AM11/9/11
to ant...@googlegroups.com
Hi Marlene,

It's really nice to get this sort of question. As you point out, the difference in the results is startling but I imagine that the reason is simply some setting difference. However, it does show how important the choice of software (and the settings chosen) affect results. Rather than try and trace the cause via a series of emails, can you send me the following in a zip file (assuming you have no license problems with the corpus) using my normal email address?:

1) The settings file used for AntConc. You can create this by just going to the File menu and choosing "Export Settings File.
2) The BNC reference word list that you used.
3) Your target corpus of 51 company profiles.

I'll try and replicate the results you generated using my versions of the software. (Please let me know if you changed any of the default settings for WordSmith tools).

Regarding the errors that you get when importing keyword reference lists, it relates to the token definition that you are using. AntConc uses a very transparent token definition based on Unicode character classes. Probably the word list that you are importing includes 'words' that don't match the definition (e.g. numbers, hyphens, etc.). This will produce the errors and ultimately create (small) differences in the two sets of results. (If anybody knows what token definition is used by WordSmith Tools, I would be very interested to know). But, the two sets of results should be largely the same. I'll see if I can identify the problem with the imported lists, too.

Let me know if you have any license problems sending your target corpus. I can probably find a workaround if you do.

Best regards,
Laurence.

Warren Tang

unread,
Nov 9, 2011, 3:51:28 PM11/9/11
to ant...@googlegroups.com
Marlene, Laurence,
The results do not look right. 'the' (and all the other function words) should not that high, so there is something wrong with the calculation in Antconc. I do feel the older versions (3.2.x) are more stable and reliable so may be the same data should be tested on those again.

My two cents ...


Regards,
Warren

Laurence Anthony

unread,
Nov 9, 2011, 7:28:35 PM11/9/11
to ant...@googlegroups.com
Hi Warren,

It's nice to see other people commenting here.

However, your statement, "so there is something wrong with the calculation in Antconc," is going to worry a lot of people. As far as I can see, there is nothing wrong with the AntConc keyness calculation based on log likelihood, which Marlene is using. I have checked it against several sources including the easy to use log likelihood calculator at Lancaster University (http://ucrel.lancs.ac.uk/llwizard.html) and the results match perfectly. As I wrote yesterday, the problem is almost certainly related to the BNC frequency list that Marlene is using. I'm investigating this now.

Also, you wrote "I do feel the older versions (3.2.x) are more stable and reliable".

This is also worrying but also confusing. The latest version *is* a 3.2.x version, i.e., 3.2.4. Which versions do you find less stable and reliable? Also, could you explain what aspects you find less stable or reliable. I certainly hope to make AntConc better with a new release, not worse! So, more details would be very appreciated.

Looking forward to more of your comments and suggestions.

Laurence.

Dr CK Jung

unread,
Nov 10, 2011, 1:22:35 AM11/10/11
to ant...@googlegroups.com
Hi Warren

If you want to say "there is something wrong with the calculation in
Antconc", you must show us evidence.

Regards
CK

> --
> You received this message because you are subscribed to the Google Groups
> "AntConc-discussion" group.
> To post to this group, send email to ant...@googlegroups.com.
> To unsubscribe from this group, send email to
> antconc+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/antconc?hl=en.
>

Laurence Anthony

unread,
Nov 10, 2011, 2:19:37 AM11/10/11
to ant...@googlegroups.com
Dear All,

I've now finished investigating the issue raised by Marlene. I'm happy to report that the Keyword Tool in AntConc is working fine. Assuming a correctly formatted reference list is being used, the results perfectly match those of both Wordsmith Tools and also the Lancaster Univ. Log-Likelihood calculator. I should add that the keyness values generated by the chi-squared measure are also correct and match those from varies sources. Here's a brief summary.

First, the difference in Keyness scores that Marlene observes is directly related to the BNC list that she is importing into AntConc. AntConc assumes that the reference list is formatted in a certain way (see the readme file). If AntConc finds something wrong, it stops reading the file, reports that there are missing values, and then continues to calculate the results on what it has managed to process. In Marlene's data, there is an error generated on line 7 of the BNC file. That's why the results are so different! (I have explained to Marlene directly how she can reformat her data.)

In hindsight, I need to stop AntConc proceeding to calculate the results when the reference list is not properly formatted. I'll introduce this in the next upgrade.

One other area of confusion is that AntConc does not convert the reference list to lowercase, even when the 'treat all data as lowercase' option is selected. The whole issue of when and where the 'treat all data as lowercase' option is applied has caused confusion for a very long time. In the next release, I'm going to try and resolve this confusion by removing it from each individual tool, and putting it in one place in the global settings and make it apply to everything. This should simplify things considerably.

Please remember that I'm really happy to get feedback, both positive and negative. I forward to further feedback from you all.

Happy concordancing!

Laurence.

Warren Tang

unread,
Nov 10, 2011, 6:55:25 AM11/10/11
to ant...@googlegroups.com
Laurence, Dr Jung, Marlene,
I was assuming you were using identical statistical techniques.

But a quick look at the words Wordsmith Tools (a piece of software I don't use) they look like MI3 results. Since AntConc uses Log-likelihood and Chi-squared they would naturally be different. Both Log-likelihood and MI3 give some weight to grammatical words with the latter giving much more. For a nice summary of this see Paul Baker's Using Corpora in Discourse Analysis (2006).

Another problem is the settings in Wordsmith Tools, as Laurence mentioned. Numbers are counted as words in WS while in AntConc they are not. If your file has a lot numbers then the same will be skewed.

Otherwise my other guess was that a typo during programming would also give these strange numbers. After all that is not hard to do.


Warren

Laurence Anthony

unread,
Nov 10, 2011, 8:15:40 AM11/10/11
to ant...@googlegroups.com
Hi Warren,

Thank you for the clarification. But, as I wrote in my earlier report, the results of WordSmith and AntConc do match when you use the correctly formatted BNC reference list. WordSmith is using Log-likelihood correctly. Actually, WordSmith was one of the first pieces of easy-to-use software to calculate Log-Likelihood, and I am only following Mike Scott's great work by adding it to AntConc.

There is an issue about the token definition used by WordSmith, and I plan to ask Mike about this shortly.

Do you have any comments on the other thing you said:  "I do feel the older versions (3.2.x) are more stable and reliable".

Laurence.

Warren Tang

unread,
Nov 10, 2011, 2:44:13 PM11/10/11
to ant...@googlegroups.com
Hi Laurence,
Sorry, I didn't see the earlier comment and sorry for not replying to your question.

As you know I am doing my PhD at the moment. This would mean remaining with version 3.2.1 until completing it, for the sake of consistency.

When I said 'stable and 'reliable' I had been talking about 3.3.x which I thought had been released. I was jumping head. So I must apologize for the misleading comment. I have no complaints and all praise for 3.2.x and know that quality is what you do, Laurence. AntConc is a quality piece of software.

Please, do talk to Mike about the word definition issue. I would like to see that resolved so that the two pieces of software can have matching lists.


Warren

Laurence Anthony

unread,
Nov 11, 2011, 3:09:29 AM11/11/11
to ant...@googlegroups.com
Dear Warren,

Thank you for the clarification. I was wondering if you were referring to version 3.3.0.

As you know, AntConc 3.3.0 is going to be released officially very soon, but now it just appears on my site as a 'development' release. This allows me to show everybody the direction I'm going with the software and helps me find some of the subtler problems before the official version is finalized.

Best regards,
Laurence


Marlene

unread,
Nov 22, 2011, 5:34:23 AM11/22/11
to AntConc-discussion

Dear all,

After I've been busy at work with a lot of other things I'm now back
to working on my antconc reference list problems.
Again, Laurence, thank you so much for your help! Your immediate and
thorough replies made the problem much clearer to me and helped me a
lot.
Unfortunately, I still don't have an idea how to really solve it
because I don't have access to full versions of the BNC and the CoCa.
I tried to use the BNC Baby, but it's XML of course and there are no
txt files on the CD-ROM that I could use with antconc.
So, now I tried a different line of thought: what if, rather than
adapting the reference list to antconc, I try to adapt the token
definition of antconc according to the reference list?
I know there is such a possibility in antconc (btw I'm using 3.2.4)
and I just wondered if that could help.
It might of course be a nonsensical idea, but please have patience
with me, as I said I'm only a beginner with corpus linguistics.
Does anybody have thoughts on the issue? Any input would be greatly
appreciated!

Best,
Marlene

Marlene

unread,
Nov 22, 2011, 7:22:55 AM11/22/11
to AntConc-discussion
Hi again,

I just wanted to add that now I have tried to use both the BNC and the
CoCa lists with a user-defined token definition in antconc and that of
course didn't work because "-" is then counted as a word, for
instance. I should have tried it first before posting here.
However, the only thing I did was substitute the # in BNC line 7 XX
and then the é-' did not create any more error messages! For the Coca
I found out that two of the error messages of the coca (rank and
number definition) were due to an additional tab sign in an empty line
at the end of the list, which seemed to confuse antconc.

Laurence, I know, you advocated creating my own lists with a more
rigorous definitions of a word because even if I substitute a special
character like #, it still means that the total frequencies are going
to be based on assumptions like "#" is a word. However, the results I
now get a very close to Wordsmith's results (also the measures for
keyness) and since the lists will only be used for an ESP course,
which introduce students to genre analysis, I think slight deviations
will not actually be a great issue. This is, of course, for my
professor to decide and in the meantime I would like to thank you for
your support!

Marlene

Laurence Anthony

unread,
Nov 22, 2011, 8:10:37 AM11/22/11
to ant...@googlegroups.com
Dear Marlene,

Thanks for the update.

I've recently had an interesting chat with Mike Scott about the token definition used in WordSmith. The actual story is quite complex so I'll need a little time to post a summary. Also, I've made some changes to AntConc so that the WordSmith word lists can be mostly used without change. I've also added some error messages to give users an idea of where differences might be found. These are all coming up in the Release Candidate 2 of Version 3.3.0. It's sitting right in front of me!

I agree that your little hack shouldn't make too much difference to the results.

Best regards,
Laurence.


Laurence Anthony

unread,
Nov 27, 2011, 10:51:04 PM11/27/11
to ant...@googlegroups.com
Dear Marlene and all,

Here is the update on the issue of token definitions in AntConc and WordSmith.

First, AntConc:

Before processing any text, the user must specify the character encoding of the raw corpus files under the global settings->Language (character) encoding option. The default is 'iso-8859-1, which is commonly called Latin 1, and is commonly used to encode English and European language texts. If other languages are to be used, they will either be encoded in a legacy encoding (e.g. the popular Shift-JIS encoding here in Japan) or perhaps a Unicode encoding (e.g., the international UTF-8 standard). I stress *the encoding must be specified before accurate processing can be done.

Within AntConc, the files are then converted to an internal Unicode standard representation. (Some people think this is UTF-8 but it's not really, and an explanation would get *very* complicated).

The token definition in AntConc is completely determined by the token definition settings under the Global Settings menu option. A series of characters in the token definition delimited by non-token definition characters will be treated as 'words'. In AntConc, the default setting is the "Letter" class of the Unicode standard. In other words, any character in the Unicode standard that has a "Letter" character property will be included. See here for a description:
http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf
Other classes, such as the Unicode "number" or "punctuation" classes, can be added to the token definition, or the user can type characters directly to build up a 'user-defined' token definition.

Note that in Version 3.3.0, I am thinking of changing the default from Latin 1 to either the UTF-8 international standard, or perhaps ANSI on Windows (to match WordSmith - see below) and UTF-8 on OS X and Linux.

Next WordSmith:

WordSmith is built for only Windows computers. Because of this, it uses the same assumptions about the encoding of texts as Windows itself does.

On Windows computers, when a user saves a text file (e.g. using Notepad), the default encoding will be something called 'ANSI'. In a Windows system, ANSI is just a poor term used to describe the system's default 'code page', which is really just another term for the default legacy encoding of the system. Unfortunately, the 'code page' for a Windows system will be different depending on *the language* of the OS. So, the ANSI encoding on most European Windows systems is cp1252 ('Latin 1'). On Japanese Windows, it will be cp932 (Shift-JIS), and on Chinese Windows, Korean Windows, etc. it will be different again.

Compounding the problem are the poor terms used by the Windows system when a user tries to save a file with a non-default encoding. If a file contains non 'ANSI' characters, Windows will suggest to the user to save the texts as 'Unicode', which is another poor term that really means the Unicode UTF-16LE encoding. The user can also access other encodings that Windows terms Unicode Big Endian, which really means UTF-16BE, or UTF-8, which really means UTF-8 BOM. UTF-8 BOM means UTF-8 with an extra character  (Byte Order Mark)  added to the beginning of the file to mark the file explicitly being in UTF-8.

When WordSmith starts to process corpus files, it will assume the files have the ANSI encoding (i.e., the legacy 'code-page' encoding of the system). It can also auto-detect if the file is encoded in UTF-16LE or UTF-8 BOM (but probably not UTF-16BE). It then converts the file to an internal Unicode representation (based on UTF-16LE).

So, if you create your own corpus on a Windows system or receive a corpus that was created on a Windows system with the same ANSI setting, you really don't have to worry about encodings with WordSmith, because it will process everything in the correct way, as does NotePad. Things will get complicated, however, if you use files created on Windows systems that don't use the same ANSI encoding, and of course, files created on non-Windows systems.

For example, if you are on a Windows system in the UK and try to save a corpus that contains some characters with French accents, or some German umlauts etc., using the ANSI setting, the corpus will be saved in the Latin-1 encoding. Then, if you send this file to somebody in Japan, those characters will become garbled because their system will try to read the file using the Shift-JIS encoding. The same will happen if a Japanese person tries to make and send a corpus to someone in the UK. (And of course, the files will be garbled in different ways if users in China and Korea try to exchange flles, because the ANSI encoding in these countries are different again!).

In short, WordSmith processes texts exactly in the same way that Notepad does, because it uses the same underlying assumption. As most corpus users in the world are using Windows systems that are based on the 'ANSI' setting of Latin 1, things, will work without problem. People working on Windows in non-Latin 1 countries, however, need to be careful. In these cases, it is obviously safer to use a Unicode encoding (UTF16LE or UTF8 BOM). In fact, I would recommend to always use UTF8 BOM when creating a corpus (because UTF8 is the standard that most corpus linguists around the world use), and when you send a corpus to anyone, always tell them that you used UTF8 BOM.

Now that the encoding issue is dealt with, we can finally understand what WordSmith does with a corpus file internally after it has converted it to a Unicode representation. Here, I can quote Mike Scott directly:

###
WordSmith reads in the corpus files and then for each file, determines for each character whether it's classified as a 'letter', a 'number', a 'space', or a 'punctuation' character based on the Unicode (Microsoft version) standard. A sequence of  only 'letter' characters delimited by space or punctuation will be treated as a 'word' and counted separately. Any other space-delimited sequence including a number will be treated as a 'number' and optionally included as a 'word' or else (the default) counted together with the total appearing under the # label. All 'space' characters and 'punctuation' characters will be ignored (unless the characters are specifically added by the user).
###

So, you can see that both AntConc and WordSmith use the same 'Letter' class. This gives me hope that the frequency counts should be *exactly the same*, provided that the encodings all match.

The only real complication is that WordSmith has a feature that allows users to specify non-'Letter' characters *within* a string of 'Letters' characters (for example, an apostrophe ['] to allow words like "don't" to be treated as words). AntConc currently does not have this feature so the frequency counts will differ when this option is used. And this comes back to Marlene's original question. The frequency list that she used was generated in WordSmith with this feature activated, i.e, the word list words included apostrophes, hyphens and other non-letter classes. The biggest problem was that numbers in WordSmith are all conflated into a 'word' labeled #, which of course, is also not a 'Letter' character in AntConc.

Sorry for the very long posting. I hope this clarifies all the issues surrounding encodings and token definitions.

Laurence.
Reply all
Reply to author
Forward
0 new messages