AntProfiler/AntConc - How can I ensure identical user defined token definitions?

David King

unread,

Jun 6, 2013, 4:57:16 AM6/6/13

to ant...@googlegroups.com

Hi

Thanks again Laurence for your reply to my previous post (regarding the issue of TTR and STTR). I've since run my files through AntProfiler, which gives the very convenient option of just exporting the results as tab separated and pasting into excel where I can just select the type column and token column - brilliant!

However, I've found that AntProfiler gives me very different type counts and token counts than AntConc. So, I went into AntProfiler's global settings to try to set the token setting so that it matches the one I have in AntConc, but to no avail.

In Ant Conc, my user-defined token class is: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'&0123456789. I can't figure out how to get the user-defined setting in AntProfile to accept the apostrophe ' and ampersand & as part of the definition. (I'm guessing that [a-zA-Z0-9]+ would give me all letters and numbers). I visited the unicode pages recommended in the readme file, but I can't make heads or tails of them!

Does anyone know how I can set AntProfiler to take into account these two symbols? Many thanks in advance!

David

Laurence Anthony

unread,

Jun 6, 2013, 12:16:55 PM6/6/13

to ant...@googlegroups.com

Hi David,

I think the following will work fine:

[a-zA-Z0-9'&]

If this doesn't work, please let me know and I'll look into it properly.

Laurence.

David

--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at http://groups.google.com/group/antconc?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

David King

unread,

Jun 6, 2013, 2:16:32 PM6/6/13

to ant...@googlegroups.com

Hi Laurence

Thanks again for your assistance in this! Your suggestion was also my first thought, but when I tried it earlier I got the discrepancies, so I thought there must be some sort of special coding that needs to be done. So, I've tried it again and, as you can see from the attached spreadsheet that a lot of the time, there isn't a difference, but that sometimes the difference is noticeable. To make it easier to see, I've highlighted the cells where the sums don't match. Almost always it's w/ the type count, with AntConc returning a higher type count compared to AntProfiler. On the few occasions when the token count doesn't agree it's Antprofiler which returns the higher figure. It's very perplexing, given that my token definition in AntConc and AntProfiler now agree.

Kind regards

David

AntConc AntProfiler comparison for Type & Token counts 060613.xlsx

Laurence Anthony

unread,

Jun 9, 2013, 12:38:46 AM6/9/13

to ant...@googlegroups.com

Hi David,

Let's keep the discussion here since this is where it started.

Have you looked at which words are not being counted correctly?

Perhaps you can just send me your files 9 and 10 directly. These should be enough to identify the problem. If your token definitions really are the same, the results should also match exactly because both tools use Perl as the underlying programming language. The version of Perl I use in the two tools is slightly different, but this should only really affect the counts for texts in a Unicode encoding. Plain English texts should be unaffected.

I suspect the problem can be identified and resolved in just a couple of minutes.

Laurence.

David King

unread,

Jun 10, 2013, 5:43:42 AM6/10/13

to ant...@googlegroups.com

Hi Laurence

Thanks for your speedy response! I've attached file 10 as well as a spreadsheet showing how AntConc and AntProfiler read the file differently. The token definition setting I use in AntConc is:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'&0123456789 (copied and pasted directly from AntConc settings)

The setting I use in AntProfiler is:

[a-zA-Z0-9'&]+ (copied and pasted directly from AntProfiler settings)

As you can see from the spreadsheet, AntProfiler is not recognising the apostrophe because the t from isn't and the s from that's are being counted separately (as are the isn from isn't and the that from that's). AntProfiler also isn't counting calvin or hobbes and has also not listed the words seeks or euphoria.

I thought maybe because euphoria was followed by a full-stop and double end-quote that maybe AntProfiler had trouble recognising it, but then I couldn't figure out why this didn't cause AntConc any problems and why would seeks not be picked up by AntProfiler when it sits right in the middle of a sentence? As for Calvin and Hobbes I can't hazard a guess.

Once again, thanks so much for your time - greatly appreciated!

David

0010.txt

AntConc AntProfiler results for txt 10.xlsx

Laurence Anthony

unread,

Jun 10, 2013, 7:13:36 AM6/10/13

to ant...@googlegroups.com

Hi David,

I've had a look at your file and I can confirm that AntConc and AntWordProfiler are both working correctly.

There are two issues. First, your file is not encoded in UTF8, which is required for AntWordProfiler to work correctly for non-ASCII texts. This is not relevant in the current file because you are only trying to identify words containing A-Za-z0-9 and '&. However, if you wanted to process the end double quote mark, you would have a problem. The first quote is one from ASCII ("), but the second is a weird double quote, probably from some non-English texts (Japanese?). If I copy it here it looks like this (・), which you can see is strange. It should be (").

The second issue is the *way* you are using AntWordProfiler. AntWordProfiler relies on counting the tokens in a target file against so-called baseword lists. The default lists included in the software are the most frequent words in English. But, they *do not* contain all the words in the file you are trying to process. So, words like "Calvin" and "Hobbes" don't get counted because they are not in the baseword lists.

To count all the words in your files (with the results matching the output from AntConc), you just need to do the following.

1) Create a word list in AntConc (for your complete corpus) and save it (removing the header information). This will serve as the new baseword list for AntWordProfiler. (For testing purposes, you can just load in the word list for file 0010.txt).
2) Open AntWordProfiler, clear the default baseword lists, and load in your saved word list from AntConc.
3) Open your target file into AntWordProfiler and create a profile making sure the word types option is checked.

If the token definitions in both tools match, the results will also match exactly.

Try it out and let us know if everything works. I'm sure others will be interested, too.

Laurence.

David King

unread,

Jun 10, 2013, 9:35:21 AM6/10/13

to ant...@googlegroups.com

Hi Laurence

Oh! Obviously I've not understood what the baseword lists were about - sorry!

OK, so I loaded the antconc word list (without the headings) as the baseword list (I've attached a screenshot of the profiler and the wordlist from antconc). I then chose txt10 and ran it through antprofiler again using [a-zA-Z'&0-9]+ as my token definition. But, antprofiler still disagreed with antconc (see attached excel file). It's still not picking up the apostrophes, though it does now recognise Calvin, Hobbes, seeks, euphoria. Antprofiler has also now returned x94 as part of the results, but I can't see that anywhere in the txt itself or in the antconc baseword list.

When you mentioned that the double end-quote was odd (maybe a Japanese character), I'm not sure what you meant. The text was copied and pasted from a word file into notepad. It was all in English. Sometimes when I was copying and pasting into notepad, there would be strange little things that didn't get copied consistently the same, but when that happened I manually corrected it. I don't recall anything odd happening with quotation marks though. Do you think there might be something similarly odd about the apostrophes themselves?

I'm really grateful for the amount of time and energy you've put into this, thank you again!

All the best

David

screenshot of antprofiler.docx

antconc_results_txt10_baseword_list.txt

antprofiler txt10 using antconc wordlist as baseword list.xlsx

Laurence Anthony

unread,

Jun 10, 2013, 9:54:11 AM6/10/13

to ant...@googlegroups.com

Dear David,

I think you are running into the classic problem of encodings. Microsoft Word is a very horrible program to use to create text files. It introduces all kinds of characters that are non-ASCII with the "curly quotes" being the most common example. When you save a word file as text, it will save the file as something called "Unicode" which really means UTF-16LE. This is horrible to work with. The international standard is UTF-8. Word does allow you to save as UTF-8, but you have to select the save option.

I have attached your 0010 text saved as UTF-8, and also included the AntConc generated wordlist for the same file. I have tested these and they definitely produce the same results for AntConc and AntWordProfiler.

First, test these files and make sure you can generate the correct results. (I think you probably just forgot to activate the user token in AntWordProfiler earlier, when isn't was being split.) Once you get past this hurdle, see if you can generate the same files on your own. If you can, you can then proceed and clean up your corpus and process the whole thing. (Actually, the cleanup process is something you should probably do anyway).

0010.txt

wordlist_complete.txt

David King

unread,

Jun 12, 2013, 3:50:54 AM6/12/13

to ant...@googlegroups.com

Hi Laurence

Thanks so much for your time and the advice. This has been a real learning experience for me! I had no idea that I needed to know about encoding. As I think I mentioned, I'm using AntConc for my MA dissertation and this was what all my queries here have been about, but I also use it at work (I'm a lecturer at University of the Arts London), where we are compiling a corpus of EAP writing, specific to the fields of Art & Design. So far we've got about 1/2 million tokens, but the original sources came as .doc, .docx, and .pdf and we just saved them all as .txt files. I know none of us working on this project had even the slightest notion of what kind of encoding was being applied. It now makes me wonder just how many oddities are going to pop-up once we start data mining it. Not looking forward to that! Oh well, live and learn!

Thanks again for everything - you've been of more assistance than my dissertation supervisor!

Kind regards

David

Laurence Anthony

unread,

Jun 12, 2013, 4:41:11 AM6/12/13

to ant...@googlegroups.com

Hi David,

Character encodings are a very important consideration when designing a corpus. Basically, the simplest procedure to follow these days is to just save all data as UTF-8. The advantage of UTF-8 is that ASCII characters (a-z, A-Z, 0-9, punctuation etc.) are unchanged. So, UTF-8 files often open perfectly fine on any system with any software assuming the text is in English. Even software assuming ASCII will open a UTF-8 perfectly if the text is just ASCII text. For non-English texts, UTF-8 can also encode all the characters correctly but in this case the software must be told that the file is UTF-8 encoded.

Unfortunately, there are also in effect two flavors of UTF-8, i.e. UTF-8 with a BOM and UTF-8 without a BOM. BOM means byte order mark and it's a special set of (invisible) non-ASCII characters that are saved at the start of each text file that tell the software that the encoding is UTF-8. The Unicode official policy is basically that a BOM should not be included. However, on most Windows systems, software (like Word, Notepad, etc.) save the BOM characters as default. If you open these files with tools like Notepad and Word, the software detects the BOM and opens the file in UTF-8 hiding the BOM characters. This is convenient. However, if you use tools that are *not* expecting UTF-8, then the tool will often assume the file is in plain ASCII and open it as such. However, then, the BOM characters can suddenly appear visible because they are non ASCII characters. (I think this is the main reason why the official guide is to not include the BOM in UTF-8).

So, for your project, the best advice is to save the data as UTF-8 with a BOM. If you do this, it will be following the most common international text data standard (UTF-8), the files will open nicely on Windows systems (because of the BOM), and as more tools adapt to work with UTF-8, your files will open on them, too. You should still tell users what the encoding is. You should also be prepared for some users contacting to ask you why some of the files have a few weird characters at the front. The answer would be that they are opening a UTF-8 + BOM file as ASCII.

Note that as long as you have saved your data in some flavor of Unicode (e.g. UTF-16 LE) it can easily be converted automatically to UTF-8 (with or without the BOM). I might write a small script to do this because so many people experience the problem.

There is one other problem. Currently, AntConc makes no assumptions about file encodings and attempts to open them in the encoding set in the global settings. If you try to open a UTF-8 file with the ASCII setting, it will reveal all the non ASCII characters (including any BOMs) as mojibake (unrendered characters). If you set the encoding correctly, it will render the file correctly. Luckily, the default encoding for AntConc 3.3.5 and all future versions is UTF-8, and so it will also read UTF-8 + BOM files without problem.

I should also note that very, very few people understand the problems of encodings and so I am now considering introducing a character encoding guessing algorithm into AntConc (the same as that used in your web browser to guess the encoding of foreign web pages). This will mean that for most cases the user will not need to know the encoding at all because AntConc will guess it correctly. In the few cases where it guesses wrongly, the user will have to go to the global settings and set the encoding explicitly. Of course, for the guessing algorithm to work well, having a BOM at the front of a UTF-8 file is very helpful.

I hope that helps!

By the way, did you manage to solve the problem of matching results in AntConc and AntWordProfiler?

Laurence.

Reply all

Reply to author

Forward