How to create a word list that will be accepted by antconc?

410 views
Skip to first unread message

Theresa Hochhold

unread,
Mar 7, 2018, 2:01:48 PM3/7/18
to AntConc-Discussion
Hi everybody!

I write my master thesis on physiotherapy master abstracts and have collected a corpus of such abstracts. To evaluate keyness etc. I wanted to use antconc. After I realized that it is expensive to get the data from coca I looked for alternatives. I found the 5k lemma list from Word frequency data (https://www.wordfrequency.info/free.asp). I copied the data in excel, changed the columns in order to get the right ordering of rank, frequency and word. Then I copied the data out of the table into word and once more into wordpad to get the txt file. Then I tried the Notepad++ and saved it with ticking the UTF-8-BOM coding. In the end, I did get some results, but I would say weird ones...see attached files.
Then I thought I would use the word list suggested through Mr. Anthony on his homepage and, therefore, I took the written part of the BNC. However, I obtained another error message:
On or more of the rank column values is not a number. First error on line: 0 Value: #word types: 334660
Finally, I tried the AntFileConverter on my data with the same weird results.

I'm quite confused by now...
Any ideas?

Best regards,
Theresa

Weird results.docx

Laurence Anthony

unread,
Mar 7, 2018, 9:31:53 PM3/7/18
to ant...@googlegroups.com
Hi Theresa,

I just tested what you did and had no problem at all.

I used AntConc 3.5.2 (the latest version) and the BNC_WRITTEN_wordlist.txt file in the BNC zip.

Everything worked as expected.

I suggest the following:

1) Create a new folder on your desktop.
2) Place AntConc 3.5.2 and the BNC_WRITTEN_wordlist.txt file in that. (This guarantees you are not accidentally including an old settings file).
3) Launch AntConc and try to load the BNC_WRITTEN_wordlist.txt as a keyword list.

If you get problems at this point, report back here.

Note also that you never need the BOM with UTF-8 files in AntConc.

Laurence.



###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+unsubscribe@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at https://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/d/optout.

Theresa Hochhold

unread,
Mar 8, 2018, 2:31:33 AM3/8/18
to AntConc-Discussion
Hi Anthony,

I did as you suggested and I have no idea what went wrong before, but this time it worked with the BNC list from your homepage. Thanks a lot!!! However, with the list I created from the 5k lemma list it is still not working. I guess that has to do with the fact that it does not have the additional information at the beginning? (#Word Types: 334660, #Word Tokens: 85887272, #Search Hits: 0)

Best regards,
Theresa
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.

Laurence Anthony

unread,
Mar 8, 2018, 2:42:29 AM3/8/18
to ant...@googlegroups.com
Hi Theresa,

The information at the beginning of the list is ignored. I suspect the problem is that the format is wrong or that the encoding is incorrect. 

Laurence.



###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

To unsubscribe from this group and stop receiving emails from it, send an email to antconc+unsubscribe@googlegroups.com.

Theresa Hochhold

unread,
Mar 8, 2018, 4:10:26 AM3/8/18
to AntConc-Discussion
Hi Laurence,

But there is no way for me to find the problem in order to use it?

Theresa

Laurence Anthony

unread,
Mar 8, 2018, 5:39:52 AM3/8/18
to ant...@googlegroups.com
Can you send me your file that is causing problems to my personal email account (listed below)?

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

To unsubscribe from this group and stop receiving emails from it, send an email to antconc+unsubscribe@googlegroups.com.

Theresa Hochhold

unread,
Mar 9, 2018, 3:17:34 AM3/9/18
to AntConc-Discussion
Hallo everybody,

thanks to the help of Laurence, I now know what caused my problems - it was not AntConc ;-) If you want to use the 5k lemma word list of coca (https://www.wordfrequency.info/free.asp) as a reference wordlist to get to the keywords of your own corpus, you will get weird results. The reason is that the list is tagged (POS tags = Part Of Speech tags are added) and, therefore, naturally some words occur more often according to their part of speech. I you want to use the list you'd have to add the frequencies of the respective words together.
By the way, does anybody know if such an un-tagged list exists somewhere on the web?

Again, thanks to Laurence.
Theresa
Reply all
Reply to author
Forward
0 new messages