Creating reference corpus from a word list

Annalena

unread,

Apr 22, 2022, 7:31:55 AM4/22/22

to AntConc-Discussion

Hello,

I have some issues creating a reference corpus with a word list from COCA I derived from https://www.wordfrequency.info/samples.asp

After downloading and reformatting the file into a text file, I tried to implement it into AntConc as a reference corpus in the form of a word list. However, when opting for the word list option, I am asked to provide three separate files (corpus information, word list information, and word list data file). My problem is, I only have this one file (word list data file). Do I need to create the other two files. If yes, what do they need to contain? Or do I need to do something completely different with that word list?

I searched the help section but I couldn't find any information that would answer my question (or maybe I didn't see it).

I would be very thankful if anyone could point me in the right direction.

Best,

Annalena

Laurence Anthony

unread,

Apr 22, 2022, 9:51:43 AM4/22/22

to ant...@googlegroups.com

Hi Annalena,

This is a great question. The detailed answer is written in the help page that you can find under the Help menu. Here's the most important part:

[To understand the various formats required, load one of the existing pre-built corpora, generate a word list, and then save the database tables via the File menu "Save Current Tab Datatables" option. In the saved .zip file, you will find examples of the table formats that you need to use. Alternatively, open the corpus database in an SQLite database reader (e.g., https://sqlitebrowser.org/) and view the different tables. An SQLite database reader should allow you to export the tables as a CSV/TSV files that you can use as templates for your own wordlist corpus creation.]

Here is some more background information on why AntConc 4 adopts the new three-file format:

AntConc 4 now includes new statistics that can use range information about words in the corpus (i.e. the number of files that the words is found in). Traditional word lists have usually not contained this information. Also, in order to use a word list properly as a reference list, it is important to know information about the corpus from where it was created (e.g. the total number of words and the total number of files it contains). Again, many word lists don't contain this information. So, AntConc 4 now uses a universal file format (called MSTV) for information about the corpus, the word list (as a whole), and the words in the word list. If you create a simple corpus (or use one of the pre-built ones) and generate a word list and export it, you will get the three files that you need. Just open these and edit the information accordingly. Once you load the three files into AntConc 4, it will generate a single file corpus that you can use as a reference corpus from then on. There will be no need to use those three files anymore.

As for the COCA lists, I see that only the top 5000 words can be downloaded. Is it clear what the total corpus size is and the total number of files in the corpus is? If so, it should be very easy to convert the list into the correct format. Also, I see the following: " If you re-post the list on the web, you must clearly indicate www.wordfrequency.info as the source of the data. " When you create the tables, it is possible to add this information into the corpus information, which means that you should then be able to distribute the word list corpus to others.

If you need more help doing the above, please let me know.

Can I also ask why you would want to use a word list that is clearly broken (in the sense that the frequencies of only a few of the words in the corpus are provided)?

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/b76b2117-8dee-4bd1-94ff-2d9c10918de2n%40googlegroups.com.

Annalena

unread,

Apr 22, 2022, 10:57:41 AM4/22/22

to AntConc-Discussion

Thank you for the reply, Laurence!

Ok, I think I got it. I'm not sure if I'll be able to try and work it out today. If I have further issues converting the data and creating the files I'll get back to you. Thank you for that!

To answer your latter question: it's not so much a question of wanting to use this word list as a question of having to use it. I want to compare corpus data from my own specialised corpus (transcribed spoken data from mostly US American speakers) with data from a corpus of general spoken American English (since I doubt it would make sense to compare my data with the spoken part of the BNC, for example). And currently I don't have access to any other kind of data that would be appropriate (I am however consulting with the chair I am writing my paper at and there might be a possibility I can get access to a full word list of COCA).

Annalena

Laurence Anthony

unread,

Apr 22, 2022, 11:08:54 AM4/22/22

to ant...@googlegroups.com

Hi again Annalena,

AntConc 4 now comes with several built-in corpora. Why not use the AmE06 1 million word corpus of 2006 American English? As a reference corpus for your project, it would probably be fine, especially considering that it contains 52,456 types.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/3ad68039-aea4-42c8-a83b-3e0130644119n%40googlegroups.com.

Annalena

unread,

Apr 25, 2022, 4:34:34 AM4/25/22

to AntConc-Discussion

Hello again,

and thank you for your help!

I now have access to another platform and coca where I am able to create full word lists (and also word lists of its subcorpora).

I created the corpus info and word list info files according to the templates. However, when trying to create my reference corpus, an error message occurs:

"The followin corpus info file could not be read.

See the error report below.

'charmap' codec can't decode byte 0x9d in position 16: character maps to <undefined>"

So, I guess the error is on my part. Yet, I don't seem able to fix the issue. Do you have any idea what I did wrong? I attached the lists I created, maybe that helps.

Annalena

corpus_info_coca spoken 2010.xlsx

wordlist_info_coca spoken 2010.xlsx

Laurence Anthony

unread,

Apr 25, 2022, 5:01:58 AM4/25/22

to ant...@googlegroups.com

Hi Annalena,

The error suggests a character encoding problem. But, looking at the two tables you sent, you seem to have put CSV data into an Excel file.

My guess is that you've opened the original CSV files in Excel, edited them, and then forgotten to save them back as CSV.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/ffa2964f-0731-49b9-92b3-12678bf634acn%40googlegroups.com.

Reply all

Reply to author

Forward