Hi Annalena,
This is a great question. The detailed answer is written in the help page that you can find under the Help menu. Here's the most important part:
[To understand the various formats required, load one of the existing pre-built corpora, generate a word list, and then save the database tables via the File menu "Save Current Tab Datatables" option. In the saved .zip file, you will find examples of the table formats that you need to use. Alternatively, open the corpus database in an SQLite database reader (e.g.,
https://sqlitebrowser.org/) and view the different tables. An SQLite database reader should allow you to export the tables as a CSV/TSV files that you can use as templates for your own wordlist corpus creation.]
Here is some more background information on why AntConc 4 adopts the new three-file format:
AntConc 4 now includes new statistics that can use range information about words in the corpus (i.e. the number of files that the words is found in). Traditional word lists have usually not contained this information. Also, in order to use a word list properly as a reference list, it is important to know information about the corpus from where it was created (e.g. the total number of words and the total number of files it contains). Again, many word lists don't contain this information. So, AntConc 4 now uses a universal file format (called MSTV) for information about the corpus, the word list (as a whole), and the words in the word list. If you create a simple corpus (or use one of the pre-built ones) and generate a word list and export it, you will get the three files that you need. Just open these and edit the information accordingly. Once you load the three files into AntConc 4, it will generate a single file corpus that you can use as a reference corpus from then on. There will be no need to use those three files anymore.
As for the COCA lists, I see that only the top 5000 words can be downloaded. Is it clear what the total corpus size is and the total number of files in the corpus is? If so, it should be very easy to convert the list into the correct format. Also, I see the following: " If you re-post the list on the web, you must clearly indicate
www.wordfrequency.info as the source of the data. " When you create the tables, it is possible to add this information into the corpus information, which means that you should then be able to distribute the word list corpus to others.
If you need more help doing the above, please let me know.
Can I also ask why you would want to use a word list that is clearly broken (in the sense that the frequencies of only a few of the words in the corpus are provided)?
Laurence.
###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied LinguisticsFaculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail:
antho...@gmail.comWWW:
http://www.laurenceanthony.net/###############################################################