Word frequency list format for determining keyness in Italian

61 views
Skip to first unread message

Mark Dobson

unread,
Jun 10, 2020, 8:31:20 AM6/10/20
to AntConc-Discussion
Hi, I would like to be able to rank terms in an Italian corpus by keyness and am trying to understand how I would go about that.

I tried using the BE06_wordlist.txt I found at https://www.laurenceanthony.net/software/antconc/ on an English corpus and managed to get those results ok, but unfortunately for me I don't see a similar ready-made Italian equivalent.

I have found this for Italian, but couldn't see a file that I could use right away with AntConc:

http://www.corpusitaliano.it/it/contents/description.html

I tried using the raw corpus, but I think it must have been much too large. I suppose I would need to process the data into a format that resemble the following, but before embarking on such a project, I'd like to check that I have a good chance of managing it!

#Word Types: 43370
#Word Tokens: 1007769
#Search Hits: 0
1    59163    the   
2    30733    of

Do I need to provide accurate figures for the top three lines (and if so, I'm not sure that I know how to determine them)? Do I need to have all the words in strict frequency order in the format rank(-tab-)frequency(-tab-)word?

Many thanks,
Mark

Laurence Anthony

unread,
Jun 10, 2020, 7:01:47 PM6/10/20
to ant...@googlegroups.com
Hi Mark,

The only thing you need is the following:
1    59163    the   
2    30733    of 

The order doesn't matter either. In fact, the 'rank' column is completely ignored.

As you say, though, the corpus you reference is too big to process in the current version of AntConc as its 250 million words. The size is not actually the problem and if you had a word list, you could use that. The problem is that AntConc will try to *display* the complete list when you try to create the list and this is likely to use up all your memory. If you can generate the word/freq counts another way, you can load that into AntConc as a reference list and it should be fine.

I hope that helps!

Laurence.

 
###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/2daa3f9e-1a44-4d30-80b6-7d66ca042fcdo%40googlegroups.com.

Mark Dobson

unread,
Jun 11, 2020, 4:05:01 AM6/11/20
to AntConc-Discussion
Hi,

Thank you. Yes, that makes me feel much more optimistic about managing it!

So, just to be absolutely sure, I can skip the # lines entirely, rather than putting in dummy figures?
And if it ignores the first column I guess I don't need to worry about using unique identifiers either, so that's easier.

Colgo l'occasione (as Italians often do at the ends of messages!) to say thank you for putting all this software together and offering it as freeware, certainly not a given these days!

Mark
To unsubscribe from this group and stop receiving emails from it, send an email to ant...@googlegroups.com.

Laurence Anthony

unread,
Jun 11, 2020, 5:20:26 AM6/11/20
to ant...@googlegroups.com
Hi Mark,

The  # lines  are just comments. They are completely ignored by the system.

>And if it ignores the first column I guess I don't need to worry about using unique identifiers either, so that's easier.

That's true, but I would recommend you use something like Excel and just add a column and put a row id in there.

Prego!

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/0025806b-30c8-4394-ab0d-005b246951dao%40googlegroups.com.

Mark Dobson

unread,
Jun 11, 2020, 5:22:15 AM6/11/20
to AntConc-Discussion
Brilliant, thank you.

Mark Dobson

unread,
Jun 24, 2020, 6:45:05 AM6/24/20
to AntConc-Discussion
Hello again,

I'm not sure what going wrong for me. I tend to get this error message:

---------------------------
Message
---------------------------
One or more of the rank column values is not a number.
First error on line: 0
Value: 1
---------------------------
OK  
---------------------------

But the file looks to me to be in the right format. In fact, I think I managed to get it to work once, but failed every time afterwards.

My first ten rows look like this:

1    di    16520744   
2    il    15591226   
3    in    6667272   
4    essere    5972216   
5    e    5550860   
6    che    3354704   
7    al    3296016   
8    da    3254354   
9    a    3167841   
10    un    2442451   

In UTF8 format with tabs between and also at the end of lines, as in the BE06_wordlist.txt file that I tested before.

Laurence Anthony

unread,
Jun 24, 2020, 7:23:06 AM6/24/20
to ant...@googlegroups.com
Hi,

Can you check that your file is saved as UTF-8 (and not UTF-8 with BOM)? Notepad++ is the best editor on Windows to do this.

Laurence.



###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/bb2e09fa-4883-409a-b0c3-9e451a7b09c6o%40googlegroups.com.

Mark Dobson

unread,
Jun 24, 2020, 8:10:55 AM6/24/20
to AntConc-Discussion
As far as I can make out, yes. It came out from Excel with tabs as ANSI so I had to convert it, and I used the UTF-8 without BOM option. Opening it again in Notepad++ it tells me that the encoding is UTF-8 at the bottom right.

Laurence Anthony

unread,
Jun 24, 2020, 12:43:51 PM6/24/20
to ant...@googlegroups.com
Very odd. Can you send me the file?

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/cbd90df3-7bbf-4edf-975c-2ee4331f5cd6o%40googlegroups.com.

Mark Dobson

unread,
Jun 25, 2020, 3:04:12 AM6/25/20
to AntConc-Discussion

Mark Dobson

unread,
Jun 25, 2020, 3:09:01 AM6/25/20
to AntConc-Discussion

Laurence Anthony

unread,
Jun 25, 2020, 3:11:05 AM6/25/20
to ant...@googlegroups.com
Hi,

I got the file. It seems that you have rank, word, frequency ordering instead of rank, frequency, word.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/477dfe27-3fb1-4ae1-8d96-3122b5fa16c0o%40googlegroups.com.

Mark Dobson

unread,
Jun 25, 2020, 3:43:13 AM6/25/20
to AntConc-Discussion
How embrarassing to have stumbled over something so basic. Problem abundantly solved, and thank you again.
Reply all
Reply to author
Forward
0 new messages