Headwords/Grouping difficulties

361 views
Skip to first unread message

ruyang li

unread,
Feb 11, 2022, 3:23:43 AM2/11/22
to AntConc-Discussion
Hi Dr. Anthony,

Thank you so much for keeping updating AntConc. It's very useful. I like it!!

I am a college student in Shanghai who is now self-learning the corpus relevant studies. When using AntConc, I had difficulties in dealing with the corpora, especially about the headwords/grouping list stuff. I would be very appreciated and thankful if you could take a look at my operating steps.

The difficulties I met is:I cannot combine the lexeme and its variants into a word. For instance, in my word list, words like"have, had, has" are separated.  In the previous versions, we could load lemma list to make it However, there are some changes in version 4.0 and I failed. 

Materials: four corpora txt files. The content are collected from Corporate Social Responsibility reports in English.

My steps are:
1. form my four corpora txt files into one txt file, and then I use TagAnt, choosing "Word+pos+lemma" as Display information, then I got a tagged txt file;
WechatIMG14.jpeg
2. open AntConc 4.0.3, open Corpus Manager,  click raw files in Target Corpus;
3. In Corpus files, I added the four corpora txt files, Indexer I chose "simple_word_pos_headword_indexer", Encoding UTF-8
4. In Headword/Grouping List, I added the tagged file
WechatIMG17.jpeg
5. Create the corpus

Then, I went to the "Word", and clicked start, I found that words like "are, was, were" were still separated in the word list. 

WechatIMG19.jpeg

So, I think there might be some error mistakes in my process. Could you pls help me find it? Thanks again!!!

Warm Regards,
Ruyang

Laurence Anthony

unread,
Feb 11, 2022, 4:09:17 AM2/11/22
to ant...@googlegroups.com
Hi Ruyang,

All your steps seem fine. The only thing you need to do now is to choose the "Headwords" option in the Word tool list preferences to group the words by the headword. In the settings you'll also see an option to also show the headword column, which is useful for regular word lists.

Let me know if that addresses the problem you have.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/0957278a-d442-4a75-8f6b-4e0b3f538b91n%40googlegroups.com.

ruyang li

unread,
Feb 11, 2022, 6:02:01 AM2/11/22
to AntConc-Discussion
Many thanks!! I think the problem is partly solved!!

I did the following steps:
1. enter into "Tool Settings", find the "Word" and changed the "List Type" into Headword list; then
WechatIMG2036.jpeg
2. I returned to the "Word" interface, the result turns out to be this:
1644576771901.jpg

We could see that the frequency of the BE word "is" "are" "were" is still calculated separately, whereas "be" and "was" were calculated together. Is this the way it supposed to be? Or there might be some mistakes in my operating process?

Thanks.

Laurence Anthony

unread,
Feb 11, 2022, 11:08:10 AM2/11/22
to ant...@googlegroups.com
Hi,

First, in the tool settings, turn on the "Show Headwords" option so you can see what is being grouped under what headword. Also, in the global settings, turn on the "Show POS values" so you can see what POS values are being used. My guess is that the words being grouped are different parts of speech.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

ruyang li

unread,
Feb 11, 2022, 11:31:35 PM2/11/22
to AntConc-Discussion
Hi Dr. Anthony,
thanks for your advice. So I've turned on the "Show Headwords" in the Tool Settings and also the "Show Tags (if available)" in the Global Settings. However, it turns out to be this:
WechatIMG2038.jpeg

The POS line is empty. What might be the reason for this? Is this because mistakes happened in the TagAnt or else?

A lot thanks for helping and explaining all these things to me.

Warm regards,
Ruyang
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+unsubscribe@googlegroups.com.

Laurence Anthony

unread,
Feb 12, 2022, 1:52:22 AM2/12/22
to ant...@googlegroups.com
Yes, it seems that you haven't tagged your data properly. You need to have the following format for each token.

WORD_POS_HEADWORD

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.

ruyang li

unread,
Feb 12, 2022, 2:38:56 AM2/12/22
to AntConc-Discussion
Hi Dr.Anthony,

it seems that I am on the right track now... eventually hahah. But, there is still one bug in my created corpus.
"are" is still separated from other BE words. I checked the tagged file, in which "are" is presented as "are AUX be", which is supposed to be correct.

WechatIMG2041.jpegWechatIMG2040.jpeg

So what might be the possible reason behind this situation?

Thanks. Ruyang

Laurence Anthony

unread,
Feb 12, 2022, 3:32:16 AM2/12/22
to ant...@googlegroups.com
Look at the headword..... its "ares"

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

ruyang li

unread,
Feb 12, 2022, 3:54:55 AM2/12/22
to AntConc-Discussion
ohh, right! I found that  this situation happened because I added another Headword list file into the created corpus. Now the problem has been solved!! 

Finally understood the usage of TagAnt...It brings convenience to POS and lemma. The changes in version 4.0 are great!

Anyway, a lot thanks!!

Ruyang

Laurence Anthony

unread,
Feb 12, 2022, 10:34:54 AM2/12/22
to ant...@googlegroups.com
That's great! I'm glad you could navigate the issues. In the end, I think AntConc 4 makes things simpler than in the past, in the sense that you just load your corpus with the correct indexer and then everything is available as you wan.  What do you think?

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

ruyang li

unread,
Feb 13, 2022, 9:49:46 AM2/13/22
to AntConc-Discussion
Yes, indeed! It's simpler: ) And I also love the changes in the reference corpus. AME and BE are already there! 

Btw, just for curiosity, in the keywords list, why does the "likelihood" take the place of "keyness"? Are there any reason for that?

Ruyang

Laurence Anthony

unread,
Feb 13, 2022, 10:47:03 AM2/13/22
to ant...@googlegroups.com
Hi,

>why does the "likelihood" take the place of "keyness"? 

That's a great question. The term "keyness" is a bit problematic as it describes the interpretation of the measure, but not the measure itself. Traditionally, "keyness" has been measured with a likelihood measure (e.g. chi-squared or log-likelihood), but recently, people have been suggesting an effect size measure would be better. AntConc 4.0 offers both measures, so I need to keep them separate and don't want to conflate them under a single heading. Strictly speaking, raw frequency is also a "keyness" measure, as is "range". So, I've basically started to use the proper terminology. 

Does that make sense?

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

Reply all
Reply to author
Forward
0 new messages