Searching for derivational nominalisations in Antconc

38 views
Skip to first unread message

James Comben

unread,
Jun 11, 2024, 10:18:02 AMJun 11
to AntConc-Discussion
Hello, thanks for admitting me to the discussion forum.

I would like to ask a question about my methodology in a study I am conducting, where I have a corpus made from reading comprehension examination questions. I have 'cleaned' the text (removing questions in the examinations that were not overtly testing expository text comprehension), converted the files from doc to UTF8 plain text, and split the files up according to the year in which the text was used and the subject. I created my corpus from these files (200 or so) and now am ready to start an analysis.

I want to see how derivational nominalisation has trended over time and whether or not the frequency of this kind of nominalisation is reflected in more general corpora (or the 'academic' sub-corpus in COCA).

To search for the nominalisations I was planning on using a selection of productive nominalising derivational suffixes (-ion, -ment, -ity, etc), then running an advanced search in the 'word' tab. I add my wildcard+suffixes to the search query list so it looks something like this:

*ty
*ties
*ance
*ances
*tion
*tions
*ence
*ences
*ency
*encies

This is only a small sample of the list I was intending to use, but as you can imagine, the list of hits I get is very large and there are a great deal of results that are not examples of derivational nominalisation (nation, cities, majority, etc).

I am very much a novice with corpora and Antconc and there might be a neater way to go about this. I was considering running my data through the CLAWS tagging facility, so that I could at least limit my search to return nouns only (although many of these would, still, not be the right 'kind' of noun). Is this feasible? Alternatively I could order the results by frequency and go through them one by one until I have a set of, say, 50 derived nouns. This would allow me to confirm which suffixes are most productive and would give me a word list that I could then use against a reference corpus. The larger the list of derived nominalisations I can generate, the more valid I think any conclusions I could draw from my analysis, but I don't have all that much time to pick my way through it all.

I am not even sure if looking for the most frequent nominalisations is the right way for me go anyway, because these tokens are more likely to be familiar to second language learners and I am interested in how nominalisations in academic texts make comprehension more challenging.

Sorry for the long post and thanks for reading. If anyone has any suggestions or tips for me, I would really appreciate it.

James

Laurence Anthony

unread,
Jun 11, 2024, 10:35:58 AMJun 11
to ant...@googlegroups.com
Hi James,

I would say that using a tagged corpus is going to help you refine your searches. Have you tried using my TagAnt tool. The tagger in that is state-of-the-art, and you can also choose to download even bigger (more advanced) models from the model library inside TagAnt.

Once you load your tagged corpus into AntConc, I think you'll find searching for what you want, much easier.

(Just remember to use the simple_word_pos_headword indexer in the Corpus Manager when you load your data).

I hope that helps.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/4a3557c3-c903-4a8a-9680-35ff6859416dn%40googlegroups.com.

James Comben

unread,
Jun 25, 2024, 3:22:45 AMJun 25
to ant...@googlegroups.com
Sorry, may I add one more question?

If I have a tagged corpus I can search for parts of speech like the following:

*_NN (to find all tokens tagged as a noun)

If I want to find all nouns ending in a derivational suffix, such as *ion, how might I combine these two search elements?

Thank you,

James

On Tue, 25 Jun 2024 at 12:09, James Comben <jpco...@gmail.com> wrote:
Hi Laurence,

Thank you very much for your help with this.

I have a couple more questions, if you have a moment?

1. I downloaded the largest English model in the model manager in TagAnt. What are the differences between this larger model and the smaller one that comes pre-installed. Is there information about this I can read up on (this is for an academic paper and I should mention this in my methodology.
2. After tagging all of the files, I loaded them into AntConc and selected the simple_word_pos_headword indexer as suggested. But doing this then 'greys out' the 'show token definition settings' button. If I create the corpus and run a 'word' search, it returns punctuation as below:

image.png

Presumably instances of full stops, brackets, and so on will contribute to the overall token count if they are appearing in this list? My files also contain some Japanese characters which I do not want included in my analyses.

Is there a way to make use of the tagging functionality, while retaining the option to manually define the token settings?

Thank you very much,

James

James Comben

unread,
Jun 25, 2024, 3:22:45 AMJun 25
to ant...@googlegroups.com
Hi Laurence,

Thank you very much for your help with this.

I have a couple more questions, if you have a moment?

1. I downloaded the largest English model in the model manager in TagAnt. What are the differences between this larger model and the smaller one that comes pre-installed. Is there information about this I can read up on (this is for an academic paper and I should mention this in my methodology.
2. After tagging all of the files, I loaded them into AntConc and selected the simple_word_pos_headword indexer as suggested. But doing this then 'greys out' the 'show token definition settings' button. If I create the corpus and run a 'word' search, it returns punctuation as below:

image.png

Presumably instances of full stops, brackets, and so on will contribute to the overall token count if they are appearing in this list? My files also contain some Japanese characters which I do not want included in my analyses.

Is there a way to make use of the tagging functionality, while retaining the option to manually define the token settings?

Thank you very much,

James

On Tue, 11 Jun 2024 at 23:35, Laurence Anthony <antho...@gmail.com> wrote:

James Comben

unread,
Jun 25, 2024, 3:22:46 AMJun 25
to ant...@googlegroups.com
Sorry again. I think I got it:

*ion_NN


Reply all
Reply to author
Forward
0 new messages