Hello, thanks for admitting me to the discussion forum.
I would like to ask a question about my methodology in a study I am conducting, where I have a corpus made from reading comprehension examination questions. I have 'cleaned' the text (removing questions in the examinations that were not overtly testing expository text comprehension), converted the files from doc to UTF8 plain text, and split the files up according to the year in which the text was used and the subject. I created my corpus from these files (200 or so) and now am ready to start an analysis.
I want to see how derivational nominalisation has trended over time and whether or not the frequency of this kind of nominalisation is reflected in more general corpora (or the 'academic' sub-corpus in COCA).
To search for the nominalisations I was planning on using a selection of productive nominalising derivational suffixes (-ion, -ment, -ity, etc), then running an advanced search in the 'word' tab. I add my wildcard+suffixes to the search query list so it looks something like this:
*ty
*ties
*ance
*ances
*tion
*tions
*ence
*ences
*ency
*encies
This is only a small sample of the list I was intending to use, but as you can imagine, the list of hits I get is very large and there are a great deal of results that are not examples of derivational nominalisation (nation, cities, majority, etc).
I am very much a novice with corpora and Antconc and there might be a neater way to go about this. I was considering running my data through the CLAWS tagging facility, so that I could at least limit my search to return nouns only (although many of these would, still, not be the right 'kind' of noun). Is this feasible? Alternatively I could order the results by frequency and go through them one by one until I have a set of, say, 50 derived nouns. This would allow me to confirm which suffixes are most productive and would give me a word list that I could then use against a reference corpus. The larger the list of derived nominalisations I can generate, the more valid I think any conclusions I could draw from my analysis, but I don't have all that much time to pick my way through it all.
I am not even sure if looking for the most frequent nominalisations is the right way for me go anyway, because these tokens are more likely to be familiar to second language learners and I am interested in how nominalisations in academic texts make comprehension more challenging.
Sorry for the long post and thanks for reading. If anyone has any suggestions or tips for me, I would really appreciate it.
James