Reference corpora

Skip to first unread message

Feb 9, 2021, 12:01:59 PMFeb 9
to WordSmith Tools
Hello, I am doing a corpus linguistics study on my dissertation. I have a corpus about 4 million words from a specific discipline. It is a collection of scholarly research articles from the field of interdisciplinary evaluation.

I am trying to figure out the best reference corpus. There are several key issues to contend with here:

1: I am looking for what distinguishes this scholarly literature from other scholarly literatures, so I am thinking the reference corpus should be constructed from other scholarly sources. Right now I'm using the COCA as reference, so it seems that my keyword analysis is popping up terms that would be used across academia, and not just specific to my field of interest.

2: My corpus is international in nature. 60% of articles are from UK publishers and 30% are from the US. 10% remaining is from Africa, Canada, and Australasia. This presents issues for consideration around international makeup of a reference corpus as well as what to do with lemmatizing.

Given these issues listed, I am thinking my best options are as follows:

1: Find an international corpus of scholarly literature that I can use. I have followed a couple channels including Cambridge's Academic English Corpus but have not heard back. Anyone know of any other corpus that may fit the bill?

2: Build my own corpus. I could use something like AntCorGen to do this. I don't really want to do this if I don't have to. If I do take that route, any suggestions for how to build a corpus that has similar properties to what I'm seeking?

Lastly, what do I do about lemma issue when working with international data? For instance do I treat "programme" and "program" as belonging to the same word family? Is there an international lemma list in existence?

Thanks for any help,


Mike Scott

Feb 9, 2021, 12:30:15 PMFeb 9
to WordSmith Tools
Aaron, hi

Interesting questions. 

 I see the KW procedure as excellent for provoking ideas which need chasing up and which you wouldn't otherwise identify easily.  

If your reference corpus (RC) is very similar to your study corpus (SC) in lexis (though bigger, maybe better stratified and presumably more representative), the KW procedure should throw up specifics of the topics covered in SC. If RC is scholarly lit but not mainly from the SC's field, you should get lexis of that field too. Besides that it will (usually) throw up some lexis you  cannot predict now, e.g. negatives, time words, words like although.

You might think it useful to consider all three possibilities as maybe offering avenues for exploration in your dissertation. Often the third type is the best....

I wouldn't lemmatise, personally. And I wouldn't worry about color/colour/programme/program. International ac. English is not usually restricted to UK/US or any other variant  and is written by experts from many language backgrounds. 
The BNC has written academic sections. Not sure how relevant they'd be for your needs.
I'd avoid building a RC. You might (?) persuade owners of corpora which are not downloadable to let you have a (free?) plain text word list of it. I have just been handling this need in WS8 having got (paid) access to some general word lists in various languages based on billions of words of text.   

Cheers -- Mike

Feb 9, 2021, 8:41:37 PMFeb 9
to WordSmith Tools
Thanks, Mike. Very helpful. 

I see that the BNC is purchasable, so I may end up doing that if it is indeed able to be segmented into just the academic subcorpus.

Feb 9, 2021, 10:33:52 PMFeb 9
to WordSmith Tools
What do you think about the academic portion of COCA? It's meant to be American English, but as you said ac. English is not standardized and tends to be international anyhow. 
Reply all
Reply to author
0 new messages