Hello, I am doing a corpus linguistics study on my dissertation. I have a corpus about 4 million words from a specific discipline. It is a collection of scholarly research articles from the field of interdisciplinary evaluation.
I am trying to figure out the best reference corpus. There are several key issues to contend with here:
1: I am looking for what distinguishes this scholarly literature from other scholarly literatures, so I am thinking the reference corpus should be constructed from other scholarly sources. Right now I'm using the COCA as reference, so it seems that my keyword analysis is popping up terms that would be used across academia, and not just specific to my field of interest.
2: My corpus is international in nature. 60% of articles are from UK publishers and 30% are from the US. 10% remaining is from Africa, Canada, and Australasia. This presents issues for consideration around international makeup of a reference corpus as well as what to do with lemmatizing.
Given these issues listed, I am thinking my best options are as follows:
1: Find an international corpus of scholarly literature that I can use. I have followed a couple channels including Cambridge's Academic English Corpus but have not heard back. Anyone know of any other corpus that may fit the bill?
2: Build my own corpus. I could use something like AntCorGen to do this. I don't really want to do this if I don't have to. If I do take that route, any suggestions for how to build a corpus that has similar properties to what I'm seeking?
Lastly, what do I do about lemma issue when working with international data? For instance do I treat "programme" and "program" as belonging to the same word family? Is there an international lemma list in existence?
Thanks for any help,
Aaron