Ultimately, I would like to know how many references there are in the corpus (could be integral or non-integral citations), taking the occurrence of a year of publication in round brackets as an indicator. Having the four digits of the year as hits, I can then look at the complete citation via KWIC, and I could export the rows to another programme to code the type of citation, for instance.
19\d\d\)|\d\d\) gets me many 'good' results, but also returns hits such as 37) in p = .37) or 06) in (p. 2206) or 1988 in he counted 1988 instances of
and does not find 1986 in (Fowler 1986: 127-46; Simpson 1990) or 1936 in (e.g. 1936, 1967)
I'm looking for a string that reduces the problem of precision and recall further.
I just tried this: \(*19\d\d|19\d\d\) and wonder if that gets me any closer?
In order to also get publication years from the 2000s, I tried this: \(*19\d\d|19\d\d\)|\(*20\d\d|20\d\d\)
Does that make it clearer now? Do you think the last string would capture everything I'm looking for?
I'm aware that I will still miss out on citations where the writer does not include a year in the sentence, but that's a different issue.
To count references in the hard sciences, which often use numbers in square brackets, I tried the following:
\[\d\d?\d?\] finds [5] [13] [234]
\[*\,\d\d?\d?\] finds 5,] 13,] 234] in [7,5] [9,13] [23,77,234]
\[\d\d?\d?, finds [7, [9, [23, in [7,5] [9,13] [23,77,234]
\[\d\d?\d?– finds [9– in [9–13] and [5–13, 45] (several references listed without giving a number to each => manual correction required)
But how to find 77 in [23,77,234] or in [23,77-78,234] and not any digits with the pattern ,dd, or ,dd- ?
Thanks for your patience!
Stefanie