I'd like to extract citations as precisely and exhaustively as
possible from a manuscript using regex in R (so I don't have to do it
manually).
Arguably the most defining feature of citations is the co-occurrence of
an author's name identifiable via the upper case letter with which it
starts as well as the year given in parentheses, e.g., Name (2020)
. But there are numerous variants to this basic pattern.
Here's a sample text featuring a hopefully at least near-complete inventory of actual variants and fake variants (such as parentheses with less than four numbers) of citations in manuscripts:
samp <- c("Irony closely co-occurs with laughter (Norrick 2003). Blahblah
concordances of laughter episodes, a method used by Partington (2007)
Written Academic Language Corpus (T2K-SWAL) and adopting a Searlian
framework, McAllister (2015). For example, the Narrative Corpus
(Rühlemann & O’Donnell 2012) blahblah (MICASE), which blah
and also Author (forthcoming) and blahblah Peter & Paul (in preparation)
for some speech acts (cf. Maynard & Leicher 2007) blahblah
most frequent ones in English (Carter et al. 2000: 179).blah
include evaluative prosody (e.g., Partington 2015), vagueness (O’Keeffe 2004),
and deixis (e.g., Rühlemann & O’Donnell 2012). blahblah
7 Brian: °E:rm yeah°
8 (1.7)
9 UNK: ( )
utterance made by a non-present speaker:
(3)
I mean I've been in two shops blah most influential has been Searle’s (1975)
and Xyz et al.'s (1999) taxonomy; (see also Kok 2017; Sperber & Wilson 1986)
7 Ena: and I'd always been sorry that my dad
8 >my dad< never <<taught us ^you know>>
(0.5)
9 Alan: I’ve been trying to learn it, but I haven't got very far
(BNC KB0: 218-223; corrected transcription)")
The regex I've tried so far is this:
str_extract_all(samp, "([A-Z][a-z].*)?\\(\\w.*[^A-Z)]\\)")
But the matching is far from perfect; the imperfect matches are commented on in the output:
[[1]]
[1] "Irony closely co-occurs with laughter (Norrick 2003)" # only "(Norrick 2003)" should match
[2] "Partington (2007)"
[3] "McAllister (2015)"
[4] "(Rühlemann & O’Donnell 2012)"
[5] "Author (forthcoming) and blahblah Peter & Paul (in preparation)" # should be 2 matches: "Author (forthcoming)" and "Peter & Paul (in preparation)"
[6] "(cf. Maynard & Leicher 2007)"
[7] "English (Carter et al. 2000: 179)"
[8] "(e.g., Partington 2015), vagueness (O’Keeffe 2004)" # should be 2 matches: "(e.g., Partington 2015)" and "(O’Keeffe 2004)"
[9] "(e.g., Rühlemann & O’Donnell 2012)"
[10] "(1.7)" # should not match
[11] "Searle’s (1975)"
[12] "Xyz et al.'s (1999) taxonomy; (see also Kok 2017; Sperber & Wilson 1986)" # should be two matches: "Xyz et al.'s (1999)" and "(see also Kok 2017; Sperber & Wilson 1986)"
[13] "(0.5)" # should not match
[14] "(BNC KB0: 218-223; corrected transcription)" # should not match
Help as to how to improve the regex is much appreciated!
Ok, this was sort of a fun challenge. The following yields the extractions your annotations suggest you’re looking for:
str_extract_all(samp, "(\\b[A-Z][^ ,.]+(\\b[A-Z][^ ,.]+|[ ,.&]+|'s|and|et al)* )?\\([^\\)]*(\\d{4}|\\bforthcoming\\b|in prep(aration)?)[^\\)]*\\)")
Rather convoluted, as natural language text regex extractions often become. Perhaps a little too tuned to the sample? Maybe not. Tried to still keep it generalizable, while returning what you were looking for. Output:
[[1]]
[1] "(Norrick 2003)" "Partington (2007)"
[3] "McAllister (2015)" "(Rühlemann & O’Donnell 2012)"
[5] "Author (forthcoming)" "Peter & Paul (in preparation)"
[7] "(cf. Maynard & Leicher 2007)" "English (Carter et al. 2000: 179)"
[9] "(e.g., Partington 2015)" "(O’Keeffe 2004)"
[11] "(e.g., Rühlemann & O’Donnell 2012)" "Searle’s (1975)"
[13] "Xyz et al.'s (1999)" "(see also Kok 2017; Sperber & Wilson 1986)"
Which is what you were looking for, yes?
Robin
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CALFCMoXf_yxRdGcuxB6-CH78JjOU7pAxFWt5YZT65EVSQvj6Ug%40mail.gmail.com.
str_extract_all(samp, "\\([A-Za-z][^)]*\\d{4}.*?\\)|\\b[A-Z][a-z].*\\([^A-Za-z)]\\w.*?\\)|\\b[A-Z][a-z].*\\(forthcoming\\)|\\b[A-Z][a-z].*\\(in preparation\\)")
[[1]]
[1] "(Norrick 2003)" "Partington (2007)"
[3] "McAllister (2015)" "(Rühlemann & O’Donnell 2012)"
[5] "Author (forthcoming)" "Peter & Paul (in preparation)"
[7] "(cf. Maynard & Leicher 2007)" "(Carter et al. 2000: 179)"
[9] "(e.g., Partington 2015)" "(O’Keeffe 2004)"
[11] "(e.g., Rühlemann & O’Donnell 2012)" "Abcd & Xyz's (1777)"
[13] "Searle’s (1975)" "Xyz et al.'s (1999)"
[15] "(see also Kok 2017; Sperber & Wilson 1986)"
The solution is exhaustive, matching all citations, and precise,
matching just the citations and nothing else (except the parentheses, of
course, and stuff like 'e.g.,' etc.; but that can be attented to in
post-editing).
Both your regex and my own are rather verbose. I'd be grateful therefore for suggestions as to improve it.
Best
Chris
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CF7FDD7B-537B-4318-9EC7-87A36EFAFB0E%40melnick.us.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CF7FDD7B-537B-4318-9EC7-87A36EFAFB0E%40melnick.us.
EDIT:
Here's a solution that (i) also splits multiple citations in parantheses (e.g., "(Kok 2017; Peter 2018; etc.)" and (ii) that post-processes the results:
cit <- str_extract_all(samp, "\\([A-Za-z][^)]*\\d{4};|;\\s[A-Za-z][^)]*\\d{4}\\)|\\([A-Za-z][^)]*\\d{4}.*?\\)|\\b[A-Z][a-z].*\\([^A-Za-z)]\\w.*?\\)|\\b[A-Z][a-z].*\\(forthcoming\\)|\\b[A-Z][a-z].*\\(in preparation\\)|\\([A-Za-z][^);]*\\d{4}|(?<=;\\s)[A-Za-z][^);]*\\d{4}")
It sure hasn't gotten less unwieldy but it's efficient:
cit
[[1]]
[1] "(Norrick 2003)" "Partington (2007)"
[3] "McAllister (2015)" "(Rühlemann & O’Donnell 2012)"
[5] "Author (forthcoming)" "Peter & Paul (in preparation)"
[7] "(cf. Maynard & Leicher 2007)" "(Carter et al. 2000: 179)"
[9] "(e.g., Partington 2015)" "(O’Keeffe 2004)"
[11] "(e.g., Rühlemann & O’Donnell 2012: 11-22)" "Abcd & Xyz's (1777)"
[13] "Searle’s (1975)" "Xyz et al.'s (1999)"
[15] "(see also Kok 2017;" "Sperber & Wilson 1986"
Now we can clean the results by removing anything that is not strictly part of the citiation:
cit_clean <- gsub("\\(|\\)|:\\s\\d+(-\\d+)?|(e\\.g\\.,|see also|cf.)\\s|'s|’s|;", "", unlist(cit))
cit_clean
[1] "Norrick 2003" "Partington 2007" "McAllister 2015"
[4] "Rühlemann & O’Donnell 2012" "Author forthcoming" "Peter & Paul in preparation"
[7] "Maynard & Leicher 2007" "Carter et al. 2000" "Partington 2015"
[10] "O’Keeffe 2004" "Rühlemann & O’Donnell 2012" "Abcd & Xyz 1777"
[13] "Searle 1975" "Xyz et al. 1999" "Kok 2017"
[16] "Sperber & Wilson 1986"
And finally we can remove the duplicates and order the unique results alphabetically:
cit_unique <- sort(unique(cit_clean))
cit_unique
[1] "Abcd & Xyz 1777" "Author forthcoming" "Carter et al. 2000"
[4] "Kok 2017" "Maynard & Leicher 2007" "McAllister 2015"
[7] "Norrick 2003" "O’Keeffe 2004" "Partington 2007"
[10] "Partington 2015" "Peter & Paul in preparation" "Rühlemann & O’Donnell 2012"
[13] "Searle 1975" "Sperber & Wilson 1986" "Xyz et al. 1999"
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CAFrBz2kifa0w6kHmRqT0jHvdGU88xhEUJrry9nQ7MfaQt-qWTQ%40mail.gmail.com.
read.table
and these arguments:read.table([my path], header = F, sep = "\n", fill = F, stringsAsFactors = F, strip.white = T)
and I've used paste
to fuse it all together:
paste0(manuscript$V1, collapse = "")
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CAFrBz2kifa0w6kHmRqT0jHvdGU88xhEUJrry9nQ7MfaQt-qWTQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CALFCMoU9sC7sumL8jrZ4fm8dqys9xoOFLbjNo4Ruj2NcA1KiQQ%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/7c5c9ded-9b1d-d7dc-9e18-b9367aade33f%40ens-lyon.fr.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CALFCMoUc2XBDVofX2OBrFyDMDoM-2%2BXS-0f_mhDZrvTaVMX%3D6A%40mail.gmail.com.