Regex: How to match citiations in manuscripts

26 views
Skip to first unread message

Christoph Ruehlemann

unread,
Mar 13, 2020, 7:51:30 AM3/13/20
to corplin...@googlegroups.com
Hi all,

The problem I have should be familiar to you:

I'd like to extract citations as precisely and exhaustively as possible from a manuscript using regex in R (so I don't have to do it manually). Arguably the most defining feature of citations is the co-occurrence of an author's name identifiable via the upper case letter with which it starts as well as the year given in parentheses, e.g., Name (2020). But there are numerous variants to this basic pattern.

Here's a sample text featuring a hopefully at least near-complete inventory of actual variants and fake variants (such as parentheses with less than four numbers) of citations in manuscripts:

samp <- c("Irony closely co-occurs with laughter (Norrick 2003). Blahblah
          concordances of laughter episodes, a method used by Partington (2007)
          Written Academic Language Corpus (T2K-SWAL) and adopting a Searlian 
          framework, McAllister (2015). For example, the Narrative Corpus 
          (Rühlemann & O’Donnell 2012) blahblah (MICASE), which blah
          and also Author (forthcoming) and blahblah Peter & Paul (in preparation)
          for some speech acts (cf. Maynard & Leicher 2007) blahblah
          most frequent ones in English (Carter et al. 2000: 179).blah
          include evaluative prosody (e.g., Partington 2015), vagueness (O’Keeffe 2004), 
          and deixis (e.g., Rühlemann & O’Donnell 2012). blahblah

          7 Brian:  °E:rm yeah° 
          8             (1.7)
          9 UNK:    ( )
          utterance made by a non-present speaker:
          (3)    
          I mean I've been in two shops blah most influential has been Searle’s (1975)
          and Xyz et al.'s (1999) taxonomy; (see also Kok 2017; Sperber & Wilson 1986)

          7 Ena:    and I'd always been sorry that my dad 
          8     >my dad< never <<taught us ^you know>>
          (0.5)
          9 Alan:   I’ve been trying to learn it, but I haven't got very far
          (BNC KB0: 218-223; corrected transcription)")

The regex I've tried so far is this:

str_extract_all(samp, "([A-Z][a-z].*)?\\(\\w.*[^A-Z)]\\)")

But the matching is far from perfect; the imperfect matches are commented on in the output:

[[1]]
 [1] "Irony closely co-occurs with laughter (Norrick 2003)" # only "(Norrick 2003)" should match                 
 [2] "Partington (2007)"                                                       
 [3] "McAllister (2015)"                                                       
 [4] "(Rühlemann & O’Donnell 2012)"                                            
 [5] "Author (forthcoming) and blahblah Peter & Paul (in preparation)" #  should be 2 matches: "Author (forthcoming)" and "Peter & Paul (in preparation)"       
 [6] "(cf. Maynard & Leicher 2007)"                                            
 [7] "English (Carter et al. 2000: 179)"                                   
 [8] "(e.g., Partington 2015), vagueness (O’Keeffe 2004)"   # should be 2 matches: "(e.g., Partington 2015)" and "(O’Keeffe 2004)"                 
 [9] "(e.g., Rühlemann & O’Donnell 2012)"                                      
[10] "(1.7)"                       # should not match                                       
[11] "Searle’s (1975)"                                                         
[12] "Xyz et al.'s (1999) taxonomy; (see also Kok 2017; Sperber & Wilson 1986)" # should be two matches: "Xyz et al.'s (1999)" and "(see also Kok 2017; Sperber & Wilson 1986)"
[13] "(0.5)"      # should not match                                                             
[14] "(BNC KB0: 218-223; corrected transcription)" # should not match

Help as to how to improve the regex is much appreciated!

Best
Chris

--
Albert-Ludwigs-Universität Freiburg
Projekt-Leiter DFG-Projekt "Analyse multimodaler Interaktion im Geschichtenerzählen"
ἰχθύς

Robin Melnick

unread,
Mar 13, 2020, 11:27:13 AM3/13/20
to corplin...@googlegroups.com

Ok, this was sort of a fun challenge. The following yields the extractions your annotations suggest you’re looking for:

 

str_extract_all(samp, "(\\b[A-Z][^ ,.]+(\\b[A-Z][^ ,.]+|[ ,.&]+|'s|and|et al)* )?\\([^\\)]*(\\d{4}|\\bforthcoming\\b|in prep(aration)?)[^\\)]*\\)")

 

Rather convoluted, as natural language text regex extractions often become. Perhaps a little too tuned to the sample? Maybe not. Tried to still keep it generalizable, while returning what you were looking for. Output:

 

[[1]]

[1] "(Norrick 2003)"                             "Partington (2007)"                        

 [3] "McAllister (2015)"                          "(Rühlemann & O’Donnell 2012)"             

 [5] "Author (forthcoming)"                       "Peter & Paul (in preparation)"            

 [7] "(cf. Maynard & Leicher 2007)"               "English (Carter et al. 2000: 179)"        

 [9] "(e.g., Partington 2015)"                    "(O’Keeffe 2004)"                          

[11] "(e.g., Rühlemann & O’Donnell 2012)"         "Searle’s (1975)"                          

[13] "Xyz et al.'s (1999)"                        "(see also Kok 2017; Sperber & Wilson 1986)"

 

Which is what you were looking for, yes?

 

Robin

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/corpling-with-r/CALFCMoXf_yxRdGcuxB6-CH78JjOU7pAxFWt5YZT65EVSQvj6Ug%40mail.gmail.com.

Christoph Ruehlemann

unread,
Mar 13, 2020, 12:04:11 PM3/13/20
to corplin...@googlegroups.com
Pretty cool! Thanks a lot! The regex works perfectly.

In the meantime I've come up with my own solution which yields exactly the same result as yours:
str_extract_all(samp, "\\([A-Za-z][^)]*\\d{4}.*?\\)|\\b[A-Z][a-z].*\\([^A-Za-z)]\\w.*?\\)|\\b[A-Z][a-z].*\\(forthcoming\\)|\\b[A-Z][a-z].*\\(in preparation\\)")


[[1]]
 [1] "(Norrick 2003)"                             "Partington (2007)"                         
 [3] "McAllister (2015)"                          "(Rühlemann & O’Donnell 2012)"              
 [5] "Author (forthcoming)"                       "Peter & Paul (in preparation)"
             
 [7] "(cf. Maynard & Leicher 2007)"               "(Carter et al. 2000: 179)"                 
 
[9] "(e.g., Partington 2015)"                    "(O’Keeffe 2004)"                           
[11] "(e.g., Rühlemann & O’Donnell 2012)"         "Abcd & Xyz's (1777)"                       
[13] "Searle’s (1975)"                            "Xyz et al.'s (1999)"                       
[15] "(see also Kok 2017; Sperber & Wilson 1986)"

The solution is exhaustive, matching all citations, and precise, matching just the citations and nothing else (except the parentheses, of course, and stuff like 'e.g.,' etc.; but that can be attented to in post-editing).

Both your regex and my own are rather verbose. I'd be grateful therefore for suggestions as to improve it.


Best

Chris


Christoph Ruehlemann

unread,
Mar 13, 2020, 12:06:16 PM3/13/20
to corplin...@googlegroups.com
Ah, no , just realized that your regex also matched "English (Carter et al. 2000: 179)", which it should not as English is not an author. I've solved this problem by excluding capitalized words preceding parentheses immediately followed by a letter.

;)

Chris

On Fri, Mar 13, 2020 at 4:27 PM Robin Melnick <ro...@melnick.us> wrote:

Stefan Th. Gries

unread,
Mar 13, 2020, 12:07:27 PM3/13/20
to CorpLing with R
I was playing around with something similar (I had added "to appear")
- ultimately, one might need to use Unicode blocks for author names
etc. because of umlaute, accented characters etc.

Christoph Ruehlemann

unread,
Mar 13, 2020, 3:09:16 PM3/13/20
to corplin...@googlegroups.com
just to follow up on this thing:

EDIT:

Here's a solution that (i) also splits multiple citations in parantheses (e.g., "(Kok 2017; Peter 2018; etc.)" and (ii) that post-processes the results:

cit <- str_extract_all(samp, "\\([A-Za-z][^)]*\\d{4};|;\\s[A-Za-z][^)]*\\d{4}\\)|\\([A-Za-z][^)]*\\d{4}.*?\\)|\\b[A-Z][a-z].*\\([^A-Za-z)]\\w.*?\\)|\\b[A-Z][a-z].*\\(forthcoming\\)|\\b[A-Z][a-z].*\\(in preparation\\)|\\([A-Za-z][^);]*\\d{4}|(?<=;\\s)[A-Za-z][^);]*\\d{4}")

It sure hasn't gotten less unwieldy but it's efficient:

cit
[[1]]
 [1] "(Norrick 2003)"                            "Partington (2007)"                        
 [3] "McAllister (2015)"                         "(Rühlemann & O’Donnell 2012)"             
 [5] "Author (forthcoming)"                      "Peter & Paul (in preparation)"            
 [7] "(cf. Maynard & Leicher 2007)"              "(Carter et al. 2000: 179)"                
 [9] "(e.g., Partington 2015)"                   "(O’Keeffe 2004)"                          
[11] "(e.g., Rühlemann & O’Donnell 2012: 11-22)" "Abcd & Xyz's (1777)"                      
[13] "Searle’s (1975)"                           "Xyz et al.'s (1999)"                      
[15] "(see also Kok 2017;"                       "Sperber & Wilson 1986" 

Now we can clean the results by removing anything that is not strictly part of the citiation:

cit_clean <- gsub("\\(|\\)|:\\s\\d+(-\\d+)?|(e\\.g\\.,|see also|cf.)\\s|'s|’s|;", "", unlist(cit))
cit_clean
 [1] "Norrick 2003"                "Partington 2007"             "McAllister 2015"            
 [4] "Rühlemann & O’Donnell 2012"  "Author forthcoming"          "Peter & Paul in preparation"
 [7] "Maynard & Leicher 2007"      "Carter et al. 2000"          "Partington 2015"            
[10] "O’Keeffe 2004"               "Rühlemann & O’Donnell 2012"  "Abcd & Xyz 1777"            
[13] "Searle 1975"                 "Xyz et al. 1999"             "Kok 2017"                   
[16] "Sperber & Wilson 1986"

And finally we can remove the duplicates and order the unique results alphabetically:

cit_unique <- sort(unique(cit_clean))
cit_unique
 [1] "Abcd & Xyz 1777"             "Author forthcoming"          "Carter et al. 2000"         
 [4] "Kok 2017"                    "Maynard & Leicher 2007"      "McAllister 2015"            
 [7] "Norrick 2003"                "O’Keeffe 2004"               "Partington 2007"            
[10] "Partington 2015"             "Peter & Paul in preparation" "Rühlemann & O’Donnell 2012" 
[13] "Searle 1975"                 "Sperber & Wilson 1986"       "Xyz et al. 1999"

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.

Christoph Ruehlemann

unread,
Mar 14, 2020, 7:00:42 AM3/14/20
to corplin...@googlegroups.com
Hi all,

One more question related to the recent discussion on how to extract citations from manuscripts. While the regex shown was successful on the sample, I'm having trouble using it on the actual manuscript (which is obviously much larger and may harbor a more complex layout or formatting structure). The trouble, I think, is not with the regex but that I'm unable to read-in the manuscript in such a way that the regex can process it properly.

So here's the question: how can I read-in the manuscript or post-process it in such a way that the whole text is just **a single large character string**?

I've tried it with read.table and these arguments:
read.table([my path], header = F,  sep = "\n", fill = F, stringsAsFactors = F, strip.white = T)

and I've used paste to fuse it all together:

paste0(manuscript$V1, collapse = "")

But there are still deep divisions that survive these transformations and prevent the regex to search the manuscript uninterruptedly from top to bottom.

Any suggestions as to how that problem can be solved?

Many thanks in advance!

Chris


On Fri, Mar 13, 2020 at 5:07 PM Stefan Th. Gries <stg...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.

Konrad Juszczyk

unread,
Mar 14, 2020, 10:09:30 AM3/14/20
to corplin...@googlegroups.com
Hi all,


It's an interesting case, far more exciting than current news. Where exactly divisions appear? Have you tried cleaning the text before extraction? Remove all non-alphabetic characters, but leave (), and four digits numbers. If that doesn't help try dividing the string into smaller where the division appears and loop the regex extraction.

On the other hand there are some approaches on the web: jabref website lists useful tools for bibliography. 

After all, using so many variations should be simplified, so machines would extract citations instead of us. But somehow, we keep the variety. 

Good luck

Konrad Juszczyk 

--
Gmailed from iPhone

Stefan Th. Gries

unread,
Mar 14, 2020, 10:09:35 AM3/14/20
to CorpLing with R
I don't know what "deep divisions" are - what do you mean by that? And
for putting it all into one string, I'd probably do

paste(manuscript$V1, collapse = " ")

i.e., put a space between parts, but this might depend on the content
of V1, obviously

Serge Heiden

unread,
Mar 14, 2020, 10:41:46 AM3/14/20
to corplin...@googlegroups.com
Hi,

Le 14/03/2020 à 15:09, Konrad Juszczyk a écrit :
> After all, using so many variations should be simplified, so machines would extract citations instead of us. But somehow, we keep the variety.

For a machine learning approach to this, you may be interested in GROBID : https://github.com/kermitt2/grobid

Best,
Serge

--
Dr. Serge Heiden, s...@ens-lyon.fr, http://textometrie.ens-lyon.fr
ENS de Lyon - IHRIM UMR5317
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883

Christoph Ruehlemann

unread,
Mar 17, 2020, 9:57:09 AM3/17/20
to corplin...@googlegroups.com

Hi all,

Just in case there's still some interest in the question how to extract citations exhaustively from manuscripts, I've made some progress fumbling together some code that exhaustively extracts citations of any kind found in an actual manuscript of moderate length (8k words). Obviously, there may be more variants in other manuscripts that have not been accounted for. But these could be accounted for by making additions/amendments to the code. So, I think it's a fairbeginning and I'd like to share the code so that others might perhaps use it/improve it/adapt it to their own purposes.

### STEP 1: Load manuscript and paste it together into a single character string:
ms <- paste0(ms, collapse = " ")

### STEP 2: define patterns for types of citations:

# pattern 1: match citations completely enclosed in parenthesis, e.g., "(Kok 2017 etc.)"
p_1 <- "\\((Mc|O’)?[A-Za-zé][^)]*\\d{4}\\)"

# pattern 2: match citations where only year etc. is enclosed in parentheses, e.g., "Kok (2017 etc.)"
p_2 <- "(Mc|O’)?[A-Z][a-zé]+\\b\\s\\(\\d{4}(: \\d+)?\\)"

# pattern 3: match citations where name is followed by stuff before parenthesis, e.g., "Kok et al.'s (2017: 1-12)"
p_3 <- "(Mc|O’)?[A-Z][a-zé]+(\\set al\\.)?(’s)?\\s\\(\\d{4}(: \\d+)?\\)"

# pattern 4: match citations with 2 names before parentheses, e.g., "Kok & Kik's (2017: 1-12)"
p_4 <- "(Mc|O’)?[A-Z][a-zé]+\\b\\s&\\s(Mc|O’)?[A-Z][a-zé]+\\b(’s)?\\s\\(\\d{4}(: \\d+)?\\)"

# pattern 5: match  citations with 2 names enclosed in parentheses, e.g., "(Kok & Kik 2017: 1-12)"
p_5 <- "\\((Mc|O’)?[A-Z][a-zé]+\\b\\s&\\s(Mc|O’)?[A-Z][a-zé]+\\b\\s\\d{4}(: \\d+)?\\)"

# pattern 6: match citations enclosed in parentheses and preceded by stuff,  e.g., "(e.g., Kok & Kik 2017: 1-12)"
p_6 <- "\\((cf\\.\\s|e\\.g\\.,\\s)?(Mc|O’)?[A-Z][a-zé]+(\\set al\\.)?\\s\\d{4}(: \\d+)?\\)"

# pattern 7: match multi-citations in parentheses, "(cf. Kik & Kok’s 2018; Pit 2008; 23; Joe 2017)"
p_7 <- "\\((cf\\.|e\\.g\\.,\\s)?(Mc|O’)?[A-Z][a-z][^)]*\\d{4}(: \\d+)?;(\\scf\\. also)?\\s(Mc|O’)?[A-Z][a-z][^)]*\\d{4}(: \\d+)?\\)"

# pattern 8: match citations in square brackets, e.g., "(but see Kik & Kok’s [2018]; cf. also [Pet 2008: 23])"
p_8 <- "(Mc|O’)?[A-Z][a-zé]+\\b\\s&\\s(Mc|O’)?[A-Z][a-zé]+\\b(’s)?\\s\\[\\d{4}(: \\d+)?\\]|\\[(Mc|O’)?[A-Z][a-zé]+\\b\\s\\d{4}(: \\d+)?\\]"


### STEP 3: combine patterns and apply them to manuscript:
# combine:
allpatterns <- paste(c(p_step1,p_step2,p_step3,p_step4,p_step5,p_step6,p_step7, p_step8), collapse="|")

# extract using `str_extract`:
str_extract_all(ms, allpatterns)

### STEP 4: post-process result
# save:
cit <- str_extract_all(ms, allpatterns)

# split multi-citations, eg, "(A 2000; B 1999; ...)":
cit_split <- unlist(str_split(unlist(cit), ";\\s"))

# clean up:
cit_clean <- gsub("\\(|\\)|\\[|\\]|:\\s\\d+(-\\d+)?|(e\\.g\\.,|see also|cf.(\\salso)?)\\s|'s|’s|;", "", cit_split)

# order unique citations alphabetically :
cit_unique <- sort(unique(cit_clean))

That's it. You should have now an orderly ordered list of all citations in the manuscript!

Best
Chris

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.

Alex Perrone

unread,
Mar 17, 2020, 11:25:30 AM3/17/20
to corplin...@googlegroups.com
If you make a couple functions out of these and a few basic tests, it would be very easy to turn it into a small package and put it up on GitHub so people can download it, use it, and you can modify it over time (if you wish). Here's a free book that's a great guide for how to do this: http://r-pkgs.had.co.nz

Reply all
Reply to author
Forward
0 new messages