Search string - RegEx

188 views
Skip to first unread message

pstef...@googlemail.com

unread,
May 16, 2016, 10:24:37 AM5/16/16
to AntConc-discussion
Hello again,

this time I'm looking for a suitable RegEx to find, in analogy to Thompson & Tribble (2001), citations in my own academic articles corpus:

"In this instance we used a simple "catch-all" search string 19??)/??), that is, search for any five
character string beginning with 19 -- remember this is a pre-21st century corpus -- and ending with a
closing bracket, and any three character string ending with a closing bracket to catch other forms."

It should return hits like the following (taken from Thompson & Tribble 2001):

  1. Christine Brooke-Rose (1958),
  2. F.R. Leavis (e.g. 1936, 1967)
  3. Co-operative Principle (from Grice 1975)
  4. David Newnham, in the Guardian (24 July 1990)
  5. point of view (Fowler 1986: 127-46; Simpson 1990)

Using this string, AntConc tells me it's not a valid RegEx (maybe because they used Wordsmith which has a different way of doing RegEx?)

I tried the following in AntConc (I'm new to using RegEx...):

19\d\d??\)
and
20\d\d??\)

The results look promising, but I'm not sure if everything would indeed be found. I have a feeling that "Fowler 1986" in ex. 5 would be left out.

Any help is much appreciated!

Stefanie

Laurence Anthony

unread,
May 16, 2016, 12:27:42 PM5/16/16
to ant...@googlegroups.com
Your explanation of what you what to find is a bit confusing. 

>search for any five character string beginning with 19 and ending with a closing bracket

So is 19xy) valid?

If you mean 19 followed by any two digits followed by a closing bracket, then this works:
19\d\d\)

>any three character string ending with a closing bracket to catch other forms

So is xy) valid?

If you mean any two digits followed by a closing bracket, then this works:
\d\d\)

Combining them simply becomes:
19\d\d\)|\d\d\)

But, as I say, I'm not sure what you want to find.

Laurence.




###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at https://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/d/optout.

pstef...@googlemail.com

unread,
May 16, 2016, 12:43:31 PM5/16/16
to AntConc-discussion
Ok, I try again.

I try to retrieve all citations that involve a year given in brackets, including, for instance, the following examples:

  1. Christine Brooke-Rose (1958),
  2. F.R. Leavis (e.g. 1936, 1967)
  3. Co-operative Principle (from Grice 1975)
  4. David Newnham, in the Guardian (24 July 1990)
  5. point of view (Fowler 1986: 127-46; Simpson 1990)
It may be straightforward as in the first example for which 19\d\d\) would work: 1958), and it would also find 1967), 1975) and 1990).
But what about 1936 and 1986 in examples 2 and 5? Here, there is an opening bracket, followed by any number of characters (letters, punctuation marks), then the four digits. How can I find those?


P.S. My starting-point was the string suggested by Thompson and Tripple (see the quote), which, however, didn't work.

Laurence Anthony

unread,
May 16, 2016, 8:29:07 PM5/16/16
to ant...@googlegroups.com
I'm still not sure what you are trying to find:

For these examples:
Christine Brooke-Rose (1958),
F.R. Leavis (e.g. 1936, 1967)
Co-operative Principle (from Grice 1975)
David Newnham, in the Guardian (24 July 1990)
point of view (Fowler 1986: 127-46; Simpson 1990)

Are you trying to extract the following?:
(1958)
(e.g. 1936, 1967)
(from Grice 1975)
(24 July 1990)
(Fowler 1986: 127-46; Simpson 1990)

Or?:
1958
1936
1967
1975
1990
1986
1990

Or something else?:

Laurence.



###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
Message has been deleted

pstef...@googlemail.com

unread,
May 17, 2016, 6:07:12 AM5/17/16
to AntConc-discussion
Ultimately, I would like to know how many references there are in the corpus (could be integral or non-integral citations), taking the occurrence of a year of publication in round brackets as an indicator. Having the four digits of the year as hits, I can then look at the complete citation via KWIC, and I could export the rows to another programme to code the type of citation, for instance.

19\d\d\)|\d\d\) gets me many 'good' results, but also returns hits such as 37) in p = .37) or 06) in (p. 2206) or 1988 in he counted 1988 instances of
and does not find 1986 in (Fowler 1986: 127-46; Simpson 1990) or 1936 in (e.g. 1936, 1967)

I'm looking for a string that reduces the problem of precision and recall further.
I just tried this: \(*19\d\d|19\d\d\) and wonder if that gets me any closer?
In order to also get publication years from the 2000s, I tried this: \(*19\d\d|19\d\d\)|\(*20\d\d|20\d\d\)

Does that make it clearer now? Do you think the last string would capture everything I'm looking for?

I'm aware that I will still miss out on citations where the writer does not include a year in the sentence, but that's a different issue.


To count references in the hard sciences, which often use numbers in square brackets, I tried the following:

\[\d\d?\d?\] finds [5] [13] [234]

\[*\,\d\d?\d?\] finds 5,] 13,] 234] in [7,5] [9,13] [23,77,234]

\[\d\d?\d?, finds [7, [9, [23, in [7,5] [9,13] [23,77,234]

\[\d\d?\d?– finds [9– in [9–13] and [5–13, 45] (several references listed without giving a number to each => manual correction required)

But how to find 77 in [23,77,234] or in [23,77-78,234] and not any digits with the pattern ,dd, or ,dd- ?


Thanks for your patience!

Stefanie



Am Montag, 16. Mai 2016 16:24:37 UTC+2 schrieb pstef...@googlemail.com:

JFlorian

unread,
May 17, 2016, 8:04:37 AM5/17/16
to ant...@googlegroups.com

Note:  I am a non-expert beginner learner, so readers should weigh my comments accordingly.

I'm just thinking aloud...perhaps it will lead you to a new idea?

 Is there anything that would identify the hyphen used in page number citations to exclude those numbers so they don't confuse your 'year' results?

Would some command help to identify-exclude based on one side of a bracket [ or ], such as for [1] or [4] as in-text references pointing to a footnote or biblio, since those don't typically contain years?  

Or to identify-include the closing half the parenthesis such as the closing symbol ) such as in (White, 1976).   I'm thinking of this because in HTML webmaking tasks, "Find" or "Find-Replace" in webmaking programs often won't find html code if I search for for the full <some text> (text) or [text] unless I leave out one side, either the opening or closing symbol.   But it will find a search term of 1976] or <1, etc.  So I'm wondering if the same principle would help you in some manner?

Would using a 'repetition modifier' help?


Does this post help any?   Quoted:

"Reference:

\d - any digit character same as [0-9]
\D - any non digit character
+ - repetition modifier which means one or more times
| - means OR
() creates a group of characters which will be separated in the matches
..... "
(end post)



Laurence Anthony

unread,
May 17, 2016, 8:20:26 AM5/17/16
to ant...@googlegroups.com
Hi again,

You now write:
>[I] would like to know how many **references** there are in the corpus (emphasis added)

Why are you looking at citations then? Why not count the actual *references* at the end of the papers (probably formatted in an easy way to extract the information)?

Or, am I right in saying that you are trying to find the *complete citations* as follows?:
(1958)
(e.g. 1936, 1967)
(from Grice 1975)
(24 July 1990)
(Fowler 1986: 127-46; Simpson 1990)

Or, are you trying to find the *number of individual citations* referred to via the years they are cited as follows?:
1958
1936
1967
1975
1990
1986
1990

I still think you are trying the find the third option above. If so, I think there is a cool way to do this, but I don't want to think of a way until I know this is what you actually want.

Laurence.





###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

pstef...@googlemail.com

unread,
May 17, 2016, 9:16:41 AM5/17/16
to AntConc-discussion

Thanks 'Lifes' and Laurence for your replies.

I am interested in the number of individual in-text citations, not the number of references as in the bibliography/list of references. Should have employed terminology correctly ;-)

I try to find out how many citations in total are in Corpus 1 (Applied Linguistics) as opposed to Corpus 2 (Physics) so that I can calculate the average number of citations per article in the two disciplines and compare.
The citation conventions in the two journals I currently look at are quite different (author date system vs. reference number in square brackets),so I need two different search strings for the two corpora to get a list of all citations; at least, that's what I've tried to do via RegEx.

In a second step, I would like to look at the citations returned by the search(es) above in closer detail (e.g. if there's a reporting verb used but also other aspects). My idea was that I could export the concordance lines to a different programme and then analyse/code the citations further.




Am Montag, 16. Mai 2016 16:24:37 UTC+2 schrieb pstef...@googlemail.com:

Laurence Anthony

unread,
May 17, 2016, 12:06:15 PM5/17/16
to ant...@googlegroups.com
Hi,

Try this:
(?<=[( ])\d{4}(?=[,:)])

I haven't checked with other data, but it will find all the years in the examples you gave.

Here is my test data:
test Christine Brooke-Rose (1958), F.R. Leavis (e.g. 1936, 1967) Co-operative in 1983 Principle (from Grice 1975) David Newnham, in the Guardian (24 July 1990) point of view (Fowler 1986: 127-46; Simpson 1990) test

I hope it helps.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--

pstef...@googlemail.com

unread,
May 17, 2016, 12:29:53 PM5/17/16
to AntConc-discussion
Hi Laurence,

many thanks for your help!

In my Applied Linguistics Corpus, your string: (?<=[( ])\d{4}(?=[,:)]) returns 1741 hits whereas mine: \(*19\d\d|19\d\d\)|\(*20\d\d|20\d\d\) returns 3358
I have attached a screen shot of a passage where the string misses out on a few years.

Do you have any idea how I could count the citations in the Physics corpus? So far, I have tried the following:


I tried the following:

\[\d\d?\d?\] finds [5] [13] [234]

\[*\,\d\d?\d?\] finds 5,] 13,] 234] in [7,5] [9,13] [23,77,234]

\[\d\d?\d?, finds [7, [9, [23, in [7,5] [9,13] [23,77,234]

\[\d\d?\d?– finds [9– in [9–13] and [5–13, 45] (several references listed without giving a number to each => manual correction required)

But how to find 77 in [23,77,234] or in [23,77-78,234] and not any digits with the pattern ,dd, or ,dd- ?


There are quite a number of corpus-based research studies on citations (frequeny, form, function), but unfortunately people don't mention which search string they used.... :-(  Would make life much easier if they did.

Stefanie




Am Montag, 16. Mai 2016 16:24:37 UTC+2 schrieb pstef...@googlemail.com:
Search for citations_LA string.JPG

Laurence Anthony

unread,
May 17, 2016, 1:21:40 PM5/17/16
to ant...@googlegroups.com
Ah...just add a semi colon after the 4 digit part:

(?<=[( ])\d{4}(?=[,:;)])

For the Physics citations, just use square brackets in the look behind and look ahead part of the regex and use a range of digits, \d+

(I'm assuming you're familiar with regex).

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--

pstef...@googlemail.com

unread,
May 17, 2016, 3:23:30 PM5/17/16
to AntConc-discussion
Hi again,

your revised version of the string now returns 2090 hits (before: 1741). Browsing through the text files I detected a few citation years that were not found (see attached files).


Would it look like this for the Physics part then?

[?<=[( ])\d+(?=[,:;)]]


I'm only a bit familiar with RegEx, and the lookahaed/lookbehind is not something I'm familiar with. I only started to work with RegEx recently, but I want to learn (and once I try to do something, I can't stop thinking about the problem and trying...)

Thanks you so much for your support - this is unique!!


Stefanie


Am Montag, 16. Mai 2016 16:24:37 UTC+2 schrieb pstef...@googlemail.com:
Search for citations_LA string V2_1.JPG
Search for citations_LA string V2_2.JPG

Laurence Anthony

unread,
May 17, 2016, 7:42:14 PM5/17/16
to ant...@googlegroups.com
Looks like you have years starting at the beginning of lines:
(?<=[\n( ])\d{4}(?=[,:;)])

But this might introduce some false positives. I recommend you replace all new lines with spaces. Then the earlier version will work better.

For physics:

[?<=[\[])\d+(?=[\]]


as you don't need to look for anything except brackets.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
Reply all
Reply to author
Forward
0 new messages