Regular Expressions

36 views
Skip to first unread message

Gustavo A. Rodriguez

unread,
Apr 11, 2023, 4:54:34 AM4/11/23
to AntConc-Discussion
HI, 

I would like to use a regular expression to find repetitions of any word within an n-word span. What would that expression be? Thanks. 

Daniel HENKEL

unread,
Apr 14, 2023, 4:58:23 AM4/14/23
to ant...@googlegroups.com, Gustavo A. Rodriguez

Dear Gustavo,

If your corpus is formatted as follows:

Token_POS_lemma Token_POS_lemma Token_POS_lemma ... (with 1 space between each word and the next)

A regex like this one, with a backreference "\1" to find the repetition of the lemma within a span of 0-9 words, will do the trick:

\b\S+_\S+_(\w+)( \w\S+){0,9} \S+_\S+_\1

This regular expression makes certain assumptions: that the corpus is perfectly formatted, that the underscore is never used for anything except tags, etc.

If the corpus is not tagged, then something like this:

\b(\w\S*)( \S+){0,9} \1

In any case, backreferences are surely the best way to find repetitions.

Be aware that, unless you narrow down the query a bit, you're going to find lots of repetitions of articles and other grammatical words that probably don't interest you.

In my correspondance with Laurence he has expressed legitimate reservations about my potentially hazardous use of \S+, but I can tell you from experience that if the corpus is properly formatted my "risky" regex will work.

Just one more thing, these regular expressions will work on AntConc 3.5.9, but 4+ can't always deal with regex at sentence level, so I'm not sure that this kind of search can be performed on 4+ (pending confirmation from Laurence) and/or you may need to experiment a bit.

Best regards,

Daniel



On 11/04/2023 10:54, Gustavo A. Rodriguez wrote:
HI, 

I would like to use a regular expression to find repetitions of any word within an n-word span. What would that expression be? Thanks. 
--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/6b80fcf8-7103-4b72-8d21-86111499a6d1n%40googlegroups.com.

Laurence Anthony

unread,
Apr 14, 2023, 5:02:48 AM4/14/23
to ant...@googlegroups.com, Gustavo A. Rodriguez
Thanks for the comment Daniel,

As you say, AntConc 4x regex works at the word level. To find repetitions of a word within a certain span, I would think the context search in the advanced search settings would work better. You can set the word and the span size there and everything should work fine. Perhaps start with a toy example of just a few lines of text and see if you can get the hits that you want. If you can't, send me that toy example, and I can have a look here.

Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


Reply all
Reply to author
Forward
0 new messages