Dear Gustavo,
If your corpus is formatted as follows:
Token_POS_lemma Token_POS_lemma Token_POS_lemma ... (with 1 space between each word and the next)
A regex like this one, with a backreference "\1" to find the repetition of the lemma within a span of 0-9 words, will do the trick:
\b\S+_\S+_(\w+)( \w\S+){0,9} \S+_\S+_\1
This regular expression makes certain assumptions: that the corpus is perfectly formatted, that the underscore is never used for anything except tags, etc.
If the corpus is not tagged, then something like this:
\b(\w\S*)( \S+){0,9} \1
In any case, backreferences are surely the best way to find
repetitions.
Be aware that, unless you narrow down the query a bit, you're
going to find lots of repetitions of articles and other
grammatical words that probably don't interest you.
In my correspondance with Laurence he has expressed legitimate reservations about my potentially hazardous use of \S+, but I can tell you from experience that if the corpus is properly formatted my "risky" regex will work.
Just one more thing, these regular expressions will work on AntConc 3.5.9, but 4+ can't always deal with regex at sentence level, so I'm not sure that this kind of search can be performed on 4+ (pending confirmation from Laurence) and/or you may need to experiment a bit.
Best regards,
Daniel
HI,
I would like to use a regular expression to find repetitions of any word within an n-word span. What would that expression be? Thanks.
--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/6b80fcf8-7103-4b72-8d21-86111499a6d1n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/7f8577c6-351a-c2a3-b303-337b5910ea6a%40gmail.com.