Regex difficulties in AntConc 4

475 views
Skip to first unread message

R. Williams

unread,
Jan 2, 2022, 7:25:47 PM1/2/22
to AntConc-Discussion
Laurence, Happy New Year!

[1]  AntConc v4.0 is a great way to usher in a new year of corpus-based research.  Without a doubt, AntConc has changed the way that I conduct my recent research.  Indeed, I could not do my research without it and the regexes that I use.  This posting acknowledges the hard work that you put into AntConc and in answering our questions.  There are numerous positive features in v4.

[2]  I also must say that I have encountered serious difficulties with using my tried-and-true regexes in the newest AntConc version.

[3]  As a political theorist, I use AntConc as the main means to access the ideas of W.E.B. Du Bois within the 230+ documents of his that compose my corpus.  My corpus contains on a minor portion of the 2000-plus pieces that he published and a fraction of several more thousands of unpublished correspondence and drafts of published and unpublished writings.

[4]  My research process involves both locating Du Bois's ideas expressed via words and also locating the words that compose the variant multiple ways by which he also conveyed, defined, and applied his ideas.  I am not seeking collocations in the linguistic sense.  To accomplish my research I create what could be called proximity-oriented regexes.  I typically search for word combinations 20-50 words apart [using word tokens consisting of, for example, ``\W+(?:\w+\W+)``, or else 30, 50, 100, or even 200 characters apart [using the dot metacharacter.]  Accordingly, I am seeking concepts by means of words that may, and often do, cross sentence and paragraph boundaries.  Indeed, the context in which I delimit the possible range of meanings spans not only the document itself, but also -- intertextually -- encompasses the corpus as a whole

[5]  As a starting point for my comparison of AntConc v3.5.9 and v4, let me establish a base-line.  For both versions:
(a) I have "Regex" checked and no other Search Query box checked.
(b) I am using the same UTF-8 files for both versions, although in v4 I created a corpus as per the new version.
(c) I use the same regexes in v4.0 that worked in v3.5.9, but many do not work as expected or else do not work at all in v4.

[6]  Regexes that are not working as expected in v4 (but did so in v3.5.9) include the following.  For example, ``Consciousness`` yields both lower-case and upper-case matches, when I only wanted upper-case.
(a) The mode modifier ``(?i)`` does not seem to work in v4.
(b) Even ``[A-Z]onsciousness`` matches lower- and upper case "c".

[7]  Regexes that are not working at all in v4 (but did so in v3.5.9), for example, include:
    {rgx-1}  (?i)sel(?:.){0,50}?conscious|conscious(?:.){0,50}?sel
    {rgx-2}  (?i)sel[\w]*\W+(?:\w+\W+){0,20}?conscious[\w]*|conscious[\w]*\W+(?:\w+\W+){0,20}?sel[\w]*
I am looking for "self-conscious" and all variant expressions and multi-word units. This is important because in my study of Du Bois's concept of consciousness I wish to find as many of its variations and synonyms as possible.

[8]  With {rgx-1} I find 77 matches in v3.5.9.  With {rgx-2} I locate 74 matches in v3.5.9.  For example, I am able to locate in the corpus:
(a) "self-conscious" and "self-consciousness"
(b) "double-consciousness, this sense of always looking at one's self"
(c) Matches like "more deeply self-critical, more conscious of its power" help me to examine self-critical reflection in relation to the awareness of humans to the power of their concerted actions [one of my research areas.]
(d) However, for {rgx-1} and {rgx-2} v4 indicates:  "No hits found!".

[9]  In addition, I often utilize "one-directional" regexes, such as:
    {rgx-3}  (?i)sel(?:.){0,50}?conscious
    {rgx-4}  (?i)conscious(?:.){0,50}?sel
    {rgx-5}  (?i)sel[\w]*\W+(?:\w+\W+){0,20}?conscious[\w]*

[10]  AntConc v3.5.9 finds various matches for all three regexes.
(a) However, v4 finds one match for {rgx-3}: "selfconscious" [no space, dash, or hyphen] I believe that this is a typographical error in the document. 
(b) For {rgx-4} and {rgx-5} AntConc v4 informs me:  "No hits found!".

[11]  Moreover, in v4 the following regex (with a blank space between the words)
    {rgx-6}  self conscious  
matches "self conscious", "self consciousness", "self-conscious", and "self-consciousness" with both a space and a dash separating the two words.  In v3.5.9 {rgx-6} does not match "self-conscious" because the node word contains a dash.  

[12]  The regex
    {rgx-7}  (?i)self[-\s]conscious
works as expected in v3.5.9 and finds 32 hits with either a dash or a space character.  In AntConc v4.0: "No hits found!" (with or without the case-insensitive mode modifier).  

Is there something that I am doing wrong?  Is there something about the newest AntConc version that I do not understand with regard to applying regular expressions?

I greatly appreciate your assistance.  Have a great day.
Robert
Williams

www.webdubois.org 

Laurence Anthony

unread,
Jan 2, 2022, 7:45:33 PM1/2/22
to ant...@googlegroups.com
Hi Robert,

Thank you for the comments. What is clear is that I need to improve the documentation when it comes to regex.

Perhaps, all your issues can be addressed by understanding the following points. These relate to your baseline in 5).

1. Regex in AntConc 4 is fundamentally different from that in AntConc 3, but all searches can be achieved with the exception of some rare backreference searches.

2. Regex in AntConc 4 works at the token level. This is the most important thing to understand and probably is the main cause of differences you find 

3. Regex case flags are now set by the case option in the search box, which is why it doesn't become disabled as it did in AntConc 3. My guess is that this is the second main cause of your differences.

Can you consider the above, try your regex again, and see if you continue to have problems.

Now that in almost all cases, the changes to your regex actually make them much simpler.

Regards,

Laurence






--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/11951cb0-0b63-4818-97f9-562751f37451n%40googlegroups.com.

Laurence Anthony

unread,
Jan 2, 2022, 8:11:43 PM1/2/22
to ant...@googlegroups.com
Robert,

Just one more comment. I notice that many of your regex searches are incorrect and will produce hits that are unintended. In particular, your searches don't have any boundary markers. You need these in AntConc 4, and especially AntConc 3, where the entire file is the target of the search.

Laurence




R. Williams

unread,
Jan 4, 2022, 10:14:18 PM1/4/22
to AntConc-Discussion
Laurence,
[1]  Thank you for your replies.  I do appreciate the time that you have taken to address my comments.  Herein, I also will address your comments on my lack of word boundaries in my examples of regexes.

[2]  A Question: Unfortunately, I am not certain about the "token level" that you mention in your first response.  Does this refer to a word as a whole?

[3]  I am sorry to say that I am still encountering serious difficulties with regexes in AntConc v4.  My experience with regex creation is based on various books, including one that I consult often: Regular Expressions Cookbook, 2nd Edition, by Jan Goyvaerts & Steven Levithan (Sebastopol, CA: O'Reilly Media, 2012).  Moreover, I test my regexes with the software RegexBuddy (coded by Goyvaerts) and they work in numerous regex engines.  As a consequence, I am not certain what to do.  In the spirit of improving, allow me to provide several examples of regexes that I have used.


Regex for a Single Word with Different Spellings 
[4]  In my corpus of Du Bois texts, "cooperation" has 3 correct spellings: "cooperation", "co-operation", and the older "coöperation".  In v3.5.9 this regex finds all 3 spellings:
    {rgx-1}  (?i)co-?[oö]peration
But in v4, using {rgx-1}, only "cooperation" and "coöperation" are matched, not "co-operation".

[5]  In v4 I tried a regex using alternation:
    {rgx-2}  co-operation|cooperation|coöperation
which in v4 still does not locate "co-operation" with the dash.  In my Du Bois corpus there are over 160 instances of "co-operation".

[6]  The only regex that worked in v4 was
    {rgx-3}  co operation
without the dash, which was able to locate "co-operation" with the dash.  As I expected, {rgx-3} did not locate "cooperation" or "coöperation".

[7]  A regex that works in AntConc v4 is:
    {rgx-4}  self conscious
This regex without the dash will find "self-conscious" and "self-consciousness", both with the dash, as well as "self conscious" itself.

[8]  A Question: Is that what you mean by token in the regex implementation in AntConc 4?  Namely, that the two search terms [node words] are matched regardless of the punctuation or blank space that may lie between them?


Proximity-Oriented Regular Expressions 
[9]  I now move to proximity-oriented regexes, which are a necessary component of my political-theoretical projects.  With proximity regexes I am seeking the manifold variation of Du Bois's ideas as he potentially ramified them across the hundreds of his writings.  I am not seeking collocations.

[10]  A Request: In AntConc 4, how would I locate, for example, "self" and its variants ("ourselves", "himself") in relation to "conscious" and its variants ("unconscious", "consciously") over a gap of 20-100 words or of 1-400 characters?  Might you please provide a sample regex?  

[11]  Any regex would need to be able to find the designated search terms [above] in the following passages from my corpus (such as I can do in v3.5.9):
    {quot-1}  "a world which yields him no true self-consciousness"
    {quot-2}  "more deeply self-critical, more conscious of its power"
    {quot-3}   {¶ } [....] If you count yourselves as something more than your money, why may not I?  {¶}  To induce, then, in men a consciousness of the humanity of all men,[....]
I can provide other examples from my interpretive research. 

[12]  The passage presented in {quot-3}, which crosses both sentence and paragraph boundaries, is salient to my research: who has consciousness and what is the content of that consciousness.  It could not be matched by boundaries set to locate only "self" or "self-conscious".  The "then" connects the two paragraphs, and thereby relates the essay's audience to the idea of the humanity embodied in all people, which is (I would argue) Du Bois's goal of the essay.

[13]  The following proximity regexes work in v3.5.9 and will locate those quotations listed above:
    {rgx-5}  (?i)sel(?:.){0,100}?conscious
    {rgx-6}  (?i)sel[\w]*\W+(?:\w+\W+){0,20}?conscious[\w]*
These regexes appeared in my initial posting.


Word Boundaries
[14]  Herein arises the importance of the strategic presence and absence of boundary markers in my regexes.  In my iterative research process, the early steps involve capacious search terms.  I will KWICly (but maybe not so quickly [sorry!]) examine the lists of matches to understand the range of possible words by which Du Bois expressed himself.  I also believe in research serendipity.  

[15]  With regard to "sel" in relation to "conscious", I am not only looking for "self" but also for other possibilities, such as "ourselves", "himself", "selfless", etc.  In the proximity-oriented regex {rgx-6},``sel``, which is to be located within 20 words of ``conscious``, seeks to match "self", "itself", "yourselves", "himself", etc.  Thus, ``sel`` or even ``sel[\w]*`` is an intentional part of my research strategy involving proximity regexes.  

[16]  Moreover, in order to to examine what Du Bois wrote in relation to "conscious" or "consciousness" I do not want initially to exclude "unconscious", "half-conscious", or "consciously".

[17]  As a next step in my research flow, I may then refine my new regex searches accordingly -- perhaps to include boundary markers to frame the search \bword\b.  In short, I include, or not, word boundaries as a part of a larger interpretive strategy to understand the plenitude and nuances of Du Bois's voices in the words and ideas of his writings.


Matching a String within Sentences
[18]  As an example of another regex useful for my research: sometimes I wish to identify the entire sentence in which a match is found.  To that end the following regex works in v3.5.9, but there are "No hits found!" in v4.
    {rgx-7}  (?i)[^\.?!]+self[\s-]conscious[\w]*.*?[\.?!]
I do not specify word boundaries because this regex, when applied to my corpus, only matches the desired phrases.  I would like to perform this regex search in v4.


In Closing
[19]  Another Request: Perhaps you could point me to documentation on the regex flavor used in AntConc 4?  In addition, sample regexes greatly aid my comprehension and would help me in crafting regexes appropriate to v4.

[20]  I thank you for your work on AntConc and for your assistance.  My posting is long, because proximity regexes figure extensively in my academic research.  Indeed, I presented at 3 academic conferences in 2021 utilizing AntConc v3.5.9 and various regexes as part of my process of interpreting Du Bois's ideas (at <www.webdubois.org>).  I am planning to publish academic writings -- hopefully sooner than later -- in which I wish to foreground AntConc and my proximity regexes by applying them to my corpus of Du Bois writings.

Ciao.
Robert

R. Williams

unread,
Jan 4, 2022, 10:15:17 PM1/4/22
to AntConc-Discussion
Laurence, 
As you indicated, I did not insert boundary markers in most of the regexes presented in my initial posting (on 1-2-22).

I discussed my use of the strategic presence and absence of boundary markers in my response posting of 1-4-22 (U.S. East Coast date).

Ciao.
Robert

Emma Goldsmith

unread,
Feb 6, 2022, 9:28:32 AM2/6/22
to AntConc-Discussion
Hi Laurence,
I've arrived here trying to find out how to locate a hyphenated word in my corpus: non-valvular
In legacy version 3.5.9, I get hits for non-valvular  by checking either Word or Regex. In version 4, I get No hits found with the same settings. The only way to get results is by entering non valvular, but then I get everything (non valvular, non-valvular and nonvalvular).
How can I find non-valvular and nothing else?
Thanks,
Emma

Laurence Anthony

unread,
Feb 6, 2022, 11:06:04 AM2/6/22
to ant...@googlegroups.com
Hi Emma,

This is another change in AntConc 4.0, which is related to the database backend. In the new release (for now), you can only search for tokens in the corpus (not non-tokens). So, If you want to find hyphens, you need to include them as part of the token definition when building the corpus.

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################


Emma Goldsmith

unread,
Feb 6, 2022, 2:47:08 PM2/6/22
to AntConc-Discussion
Many thanks, Laurence. I've now added hyphens to the token definition and I'm getting results with hyphenated words. I also added apostrophes but that seems to return other rogue characters as part of the token. Not to worry, though, I'll go on testing. 
Reply all
Reply to author
Forward
0 new messages