regular expressions

111 views
Skip to first unread message

Agnieszka Leńko-Szymańska

unread,
Mar 23, 2015, 6:24:57 PM3/23/15
to antword...@googlegroups.com
Dear Laurence and others,

I have just started to work with AntWordProlifer, so first of all a big thank you to Laurence for making it available to the research community.

I need some help with regular expressions in the settings. My customised wordlists (based on unlemmatised BNC frequency lists) contain hypheneted words, words with an apostrophe, abbreviations with a full stop at the end and occasionally alternatives such as and/or. I undertand I need to change the settings for the token definition, but my knowlege on unicode regular expressions is not sufficient. Could anyone advise me on how I can change the default settings to include the characteres I mentioned above.

Thanks in advance.

Agnieszka

Laurence Anthony

unread,
Mar 23, 2015, 6:33:33 PM3/23/15
to antword...@googlegroups.com
Hi,

From what you say, the following would probably work:

[a-zA-Z-'./]+

If you have AntConc, you can load a text into it, and then use File View tool to view a text and test out regular expressions to make sure they highlight what you hope to find.

I hope that helps.

Laurence.

--
You received this message because you are subscribed to the Google Groups "AntWordProfiler-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antwordprofil...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Agnieszka Leńko-Szymańska

unread,
Apr 6, 2015, 7:13:54 PM4/6/15
to antword...@googlegroups.com
Laurence,

Apologies for a delayed reply.

Thank you for your suggestion. It worked fine with two exceptions. The slash seemed to upset the process of reading the baselists (they were all reported as containg 0 items). But since the slash was not that important in my baselists I just got rid of it. The full stops did not quite work (this was my logical error) because once they were allowed as possible characters wiithin a word, all words at the ends of sentences were interpreted as separate types.

I managed to find a way around these problems by modifying my baselists and used AntWordProfiler successfully for my research. I am just reporting them to other potential readers who can face the same questions in their research.

One suggestion (if I may) for a future update.

I generated batch results for 120 files and used your suggestion from an earlier discussion about copying results into Excel. What made my life really difficult was the fact that the levels with zero hits don't get reported in the statistics. I could not use a simple 'copy and paste' function and had to reintroduce the missing levels manually in order to produce a table with all the results.


Many thanks again for creating AntWordProfiler and making it available to us.


Best

Agnieszka

Laurence Anthony

unread,
Apr 11, 2015, 7:26:19 AM4/11/15
to antword...@googlegroups.com
Hi,

Glad to see you managed to get things working.

>What made my life really difficult was the fact that the levels with zero hits don't get reported in the statistics. 

Good point! I should address this. I'm currently working on a major new release of AntWordProfiler. I can certainly make sure this 'feature' is included.

Laurence.


--

syarifuddin al menaki

unread,
Mar 20, 2017, 5:33:13 AM3/20/17
to AntWordProfiler-Discussion
This regex may be helpful:

\b[a-zA-Z0-9]+(?:[-'.]?[a-zA-Z0-9]+)*\b

which is described as:
\b       # word-break
[a-zA-Z0-9]+    # one or more
(?:       # start non-matching group
  [-'.]?       # zero or one
  [a-zA-Z0-9]+      # one or more
)*       # end of non-matching group, zero or more
\b       # word-break
Reply all
Reply to author
Forward
0 new messages