Help with regular expressions

Grafstrom

unread,

Dec 19, 2001, 12:51:59 PM12/19/01

to

Hi,

I'm having some problems with the following regsub expression. Given a list
of words, eg. "the a an of and", I wish to remove these words from a
sentence. How should I do it?

For eg, "the quick brown fox and the lazy dog" will become "quick brown fox
lazy dog" after the substitution.

I tried

regsub -all -nocase "( (the|a|an|of) )" $text $replaced

but it did not work for all expressions. Words at the beginning and at the
end of the sentence would not be replaced. Also, if two of the words
appeared side by side, like "and the" in the above sentence, only one word
gets replaced.

Thanx for any help.

Regards,
Grafstrom

miguel sofer

unread,

Dec 19, 2001, 1:18:23 PM12/19/01

to

Try this:

set re {(?:^|\s)(?:the|a|an|of)(?:$|\s)}
regsub -all -nocase $re $text { } newText
set newText [string trim newText]

You were requesting that the word to be substituted have a space both
before and after; this here requests (either a space or string start)
before the word, and also (either a space or string end) afterwards. The
selected words are replaced by a space, which may add a space at the
start and another one at the end - hence the trimming.

Punctuation and line breaks may also have to be taken care of ...

Hope this helps
Miguel

Grafstrom

unread,

Dec 19, 2001, 3:42:19 PM12/19/01

to

Thanx for the answer, but there's still one more problem with the solution,

Let's say $text = "fox and the lazy dog"

Using the RE you suggested, only "and" will be removed, and the resultant
$text becomes "fox the lazy dog". How may the RE be changed such that both
"and" and "the" are removed?

Regards,
Grafstrom

"miguel sofer" <m...@utdt.edu> wrote in message
news:3C20D9EF...@utdt.edu...

rand mair fheal

unread,

Dec 19, 2001, 7:48:18 PM12/19/01

to

In article <191C91BDFE8ED411B844...@pfs21.ex.nus.edu.sg>,

Laurent Riesterer

unread,

Dec 20, 2001, 2:29:46 AM12/20/01

to

> regsub -all -nocase "( (the|a|an|of) )" $text $replaced
> but it did not work for all expressions. Words at the beginning and at the
> end of the sentence would not be replaced. Also, if two of the words
> appeared side by side, like "and the" in the above sentence, only one word
> gets replaced.

Use the special escape characters to anchor the beginning and ending of a
word :

regsub -all -- {\m()\M} $text {} result

You may need an additionnal step to remove duplicate spaces :

regsub -all -- {\s+} $result {} result

To play with regexp and help you to find the "magical expression", you can
use
my little VisualREGEXP tool (http://laurent.riesterer.free.fr/regexp)

Laurent.

Laurent Riesterer

unread,

Dec 20, 2001, 2:35:56 AM12/20/01

to

> regsub -all -- {\m()\M} $text {} result

I just forgot the pattern to match ...

regsub -all -- {\m(an?|the|of)\M} $text {} result

Michael A. Cleverly

unread,

Dec 26, 2001, 12:59:18 AM12/26/01

to

Laurent Riesterer points out the special escape characters to anchor at
the beginning and ending of a word; it can be combined with a trick to
remove the extra white space in one fell swoop:

% set RE {(\s)?\s*\m(?:the|of|a|and?)\M\s*}
(\s)?\s*\m(?:the|of|a|and?)\M\s*
% set text "the quick brown fox and the lazy dog"

the quick brown fox and the lazy dog

% regsub -all -nocase -- $RE $text {\1} replacement
3
% set replacement