gensub() question

mss

unread,

Jan 13, 2011, 1:06:07 PM1/13/11

to

How does one delete duplicate adjacent words
(leaving only one occurrence of the doubled word)
using regex in gawk?

before: "the the"

after: "the"

These fail:

echo "the the" | gawk '{print gensub(/\<([a-z]+) +\\1\>/, "\\1", "")}'

echo "the the" | gawk '{print gensub(/\<([a-z]+) +\1\>/, "\\1", "")}'

--
later on,
Mike

http://www.topcat.hypermart.net/index.html

Janis Papanagnou

unread,

Jan 13, 2011, 1:33:27 PM1/13/11

to

On 13.01.2011 19:06, mss wrote:
> How does one delete duplicate adjacent words
> (leaving only one occurrence of the doubled word)
> using regex in gawk?
>
>
> before: "the the"
>
> after: "the"
>
>
> These fail:
>
> echo "the the" | gawk '{print gensub(/\<([a-z]+) +\\1\>/, "\\1", "")}'
>
> echo "the the" | gawk '{print gensub(/\<([a-z]+) +\1\>/, "\\1", "")}'
>
>

According to the gawk manual components of the regexp may be referenced
in the _replacement_ text, so back-references seem not existing in gawk.

Note, BTW, that back-referenced, even though they are supported by some
"regexp" parser implementations, exceed the class of regular expressions.

A workaround is to iterate over the fields and compare adjacent ones;

{ for (i=1; i<NF; i++) if ($i == $(i+1)) $i = "" ; print }

with the usual caveat that you change the while space characters that
way, so above may need some more tweaks. The same problem if you print
the fields (conditionally); then the white spaces between fields may be
changed depending on the data.

Janis

Ed Morton

unread,

Jan 13, 2011, 2:31:41 PM1/13/11

to

How do you define a "word", how do you define "adjacent", and does
case matter?

For example:

a) Are "it's", "ill-mannered", "Ke$ha", each one word or 2 or
something else?
b) Is "the-the" or "the, the", or "The THE" examples of adjacent
"the"s or something else?

Ed.

mss

unread,

Jan 13, 2011, 3:59:26 PM1/13/11

to

Janis Papanagnou wrote:

> According to the gawk manual components of the regexp may be referenced
> in the _replacement_ text, so back-references seem not existing in gawk.
>
> Note, BTW, that back-referenced, even though they are supported by some
> "regexp" parser implementations, exceed the class of regular expressions.
>
> A workaround is to iterate over the fields and compare adjacent ones;
>
> { for (i=1; i<NF; i++) if ($i == $(i+1)) $i = "" ; print }
>
> with the usual caveat that you change the while space characters that
> way, so above may need some more tweaks. The same problem if you print
> the fields (conditionally); then the white spaces between fields may be
> changed depending on the data.

Thanks for the help Janis, appreciate it.

Yes it seems the replacement parameter of
gensub() can make use of backreferences.

mss

unread,

Jan 13, 2011, 4:17:34 PM1/13/11

to

Ed Morton wrote:

> How do you define a "word", how do you define "adjacent", and does
> case matter?
>
> For example:
>
> a) Are "it's", "ill-mannered", "Ke$ha", each one word or 2 or
> something else?
> b) Is "the-the" or "the, the", or "The THE" examples of adjacent
> "the"s or something else?

Well, those are good questions Ed. I honestly haven't yet defined
what all a word might encompass...

I've been studying Jeffrey Friedl's 'Mastering Regular Expressions'
and was hoping to apply the lessons contained therein to gawk
(understanding on this end of course that some modification would
be necessary), and one of the examples tackled doubled-words like
'the the'.

Its easy enough to do in native awk by simply comparing fields
(as Janis too notes):

if ($2 == $1) {etc...}

But then the question of backreferences came to mind.

Ed Morton

unread,

Jan 13, 2011, 8:47:16 PM1/13/11

to

Got it. Yes, unfortunately awk doesn't support backreferences in the matching RE
so in non-gawk you'd have to do something like:

match($0,/[a-z]+/)

then use substr() to save the result of that match and then truncate $0 using
substr() again then match() on that....

In a newer version of gawk you could get cute with the split() functions new
gawk-specific argument that's an array of field separators. Something like this:

c = split($0,betweenWords,/[a-x]+/,words)
for (i=1; i<c; i++)
if (words[i] == words[i+1])
print "found dup",words[i]

Regards,

Ed.

Sam Trenholme

unread,

Jan 13, 2011, 9:01:21 PM1/13/11

to

>How does one delete duplicate adjacent words
>(leaving only one occurrence of the doubled word)
>using regex in gawk?

I believe that awk's regex machine always compiles to what is called a
"finate state machine", and what you are asking for can not be done with
a FSM (it's almost identical to the textbook "match on 'ab', 'aabb',
'aaabbb', 'aaaabbbb', etc." example of what a FSM can not match against)

- Sam

--
#Sam Trenholme http://samiam.org -- Usenet user since September 1993#
######## My email address is at http://samiam.org/mailme.php ########
# The following script works around an annoyance in the Nano Editor #
cat | awk '{a=a $0 "\n";if($0 ~ /[a-zA-Z0-9]/){printf("%s",a);a=""}}'

Aharon Robbins

unread,

Jan 14, 2011, 7:15:56 AM1/14/11

to

Both the dfa and regex code that gawk uses could support this feature
(they do for grep), but it's an extension that I chose not to add.

Technically, the point is correct, but the code can implement backrefs
for matching if one sets the right bit(s) when configuring the matchers.

In article <igoah2$fh2$1...@Milagro.leafnode.foo>,

--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL

Sam Trenholme

unread,

Jan 14, 2011, 9:06:07 PM1/14/11

to

>Both the dfa and regex code that gawk uses could support this feature
>(they do for grep), but it's an extension that I chose not to add.
>
>Technically, the point is correct, but the code can implement backrefs
>for matching if one sets the right bit(s) when configuring the matchers.

You know, I wonder how specific POSIX is about regular expressions and
what regular expressions are OK or not (yes, I know, these days the
slightly older POSIX is readily available for anyone on the web to read
over at opengroup, but you have probably already gone over it).

Speaking of which, I would love to see POSIX updated. When I last
looked at it, it didn't have any networking beyond UUCP (which isn't
quote dead; I actually found an ISP that still offers UUCP--but they no
longer offer Usenet-over-UUCP). I would love to see POSIX commands for
modern TCP/IP networking, as well as a well defined interface for
configuring a firewall (Linux has changed this. Twice), connecting to a
wireless network [1], and what not.

In terms of AWK, I would like to see POSIX expand AWK to have it have a
built-in sort [2], a defined interface for matching [A-Z] in a non-C
locale [3], and perhaps real multi-dimensional associative arrays if
we're really ambitious [4].

But POSIX, alas, is moribund. [5] I don't think we will ever see any
changes made to the AWK language become part of the standard at this
point.

- Sam

[1] The Linux command line interface is one way for WEP, and
something completely different for WPA. God help you if you want to
configure a wireless access point from the Linux command line.

[2] The correct way to specify this interface is to use the variable
WHINY_USERS. Any other name would be wrong.

[3] I still contend that [A-Z] should only match on uppercase in all AWK
implementations, or, barring that, a backwards-compatible command that
can be put in BEGIN that says "make the locale C", such as "C_LOCALE=1"

[4] People with problems that need multi-dimensional AAs to solve should
probably be using Perl or Python, though.

[5] The same, for better or for worse, can be said of Usenet, which is
slowly dying--many groups I loved in the mid-1990s and early 2000s are
now ghost towns. Very sad.

pk

unread,

Jan 15, 2011, 7:18:44 AM1/15/11

to

On Sat, 15 Jan 2011 02:06:07 +0000 (UTC) Sam Trenholme
<sam-reads...@samiam.org> wrote:

> Speaking of which, I would love to see POSIX updated.

It was last updated in 2008.

> When I last looked at it, it didn't have any networking beyond UUCP
> (which isn't quote dead; I actually found an ISP that still offers
> UUCP--but they no longer offer Usenet-over-UUCP). I would love to see
> POSIX commands for modern TCP/IP networking, as well as a well defined
> interface for configuring a firewall (Linux has changed this. Twice),
> connecting to a wireless network [1], and what not.
>
> In terms of AWK, I would like to see POSIX expand AWK to have it have a
> built-in sort [2], a defined interface for matching [A-Z] in a non-C
> locale [3], and perhaps real multi-dimensional associative arrays if
> we're really ambitious [4].
>
> But POSIX, alas, is moribund. [5]

Is it? By the looks of email volume and active bugs, it wouldn't seem so:

https://www.opengroup.org/sophocles/show_archive.tpl?CALLER=index.tpl&source=L&listname=austin-group-l

http://austingroupbugs.net/view_all_bug_page.php

> [3] I still contend that [A-Z] should only match on uppercase in all AWK
> implementations, or, barring that, a backwards-compatible command that
> can be put in BEGIN that says "make the locale C", such as "C_LOCALE=1"

You can just say LC_ALL=C awk etc.

Kenny McCormack

unread,

Jan 15, 2011, 8:42:13 AM1/15/11

to

In article <igs4bc$nus$1...@speranza.aioe.org>, pk <p...@pk.invalid> wrote:
...

>> [3] I still contend that [A-Z] should only match on uppercase in all AWK
>> implementations, or, barring that, a backwards-compatible command that
>> can be put in BEGIN that says "make the locale C", such as "C_LOCALE=1"
>
>You can just say LC_ALL=C awk etc.

From the C:\> prompt?

I think previous poster's point is valid - that there should be a way to
do it in the AWK language itself. Not a big deal, of course, as there
are always workarounds - but valid nevertheless.

--
Just for a change of pace, this sig is *not* an obscure reference to
comp.lang.c...