grep multiple characters in any order

jrya...@gmail.com

unread,

May 19, 2012, 1:47:30 PM5/19/12

to

Say I have a list of words.
I want to find all the words that contain _all_ the characters 'i n v e d t r'.
These characters can be in any order.

How would I use grep to accomplish this ?

anon...@coward.org

unread,

May 19, 2012, 3:11:10 PM5/19/12

to

Alan Curry

unread,

May 19, 2012, 4:33:27 PM5/19/12

to

In article <jp8r8e$r9p$1...@speranza.aioe.org>, <anon...@coward.org> wrote:
>On 2012-05-19, jrya...@gmail.com <jrya...@gmail.com> wrote:
>> Say I have a list of words.
>> I want to find all the words that contain _all_ the characters 'i n v
>e d t r'.
>> These characters can be in any order.
>>
>> How would I use grep to accomplish this ?
>
>Grep is not the best tool for the job. But it can be done. Something like:
>
> $ cat /usr/share/dict/words | tr ' ' '\n' | grep i | grep n | grep v |
> grep e | grep d | grep t | grep r

What's with the tr? Do you have multiple words per line in your wordlist?
I don't think that's normal.

Anyway, the multiple greps can be consolidated.

grep -Ev '^[^i]*$|^[^n]*$|^[^v]*$|^[^e]*$|^[^d]*$|^[^t]*$|^[^r]*$'

>
>awk can do it with a single expression. E.g. something like:
>
> $ cat /usr/share/dict/words | tr ' ' '\n' |
> awk '/i/ && /n/ && /v/ && /e/ && /d/ && /t/ && /r/'
>

Surprisingly, my double-reverse grep actually runs faster than the
simpler-looking awk script.

original-awk 0.34s user 0.01s system 98% cpu 0.355 total
gawk 0.20s user 0.01s system 97% cpu 0.212 total
mawk 0.08s user 0.01s system 94% cpu 0.093 total
grep (GNU) 0.04s user 0.02s system 90% cpu 0.062 total

This post started as an obfuscated-code joke, but I guess it ended up being
sort of a serious improvement.

--
Alan Curry

Kaz Kylheku

unread,

May 19, 2012, 5:29:31 PM5/19/12

to

On 2012-05-19, Alan Curry <pac...@kosh.dhis.org> wrote:
> In article <jp8r8e$r9p$1...@speranza.aioe.org>, <anon...@coward.org> wrote:
>>On 2012-05-19, jrya...@gmail.com <jrya...@gmail.com> wrote:
>>> Say I have a list of words.
>>> I want to find all the words that contain _all_ the characters 'i n v
>>e d t r'.
>>> These characters can be in any order.
>>>
>>> How would I use grep to accomplish this ?
>>
>>Grep is not the best tool for the job. But it can be done. Something like:
>>
>> $ cat /usr/share/dict/words | tr ' ' '\n' | grep i | grep n | grep v |
>> grep e | grep d | grep t | grep r
>
> What's with the tr? Do you have multiple words per line in your wordlist?
> I don't think that's normal.
>
> Anyway, the multiple greps can be consolidated.
>
> grep -Ev '^[^i]*$|^[^n]*$|^[^v]*$|^[^e]*$|^[^d]*$|^[^t]*$|^[^r]*$'

You can factor out that ^ $. The branches do not have to be individually
anchored; we just need one global instance of ^ $ wrapping to counteract the
regex search semantics of grep.

grep -Ev '^([^i]*|[^n]*|[^v]*|[^e]*|[^d]*|[^t]*|[^r]*)$'

>>
>>awk can do it with a single expression. E.g. something like:
>>
>> $ cat /usr/share/dict/words | tr ' ' '\n' |
>> awk '/i/ && /n/ && /v/ && /e/ && /d/ && /t/ && /r/'
>>
>
> Surprisingly, my double-reverse grep actually runs faster than the
> simpler-looking awk script.

The awk script has to try up to seven different regular expressions,
at various positions in each input line.

Grep has to search only with a single expression, and the anchoring may be
optimized; i.e. if every branch of the regex is anchored to ^, there
is the obvious optimization that you don't need to bother searching,
since the regex will fail at every position other than 0.

Janis Papanagnou

unread,

May 19, 2012, 7:00:32 PM5/19/12

to

In case you have the freedom to use something else than grep, want to avoid
one or many external processes, and use ksh as your shell, you can do all
within shell using ksh patterns[*], @(pat1&pat2&pat3&...), or specifically
for your sample, @(*i*&*n*&*v*&*e*&*d*&*t*&*r*)

while read -r word
do
case $word in
(@(*i*&*n*&*v*&*e*&*d*&*t*&*r*)) echo $word: match ;;
(*) echo $word: no match ;;
esac
done

Janis

[*] The @(...&...&...) patterns *might* be available in bash with extglob
option enabled; haven't checked bash, though.

Thomas 'PointedEars' Lahn

unread,

May 21, 2012, 1:51:05 PM5/21/12

to

Kaz Kylheku wrote:

> On 2012-05-19, Alan Curry <pac...@kosh.dhis.org> wrote:
>> In article <jp8r8e$r9p$1...@speranza.aioe.org>, <anon...@coward.org>
>> wrote:
>>>On 2012-05-19, jrya...@gmail.com <jrya...@gmail.com> wrote:
>>>> Say I have a list of words.
>>>> I want to find all the words that contain _all_ the characters 'i n v
>>>e d t r'.
>>>> These characters can be in any order.
>>>>
>>>> How would I use grep to accomplish this ?

>> […]

>> grep -Ev '^[^i]*$|^[^n]*$|^[^v]*$|^[^e]*$|^[^d]*$|^[^t]*$|^[^r]*$'
>
> You can factor out that ^ $. The branches do not have to be individually
> anchored; we just need one global instance of ^ $ wrapping to counteract
> the regex search semantics of grep.

s/can/must/
s/do not have to/must not/
s/just//

--
PointedEars

Please do not Cc: me. / Bitte keine Kopien per E-Mail.

Alan Curry

unread,

May 21, 2012, 4:12:48 PM5/21/12

to

In article <2252105.u...@PointedEars.de>,

The original works fine. You *can* anchor the alternatives individually.

$ echo hello | grep -E '^h.*o$|^x.*x$|^y.*y$'
hello
$ echo hello | grep -E '^x.*x$|^h.*o$|^y.*y$'
hello
$ echo hello | grep -E '^x.*x$|^y.*y$|^h.*o$'
hello

Whether you *should* is not obvious. The parenthesized version is easier
to read. The reason I didn't use it originally is that I assumed it would be
slower because it implies creation of the \1 capture variable. Testing it
now, it runs exactly as fast as the equivalent without the parentheses, so
maybe the capturing is optimized away.

Anyway, you're wrong about the inability to anchor alternatives. I did
compare the output of the grep to the output of the awk before I compared
their speeds, so I knew that it worked when I posted it.

--
Alan Curry

Thomas 'PointedEars' Lahn

unread,

May 23, 2012, 6:10:52 PM5/23/12

to

Alan Curry wrote:

> In article <2252105.u...@PointedEars.de>,
> Thomas 'PointedEars' Lahn <use...@PointedEars.de> wrote:
>>Kaz Kylheku wrote:
>>> On 2012-05-19, Alan Curry <pac...@kosh.dhis.org> wrote:
>>> You can factor out that ^ $. The branches do not have to be individually
>>> anchored; we just need one global instance of ^ $ wrapping to counteract
>>> the regex search semantics of grep.
>>
>>s/can/must/
>>s/do not have to/must not/
>>s/just//

That was a mistake; I did not see the double negation.

Those expressions are very different from the suggested one. Your point
being?

> Whether you *should* is not obvious. The parenthesized version is easier

> to read. […]

Without having tested that, it also strikes me as being more efficient. In
the version where each alternative is anchored, backtracking for the entire
string has to be performed for the whole string for each alternative, while
in the single-anchored version we can stop with backtracking *and* trying to
match whenever a character is encountered that is not matched by either
negated character class.

> Anyway, you're wrong about the inability to anchor alternatives. […]

I did not say so.