Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Full text Boolean search

0 views
Skip to first unread message

David

unread,
Oct 29, 2003, 9:34:25 AM10/29/03
to
Hi all,

I have an interesting requirement I'm not sure how to solve -
searching a large directory (150,000+ files) for two search terms,
returning only the filename as the result.

A positive match equals both search terms in the same file, regardless
of how many times each term appears in that file. The key is that
both are there.

I've tried cat * |grep <term1> |grep <term2> which gave me the "too
many arguments" error due to directory size. I was wondering if
anyone had a suggestion on a command I could use to do this search?

Thanks!
Dave Saunders

Philip Edward Lewis

unread,
Oct 29, 2003, 10:30:20 AM10/29/03
to
I believe the answer may be found in "find"

ummm try:
find . -exec grep -c search1 {} \; -exec grep -c search2 {} \; -print

hmmm.. that prints out a bunch of numbers (counts of occurances)
perhaps you can parse that output with another grep

find . -exec grep -c search1 {} \; -exec grep -c search2 {} \; -print \
| grep '^./'

you could also do a dual grep -l using the output of the first as a
search filelist for the second.

hope this helps
--
be safe.
flip
Verso l'esterno! Verso l'esterno! Deamons di ignoranza.


Dale Talcott

unread,
Oct 29, 2003, 11:16:38 AM10/29/03
to
dss...@hotmail.com (David) writes:

>I have an interesting requirement I'm not sure how to solve -
>searching a large directory (150,000+ files) for two search terms,
>returning only the filename as the result.

>A positive match equals both search terms in the same file, regardless
>of how many times each term appears in that file. The key is that
>both are there.

If you are doing this only a few times, you can use:

find . -type f -exec grep -q search1 {} \; -exec grep -q search2 {} \; -print

For efficiency, put the least frequent search term on the first grep. Still,
this is going to execute grep at least 150,000+ times, once for each file.

So, you can get a slight improvement with:

find . -type f -print | xargs -n 50 grep -l search1 | xargs grep -l search2

If the files have names that might contain spaces or other special
characters, you need to use the GNU or BSD versions of the utilities, with
special flags:

<set PATH to find GNU or BSD tools first>
find . -type f -print0 \
| xargs -0 -n 50 grep -lZ -- search1 \
| xargs -0 grep -l -- search2

The alternate commands are available from IBM's Linux Toolkit site:
<http://www-1.ibm.com/servers/aix/products/aixos/linux/download.html>

If you need to do many searches, look into using the "glimpse" package
to build an index of the files. <http://webglimpse.net>. Glimpse is
now commercial software. I think there used to be a free version you
might be able to find in some net archive.

I'm sure there are other content indexers that could also be used.

--
Dale Talcott, IT Research Computing Services, Purdue University
a...@quest.cc.purdue.edu http://quest.cc.purdue.edu/~aeh/

0 new messages