Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to Write grep in Emacs Lisp (tutorial)

141 views
Skip to first unread message

Xah Lee

unread,
Feb 8, 2011, 1:31:51 AM2/8/11
to
new little elisp tutorial.

〈How to Write grep in Emacs Lisp〉
http://xahlee.org/emacs/elisp_grep_script.html

--------------------------------------------------
How to Write grep in Emacs Lisp

Xah Lee, 2011-02-07

This page shows a real-world example of a emacs lisp script that
search files, similar to unix grep. If you don't know elisp, first
take a look at Emacs Lisp Basics.

----------------------------------------
The Problem

Summary

I want to write a elisp script that reports all files in a dir that
contains a string n times. The script is expected to search thru 5
thousand files.

----------------------------------------
Detail

Why can't i just use grep? Because:

• Often, the string i need to search is long, containing 300 hundred
chars or more. You could put your search string in a file with grep,
but it is not convenient.

• Unix grep is not very robust with unicode. Especially so if are
calling it inside emacs on Windows, because it has to go thru 2 layers
of interface: ① the ported unix grep program. ② the Windows OS. In the
process, the char encoding in the stream can be messed up.

• grep isn't robust with various encoding. You have to deal with
“locale” and it's a headache. With emacs, you don't have to think
about file encoding at all.

• grep can't really deal with directories recursively. (there's -r,
but then you can't specify file pattern such as “*\.html” (maybe it is
possible, but i find it quite frustrating to trial error man page loop
with unix tools.))

• unix grep and associated tool bag (sort, wc, uniq, pipe, sed, awk,
…) is not flexible. When your need is slightly more complex, unix
shell tool bag can't handle it. For example, suppose you need to find
a string in HTML file only if the string happens inside another tag.
(extending the limit of unix tool bag is how Perl was born in 1987.)

When writing a script in perl or python, you can always write it so
the script works as a command line script that takes options. Or, you
can leave the script raw. When you need to run the script, you open it
with a editor, modify the parameters, save, then run it.

Ι always prefer the latter. Because, that way i can give and edit the
options much more comfortably in a editor than the command line. I can
also view whatever doc the script has in the header, instead of doing
some confusing “-help” or “-h” or “--help” in the command line. And
with emacs, i can run the script by a press of a key, and much other
conveniences. Basically, a command line is nice if you are using
other's code because it's a blackbox with a (somewhat) standardize
command line interface. But for my custom text processing needs, i
find that writing my own is much more convenient.

So, with my own script for grep (may it be elisp or Python Find &
Replace or Perl Find & Replace ), i can make the script do exactly
what i need, than the confusing and bewilding unix options that may
not be possible to do what i need.

----------------------------------------
Solution

The solution is quite simple actually. Here's a script i've been using
close to a year. I use it almost everyday, on 5 thousand files.

Typically, i press one button to open the script. Edit the parameters
i want to search. (the input dir, file extension filter, search
string, plain text or regex, number of occurance, etc.) Then, save the
script. Press another button to run it.

;; -*- coding: utf-8 -*-
;; 2010-03-27
;; print file names of files that has n occurrences of a string, of a
given dir

;; input dir
(setq inputDir "~/web/xahlee_org/" )

;; add a ending slash if not there
;; in elisp, dir path should end with a slash
(when (not (string= "/" (substring inputDir -1) ))
(setq inputDir (concat inputDir "/") )
)

(defun my-process-file (fpath)
"process the file at fullpath fpath ..."
(let (mybuffer p1 p2 (ii 0) searchStr)

(when t
;; (and (not (string-match "/xx" fpath)) ) ; exclude some dir

;; create a temp buffer. Work in temp buffer. Faster.
(setq mybuffer (get-buffer-create " myTemp"))
(set-buffer mybuffer)
(insert-file-contents fpath nil nil nil t)

(setq searchStr "(2) " ) ; search string here

(goto-char 1)
(while (search-forward searchStr nil t)
(setq ii (1+ ii))
)

;; report if the occurance is not n times
(if (not (= ii 0))
(princ (format "this many: %d %s\n" ii fpath))
)

(kill-buffer mybuffer)
)
))

;; traverse the dir

(require 'find-lisp)

(let (outputBuffer)
(setq outputBuffer "*xah occur output*" )
(with-output-to-temp-buffer outputBuffer
(mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
(princ "Done deal!")
)
)

The code is pretty simple. At the bottom, the code visits every file
in a dir. For each file, it calls (my-process-file fpath). The “my-
process-file” creates a temp buffer, paste the file content in it,
then do search inside the temp buffer. We do this because it's faster.
(with temp buffer, emacs doesn't do font-locking (which is rather
resource intensive), and no “undo”, or any other thing emacs normally
do when opening a file for interactive edit.)

To run the file, you can call “eval-buffer” or “load-file”. (i have
“eval-buffer” aliased to just “eb”. ((defalias 'eb 'eval-buffer))
Actually, i just press a button to run the current file. See: Elisp
Lesson: Execute/Compile Current File.)

The techniques used in this script have been explained a few times in
different places in this site. If you are not familiar, please review
at: Text Processing with Emacs Lisp Batch Style.

On 5k files, the script takes 30 seconds on my machine.

Emacs is fantastic!

Xah ∑ http://xahlee.org/

Tassilo Horn

unread,
Feb 8, 2011, 3:22:02 AM2/8/11
to
Xah Lee <xah...@gmail.com> writes:

Hi Xah,

> • Often, the string i need to search is long, containing 300 hundred
> chars or more. You could put your search string in a file with grep,
> but it is not convenient.

Well, you seem to encode the search string in your script, so I don't
see how that is better than relying on your shell history, which is
managed automatically, searchable, editable...

> • grep can't really deal with directories recursively. (there's -r,
> but then you can't specify file pattern such as “*\.html” (maybe it is
> possible, but i find it quite frustrating to trial error man page loop
> with unix tools.))

You can rely on shell globbing, so that grep gets a list of all files in
all subdirectories. For example, I can grep all header files of the
linux kernel using

% grep FOO /usr/src/linux/**/*.h

However, on older systems or on windows, that may produce a too long
command line. Alternatively, you can use the -R option to grep a
directory recursively, and specify an include globbing pattern (or many,
and/or one or many exclude patterns).

% grep -R FOO --include='*.h' /usr/src/linux/

You can also use a combination of `find', `xargs' and `grep' (with some
complications for allowing spaces in file names [-print0 to find]), or,
when using zsh, you can use

% zargs /usr/src/linux/**/*.h -- grep FOO

which does all relevant quoting and stuff for you.

> • unix grep and associated tool bag (sort, wc, uniq, pipe, sed, awk,
> …) is not flexible. When your need is slightly more complex, unix
> shell tool bag can't handle it. For example, suppose you need to find
> a string in HTML file only if the string happens inside another tag.
> (extending the limit of unix tool bag is how Perl was born in 1987.)

There are many things you can also do with a plain shell script. I'm
always amazed how good and concise you can do all sorts of file/text
manipulation using `zsh' builtins.

Bye,
Tassilo

Xah Lee

unread,
Feb 8, 2011, 7:54:05 AM2/8/11
to
hi Tass,

Xah wrote:
〈How to Write grep in Emacs Lisp〉
http://xahlee.org/emacs/elisp_grep_script.html

On Feb 8, 12:22 am, Tassilo Horn <tass...@member.fsf.org> wrote:
> Hi Xah,
>
> > • Often, the string i need to search is long, containing 300 hundred
> > chars or more. You could put your search string in a file with grep,
> > but it is not convenient.
>
> Well, you seem to encode the search string in your script, so I don't
> see how that is better than relying on your shell history, which is
> managed automatically, searchable, editable...

not sure what you meant above. I made a mistake above. I meant to say
my search string is few hundred chars. Usually a snippet of html code
that may contain javascript code and also unicode chars.

e.g.

<div class="chtk"><script type="text/
javascript">ch_client="thoucm";ch_width=550;ch_height=90;ch_type="mpu";ch_sid="Chitika
Default";ch_backfill=1;ch_color_site_link="#00C";ch_color_title="#00C";ch_color_border="#FFF";ch_color_text="#000";ch_color_bg="#FFF";</
script><script src="http://scripts.chitika.net/eminimalls/amm.js"
type="text/javascript"></script></div>

> > • grep can't really deal with directories recursively. (there's -r,
> > but then you can't specify file pattern such as “*\.html” (maybe it is
> > possible, but i find it quite frustrating to trial error man page loop
> > with unix tools.))
>
> You can rely on shell globbing, so that grep gets a list of all files in
> all subdirectories.  For example, I can grep all header files of the
> linux kernel using
>
>   % grep FOO /usr/src/linux/**/*.h

say, i want to search in the dir
~/web/xahlee_org/

but no more than 2 levels deep, and only files ending in “.html”. This
is not a toy question. I actually need to do that.

> However, on older systems or on windows, that may produce a too long
> command line.  Alternatively, you can use the -R option to grep a
> directory recursively, and specify an include globbing pattern (or many,
> and/or one or many exclude patterns).
>
>   % grep -R FOO --include='*.h' /usr/src/linux/
>
> You can also use a combination of `find', `xargs' and `grep' (with some
> complications for allowing spaces in file names [-print0 to find]), or,
> when using zsh, you can use
>
>   % zargs /usr/src/linux/**/*.h -- grep FOO
>
> which does all relevant quoting and stuff for you.

problem with find xargs is that they spawn grep for each file, which
becomes too slow to be usable.
To not use xargs but “find ... -exec” instead is possible of course
but i always have problems with the syntax...

> > • unix grep and associated tool bag (sort, wc, uniq, pipe, sed, awk,
> > …) is not flexible. When your need is slightly more complex, unix
> > shell tool bag can't handle it. For example, suppose you need to find
> > a string in HTML file only if the string happens inside another tag.
> > (extending the limit of unix tool bag is how Perl was born in 1987.)
>
> There are many things you can also do with a plain shell script.  I'm
> always amazed how good and concise you can do all sorts of file/text
> manipulation using `zsh' builtins.

never really got into bash for shell scripting... sometimes tried but
the ratio power/syntax isn't tolerable. Knowing perl well pretty much
killed any possible incentive left.

... in late 1990s, my thoughts was that i'll just learn perl well and
never need
to learn other lang or shell for any text processing and sys admin
tasks for
personal use. The thinking is that it'd be efficient in the sense of
not having
to waste time learning multiple langs for doing the same thing. (not
counting
job requirement in a company) So i have written a lot perl scripts for
find &
replace and file management stuff and tried to make them as general as
possible.
lol. But what turns out is that, over the years, for one reason or
another, i
just learned python, php, then in 2007 elisp. Maybe the love for
languages
inevitably won over my one-powerful-coherent-system efficiency
obsession. But
also, i end up rewrote many of my text processing script in each lang.
I guess
part of it is exercise when learning a new lang.

... anyway, i guess am random babbling, but one thing i learned is
that for misc
text processing scripts, the idea of writing a generic flexible
powerful one
once for all just doesn't work, because the coverage are too wide and
tasks
that needs to be done at one time are too specific. (and i think this
makes
sense, because the idea of one language or one generic script for all
is mostly
from ideology, not really out of practical need. If we look at the
real world,
it's almost always a disparate mess of components and systems.)

my text processing scripts ends up being a mess. There are like
several versions
in different langs. A few are general, but most are basically used
once or in a
particular year only. (many of them do more or less the same thing).
When i need to do some
particular task, i found it easier just to write a new one in whatever
lang that's
currently in my brain memory than trying to spend time fishing out and
revisit old scripts.

some concrete example...

e.g. i wrote this general script in 2000, intended to be one-stop for
all find/replace needs

〈Perl: Find & Replace on Multiple Files〉
http://xahlee.org/perl-python/find_replace_perl.html

in 2005, while i was learning python, i wrote (several) versions in
python. e.g.

〈Python: Find & Replace Strings in Unicode Files〉
http://xahlee.org/perl-python/find_replace_unicode.html

it's not a port of the perl code. The python version doesn't have much
features as the perl. But for some reason, i have stopped using the
perl version. Didn't need all that perl version features for some
reason, and when i do need them, i have several other python scripts
that address a particular need. (e.g. one for unicode, one for
multiple pairs in one shot, one for regex one for plain text, one for
find only one for finde+replace, several for find/replace only if
particular condition is met, etc.)

then in 2006, i fell into the emacs hole and start to learn elisp. In
the process, i realized that elisp for text processing is more
powerful than perl or python. Not due to lisp the lang, but more due
to emacs the text-editing environment and system. I tried to explain
this in few places but mostly here:

〈Text Processing: Emacs Lisp vs Perl〉
http://xahlee.org/emacs/elisp_text_processing_lang.html

so, all my new scripts for text processing are in elisp. A few of my
python script i still use, but almost everything is now in elisp.

also, sometimes in 2008, i grew a shell script that process weblogs
using the bunch of unix bag cat grep awk sort uniq. It's about 100
lines. You can see it here:

http://xahlee.org/comp/weblog_process.sh

at one time i wondered, why am i doing it. Didn't i thought that perl
replace all shell scripts? I gave it a little thought, and i think
the
conclusion is that for this task, the shell script is actually more
efficient
and simpler to write. Possibly if i started with perl for this task
and i might
end up with a good structured code and not necessarily less
efficient... but you
know things in life isn't all planned. It began when i just need a few
lines of
grep to see something in my web log. Then, over the years, added
another line,
another line, then another, all need based. If in any of those time i
thought
“let's scratch this and restart with perl”, that'd be wasting time.
Besides
that, i have some doubt that perl would do a better job for this. With
shell
tools, each line just do one simple thing with piping. To do it in
perl, one'd
have to read-in the huge log file then maintain some data structure
and try to
parse it... too much memory and thinking would involved. If i code
perl by
emulating the shell code line-by-line, then it makes no sense to do it
in perl,
since it's just shell bag in perl.

Also note, this shell script can't be replaced by elisp, because elisp
is not suitable when the file size is large.

well, that's my story — extempore! ☺

Xah Lee

Petter Gustad

unread,
Feb 8, 2011, 7:51:54 AM2/8/11
to
Xah Lee <xah...@gmail.com> writes:

> problem with find xargs is that they spawn grep for each file, which
> becomes too slow to be usable.

find . -maxdepth 2 -name '*.html -print0 | xargs -0 grep whatever

will call grep with a list of filenames given by find, only a single
grep process will run.

//Petter
--
.sig removed by request.

Tim Bradshaw

unread,
Feb 8, 2011, 8:35:22 AM2/8/11
to
On 2011-02-08 12:51:54 +0000, Petter Gustad said:

> find . -maxdepth 2 -name '*.html -print0 | xargs -0 grep whatever
>
> will call grep with a list of filenames given by find, only a single
> grep process will run.

... and you'd better hope you don't have any odd filenames.

Zach Beane

unread,
Feb 8, 2011, 8:37:52 AM2/8/11
to
Tim Bradshaw <t...@tfeb.org> writes:

Are you aware of what -print0 and -0 mean?

Zach

Tim Bradshaw

unread,
Feb 8, 2011, 8:40:08 AM2/8/11
to

I take this back (I had confused -n and -0), sorry. I don't think
you're correct that only one grep will run though - depends how many
files you have.

Tim Bradshaw

unread,
Feb 8, 2011, 8:41:08 AM2/8/11
to
On 2011-02-08 13:37:52 +0000, Zach Beane said:

> Are you aware of what -print0 and -0 mean?

Yes, see other reply :-). I was derailed by the "only one grep will
run" thing.

Petter Gustad

unread,
Feb 8, 2011, 9:03:02 AM2/8/11
to
Tim Bradshaw <t...@tfeb.org> writes:

> I take this back (I had confused -n and -0), sorry. I don't think
> you're correct that only one grep will run though - depends how many
> files you have.

This is true. It will split it in not to overflow then argv buffer. It
will typically process several thousand files at the time depending
upon the length of the filenames.

Icarus Sparry

unread,
Feb 8, 2011, 12:32:05 PM2/8/11
to

This is getting off-topic for the listed newsgroups and into
comp.unix.shell (although the question was originally posed in a MS
windows context).

The 'modern' way to do this is
find . -maxdepth 2 -name '*.html' -exec grep whatever {} +

The key thing which makes this 'modern' is the '+' at the end of the
command, rather than '\;'. This causes find to execute the grep once per
group of files, rather than once per file.

Petter Gustad

unread,
Feb 8, 2011, 12:55:45 PM2/8/11
to
Icarus Sparry <i.spa...@gmail.com> writes:

> The 'modern' way to do this is
> find . -maxdepth 2 -name '*.html' -exec grep whatever {} +

Agree, I've noticed that recent version of find have the + option. I
remember in the old days the exec method was considered bad since it
would fork grep for each process, so I've got used to using xargs. I
always used to quote "{}" as well, but this does not seem to be
required in later versions of find.

In terms of the number of forks the above will be similar to xargs as
they both have to make sure that they don't overflow the command
length.

Tim Bradshaw

unread,
Feb 8, 2011, 4:15:29 PM2/8/11
to
On 2011-02-08 14:03:02 +0000, Petter Gustad said:

> This is true. It will split it in not to overflow then argv buffer. It
> will typically process several thousand files at the time depending
> upon the length of the filenames.

Yes. I also apologise for the article of mine you're responding to
sounding like I was trying not to be completely wrong: I really was
just wrong. The whole source of confusion (which bites me in real life
as well) is that I make great use of a command which treats, for
instance, -3 as a shorthand for -n 3, and this has caused some kind of
distortion in my memory for the find | xargs pattern, such that I now
often type find ... | xargs -1 when I mean find ... | xargs -n 1.

Xah Lee

unread,
Feb 8, 2011, 5:30:53 PM2/8/11
to

Nice. When was the + introduced?

Xah

Icarus Sparry

unread,
Feb 8, 2011, 7:02:17 PM2/8/11
to
On Tue, 08 Feb 2011 14:30:53 -0800, Xah Lee wrote:

> On Feb 8, 9:32 am, Icarus Sparry <i.sparry...@gmail.com> wrote:

[snip]


>> The 'modern' way to do this is
>> find . -maxdepth 2 -name '*.html' -exec grep whatever {} +
>>
>> The key thing which makes this 'modern' is the '+' at the end of the
>> command, rather than '\;'. This causes find to execute the grep once
>> per group of files, rather than once per file.
>
> Nice. When was the + introduced?

Years ago! The posix spec for find lists it in the page which has a
copyright of 2001-2004.

http://pubs.opengroup.org/onlinepubs/009695399/utilities/find.html

Using google, I have come up with this reference from 2001

https://www.opengroup.org/sophocles/show_mail.tpl?
CALLER=show_archive.tpl&source=L&listname=austin-group-l&id=3067

in which David Korn reports writing the code in 1987.

Hugh Aguilar

unread,
Feb 8, 2011, 7:31:24 PM2/8/11
to
On Feb 7, 11:31 pm, Xah Lee <xah...@gmail.com> wrote:
> How to Write grep in Emacs Lisp

On a related note --- is there a CL implementation of regular
expressions?

Zach Beane

unread,
Feb 8, 2011, 7:39:16 PM2/8/11
to
Hugh Aguilar <hughag...@yahoo.com> writes:

cl-ppcre is the most popular:

http://weitz.de/cl-ppcre/

Zach

Hugh Aguilar

unread,
Feb 8, 2011, 8:58:01 PM2/8/11
to
On Feb 8, 5:39 pm, Zach Beane <x...@xach.com> wrote:

Thanks; I'll look at that. I'm actually implementing a regex in Forth,
but I may be able to learn a lot about the subject from that Lisp
program.

I'm pretty new to regex, so this may be a dumb question: What is the
deal with the standards? Are any of the following statements true?
1.) There are *two* POSIX standards.
2.) GREP is one of the POSIX standards.
3.) Perl is its own standard.
4.) PCRE is a subset of Perl intended to be integrated into scripting
languages.
5.) Python uses PCRE.

Right now, I'm working from Sedgewick's book "Algorithms." I realize
that this is a pretty basic regex, but it is helping me to understand
the concepts. I will likely have to learn PCRE in order to make my
package comparable to what is offered in the major scripting languages
--- especially Python.

Xah Lee

unread,
Feb 8, 2011, 11:41:29 PM2/8/11
to

if you are implementing regex, i'd highly suggest
Parsing Expression Grammar (PEG) instead.

It's not more difficult to implement, but gives you much more power.
In particular, matching nested pattern. In fact, afaik some newer lang
implement regex as a pattern of PEG. (was it Lua? ...)

i wrote up how i discovered it here:

〈Pattern Matching vs Lexical Grammar Specification〉
http://xahlee.org/cmaci/notation/pattern_matching_vs_pattern_spec.html

Xah

Petter Gustad

unread,
Feb 9, 2011, 5:19:27 AM2/9/11
to
Tim Bradshaw <t...@tfeb.org> writes:

> On 2011-02-08 14:03:02 +0000, Petter Gustad said:
>
>> This is true. It will split it in not to overflow then argv buffer. It
>> will typically process several thousand files at the time depending
>> upon the length of the filenames.
>
> Yes. I also apologise for the article of mine you're responding to

No apology needed.

> sounding like I was trying not to be completely wrong: I really was
> just wrong. The whole source of confusion (which bites me in real
> life as well) is that I make great use of a command which treats, for
> instance, -3 as a shorthand for -n 3, and this has caused some kind of
> distortion in my memory for the find | xargs pattern, such that I now
> often type find ... | xargs -1 when I mean find ... | xargs -n 1.

Idioms can be problematic at times as they get stuck in your fingers.
I still use find | xargs even though I was shown by a colleague that I
could do -exec + long time ago. I also still use find | cpio for
copying files to keep permissions, owner, group and symbolic links,
even though recent versions of cp will do the same.

Rob Warnock

unread,
Feb 9, 2011, 8:10:48 AM2/9/11
to
Petter Gustad <newsma...@gustad.com> wrote:
+---------------

| Tim Bradshaw <t...@tfeb.org> writes:
| > The whole source of confusion (which bites me in real
| > life as well) is that I make great use of a command which treats, for
| > instance, -3 as a shorthand for -n 3, and this has caused some kind of
| > distortion in my memory for the find | xargs pattern, such that I now
| > often type find ... | xargs -1 when I mean find ... | xargs -n 1.
+---------------

I will confess to having also done that occasionally.

+---------------


| Idioms can be problematic at times as they get stuck in your fingers.
| I still use find | xargs even though I was shown by a colleague that I
| could do -exec + long time ago.

+---------------

There are still *very* good performance reasons for preferring
"find | xargs" to "find -exec" in most common usages, though
"find | xargs -n 1" is almost as bad as "find -exec". The one
advantage of the latter is that you can use the "{}" more than
once, e.g.:

find path... -name '*pattern*' -exec mv {} '{}.bak' \;

To do that with "find | xargs" you generally need to write an
auxiliary script for "xargs" to fire off.

+---------------


| I also still use find | cpio for copying files to keep permissions,
| owner, group and symbolic links, even though recent versions of cp
| will do the same.

+---------------

"rsync -auv" is your friend! Works both locally and remotely.


-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

Pingouin

unread,
Feb 9, 2011, 9:40:00 AM2/9/11
to

What am I missing?

> find . -maxdepth 2 -name '*.tex' -exec grep "Rejoint Non" {} +

find: missing Parameter for `-exec'

Thanks,

Gérald

Harald Hanche-Olsen

unread,
Feb 9, 2011, 11:07:33 AM2/9/11
to
[Icarus Sparry <i.spa...@gmail.com>]

> The 'modern' way to do this is
> find . -maxdepth 2 -name '*.html' -exec grep whatever {} +

Actually, I think it should be

find . -maxdepth 2 -name '*.html' -exec grep whatever /dev/null {} +

because grep behaves differently when given only one filename as opposed
to several.

--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
when there is no ground whatsoever for supposing it is true.
-- Bertrand Russell

Harald Hanche-Olsen

unread,
Feb 9, 2011, 11:09:54 AM2/9/11
to
[Pingouin <geral...@dgag.ca>]

> What am I missing?
>
>> find . -maxdepth 2 -name '*.tex' -exec grep "Rejoint Non" {} +
>
> find: missing Parameter for `-exec'

It's missing a \; at the end.

Harald Hanche-Olsen

unread,
Feb 9, 2011, 11:10:19 AM2/9/11
to
[Icarus Sparry <i.spa...@gmail.com>]

> The 'modern' way to do this is
> find . -maxdepth 2 -name '*.html' -exec grep whatever {} +

Actually, I think it should be

find . -maxdepth 2 -name '*.html' -exec grep whatever /dev/null {} + \;

because grep behaves differently when given only one filename as opposed
to several.

--

Tassilo Horn

unread,
Feb 9, 2011, 2:48:44 PM2/9/11
to
Xah Lee <xah...@gmail.com> writes:

>> You can rely on shell globbing, so that grep gets a list of all files in
>> all subdirectories.  For example, I can grep all header files of the
>> linux kernel using
>>
>>   % grep FOO /usr/src/linux/**/*.h
>
> say, i want to search in the dir
> ~/web/xahlee_org/
>
> but no more than 2 levels deep, and only files ending in “.html”. This
> is not a toy question. I actually need to do that.

% grep ~/web/xahlee_org/*{,/*}.html FOO

That'll grep files like ~/web/xahlee_org/bla.html as well as
~/web/xahlee_org/bla/bla.html, but not any deeper.

>> However, on older systems or on windows, that may produce a too long
>> command line.  Alternatively, you can use the -R option to grep a
>> directory recursively, and specify an include globbing pattern (or many,
>> and/or one or many exclude patterns).
>>
>>   % grep -R FOO --include='*.h' /usr/src/linux/
>>
>> You can also use a combination of `find', `xargs' and `grep' (with some
>> complications for allowing spaces in file names [-print0 to find]), or,
>> when using zsh, you can use
>>
>>   % zargs /usr/src/linux/**/*.h -- grep FOO
>>
>> which does all relevant quoting and stuff for you.
>
> problem with find xargs is that they spawn grep for each file, which
> becomes too slow to be usable.

I can see not speed difference in find | xargs grep or grep with glob...

> To not use xargs but “find ... -exec” instead is possible of course
> but i always have problems with the syntax...

Yeah, there are so many ways. ;-)

>> There are many things you can also do with a plain shell script.  I'm
>> always amazed how good and concise you can do all sorts of file/text
>> manipulation using `zsh' builtins.
>
> never really got into bash for shell scripting... sometimes tried but
> the ratio power/syntax isn't tolerable. Knowing perl well pretty much
> killed any possible incentive left.

Yeah, perl is a swiss army knife, but I never got comfortable with it.

> Also note, this shell script can't be replaced by elisp, because elisp
> is not suitable when the file size is large.

You could chunk the file and handle the parts separately, in order to
not have everything in an emacs buffer and thus getting out of RAM.

Bye,
Tassilo

Rob Warnock

unread,
Feb 9, 2011, 10:39:08 PM2/9/11
to
Harald Hanche-Olsen <han...@math.ntnu.no> wrote:
+---------------

| [Icarus Sparry <i.spa...@gmail.com>]
| > The 'modern' way to do this is
| > find . -maxdepth 2 -name '*.html' -exec grep whatever {} +
|
| Actually, I think it should be
| find . -maxdepth 2 -name '*.html' -exec grep whatever /dev/null {} +
| because grep behaves differently when given only one filename as opposed
| to several.
+---------------

Yup. This is why it's also important to include that "/dev/null"
when using "find | xargs", too:

find . -maxdepth 2 -name '*.html' -print | xargs grep whatever /dev/null

Years & years ago, right after I learned about "xargs", I got burned
several times on "find | xargs grep pat" when the file list was long
enough that "xargs" fired up more than one "grep"... and the last
invocation was given only one arg!! IT FOUND THE PATTERN, BUT DIDN'T
TELL ME WHAT !@^%!$@#@! FILE IT WAS IN!! :-{

The trailing "/dev/null" fixes that. ;-}

Rob Warnock

unread,
Feb 9, 2011, 10:47:20 PM2/9/11
to
Harald Hanche-Olsen <han...@math.ntnu.no> wrote:
+---------------
| [Icarus Sparry <i.spa...@gmail.com>]
| > The 'modern' way to do this is
| > find . -maxdepth 2 -name '*.html' -exec grep whatever {} +
|
| Actually, I think it should be
| find . -maxdepth 2 -name '*.html' -exec grep whatever /dev/null {} + \;
| because grep behaves differently when given only one filename as opposed
| to several.
+---------------

Oh, wow! I just learned from this thread about the new (to me)
"{} +" option to "find"! That wasn't in "find" until relatively
recently, it seems. [At least, it wasn't in FreeBSD 4.6, though
it seems to be in FreeBSD 6.x and later...]

Thanks, guys!!

Petter Gustad

unread,
Feb 10, 2011, 1:52:34 AM2/10/11
to
rp...@rpw3.org (Rob Warnock) writes:

> invocation was given only one arg!! IT FOUND THE PATTERN, BUT DIDN'T
> TELL ME WHAT !@^%!$@#@! FILE IT WAS IN!! :-{

Sounds frustrating, but grep -H will always print the filename, even
when given a single filename on the command line.

Tim Bradshaw

unread,
Feb 10, 2011, 3:15:04 AM2/10/11
to
On 2011-02-09 13:10:48 +0000, Rob Warnock said:

> There are still *very* good performance reasons for preferring
> "find | xargs" to "find -exec" in most common usages, though
> "find | xargs -n 1" is almost as bad as "find -exec". The one
> advantage of the latter is that you can use the "{}" more than
> once, e.g.:

My xargs -n 1 case is actually (something like) this:

echo *.zip | xargs -n 1 unzip -q

because I don't know how to get unzip to take many files as arguments
(don't tell me, also don't tell me that this suffers from the kind of
filename fragility I was (wrongly) pointing out up-thread: I know).

Björn Lindberg

unread,
Feb 10, 2011, 4:41:08 AM2/10/11
to
Tim Bradshaw <t...@tfeb.org> writes:

> On 2011-02-09 13:10:48 +0000, Rob Warnock said:
>
>> There are still *very* good performance reasons for preferring
>> "find | xargs" to "find -exec" in most common usages, though
>> "find | xargs -n 1" is almost as bad as "find -exec". The one
>> advantage of the latter is that you can use the "{}" more than
>> once, e.g.:
>
> My xargs -n 1 case is actually (something like) this:
>
> echo *.zip | xargs -n 1 unzip -q

Not that you asked for it, but in that particular situation I do this:

for z in *.zip; do unzip -q "$z"; done


Bj�rn Lindberg

Petter Gustad

unread,
Feb 10, 2011, 4:29:19 AM2/10/11
to
rp...@rpw3.org (Rob Warnock) writes:

> Petter Gustad <newsma...@gustad.com> wrote:
> +---------------

> | Idioms can be problematic at times as they get stuck in your fingers.
> | I still use find | xargs even though I was shown by a colleague that I
> | could do -exec + long time ago.
> +---------------
>
> There are still *very* good performance reasons for preferring
> "find | xargs" to "find -exec" in most common usages, though

They seem to be very similar with respect to the number of forks:

$ cat /tmp/dummy
#/bin/sh
echo "this is dummy pid $$, args $#" >> /tmp/dummy.out
$ rm /tmp/dummy.out ; time find . -type f -print0 | xargs -0 /tmp/dummy ; wc -l /tmp/dummy.out
$ rm /tmp/dummy.out ; time find . -type f -exec /tmp/dummy {} + ; wc -l /tmp/dummy.out

results in 3.8s/190lines and 4.6s/190lines respectively (many of the
files being cached as I ran both commands several times). xargs is a
little faster, but I can't see why as the number of forks are the
same? The number of arguments appears to be in the range of 2000 for
both with my pathnames.

> +---------------
> | I also still use find | cpio for copying files to keep permissions,
> | owner, group and symbolic links, even though recent versions of cp
> | will do the same.
> +---------------
>
> "rsync -auv" is your friend! Works both locally and remotely.

I know, I just can't seem to get out of the find|cpio habit... I also
use find/cpio over rsh/ssh for remote copies as well.

Tim Bradshaw

unread,
Feb 10, 2011, 5:06:20 AM2/10/11
to
On 2011-02-10 09:41:08 +0000, Bj�rn Lindberg said:

>
> Not that you asked for it, but in that particular situation I do this:
>
> for z in *.zip; do unzip -q "$z"; done

That's what I used to do as well. Not sure why I changed to the xargs
version other than hack value (in the case I care about the filenames
are known to be well-behaved).

Rob Warnock

unread,
Feb 10, 2011, 8:39:06 PM2/10/11
to
Tim Bradshaw <t...@tfeb.org> wrote:
+---------------
+---------------

And if the filenames are known to be ill-behaved, there's always
"find ... -print0 | xargs -0 ...".


-Rob

p.s. (deadp (beat horse))? ;-}

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <http://rpw3.org/>
San Mateo, CA 94403

Hugh Aguilar

unread,
Feb 10, 2011, 11:53:10 PM2/10/11
to

Thanks for the tip --- that looks *very* interesting. Since I'm
implementing from scratch, I might was well go with the latest and
greatest technology. :-) I've never really liked regex all that much,
but didn't know of any other way to do it, except writing a program
for the pattern match (that can actually be more readable than a
regex, which looks a lot like line-noise).

Lua has a pretty basic pattern-matching facility; they were trying to
keep the code-size and memory usage as low as possible. I think there
are some regex add-ons though. A lot of people are writing add-on
packages for Lua --- it is really taking off.

tange

unread,
Feb 12, 2011, 4:37:42 PM2/12/11
to

I do:

parallel unzip -q ::: *.zip

It often works faster.

Watch the intro video to learn more about GNU Parallel at
http://www.youtube.com/watch?v=OpaiGYxkSuQ

/Ole

0 new messages