Performance of sed -f

Noob

unread,

Oct 24, 2014, 12:17:27 PM10/24/14

to

Hello,

Consider a large number of sed replacement commands, stored in a file.

sed s@aaaa@bbb@
sed s@cccccc@ddd@
...

I have ~3000 such commands in command_file.
and I'm running sed -f command_file input_file
(input_file is ~500 KB)

The sed command takes a long time (around 5 minutes).

I was expecting sed to populate some kind of hash table,
and perform the look-ups in O(1) but I must be mistaken,
considering the large run-time.

So, I'm wondering...

1) how does sed handle a large number of replacement commands?
2) is there a better approach for this type of operation?

Regards.

Richard Kettlewell

unread,

Oct 24, 2014, 12:45:31 PM10/24/14

to

Noob <ro...@127.0.0.1> writes:
> Consider a large number of sed replacement commands, stored in a file.
>
> sed s@aaaa@bbb@
> sed s@cccccc@ddd@
> ...
>
> I have ~3000 such commands in command_file.
> and I'm running sed -f command_file input_file
> (input_file is ~500 KB)
>
> The sed command takes a long time (around 5 minutes).
>
> I was expecting sed to populate some kind of hash table,
> and perform the look-ups in O(1) but I must be mistaken,
> considering the large run-time.
>
> So, I'm wondering...
>
> 1) how does sed handle a large number of replacement commands?

The same way it handles any other sequence of commands: it works through
them one by one. It’s a general purpose stream editing tool.

> 2) is there a better approach for this type of operation?

If you want to get high performance for this kind of operation you need
something that can turn your entire set of search strings into a single
finite automaton that can tell you what the first matching substring is
in a single pass of a line.

Once you’ve got a match then a constant-time lookup of the substition
becomes possible.

http://swtch.com/~rsc/regexp/regexp1.html suggests Tcl and Awk contain
such things, though it's dated 2007 so other language implementations
may have caught up since then.

--
http://www.greenend.org.uk/rjk/

Kaz Kylheku

unread,

Oct 24, 2014, 3:47:30 PM10/24/14

to

On 2014-10-24, Noob <ro...@127.0.0.1> wrote:
> Hello,
>
> Consider a large number of sed replacement commands, stored in a file.
>
> sed s@aaaa@bbb@
> sed s@cccccc@ddd@
> ...
>
> I have ~3000 such commands in command_file.
> and I'm running sed -f command_file input_file
> (input_file is ~500 KB)
>
> The sed command takes a long time (around 5 minutes).

If you want a blazing fast solution, and the set of replacements does
not have to be easy to modify, you can turn it into a Lex program.
(Compiled to C with lex implementation such as GNU Flex.)

Lex turns all of its pattern rules into a one giant regular expression.
The pattern rules trigger snippets of C code called actions. These
actions can perform output.

That action is triggerd which matches the longest possible text
at any given input position. If there is a tie, then the earlier rule
wins.

If none of the rules match, then a default rule kicks in which matches a single
character, and sends it to standard output, then advances the input to the next
character. (If you write a one-character rule explicitly, it shadows that
rule.)

Her is an example Lex program which translates "foo" to "xyzzy" and "bar"
to "quux".

%{
#include <stdio.h>
%}

/* regex "macros can" be defined here: symbolic names for regex fragments */

%%

foo { fputs("xyzzy", stdout); }
bar { fputs("quux", stdout); }

%%

int main(void)
{
yylex();
return 0;
}

I saved thea bove in a file called "lex.l" and built on an Ubuntu Linux
system like this:

$ lex lex.l # this produces a file called "lex.yy.c"
$ cc lex.yy.c -ll # we must link in the flex library "libflex"

Test run:

$ ./a.out
foo
xyzzy
foo bar
xyzzy quux
Hey foo, I have a bar for you.
Hey xyzzy, I have a quux for you.

The Unix command for compiling a Lex program is "lex"
and the run-time support library is linked with "-ll". These work
fine on the Ubuntu system and other systems that have the right
symbolic links:

$ ls -l /usr/lib/i386-linux-gnu/lib{fl,l}.a
-rw-r--r-- 1 root root 4304 Nov 7 2011 /usr/lib/i386-linux-gnu/libfl.a
lrwxrwxrwx 1 root root 7 Nov 7 2011 /usr/lib/i386-linux-gnu/libl.a -> libfl.a

$ ls -l /usr/bin/*lex
-rwxr-xr-x 1 root root 288384 Nov 7 2011 /usr/bin/flex
lrwxrwxrwx 1 root root 4 Nov 7 2011 /usr/bin/lex -> flex

Without these symlinks, you would have to use "flex" and link with "-lfl".

Barry Margolin

unread,

Oct 24, 2014, 5:41:08 PM10/24/14

to

In article <m2du2e$glv$1...@dont-email.me>, Noob <ro...@127.0.0.1> wrote:

> Hello,
>
> Consider a large number of sed replacement commands, stored in a file.
>
> sed s@aaaa@bbb@
> sed s@cccccc@ddd@
> ...
>
> I have ~3000 such commands in command_file.
> and I'm running sed -f command_file input_file
> (input_file is ~500 KB)
>
> The sed command takes a long time (around 5 minutes).
>
> I was expecting sed to populate some kind of hash table,
> and perform the look-ups in O(1) but I must be mistaken,
> considering the large run-time.

Probably 90% of sed invocations have just 1 replacement, and 99% have
fewer than 10. Optimizing multiple commands as you expect would be way
overkill for most uses.

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***

Kaz Kylheku

unread,

Oct 24, 2014, 6:39:17 PM10/24/14

to

On 2014-10-24, Kaz Kylheku <k...@kylheku.com> wrote:
> On 2014-10-24, Noob <ro...@127.0.0.1> wrote:
>> The sed command takes a long time (around 5 minutes).
>

> If you want a blazing fast solution [ ... ]

[ ... ]

> $ cc lex.yy.c -ll # we must link in the flex library "libflex"

... then use your compiler's optimization options!

Eric Sosman

unread,

Oct 24, 2014, 8:28:49 PM10/24/14

to

Hmmm: If we count the total "solution time" as the sum of
compilation and execution, would high optimization be a good idea?

- Pro: Using -OInf will generate smaller/tighter/faster code,
which will (since the O.P. has told us the input to be processed
is lengthy) be significantly more efficient than an unoptimized
version.

- Con: Optimizing what amounts to a `switch' statement is a
waste of time; what really needs to be optimized is the pattern
matcher in the already-compiled libflex, not the dumb-simple
driver program that calls upon it!

I truly don't know which argument should prevail (although
I lean slightly toward the latter): Does anyone have measurements?

--
eso...@comcast-dot-net.invalid

Kaz Kylheku

unread,

Oct 24, 2014, 9:03:49 PM10/24/14

to

On 2014-10-25, Eric Sosman <eso...@comcast-dot-net.invalid> wrote:
> On 10/24/2014 6:39 PM, Kaz Kylheku wrote:
>> On 2014-10-24, Kaz Kylheku <k...@kylheku.com> wrote:
>>> On 2014-10-24, Noob <ro...@127.0.0.1> wrote:
>>>> The sed command takes a long time (around 5 minutes).
>>>
>>> If you want a blazing fast solution [ ... ]
>>
>> [ ... ]
>>
>>> $ cc lex.yy.c -ll # we must link in the flex library "libflex"
>>
>> ... then use your compiler's optimization options!
>
> Hmmm: If we count the total "solution time" as the sum of
> compilation and execution, would high optimization be a good idea?

It's hard to tell. The break-even point is shifted elsewhere;
every second spent optimizing is a second not spent crunching through
some amount of data, but increases the rate.

Antoher problem is the I/O ceiling. At some point, the program becomes
I/O bound, so further optimizations don't knock down any more seconds
of wall time (but make the processor somewhat more available).

If the data set is large, we can't count on it being pre-cached.

> - Pro: Using -OInf will generate smaller/tighter/faster code,
> which will (since the O.P. has told us the input to be processed
> is lengthy) be significantly more efficient than an unoptimized
> version.
>
> - Con: Optimizing what amounts to a `switch' statement is a
> waste of time; what really needs to be optimized is the pattern
> matcher in the already-compiled libflex, not the dumb-simple
> driver program that calls upon it!

Here is what libflex contains! These two functions:

int yywrap (void)
{
return 1;
}

int main(argc, argv)
int argc;
char *argv[];
{
while (yylex () != 0);
return 0;
}

That's it!

The pattern matcher is generated by flex from the patterns in the flex file,
and is compiled into the code in your lex.yy.c. That contains all the code to
do the buffering of the input, and accumulation of tokens and so on. If that
module is not optimized, its performance will suffer.

By the way, I had no idea that libflex contains so little!

What, and to think that on a previous job I wasted time tweaking the Flex build
system to properly cross-compile this stupid thing for the target,
while keeping the compilation of flex program for the build machine where the
toolchain runs ... argh!

If the program already has a main, that libflex main is useless, and anyone can
jus stick that dummy yywrap into their flex program. Argh!

Eric Sosman

unread,

Oct 24, 2014, 10:35:20 PM10/24/14

to

On 10/24/2014 9:03 PM, Kaz Kylheku wrote:
> On 2014-10-25, Eric Sosman <eso...@comcast-dot-net.invalid> wrote:

>>[...]

>> - Con: Optimizing what amounts to a `switch' statement is a
>> waste of time; what really needs to be optimized is the pattern
>> matcher in the already-compiled libflex, not the dumb-simple
>> driver program that calls upon it!
>
> Here is what libflex contains! These two functions:

>[...]

> That's it!
>
> The pattern matcher is generated by flex from the patterns in the flex file,
> and is compiled into the code in your lex.yy.c. That contains all the code to
> do the buffering of the input, and accumulation of tokens and so on. If that
> module is not optimized, its performance will suffer.

Sonuvagun. Who'd a thunk it? (On the evidence, I most clearly
did not thunk it; "it stands to reason" that the pattern-matching would
be in the big ol' library, not in the generated code. Just another
example of how "it stands to reason" can be unreasonably wrong ...)

Thanks for the research!

--
eso...@comcast-dot-net.invalid

Nobody

unread,

Oct 24, 2014, 11:04:51 PM10/24/14

to

On Fri, 24 Oct 2014 22:35:12 -0400, Eric Sosman wrote:

> Sonuvagun. Who'd a thunk it? (On the evidence, I most clearly
> did not thunk it; "it stands to reason" that the pattern-matching would be
> in the big ol' library, not in the generated code. Just another example
> of how "it stands to reason" can be unreasonably wrong ...)

The programs generated by flex aren't supposed to have any dependencies
(other than the standard C library) beyond those specifically chosen by
the user.

libfl is just a utility library so that the flex source file can hold the
entire program, without needing a separate .c file for two simple
functions.

Nobody

unread,

Oct 24, 2014, 11:32:08 PM10/24/14

to

On Fri, 24 Oct 2014 18:17:22 +0200, Noob wrote:

> Consider a large number of sed replacement commands, stored in a file.
>
> sed s@aaaa@bbb@
> sed s@cccccc@ddd@
> ...

What are the leading "sed"s doing there? Is this a sed file or a shell
script.

> I was expecting sed to populate some kind of hash table, and perform the
> look-ups in O(1) but I must be mistaken, considering the large run-time.
>
> So, I'm wondering...
>
> 1) how does sed handle a large number of replacement commands?

Sequentially.

Bear in mind:

1. sed has other commands beside "s". It even has nested blocks.

2. sed doesn't stop after the first substitution. Each substitution will
modify the current pattern space on which subsequent commands ("s" or
otherwise) operate.

3. The patterns in the replacement aren't string literals, they're
regexps. I'm not sure how you would match the input against a sequence of
regexps using a hash table.

IOW: sed scripts aren't limited to a sequence of mutually-exclusive
substitutions where the patterns are all literals, and sed isn't
optimised for that specific case.

> 2) is there a better approach for this type of operation?

As others have suggested, lex will be more efficient. And a lot more work.

If you can find a good regex library, you could do the same thing without
having to generate a lex file, compile it to C, compile and link the C.

What you want is something which can match alternatives (a|b|c|...) and
give you the index of the one which matched. This would be relatively
simple to implement, but I haven't found one which does this.

Paul

unread,

Oct 24, 2014, 11:53:08 PM10/24/14

to

In a group like this, wouldn't it be better to define
the problem first, rather than dictate the solution ?

Someone might spot a property of your problem, that
makes a particular language "the best" for it.

As well, sometimes when you're forced to write out
a problem, the answer becomes self evident. I used to
find that for the problems I was called on to solve at
work, that by the time I'd followed some rote solution
path, I'd have a "moment of clarity", a "Doh" moment,
and realize the problem had a much simpler solution.

1) The problem could be a compression problem. 3000
substitutions sounds like a lot, otherwise.

2) Are the substitutions to be applied to whole tokens ?

cat doc at catcher

That would be four tokens. A substitution for "cat"
would only process the first one, if you were looking
for an exact match for the token.

3) Is the substitution to be applied over and over again,
until no instances of the string exist ? Or just the first
time you spot it ? With 3000 substitutions, there could be
some overlap.

s@aaa@aa@

Input: Output (applied until no more matches are evident)

aaaaa aaaa --> aaa --> aa

Input: Output (replace first instance, no further match)

aaaaa aa aa

There are some languages that have associative arrays, but
they quickly lose their utility if the problem set isn't
the right one.

Paul

Noob

unread,

Oct 25, 2014, 2:59:51 AM10/25/14

to

On 25/10/2014 05:32, Nobody wrote:

> On Fri, 24 Oct 2014 18:17:22 +0200, Noob wrote:
>
>> Consider a large number of sed replacement commands, stored in a file.
>>
>> sed s@aaaa@bbb@
>> sed s@cccccc@ddd@
>> ...
>
> What are the leading "sed"s doing there? Is this a sed file or a shell
> script.

Good catch. command_file contains only replacement rules:

s@aaaa@bbb@
s@cccccc@ddd@
...

>> I was expecting sed to populate some kind of hash table, and perform the
>> look-ups in O(1) but I must be mistaken, considering the large run-time.
>>
>> So, I'm wondering...
>>
>> 1) how does sed handle a large number of replacement commands?
>
> Sequentially.
>
> Bear in mind:
>
> 1. sed has other commands beside "s". It even has nested blocks.
>
> 2. sed doesn't stop after the first substitution. Each substitution will
> modify the current pattern space on which subsequent commands ("s" or
> otherwise) operate.
>
> 3. The patterns in the replacement aren't string literals, they're
> regexps. I'm not sure how you would match the input against a sequence of
> regexps using a hash table.
>
> IOW: sed scripts aren't limited to a sequence of mutually-exclusive
> substitutions where the patterns are all literals, and sed isn't
> optimised for that specific case.
>
>> 2) is there a better approach for this type of operation?
>
> As others have suggested, lex will be more efficient. And a lot more work.
>
> If you can find a good regex library, you could do the same thing without
> having to generate a lex file, compile it to C, compile and link the C.
>
> What you want is something which can match alternatives (a|b|c|...) and
> give you the index of the one which matched. This would be relatively
> simple to implement, but I haven't found one which does this.

I'd like to find a solution that doesn't require a compiler, because it is
intended to be used by non-technical people on different platforms.

I have very little experience with text-processing languages (sed, awk, perl)
so I'm keeping an open mind about suggestions.

I know that bash (among others) implements associative arrays. I was thinking
I could build my dictionary using that. What do you think?

Regards.

Noob

unread,

Oct 25, 2014, 3:07:03 AM10/25/14

to

Hello Kaz,

On 24/10/2014 21:47, Kaz Kylheku wrote:

> If you want a blazing fast solution, and the set of replacements does
> not have to be easy to modify, you can turn it into a Lex program.
> (Compiled to C with lex implementation such as GNU Flex.)

I've used lex and yacc to build toy compilers and parsers.

As I wrote else-thread, I am looking for a scripted solution, because it

is intended to be used by non-technical people on different platforms.

(I fear requiring compilation would raise the bar too much.)

I will try to provide a high-level overview of what I'm actually trying
to accomplish. Maybe someone will come up with a brilliant idea.

Regards.

Eric Sosman

unread,

Oct 25, 2014, 8:35:57 AM10/25/14

to

You could write a front-end that digests the users' input, then
writes, compiles, and runs a lex engine behind the scenes. Depending
on how much coddling the users need, writing and testing such a thing
might take only an hour or so.

... and could provide a bonus, too: If someday you determine that
lex just isn't right for the job, you could write a purpose-built
gizmo to sit behind the same user-facing tool. They'd get the benefit
of the improved engine without the confusion of learning a new UI.

--
eso...@comcast-dot-net.invalid

Martin Vaeth

unread,

Oct 25, 2014, 9:40:10 AM10/25/14

to

Noob <ro...@127.0.0.1> wrote:
>> What you want is something which can match alternatives (a|b|c|...) and
>> give you the index of the one which matched. This would be relatively
>> simple to implement, but I haven't found one which does this.
>
> I'd like to find a solution that doesn't require a compiler

In perl you could do this easily:
You could use a hash for the replacement like:

#!/usr/bin/perl -p
BEGIN {
$r{a}='x';
$r{b}='y';
$r{c}='z';
};
s/(a|b|c)/$r{$1}/g;

Kaz Kylheku

unread,

Oct 25, 2014, 11:04:54 AM10/25/14

to

On 2014-10-25, Noob <ro...@127.0.0.1> wrote:
> Hello Kaz,
>
> On 24/10/2014 21:47, Kaz Kylheku wrote:
>
>> If you want a blazing fast solution, and the set of replacements does
>> not have to be easy to modify, you can turn it into a Lex program.
>> (Compiled to C with lex implementation such as GNU Flex.)
>
> I've used lex and yacc to build toy compilers and parsers.
>
> As I wrote else-thread, I am looking for a scripted solution, because it
> is intended to be used by non-technical people on different platforms.
> (I fear requiring compilation would raise the bar too much.)

This can be scripted. In fact, you can make it a self-compiling hash-bang
script; see the C solution here:

rosettacode.org/wiki/Multiline_shebang#C

With some languages you can't even tell if things are being compiled or
interpreted. Some Lisp implementations compile every expression they read to
machine code; some interpret unless compile or compile-file is used.

Many scripting languages these days compile their inputs to some abstract
syntax tree at least; some to a byte code.

For practical purposes, "compiling is scripting".

Nicolas George

unread,

Oct 25, 2014, 11:30:23 AM10/25/14

to

Kaz Kylheku , dans le message <201410250...@kylheku.com>, a
écrit :

> This can be scripted. In fact, you can make it a self-compiling hash-bang
> script; see the C solution here:

Where do you find your multi-target C compiler written in shell?

Kaz Kylheku

unread,

Oct 25, 2014, 1:03:02 PM10/25/14

to

I don't know where you're getting this "written in shell" requirement from.

If you want to run programs in language X, you generally need an
implementation of language X.

See link for how it it works.

The shell is used only as a shim because the C compiler cannot be *directly*
used as a #! interpreter; it doesn't have a convenient command line option for
this like awk's "-f" or whatever.

Nicolas George

unread,

Oct 25, 2014, 1:11:37 PM10/25/14

to

Kaz Kylheku , dans le message <201410250...@kylheku.com>, a
écrit :

> I don't know where you're getting this "written in shell" requirement from.

I read other people's messages. You should try it sometime, it is
instructive.

In this instance, it was this part of sentence:

"intended to be used by non-technical people on different platforms."

Non-technical people tend to use, if they use Unix at all, apple's expensive
plastic imitation, or at best Ubuntu. And guess what? They do not have gcc
and libc6-dev installed, or whatever the apple equivalent is called. Nor ler
or flex, for that matter.

> See link for how it it works.

I do not need to see a link to see how it works, thank you very much.

Barry Margolin

unread,

Oct 25, 2014, 1:12:38 PM10/25/14

to

In article <201410250...@kylheku.com>,

Kaz Kylheku <k...@kylheku.com> wrote:

> On 2014-10-25, Nicolas George <nicolas$geo...@salle-s.org> wrote:
> > Kaz Kylheku , dans le message <201410250...@kylheku.com>, a
> > écrit :
> >> This can be scripted. In fact, you can make it a self-compiling hash-bang
> >> script; see the C solution here:
> >
> > Where do you find your multi-target C compiler written in shell?
>
> I don't know where you're getting this "written in shell" requirement from.

The OP seems to want a solution that could be used on systems without a
C compiler.

Noob

unread,

Oct 25, 2014, 2:38:20 PM10/25/14

to

On 25/10/2014 19:12, Barry Margolin wrote:

> The OP seems to want a solution that could be used on systems without a
> C compiler.

(Thinking out loud...)

I think a handful of people might want to run my tool on Windows.
If my solution relies on bash, sed, awk, perl, I still need to
provide the run-time somehow (cygwin?)

Maybe there is no simple way to support Windows, and I'll just
write for Linux, and direct Windows users to Cygwin...

What do you guys think?

(I still need to explain what I'm trying to do, coming soon.)

Regards.

Melzzzzz

unread,

Oct 25, 2014, 2:50:16 PM10/25/14

to

On Sat, 25 Oct 2014 20:38:13 +0200
Noob <ro...@127.0.0.1> wrote:

> On 25/10/2014 19:12, Barry Margolin wrote:
>
> > The OP seems to want a solution that could be used on systems
> > without a C compiler.
>
> (Thinking out loud...)
>
> I think a handful of people might want to run my tool on Windows.
> If my solution relies on bash, sed, awk, perl, I still need to
> provide the run-time somehow (cygwin?)
>
> Maybe there is no simple way to support Windows, and I'll just
> write for Linux, and direct Windows users to Cygwin...
>
> What do you guys think?

What about writing it in C and compiling it for Windows/Linux/OSX
and just providing binaries...

--
Manjaro all the way!
http://manjaro.org/

Nicolas George

unread,

Oct 25, 2014, 3:32:56 PM10/25/14

to

Noob , dans le message <m2gqmi$4em$1...@dont-email.me>, a écrit :

> What do you guys think?

I think that you should do what was suggested a long time ago, which is
expose the actual frigging problem you want to solve and not a half-baked
solution.

Volker Birk

unread,

Oct 25, 2014, 4:12:43 PM10/25/14

to

Noob <ro...@127.0.0.1> wrote:
> (I still need to explain what I'm trying to do, coming soon.)

Without knowning that it will be difficult to give adivces. From what
you're telling yet, I'm assuming using a scripting language could help,
which is available on each platform you want to support.

Yours,
VB.
--
“Vor Snowden war das ein Blog mit Verschwörungstheorien.
Nach Snowden ist das ein Security-Newsticker.
Bei unverändertem Inhalt...”
Marc Stibane über Fefes Blog

Kaz Kylheku

unread,

Oct 25, 2014, 7:41:24 PM10/25/14

to

On 2014-10-25, Nicolas George <nicolas$geo...@salle-s.org> wrote:

> Kaz Kylheku , dans le message <201410250...@kylheku.com>, a
> écrit :
>> I don't know where you're getting this "written in shell" requirement from.
>
> I read other people's messages. You should try it sometime, it is
> instructive.
>
> In this instance, it was this part of sentence:
>
> "intended to be used by non-technical people on different platforms."
>
> Non-technical people tend to use, if they use Unix at all, apple's expensive
> plastic imitation, or at best Ubuntu. And guess what? They do not have gcc
> and libc6-dev installed, or whatever the apple equivalent is called. Nor ler
> or flex, for that matter.

Nobody said anything about handing the non-technical people an incompletely
delivered solution with dependencies that they must hunt down themselves.

That you think that, for a given solution, those dependencies are unnacceptable
for whatever reason (size ...) is a different matter.

Kaz Kylheku

unread,

Oct 25, 2014, 7:44:39 PM10/25/14

to

On 2014-10-25, Barry Margolin <bar...@alum.mit.edu> wrote:
> In article <201410250...@kylheku.com>,
> Kaz Kylheku <k...@kylheku.com> wrote:
>
>> On 2014-10-25, Nicolas George <nicolas$geo...@salle-s.org> wrote:
>> > Kaz Kylheku , dans le message <201410250...@kylheku.com>, a
>> > écrit :
>> >> This can be scripted. In fact, you can make it a self-compiling hash-bang
>> >> script; see the C solution here:
>> >
>> > Where do you find your multi-target C compiler written in shell?
>>
>> I don't know where you're getting this "written in shell" requirement from.
>
> The OP seems to want a solution that could be used on systems without a
> C compiler.

That may be so, but in my posting there was not such requirement.
A "self-compiling hash-bang script written in C" does not require
a "multi-target C compiler written in shell".

Kaz Kylheku

unread,

Oct 25, 2014, 7:58:28 PM10/25/14

to

On 2014-10-25, Noob <ro...@127.0.0.1> wrote:

> On 25/10/2014 19:12, Barry Margolin wrote:
>
>> The OP seems to want a solution that could be used on systems without a
>> C compiler.
>
> (Thinking out loud...)
>
> I think a handful of people might want to run my tool on Windows.
> If my solution relies on bash, sed, awk, perl, I still need to
> provide the run-time somehow (cygwin?)

The MinGW environment has all these tools. (I just checked.)
So it is possible to do without Cygwin.

I haven't done a brand new install in some time so it might not
be the latest, but my MinGW installation has perl 5.8.8 and gawk 3.1.8.

(It also has a C compiler and GNU flex. That silly -lfl library is not
in the right path in MinGW. It's in MinGW's /usr/lib, which is not
searched by the linker! -L/usr/lib -lfl works for me.)

> Maybe there is no simple way to support Windows, and I'll just
> write for Linux, and direct Windows users to Cygwin...

If they are non-technical, a more turnkey solution is probably appropriate.

You know your users best, of course.

Nicolas George

unread,

Oct 26, 2014, 5:04:03 AM10/26/14

to

Kaz Kylheku , dans le message <201410251...@kylheku.com>, a
écrit :

> Nobody said anything about handing the non-technical people an incompletely
> delivered solution with dependencies that they must hunt down themselves.

You suggested to give them a shell script bundling the dependencies, which,
given the amount of dependencies your suggested solution requires (C
compiler and build environment for multiple platforms), is even more
ridiculous.

EOT for me.

Jorgen Grahn

unread,

Oct 26, 2014, 5:18:11 AM10/26/14

to

Cool! I've used Perl for 15 years, and never suspected you could do
that -- i.e. have a replacement that's evaluated at runtime, using
more than just $1 and so on. Was it always like that, or is it a
newer feature?

I note though that it's not quite the same as the original sed script,
where each line goes through 3000 separate replacement passes.

salix:~% cat /tmp/z
#!/usr/bin/perl -p
BEGIN {
$r{a}='b';
$r{b}='c';

$r{c}='z';
};
s/(a|b|c)/$r{$1}/g;

salix:~% perl /tmp/z
abc
bcz

salix:~% sed 's/a/b/g; s/b/c/g; s/c/z/g'
abc
zzz

It seems likely, though, that he really wants your result in such a
case.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Richard Kettlewell

unread,

Oct 26, 2014, 5:33:48 AM10/26/14

to

Jorgen Grahn <grahn...@snipabacken.se> writes:

> Martin Vaeth wrote:
>> You could use a hash for the replacement like:
>>
>> #!/usr/bin/perl -p
>> BEGIN {
>> $r{a}='x';
>> $r{b}='y';
>> $r{c}='z';
>> };
>> s/(a|b|c)/$r{$1}/g;
>
> Cool! I've used Perl for 15 years, and never suspected you could do
> that -- i.e. have a replacement that's evaluated at runtime, using
> more than just $1 and so on. Was it always like that, or is it a
> newer feature?

It’s had that feature for a very long time.

It was the first approach I thought of. However the question was about
performance, not just how to translate into another language, and
whether it’s significantly faster than the original sed implementation
depends on the details of how Perl’s RE implementation works.

--
http://www.greenend.org.uk/rjk/

Jorgen Grahn

unread,

Oct 26, 2014, 5:57:31 AM10/26/14

to

The OP will have to do the benchmarking (I'm too lazy) but I'd expect
it to be so fast that I/O becomes the bottleneck. One pass over each
line, instead of 3000. I also expect Perl to generate optimal code for
(a|b|c|...) even if there are 3000 elements inside it -- REs are Perl's
killer feature, after all.

(Then there's the question about what he's really trying to do, as
others wrote. And what he wants to do about overlaps.)

Richard Kettlewell

unread,

Oct 26, 2014, 6:09:00 AM10/26/14

to

Jorgen Grahn <grahn...@snipabacken.se> writes:
> On Sun, 2014-10-26, Richard Kettlewell wrote:

>> It was the first approach I thought of. However the question was about
>> performance, not just how to translate into another language, and
>> whether it's significantly faster than the original sed implementation
>> depends on the details of how Perl's RE implementation works.
>
> The OP will have to do the benchmarking (I'm too lazy) but I'd expect
> it to be so fast that I/O becomes the bottleneck. One pass over each
> line, instead of 3000. I also expect Perl to generate optimal code for
> (a|b|c|...) even if there are 3000 elements inside it -- REs are Perl's
> killer feature, after all.

Have a look at http://swtch.com/~rsc/regexp/regexp1.html then...

--
http://www.greenend.org.uk/rjk/

Jorgen Grahn

unread,

Oct 26, 2014, 7:08:54 AM10/26/14

to

On Sun, 2014-10-26, Richard Kettlewell wrote:
> Jorgen Grahn <grahn...@snipabacken.se> writes:
>> On Sun, 2014-10-26, Richard Kettlewell wrote:
>
>>> It was the first approach I thought of. However the question was about
>>> performance, not just how to translate into another language, and
>>> whether it's significantly faster than the original sed implementation
>>> depends on the details of how Perl's RE implementation works.
>>
>> The OP will have to do the benchmarking (I'm too lazy) but I'd expect
>> it to be so fast that I/O becomes the bottleneck. One pass over each
>> line, instead of 3000. I also expect Perl to generate optimal code for
>> (a|b|c|...) even if there are 3000 elements inside it -- REs are Perl's
>> killer feature, after all.
>
> Have a look at http://swtch.com/~rsc/regexp/regexp1.html then...

Ok, I guess I'll have to stop being lazy, then. Experiment:

% cp /usr/share/dict/words .
% grep -v "'" words | grep -A2999 gayness >3k

Then I wrote a sed script which replaced the 3000 words in '3k' with
"foo" (OP's approach) and another one in Perl (the approach above).
Then I applied them on 'words'.

Sed : 225.10s
Perl: 0.23s

I.e. the Perl solution was a thousand times faster.

gamo

unread,

Oct 26, 2014, 9:28:36 AM10/26/14

to

El 26/10/14 a las 11:08, Richard Kettlewell escribió:

Interesting, but that perl example that generates a core dump,
does not suppose a problem for a modern perl and time spended
is negligible.

--
http://www.telecable.es/personales/gamo/

Martin Vaeth

unread,

Oct 26, 2014, 11:09:28 AM10/26/14

to

Richard Kettlewell <r...@greenend.org.uk> wrote:
>
> Have a look at http://swtch.com/~rsc/regexp/regexp1.html then...

Here it was sort of urban legend that perl converts to DFA if possible,
that is, if no particular features like backreferences are used.
I am shocked to read that this is not the case.
This matches with my previous experience where a single /ABC|ACD/
in perl took considerably longer than two explicit string searches.
This really changes my opinion about perl...

Kaz Kylheku

unread,

Oct 26, 2014, 12:12:39 PM10/26/14

to

I didn't write anything about a bundling using shell script, or combining
deliverables for multiple platforms into a single package.

Nobody

unread,

Oct 28, 2014, 12:17:19 AM10/28/14

to

On Sun, 26 Oct 2014 11:08:51 +0000, Jorgen Grahn wrote:

> I.e. the Perl solution was a thousand times faster.

... because the Perl solution is a solution to a much simpler problem than
the one the OP actually posed.

It neither confirms nor refutes Jorgen's statement that:

> I also expect Perl to generate optimal code for (a|b|c|...) even if
> there are 3000 elements inside it

Perl's REs may be "good enough" in this case, but historically its
implementation hasn't been "optimal".

Assuming that the Perl solution solves the OP's actual problem (i.e. each
match will be replaced exactly once; the replacement will not undergo any
substitutions), it's possible to do somewhat better in sed. For a start,
replacing

s/foo/bar/

with
/foo/ { s//bar/ ; b }

will cause a match to skip all remaining commands and immediately begin
the next cycle. However, I believe that this will still test each case
sequentially.

Jorgen Grahn

unread,

Oct 28, 2014, 11:50:56 AM10/28/14

to

On Tue, 2014-10-28, Nobody wrote:
> On Sun, 26 Oct 2014 11:08:51 +0000, Jorgen Grahn wrote:
>
>> I.e. the Perl solution was a thousand times faster.
>
> ... because the Perl solution is a solution to a much simpler problem than
> the one the OP actually posed.
>
> It neither confirms nor refutes Jorgen's statement that:
>
>> I also expect Perl to generate optimal code for (a|b|c|...) even if
>> there are 3000 elements inside it
>
> Perl's REs may be "good enough" in this case, but historically its
> implementation hasn't been "optimal".

All that might be true, but I tried to look at the practical side of
this: the OP had a program which took five minutes to run, and his
users suffered. If they don't need the features of the original sed
script (and that seems rather likely) that suffering can go away.

I might be overestimating the quality of Perl's RE engine, but
personally I'm not all that interested in that. The way I've used it,
it has always been fast enough not to be an issue (and I often process
extremely large files).

Rainer Weikusat

unread,

Oct 28, 2014, 12:53:19 PM10/28/14

to

Jorgen Grahn <grahn...@snipabacken.se> writes:
> On Tue, 2014-10-28, Nobody wrote:
>> On Sun, 26 Oct 2014 11:08:51 +0000, Jorgen Grahn wrote:
>>> I.e. the Perl solution was a thousand times faster.
>>
>> ... because the Perl solution is a solution to a much simpler problem than
>> the one the OP actually posed.
>>
>> It neither confirms nor refutes Jorgen's statement that:
>>
>>> I also expect Perl to generate optimal code for (a|b|c|...) even if
>>> there are 3000 elements inside it
>>
>> Perl's REs may be "good enough" in this case, but historically its
>> implementation hasn't been "optimal".
>
> All that might be true, but I tried to look at the practical side of
> this: the OP had a program which took five minutes to run, and his
> users suffered. If they don't need the features of the original sed
> script (and that seems rather likely) that suffering can go away.
>
> I might be overestimating the quality of Perl's RE engine,

[...]

JFTR: According to 'sources on the web' (eg,
http://taint.org/2006/07/07/184022a.html), it does support creating
search tries from sets of 'static' alternatives.