Need a shell script

Geico Caveman

unread,

Nov 7, 2008, 12:18:37 PM11/7/08

to

Hello

I am looking for a shell script that will do the following :

1. Scan a specified text file.
2. Find all occurences of a string patterns and replaced them as :

a) \added[<text1>]{<text2>} by text2.
b) \replaced[<text1>]{<text2>}{<text3>} by text2.
c) \deleted[<text1>]{<text2>} by blank.

The reason I pose this question is that I am not a sed / awk expert (the
syntax always daunted me).

Thanks.

PS : Yes, the above is LaTeX markup.

pk

unread,

Nov 7, 2008, 12:42:13 PM11/7/08

to

On Friday 7 November 2008 18:18, Geico Caveman wrote:

> Hello
>
> I am looking for a shell script that will do the following :
>
> 1. Scan a specified text file.
> 2. Find all occurences of a string patterns and replaced them as :
>
> a) \added[<text1>]{<text2>} by text2.
> b) \replaced[<text1>]{<text2>}{<text3>} by text2.
> c) \deleted[<text1>]{<text2>} by blank.

sed 's|\\added\[<text1>\]{<text2>}|text2|g
s|\\replaced\[<text1>\]{<text2>}{<text3>}|text2|g
s|\\deleted[<text1>]{<text2>}||g' file.tex

in case "text2" above is not a fixed string, but should be the same text
that appears inside {< >} in the pattern, then

sed 's|\\added\[<text1>\]{<$text2$>}|\1|g
s|\\replaced\[<text1>\]{<$text2$>}{<text3>}|\1|g
s|\\deleted[<text1>]{<text2>}||g' file.tex

but of course, if you posted an actual sample input with expected output it
would be better.

pk

unread,

Nov 7, 2008, 12:45:27 PM11/7/08

to

On Friday 7 November 2008 18:42, pk wrote:

> s|\\deleted[<text1>]{<text2>}||g' file.tex

should be

s|\\deleted\[<text1>\]{<text2>}||g' file.tex

and

> s|\\deleted[<text1>]{<text2>}||g' file.tex

should be

Geico Caveman

unread,

Nov 7, 2008, 2:54:30 PM11/7/08

to

pk wrote:

Thanks.

Anything in <> is not a fixed string (just following common syntactical
conventions - characters "<" and ">" do not actually occur, I am just using
them to indicate that the textn arguments are variable strings of
unpredictable lengths - all you know is that they will either be delimited
by {} or by [] - see syntax).

Here is a sample input :

\replaced[GC]{the increase of}{the decrease of}

which, should output "the increase of" (without the quotes of course).

Huibert Bol

unread,

Nov 7, 2008, 3:22:30 PM11/7/08

to

Geico Caveman wrote:

> I am looking for a shell script that will do the following :
>
> 1. Scan a specified text file.
> 2. Find all occurences of a string patterns and replaced them as :
>
> a) \added[<text1>]{<text2>} by text2.
> b) \replaced[<text1>]{<text2>}{<text3>} by text2.
> c) \deleted[<text1>]{<text2>} by blank.

perl -0777 -pe 's/\\added\[.*?\]{(.*?)}/\1/g;
s/\\replaced\[.*?\]{(.*?)}{.*?}/\1/g;
s/\\deleted\[.*?\]{.*?}//g'

(You can drop the -0777 if there are no newlines in <text?>.)

caveat:

1. <text1> cannot contain ]'s
2. neither <text2> nor <text3> can contain }'s

--
Huibert
"Hey! HEY! Curious cat, here!" -- Krosp I (GG)

pk

unread,

Nov 8, 2008, 6:57:35 AM11/8/08

to

On Friday 7 November 2008 20:54, Geico Caveman wrote:

>>> a) \added[<text1>]{<text2>} by text2.
>>> b) \replaced[<text1>]{<text2>}{<text3>} by text2.
>>> c) \deleted[<text1>]{<text2>} by blank.
>

> Anything in <> is not a fixed string (just following common syntactical
> conventions - characters "<" and ">" do not actually occur, I am just
> using them to indicate that the textn arguments are variable strings of
> unpredictable lengths - all you know is that they will either be delimited
> by {} or by [] - see syntax).
>
> Here is a sample input :
>
> \replaced[GC]{the increase of}{the decrease of}
>
> which, should output "the increase of" (without the quotes of course).

Ok, then:

sed 's|\\added\[[^]]*\]{$[^}]*$}|\1|g
s|\\replaced\[[^]]*\]{$[^}]*$}{[^}]*}|\1|g
s|\\deleted[[^]]*]{[^}]*}||g' file.tex

This assumes that no } appears inside {..}, and no ] inside [..] (not even
escaped or in any other form). If you need that, then it becomes much more
complicated.

pk

unread,

Nov 8, 2008, 6:58:44 AM11/8/08

to

On Friday 7 November 2008 21:22, Huibert Bol wrote:

> perl -0777 -pe 's/\\added\[.*?\]{(.*?)}/\1/g;
> s/\\replaced\[.*?\]{(.*?)}{.*?}/\1/g;
> s/\\deleted\[.*?\]{.*?}//g'

\1 better written as $1 at -e line 1.

Huibert Bol

unread,

Nov 8, 2008, 7:33:43 AM11/8/08

to

pk <p...@pk.invalid> wrote:

(Line 2 as well.)

Oops, too much SED. (I.e., force of habit.)

Geico Caveman

unread,

Nov 10, 2008, 2:49:09 PM11/10/08

to

pk wrote:

Thanks. I will try this out and see if it works. I was wise to avoid
figuring this syntax out by myself :)

pk

unread,

Nov 10, 2008, 3:19:20 PM11/10/08

to

On Monday 10 November 2008 20:49, Geico Caveman wrote:

>> Ok, then:
>>
>> sed 's|\\added\[[^]]*\]{$[^}]*$}|\1|g
>> s|\\replaced\[[^]]*\]{$[^}]*$}{[^}]*}|\1|g
>> s|\\deleted[[^]]*]{[^}]*}||g' file.tex
>>
>> This assumes that no } appears inside {..}, and no ] inside [..] (not
>> even escaped or in any other form). If you need that, then it becomes
>> much more complicated.
>
> Thanks. I will try this out and see if it works. I was wise to avoid
> figuring this syntax out by myself :)

btw, there is an error in the last one, should be

Pete

unread,

Nov 10, 2008, 10:03:25 PM11/10/08

to

In article <gf26dm$kvj$1...@aioe.org>,

Geico Caveman <spammer...@spam.invalid> wrote:
>pk wrote:
>
>> On Friday 7 November 2008 18:18, Geico Caveman wrote:
>>
>>> Hello
>>>
>>> I am looking for a shell script that will do the following :
>>>
>>> 1. Scan a specified text file.
>>> 2. Find all occurences of a string patterns and replaced them as :
>>>
>>> a) \added[<text1>]{<text2>} by text2.
>>> b) \replaced[<text1>]{<text2>}{<text3>} by text2.
>>> c) \deleted[<text1>]{<text2>} by blank.
>>

>> [sed scripts snipped]

>
>Anything in <> is not a fixed string (just following common syntactical
>conventions - characters "<" and ">" do not actually occur, I am just using
>them to indicate that the textn arguments are variable strings of
>unpredictable lengths - all you know is that they will either be delimited
>by {} or by [] - see syntax).
>
>Here is a sample input :
>
>\replaced[GC]{the increase of}{the decrease of}
>
>which, should output "the increase of" (without the quotes of course).

To be contrarian (:-)), I would use a little program I wrote a few years
ago for this rather than sed. Its main advantage is that it's not line-bound,
so the above text fragments could contain newlines. It uses Rob Pike's old
"Structural Regular Expression" scanning, reading text from the source file
as it needs to, rather than a line at a time. (Regular expression scanning
in perl, ruby, and so on can easily handle newlines too, but a single
command line -- even if massive! -- can be more convenient.)

The program is called 'matt' (< http:www.goodeveCA.net/matt/ > if you're
interested... (:-)). The command to do the above is a single -- long! --
line rather than the split sed regexs but the idea is similar:

matt -vas '\\added\[[^\]]*\]{([^}]*)}|\\replaced\[[^\]]*\]{([^}*)}{[^}]*}|\\deleted\[[^\]]*\]{[^}]*}' -o '$1$2' file.tex

The main regular expression is obvious, containing all three alternatives,
with sub-expressions to be extracted marked by parens in the usual way.
The '-o...' parameter says to output the first or second subexpression
if they have been matched (but not any other part of the overall match).
In '-vas', the 'v' means to output unmatched parts unchanged, 'a' means
to match newlines -- in this case with the '[^}]' patterns, and 's' is
"shortest".

The regular expressions it handles are not quite as extensive as perl
or ruby (I didn't extend the code much, although it is now C++), but
they are perfectly adequate for most tasks. No doubt it's a matter of
familiarity, but I've made a lot of use of it -- for rewriting HTML,
or whatever -- over the years. Be nice if someone else found it
useful too!
-- Pete --

--
============================================================================
The address in the header is a Spam Bucket -- don't bother replying to it...
(If you do need to email, replace the account name with my true name.)
============================================================================

Geico Caveman

unread,

Nov 11, 2008, 1:52:07 PM11/11/08

to

pk wrote:

Got that. I am now trying to extend this script to do a few additional
things -

- replace all instances of \marginpar{<text1>} by blank.

I added the following code fragment to the script :

s|\\marginpar{.*}||g

This is not working out (some unmatched sections of the text are
disappearing). I have some familiarity with the :1,$ s/text1/text2/gc
functionality of vim, so was just trying to see if that would work.

What am I doing wrong above ?

pk

unread,

Nov 11, 2008, 2:52:00 PM11/11/08

to

On Tuesday 11 November 2008 19:52, Geico Caveman wrote:

>> s|\\deleted\[[^]]*\]{[^}]*}||g' file.tex
>
> Got that. I am now trying to extend this script to do a few additional
> things -
>
> - replace all instances of \marginpar{<text1>} by blank.
>
> I added the following code fragment to the script :
>
> s|\\marginpar{.*}||g
>
> This is not working out (some unmatched sections of the text are
> disappearing). I have some familiarity with the :1,$ s/text1/text2/gc
> functionality of vim, so was just trying to see if that would work.
>
> What am I doing wrong above ?

You have hit the so-called "greedy" behavior of many regular expressione
engines. Greedy, in this context, means that when an expression can match
in many ways, the longest match is always taken.

What that means in practice is that if you do

echo 'abc{blah} def{foobar} ghi' | sed 's/{.*}//'

thinking that it will delete just "{blah}", it will instead delete the
whole "{blah} def{foobar}" (which is longer than "{blah}").
In this (and your) case, if we assume that } cannot appear inside { .. },
the fix is easy:

echo 'abc{blah} def{foobar} ghi' | sed 's/{[^}]*}//'

that is, select {, all non-} characters that follows, and } (if you look at
the solutions that were proposed to you, they all use that trick, subject
to the same condition).
There may be other cases where the fix is not as easy (for example, when the
pattern ends with more than a single character).

Huibert Bol

unread,

Nov 11, 2008, 2:52:02 PM11/11/08

to

Geico Caveman wrote:

> - replace all instances of \marginpar{<text1>} by blank.
>
> I added the following code fragment to the script :
>
> s|\\marginpar{.*}||g

> This is not working out (some unmatched sections of the text are
> disappearing). I have some familiarity with the :1,$ s/text1/text2/gc
> functionality of vim, so was just trying to see if that would work.
>
> What am I doing wrong above ?

The '*' operator is greedy. In the example below, .* will match the
underlined text.

\marginpar{<text1>} some {\em more} text
^^^^^^^^^^^^^^^^^^^^^^^^

You will want to limit the expression to non-} characters:

s|\\marginpar{[^}]*}||g

--
Huibert
"The Commercial Channel! All commercials all the time.
An eternity of useless products to rot your skeevy little mind, forever!"
-- Mike the TV (Reboot)

Bill Marcum

unread,

Nov 11, 2008, 3:08:00 PM11/11/08

to

On 2008-11-11, Geico Caveman <spammer...@spam.invalid> wrote:
> pk wrote:
>
>

> s|\\marginpar{.*}||g
>
> This is not working out (some unmatched sections of the text are
> disappearing). I have some familiarity with the :1,$ s/text1/text2/gc
> functionality of vim, so was just trying to see if that would work.
>
> What am I doing wrong above ?

Maybe you are getting bitten by "greedy" regular expression matching.
Try this:

s|\\marginpar{[^}]*}||g

--
They can't stop us... we're on a mission from God!
-- The Blues Brothers

Geico Caveman

unread,

Nov 11, 2008, 9:33:29 PM11/11/08

to

pk wrote:

Ok. I now have a case where there are {} pairs inside the {...}. What would
be a good place to start ?

:)

(Thanks for your continued help).

pk

unread,

Nov 12, 2008, 3:26:51 AM11/12/08

to

On Wednesday 12 November 2008 03:33, Geico Caveman wrote:

>> sed 's|\\added\[[^]]*\]{$[^}]*$}|\1|g
>> s|\\replaced\[[^]]*\]{$[^}]*$}{[^}]*}|\1|g
>> s|\\deleted[[^]]*]{[^}]*}||g' file.tex
>>
>> This assumes that no } appears inside {..}, and no ] inside [..] (not
>> even escaped or in any other form). If you need that, then it becomes
>> much more complicated.
>
> Ok. I now have a case where there are {} pairs inside the {...}. What
> would be a good place to start ?

Please paste an actual example of that case. There are some possible
solutions to that one (including using anoter tool!), but they all depend
on the actual input data. For example, in your case it would also be good
to know if the inner {} are "bare" or somehow escaped, or whether an
arbitrary number of {} can be nested (like eg {...{...{..}...}...} etc.).

For the very simple case in which you can have only a single {} couple
inside an outer {} pair, then this should work:

{[^{]*{[^}]*}[^}]*}

(in short, it's the same old trick, used two times).

Geico Caveman

unread,

Nov 12, 2008, 5:56:07 PM11/12/08

to

pk wrote:

Here is an example :

\replaced[<text1>]{<text2a>\cite{<text2b>}<text2c>
{<text3a>\cite{<text3b>}<text3c>}

More concrete example :

\replaced[GC]{Recent results\cite{paper2} illustrate difficulty of
interpreting old data by Doe et. al.\cite{paper1} using Method X.}{This has
been demonstrated\cite{paper1} in the past.}

pk

unread,

Nov 12, 2008, 6:26:46 PM11/12/08

to

On Wednesday 12 November 2008 23:56, Geico Caveman wrote:

>> For the very simple case in which you can have only a single {} couple
>> inside an outer {} pair, then this should work:
>>
>> {[^{]*{[^}]*}[^}]*}
>>
>> (in short, it's the same old trick, used two times).
>
> Here is an example :
>
> \replaced[<text1>]{<text2a>\cite{<text2b>}<text2c>
> {<text3a>\cite{<text3b>}<text3c>}
>
> More concrete example :
>
> \replaced[GC]{Recent results\cite{paper2} illustrate difficulty of
> interpreting old data by Doe et. al.\cite{paper1} using Method X.}{This
> has been demonstrated\cite{paper1} in the past.}

I seem to infer from this that you have patterns like this:

{...{..}...{..}...........}

ie, only one level of nesting, but there can be a variable number of inner
{} pairs. This *should* match such strings, but I haven't tested it
thoroughly:

{$[^{}]*{[^{}]*}$*[^{}]*}

I've included both { and } in the negated character classes, to avoid
matching too much when there are multiple matches per line.
Parsing things like these is something that gets close to sed's (and other
tools) regex engine capability limits.