I am looking for a shell script that will do the following :
1. Scan a specified text file.
2. Find all occurences of a string patterns and replaced them as :
a) \added[<text1>]{<text2>} by text2.
b) \replaced[<text1>]{<text2>}{<text3>} by text2.
c) \deleted[<text1>]{<text2>} by blank.
The reason I pose this question is that I am not a sed / awk expert (the
syntax always daunted me).
Thanks.
PS : Yes, the above is LaTeX markup.
> Hello
>
> I am looking for a shell script that will do the following :
>
> 1. Scan a specified text file.
> 2. Find all occurences of a string patterns and replaced them as :
>
> a) \added[<text1>]{<text2>} by text2.
> b) \replaced[<text1>]{<text2>}{<text3>} by text2.
> c) \deleted[<text1>]{<text2>} by blank.
sed 's|\\added\[<text1>\]{<text2>}|text2|g
s|\\replaced\[<text1>\]{<text2>}{<text3>}|text2|g
s|\\deleted[<text1>]{<text2>}||g' file.tex
in case "text2" above is not a fixed string, but should be the same text
that appears inside {< >} in the pattern, then
sed 's|\\added\[<text1>\]{<\(text2\)>}|\1|g
s|\\replaced\[<text1>\]{<\(text2\)>}{<text3>}|\1|g
s|\\deleted[<text1>]{<text2>}||g' file.tex
but of course, if you posted an actual sample input with expected output it
would be better.
> s|\\deleted[<text1>]{<text2>}||g' file.tex
should be
s|\\deleted\[<text1>\]{<text2>}||g' file.tex
and
> s|\\deleted[<text1>]{<text2>}||g' file.tex
should be
Thanks.
Anything in <> is not a fixed string (just following common syntactical
conventions - characters "<" and ">" do not actually occur, I am just using
them to indicate that the textn arguments are variable strings of
unpredictable lengths - all you know is that they will either be delimited
by {} or by [] - see syntax).
Here is a sample input :
\replaced[GC]{the increase of}{the decrease of}
which, should output "the increase of" (without the quotes of course).
> I am looking for a shell script that will do the following :
>
> 1. Scan a specified text file.
> 2. Find all occurences of a string patterns and replaced them as :
>
> a) \added[<text1>]{<text2>} by text2.
> b) \replaced[<text1>]{<text2>}{<text3>} by text2.
> c) \deleted[<text1>]{<text2>} by blank.
perl -0777 -pe 's/\\added\[.*?\]{(.*?)}/\1/g;
s/\\replaced\[.*?\]{(.*?)}{.*?}/\1/g;
s/\\deleted\[.*?\]{.*?}//g'
(You can drop the -0777 if there are no newlines in <text?>.)
caveat:
1. <text1> cannot contain ]'s
2. neither <text2> nor <text3> can contain }'s
--
Huibert
"Hey! HEY! Curious cat, here!" -- Krosp I (GG)
>>> a) \added[<text1>]{<text2>} by text2.
>>> b) \replaced[<text1>]{<text2>}{<text3>} by text2.
>>> c) \deleted[<text1>]{<text2>} by blank.
>
> Anything in <> is not a fixed string (just following common syntactical
> conventions - characters "<" and ">" do not actually occur, I am just
> using them to indicate that the textn arguments are variable strings of
> unpredictable lengths - all you know is that they will either be delimited
> by {} or by [] - see syntax).
>
> Here is a sample input :
>
> \replaced[GC]{the increase of}{the decrease of}
>
> which, should output "the increase of" (without the quotes of course).
Ok, then:
sed 's|\\added\[[^]]*\]{\([^}]*\)}|\1|g
s|\\replaced\[[^]]*\]{\([^}]*\)}{[^}]*}|\1|g
s|\\deleted[[^]]*]{[^}]*}||g' file.tex
This assumes that no } appears inside {..}, and no ] inside [..] (not even
escaped or in any other form). If you need that, then it becomes much more
complicated.
> perl -0777 -pe 's/\\added\[.*?\]{(.*?)}/\1/g;
> s/\\replaced\[.*?\]{(.*?)}{.*?}/\1/g;
> s/\\deleted\[.*?\]{.*?}//g'
\1 better written as $1 at -e line 1.
(Line 2 as well.)
Oops, too much SED. (I.e., force of habit.)
Thanks. I will try this out and see if it works. I was wise to avoid
figuring this syntax out by myself :)
>> Ok, then:
>>
>> sed 's|\\added\[[^]]*\]{\([^}]*\)}|\1|g
>> s|\\replaced\[[^]]*\]{\([^}]*\)}{[^}]*}|\1|g
>> s|\\deleted[[^]]*]{[^}]*}||g' file.tex
>>
>> This assumes that no } appears inside {..}, and no ] inside [..] (not
>> even escaped or in any other form). If you need that, then it becomes
>> much more complicated.
>
> Thanks. I will try this out and see if it works. I was wise to avoid
> figuring this syntax out by myself :)
btw, there is an error in the last one, should be
To be contrarian (:-)), I would use a little program I wrote a few years
ago for this rather than sed. Its main advantage is that it's not line-bound,
so the above text fragments could contain newlines. It uses Rob Pike's old
"Structural Regular Expression" scanning, reading text from the source file
as it needs to, rather than a line at a time. (Regular expression scanning
in perl, ruby, and so on can easily handle newlines too, but a single
command line -- even if massive! -- can be more convenient.)
The program is called 'matt' (< http:www.goodeveCA.net/matt/ > if you're
interested... (:-)). The command to do the above is a single -- long! --
line rather than the split sed regexs but the idea is similar:
matt -vas '\\added\[[^\]]*\]{([^}]*)}|\\replaced\[[^\]]*\]{([^}*)}{[^}]*}|\\deleted\[[^\]]*\]{[^}]*}' -o '$1$2' file.tex
The main regular expression is obvious, containing all three alternatives,
with sub-expressions to be extracted marked by parens in the usual way.
The '-o...' parameter says to output the first or second subexpression
if they have been matched (but not any other part of the overall match).
In '-vas', the 'v' means to output unmatched parts unchanged, 'a' means
to match newlines -- in this case with the '[^}]' patterns, and 's' is
"shortest".
The regular expressions it handles are not quite as extensive as perl
or ruby (I didn't extend the code much, although it is now C++), but
they are perfectly adequate for most tasks. No doubt it's a matter of
familiarity, but I've made a lot of use of it -- for rewriting HTML,
or whatever -- over the years. Be nice if someone else found it
useful too!
-- Pete --
--
============================================================================
The address in the header is a Spam Bucket -- don't bother replying to it...
(If you do need to email, replace the account name with my true name.)
============================================================================
Got that. I am now trying to extend this script to do a few additional
things -
- replace all instances of \marginpar{<text1>} by blank.
I added the following code fragment to the script :
s|\\marginpar{.*}||g
This is not working out (some unmatched sections of the text are
disappearing). I have some familiarity with the :1,$ s/text1/text2/gc
functionality of vim, so was just trying to see if that would work.
What am I doing wrong above ?
>> s|\\deleted\[[^]]*\]{[^}]*}||g' file.tex
>
> Got that. I am now trying to extend this script to do a few additional
> things -
>
> - replace all instances of \marginpar{<text1>} by blank.
>
> I added the following code fragment to the script :
>
> s|\\marginpar{.*}||g
>
> This is not working out (some unmatched sections of the text are
> disappearing). I have some familiarity with the :1,$ s/text1/text2/gc
> functionality of vim, so was just trying to see if that would work.
>
> What am I doing wrong above ?
You have hit the so-called "greedy" behavior of many regular expressione
engines. Greedy, in this context, means that when an expression can match
in many ways, the longest match is always taken.
What that means in practice is that if you do
echo 'abc{blah} def{foobar} ghi' | sed 's/{.*}//'
thinking that it will delete just "{blah}", it will instead delete the
whole "{blah} def{foobar}" (which is longer than "{blah}").
In this (and your) case, if we assume that } cannot appear inside { .. },
the fix is easy:
echo 'abc{blah} def{foobar} ghi' | sed 's/{[^}]*}//'
that is, select {, all non-} characters that follows, and } (if you look at
the solutions that were proposed to you, they all use that trick, subject
to the same condition).
There may be other cases where the fix is not as easy (for example, when the
pattern ends with more than a single character).
> - replace all instances of \marginpar{<text1>} by blank.
>
> I added the following code fragment to the script :
>
> s|\\marginpar{.*}||g
> This is not working out (some unmatched sections of the text are
> disappearing). I have some familiarity with the :1,$ s/text1/text2/gc
> functionality of vim, so was just trying to see if that would work.
>
> What am I doing wrong above ?
The '*' operator is greedy. In the example below, .* will match the
underlined text.
\marginpar{<text1>} some {\em more} text
^^^^^^^^^^^^^^^^^^^^^^^^
You will want to limit the expression to non-} characters:
s|\\marginpar{[^}]*}||g
--
Huibert
"The Commercial Channel! All commercials all the time.
An eternity of useless products to rot your skeevy little mind, forever!"
-- Mike the TV (Reboot)
Maybe you are getting bitten by "greedy" regular expression matching.
Try this:
s|\\marginpar{[^}]*}||g
--
They can't stop us... we're on a mission from God!
-- The Blues Brothers
Ok. I now have a case where there are {} pairs inside the {...}. What would
be a good place to start ?
:)
(Thanks for your continued help).
>> sed 's|\\added\[[^]]*\]{\([^}]*\)}|\1|g
>> s|\\replaced\[[^]]*\]{\([^}]*\)}{[^}]*}|\1|g
>> s|\\deleted[[^]]*]{[^}]*}||g' file.tex
>>
>> This assumes that no } appears inside {..}, and no ] inside [..] (not
>> even escaped or in any other form). If you need that, then it becomes
>> much more complicated.
>
> Ok. I now have a case where there are {} pairs inside the {...}. What
> would be a good place to start ?
Please paste an actual example of that case. There are some possible
solutions to that one (including using anoter tool!), but they all depend
on the actual input data. For example, in your case it would also be good
to know if the inner {} are "bare" or somehow escaped, or whether an
arbitrary number of {} can be nested (like eg {...{...{..}...}...} etc.).
For the very simple case in which you can have only a single {} couple
inside an outer {} pair, then this should work:
{[^{]*{[^}]*}[^}]*}
(in short, it's the same old trick, used two times).
Here is an example :
\replaced[<text1>]{<text2a>\cite{<text2b>}<text2c>
{<text3a>\cite{<text3b>}<text3c>}
More concrete example :
\replaced[GC]{Recent results\cite{paper2} illustrate difficulty of
interpreting old data by Doe et. al.\cite{paper1} using Method X.}{This has
been demonstrated\cite{paper1} in the past.}
>> For the very simple case in which you can have only a single {} couple
>> inside an outer {} pair, then this should work:
>>
>> {[^{]*{[^}]*}[^}]*}
>>
>> (in short, it's the same old trick, used two times).
>
> Here is an example :
>
> \replaced[<text1>]{<text2a>\cite{<text2b>}<text2c>
> {<text3a>\cite{<text3b>}<text3c>}
>
> More concrete example :
>
> \replaced[GC]{Recent results\cite{paper2} illustrate difficulty of
> interpreting old data by Doe et. al.\cite{paper1} using Method X.}{This
> has been demonstrated\cite{paper1} in the past.}
I seem to infer from this that you have patterns like this:
{...{..}...{..}...........}
ie, only one level of nesting, but there can be a variable number of inner
{} pairs. This *should* match such strings, but I haven't tested it
thoroughly:
{\([^{}]*{[^{}]*}\)*[^{}]*}
I've included both { and } in the negated character classes, to avoid
matching too much when there are multiple matches per line.
Parsing things like these is something that gets close to sed's (and other
tools) regex engine capability limits.