GREP: Replace variable string with title case or leave unchanged?

Be

unread,

May 21, 2012, 4:17:37 PM5/21/12

to bbe...@googlegroups.com

Hi!

I'm a newbie to GREP. I have a GREP search that looks like this:

<p class=\"topic\"><a name=\"[\s\S]*?\"></a>[\s\S]*?</p>

I want to replace the "p" tags with "div", leave the first variable string unchanged, and replace the second variable string with a Title Case version of itself. The second string has a variable number of words and spaces, and may or may not contain line breaks and digits.

Changing the tags is no problem, of course. It's the variable strings that have me stumped.

Any help is appreciated. Thanks in advance!

Be

unread,

May 22, 2012, 12:46:56 PM5/22/12

to bbe...@googlegroups.com

More details:

I'm starting out with this:

<p class="topic"><a name="topic-anchor-name"></a>TOPIC TITLE & IN ALL CAPS EXCEPT FOR THE AND ENTITY</p>

As mentioned previously, there may also be line breaks or extra spaces at any point in this, and the title may also include digits.

I want to change the line to this:

<div class="topic"><a name="topic-anchor-name"></a>Topic Title & In All Caps Except For The And Entity</div>

I have thousands of these across hundreds of pages. The reason I'm trying to do it in one step rather than 2 or more is because I have <p></p> sets that should remain untouched. However, I'm sure there's a way to work around that problem. I just haven't figured out what it is yet.

Be

unread,

May 22, 2012, 5:33:34 PM5/22/12

to bbe...@googlegroups.com

Grep is a beautiful thing. I realize now how awful my original attempt was.

I've now replaced the "p" tags with "div" tags using this:

FIND:

<p class=\"topic\"><a name=\"(\w+?)\"></a>([\s\S]*?)</p>

REPLACE:

<div class=\"pgtopic\"><a name=\"\1\"></a>\2</div>

Voila!

What I need now is something that says, "Look between <div class="topic"> and the first </div> that follows it. Find instances where an uppercase character is followed by another uppercase character. Turn these instances into title case."

I'm not going to worry about turning "&" into "&Amp;". I can fix that easily enough once the title case script finishes running.

Any ideas?

Ronald J Kimball

unread,

May 22, 2012, 6:25:35 PM5/22/12

to bbe...@googlegroups.com

On Tue, May 22, 2012 at 02:33:34PM -0700, Be wrote:
> What I need now is something that says, "Look between <div class="topic">
> and the first </div> that follows it. Find instances where an uppercase
> character is followed by another uppercase character. Turn these instances
> into title case."

This is not possible with a simple grep, because you need to do a loop
inside a loop. You'll a need script of some kind.

Here's how you might do it in Perl:

#!perl -p0

s{(<div class="topic">)(.*?)(</div>)}
< { ($x = $2) =~ s!([A-Z]+)!\L\u$1!g } "$1$x$3" >sge;

__END__

I have to run, but I can send a followup later explaining what this is
doing. Sorry!

Ronald

Be

unread,

May 22, 2012, 6:30:51 PM5/22/12

to bbe...@googlegroups.com

Thanks so much, Ron!

I know nothing about Perl, so this will take some time for me to figure out. If you get time to elucidate, that would be great, but don't sweat it. You've already helped a great deal. :)

Ronald J Kimball

unread,

May 23, 2012, 4:34:02 PM5/23/12

to bbe...@googlegroups.com

On Tue, May 22, 2012 at 06:25:35PM -0400, Ronald J Kimball wrote:
> #!perl -p0
>
> s{(<div class="topic">)(.*?)(</div>)}
> < { ($x = $2) =~ s!([A-Z]+)!\L\u$1!g } "$1$x$3" >sge;
>
> __END__

I'm back! Here's what's going on with this script:

#!perl -p0

-p wraps the script in an implicit loop that reads line-by-line into $_,
executes your code, and then prints $_.

-0 (zero) sets the input record separator to the null character, instead
of "\n", so that the whole file is read in one chunk (assuming it doesn't
contain any null characters), in case the HTML we want to match is split
across multiple lines.

s{(<div class="topic">)(.*?)(</div>)}
< { ($x = $2) =~ s!([A-Z]+)!\L\u$1!g } "$1$x$3" >sge;

This performs a substitution, of course.

The pattern is enclosed in {}, but the replacement in enclosed in <>
because it uses all the other paired delimiters inside.

(<div class="topic">)(.*?)(</div>)

The pattern matches the opening div tag, some text, and the first closing
div tag, and captures those three pieces in $1, $2, and $3.

{ ($x = $2) =~ s!([A-Z]+)!\L\u$1!g } "$1$x$3"

The replacement copies the matched text to $x (because $2 is read-only),
and then performs a substitution on it - this is the loop inside the
loop. It then returns a string containing the values of $1, $x, and $3.

The inner substitution is inside a block so that it doesn't clobber $1
and $2.

s!([A-Z]+)!\L\u$1!g

The inner substitution finds all sequences of uppercase letters, and
converts them to title-case. \L lowercases subsequent characters, and \u
uppercases the first character. (I guess \u\L would be more correct, but
Perl "does what I mean" with \L\u, treating it as \u\L anyway.)

If you wanted to convert something like AbC to Abc, you could change
[A-Z]+ to [A-Z][a-zA-Z]*

sge;

These substitution flags specify single-line mode (. matches newline),
global (find and replace all matches), and eval (execute the replacement
string as Perl code).

Hope that helps!

Ronald

Be

unread,

May 24, 2012, 10:24:00 PM5/24/12

to bbe...@googlegroups.com

Sorry about the delay in responding.

Your explanation is really helpful. I'm under the gun right now to complete a project, but as soon as I'm done, I'll learn enough Perl to implement this. It's something I need to do anyway.

Thanks so much for taking the time to go through it step-by-step. These real-world examples are the best way to learn a new language, IMHO.

Maybe I'll learn enough to pay it forward! :)

Reply all

Reply to author

Forward