Find <h2> followed by many lines of arbitrary HTML through next <h2> but exclude the second <h2>

39 views
Skip to first unread message

Sonic Purity

unread,
Sep 14, 2021, 6:17:42 PM9/14/21
to BBEdit Talk
My fiction writing workflow initially produces one HTML document with the entire novel’s content. Each chapter starts with <h2>Exciting Chapter Title Here</h2> then many paragraphs of story text with arbitrary HTML markup. I split each chapter into its own HTML page, containing everything from that first <h2> with the chapter title through the end of the chapter, which is always immediately before the subsequent opening <h2> for the following chapter (in the original un-split document).

Working manually, i’ve been using the Grep Find:

<h2>([\s\S]+?)<h2>

This works perfectly, other than it includes the <h2> at the start of the next chapter in the selection i’m about to cut or copy into a new HTML document. I manually back off the selection to include everything found minus that ending <h2>. I would like to better automate my workflow, but can’t with the need for this manual adjustment.

Re-reading the Grep help file with BBEdit, i thought lookahead might help. I tried:

<h2>([\s\S]+?)(?<h2>)

but that just finds the first <h2> and one character immediately following it. Noticing that BBEdit is highlighting the < for that second <h2>, i tried escaping it:

<h2>([\s\S]+?)(?\<h2>)

This throws a PCRE error: unrecognized character after (? or (?- (12)

Can anyone suggest a search string that will accomplish my goal?

(BBEdit 11.6.8 running under macOS 10.12.6 Sierra.)

Thanks!

Tom Robinson

unread,
Sep 14, 2021, 7:03:49 PM9/14/21
to BBEdit Talk
Try this, using positive lookahead and lookbehind assertions:

(?<=<h2>).+(?=</h2>)

Cheers

Sonic Purity

unread,
Sep 14, 2021, 9:15:41 PM9/14/21
to BBEdit Talk
Thank you, but all that does for me is select that one entire chapter heading, not the entire chapter heading plus all the paragraphs of text below. In other words in my example that string selects Exciting Chapter Title Here, but nothing else.

Tom Robinson

unread,
Sep 15, 2021, 12:46:43 AM9/15/21
to BBEdit Talk
Misread that part.  Try this:

(?<=<h2>)([\s\S]+?)(?=<h2>)

Cheers

ctfishman

unread,
Sep 15, 2021, 1:51:18 AM9/15/21
to BBEdit Talk
I tried doing this with just a regular expression but couldn't figure out how. I was however able to do it quite easily with a text filter. The following PERL example works for me to split the text and create and save an individual file for each chapter.

Save the following in your text filters folder and run it against your document.

--------------------------

#!/usr/bin/perl

# Read each line into a scaler, then print it back

my $fullstring;

while (<>) {
    $fullstring .= $_;
    print;
}

# split the scaler into an array

my @h2s = split( /<h2>/, $fullstring );

# Delete the first item of the array, which will be empty because our text starts with "<h2>"

shift @h2s;

# add back the "<h2>" at the start of each array element
# which was removed when we did the split

foreach $string (@h2s) {
    $string = "<h2>" . $string;
}

# Now the array contains each of your chapters, one per element.
# The following will create a new directory on your desktop called "Chapters"
# (if it doesn't exist already) and save a new document with the text from each
# chapter/array element. The original document will be the same as when it started,
#  because we printed each line back out after we read it.

my $counter = 1;

print `mkdir -p ~/Desktop/Chapters/`;

for (@h2s) {
    open( CHAPTER, ">~/Desktop/Chapters/chapter$counter.html" );
    print CHAPTER $_;
    close(CHAPTER);
    $counter++;
}

Christopher Stone

unread,
Sep 15, 2021, 6:26:11 AM9/15/21
to BBEdit-Talk
On Sep 14, 2021, at 16:57, Sonic Purity <sonic...@gmail.com> wrote:

Re-reading the Grep help file with BBEdit, i thought lookahead might help. I tried:

<h2>([\s\S]+?)(?<h2>)



Hey There,

You miswrote your lookahead-assertion.

This:

<h2>([\s\S]+?)(?<h2>)


Should look like this:


<h2>([\s\S]+?)(?=<h2>)


This is fine, except it will exclude your last chapter.


Try this instead:

(?s)<h2>.+?(?=<h2>|\Z)


--
Best Regards,
Chris

Christopher Stone

unread,
Sep 15, 2021, 6:35:18 AM9/15/21
to BBEdit-Talk
On Sep 15, 2021, at 00:51, ctfishman <mfis...@casciac.org> wrote:

I tried doing this with just a regular expression but couldn't figure out how.


Hey There,

Yeah, you couldn't automate the whole process with regex alone.

I was however able to do it quite easily with a text filter...

--------------------------

#!/usr/bin/perl

# Read each line into a scaler, then print it back

my $fullstring;

while (<>) {
    $fullstring .= $_;
    print;
}

Looks good, although I'd shortcut the above with:



#!/usr/bin/env perl -0777 -nsw

print;



Now the entire string is in $_ and ready to process.


--
Best Regards,
Chris

Sonic Purity

unread,
Sep 15, 2021, 2:15:04 PM9/15/21
to BBEdit Talk
Thank you to everyone who’s replied. My issue has been solved. For anyone else who may be interested, i report my findings below.

On Wednesday, September 15, 2021 at 3:26:11 AM UTC-7 listmei...@gmail.com wrote:
You miswrote your lookahead-assertion.

Oh boy i sure did. Thank you for pointing that out.
 
Try this instead:

(?s)<h2>.+?(?=<h2>|\Z)

^ This is spectacular, and what i’ll be using. After testing it, i made myself sit down and re-read the Grep Help to understand what each part of the expression is doing. I’d entirely missed the section on the (?s) ability to allow . to include \r as well. This knowledge alone will help improve a number of my other regular expressions. I’ve not used positional assertions like \Z in the past, hence they don’t come to mind—something else learned—thanks!

The PERL filters (original and Chris’ modification both tested) failed for me: created the folder on the Desktop, but it was empty. The original document did remain intact. But that’s OK because i‘ve not evolved to the point to be doing exactly that yet. No one should spend any more time on this PERL filter on my behalf. This is both a learning experience and practical matter of getting things done activity for me, so it’s best for me to clunk along on training wheels with Text Factories full of Replace All clauses and likely AppleScript until i’m ready to learn more and get into something like PERL.

Appreciatively,
))Sonic((
Reply all
Reply to author
Forward
0 new messages