Find & Replace: finding and replacing a range of characters or lines surrounding a word

572 views
Skip to first unread message

Pierre-Olivier Bonin

unread,
Oct 11, 2016, 11:33:06 AM10/11/16
to TextWrangler Talk
Hi everyone,

I have a document in which there are lots of rows comprising unnecessary information that I would like to delete. It looks approximately like this:

[Title: irrelevant information]

[irrelevant information]

[irrelevant information]

[irrelevant information]

[irrelevant information]

[irrelevant information]

[irrelevant information]

[irrelevant information]

[irrelevant information]

[Database: irrelevant information]

[relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text relevant text  --> for multiple paragraphs]



Now, what I would like to do is to select everything ranging from the first word ("Title") and the last word ("Database") of the irrelevant section, and delete it.

I have been able to do a somewhat decent job by using my Word processor, using the "Character in Range" wildcard represented by the asterix (*), and adding a given range, so that I could find and replace a range of characters, but I would like to know if there is a more systematic way to do it using TextWrangler. Also, is there a way to select a range not only after a word, but before a word?

Thank you very much for your feedback!

Regards,
Pierre-Olivier

Steve

unread,
Oct 11, 2016, 1:56:42 PM10/11/16
to TextWrangler Talk
One way that I do this is to use a look-ahead on each line, checking for the presence of "Database". If it's not found, you match the line and loop again.

https://regex101.com/r/Jm0cMs/1 shows an example of this:

    Title:.*\r?\n(?:(?!.*Database).*\r?\n)+.*\r?\n

The problem I've encountered, however, is that large amounts of text on each line will overwhelm the regex stack, and it will say that it's out of memory and can't do the replacement.

When that happens, I will capture the "Title" line (which I will replace back into the text), and then match only a few handful of lines at a time. https://regex101.com/r/Jm0cMs/2

    (Title:.*\r?\n)(?:(?!.*Database).*\r?\n){1,10}

The trick here is that you use a low enough number of lines as to NOT overwhelm the regex stack, and you just run that replacement a bunch of times.

In essence:

  Title
  Remove
  Remove
  Not removed
  Not removed
  Database

This will start at "Title" and then remove only some of the lines below it, re-inserting the "Title" text into the replacement:

  Title
  Not removed
  Not removed
  Database

That way, when you re-run the regex, it will again find "Title" as its starting point and match "a few more lines" until it reaches "Database". You do it again and again until you're left only with "Title" and "Database" lines, and you can easily remove those lines.

If the first regex works because your text is short enough, you don't need to do anything else. But if your text happens to overwhelm it, this is a decent fall-back method.

Patrick Woolsey

unread,
Oct 11, 2016, 3:15:49 PM10/11/16
to textwr...@googlegroups.com
On 10/11/16 at 9:47 AM, pierre-oli...@hotmail.com
(Pierre-Olivier Bonin) wrote:

[...example elided...]
>
>Now, what I would like to do is to select everything ranging
>from the first word ("Title") and the last word ("Database") of
>the irrelevant section, and delete it.


Though Steve's suggested patterns will certainly work, I suggest
you instead just use brute force :-) as such a pattern is
simpler to write and you needn't worry about recursion.

If you want to match and delete all the text starting with
"Title" through the end of the line that starts with "Database",
then try:

Find: ^Title(?s).+?\nDatabase.+?$
Replace: # leave this field empty

For reference:
This pattern starts by matching the string "Title" at the start
of a line ^ and continues with the modifier (?s) which allows
the following ".+", meaning 'one or more instances "+" of any
character "."', to match across linebreaks, followed by "?" to
make that match _non-greedy_ and then wrapping up by matching a
literal line break \n followed by the string "Database" and
another non-greedy run of characters through the end of that
line $.


If instead you want to delete all the text from "Title" down to
but _not_ including the line that starts with "Database", then:

Find: ^Title(?s).+?\nDatabase
Replace: \nDatabase

This pattern works very much like the preceding, except it ends
at a plain string match and then instead of just deleting all
the matched text, does that and then inserts a hard line break
followed by the string "Database".


On a more general note :-), please see Chapter 8 of the included
PDF manual (which you can open at any time via Help -> User
Manual) for complete details on using grep patterns with
TextWrangler's search & replace commands.


Regards,

Patrick Woolsey
==
Bare Bones Software, Inc. <http://www.barebones.com/>

Christopher Stone

unread,
Oct 13, 2016, 2:39:21 AM10/13/16
to TextWrangler-Talk
On Oct 11, 2016, at 08:47, Pierre-Olivier Bonin <pierre-oli...@hotmail.com> wrote:
I have a document in which there are lots of rows comprising unnecessary information that I would like to delete. It looks approximately like this:


Hey Pierre-olivier,

A simple text-filter might do the trick.

#!/usr/bin/env bash

sed -E '/\[Title:.+\]/,/\[Database:.+\]/d; /^$/d'

What I'm doing is deleting from the title line through the database line and then deleting any remaining empty lines.

--
Best Regards,
Chris

Pierre-Olivier Bonin

unread,
Oct 19, 2016, 1:54:48 PM10/19/16
to TextWrangler Talk
Wow, you guys are great! Thanks everyone for the advice. It works perfectly.


Reply all
Reply to author
Forward
0 new messages