omit literal text from otherwise matching search results

58 views
Skip to first unread message

GWied

unread,
Nov 10, 2025, 12:58:50 PMNov 10
to BBEdit Talk
Another "occasional user" question - with not enough time to learn all the cool tools bbedit and regex offer that would solve my problem.
context: I have about 500 KB of text (as 43 .txt files) video transcripts to edit/refine. All files have timecode removed.

I want to find all instances of doubled words, but omit/ignore a subset of those matches, i.e., search for doubled words in a video transcript, but EXCLUDE "many, many" and "very, very". In effect, this will reduce instances of stuttering in a video transcript, but leave the intentional repeats intact. 

This search string finds doubled words separated by a comma and a space, which satisfies most of the instances of doubled words:
(\b[A-Za-z]+\b),\s\1 

replace with
\1 

e.g., find "what, what" and replace with "what" in the string:
So when we talk about the structure of data that describes what, what identifies our columns

But do not replace "very, very" in the string:
or even the greater distance away from zero, is very, very small. 


Thank you for any hints on doing this.

Glenn

 

GP

unread,
Nov 10, 2025, 8:13:26 PMNov 10
to BBEdit Talk
One way to accomplish that is to set up a text factory with three Canonize steps.

The first Canonize step, say a DoublesWordsSanitizerIgnores.txt file, will have the one or more patterns that you want to exclude from the double word finding and fixing step and supplement them with a fix-up, ignore add-on pattern.

The second Canonize step, say a DoublesWordsSanitizer.txt file, will have the one or more patterns that find the double words you want to replace with single words and doesn't find the fixed-up excludes from the first Canonize step.

The third and final Canonize step, say a DoubleWordSanitizerCleanup.txt file, will have the pattern to find all the fixed-up ignores/excludes from the first Canonize step and remove the fix-up add-ons, returning those bits of text back to the original.

For the first Canonize list of ignore/exclude patterns, it would probably be best to start out with a fairly simple list of double word patterns. Starting out simple helps in debugging and easily understanding what you're ignoring/excluding. Something simple like:

(\s)(many,\smany)(\s) \1%%\2%%\3
(\s)(very,\svery)(\s) \1%%\2%%\3

where I'm using %% as the fix-up add-ons to ignore/exclude those double word occurrences from the next double word sanitizing step.

(I'm capturing the leading and trailing white space to handle edge cases like line feeds, and I'm not using any word boundaries, \b, to avoid having to handle non-word character letters concatenated with word characters.)

For the second Canonize list of patterns to find double words and replace with single words, one or more grep patterns like:

(\s)(\w+)\s\2 \1\2
(\s)(\w+),\s\2 \1\2

For the last and final Canonize step to clean up the fix-up add-ons added in the first ignore/exclude step, a grep pattern like:

%%(.+)%% \1

Suggest you start out with sample text, and with that sample text, individually run each Canonize step (from Text -> Canonize...) on the sample text to check the patterns you have in the specific canon file do what they're supposed to do. Then, after that, combine them into a text factory and recheck the combined operation.

Bruce Van Allen

unread,
Nov 10, 2025, 8:52:54 PMNov 10
to bbe...@googlegroups.com
I wonder how many different doubled words there are in your docs.

If you did a multi-file search - but not replace - with your pattern, BBEdit would give you a list of the instances. If scanning them shows that there are actually not many variations, then you might consider handling them with search/replace one at a time. "the, the" => "the; "but, but" => "but; etc.

That may take less time than composing and testing a more fully-automated approach. GP's suggestion shows the power of BBEdit, but if you're only cleaning those files this one time, maybe more than you need.

I say this as a coder who will spend a couple hours working out a script to handle some task, and when it's ready after testing it will take 10 seconds to actually run, and I could have done the work by hand in 20 mins :-).

With any approach, best to work on copies of the originals until you're sure you have it.

Also, your pattern is fairly restrictive - always exactly a comma and then a single space between the two words. Are your docs that consistent?

HTH,

— Bruce

_bruce__van_allen__santa_cruz_ca_

jj

unread,
Nov 11, 2025, 7:13:47 AMNov 11
to BBEdit Talk
Hi Glenn,

Use a negative lookahead assertion like so:

    (?i)\b(?!well|very|so|really|okay|now?|many|long|far|et ?cetera)([\p{L}]+),\s\1\b
   
\b(?!terms|to|be|excluded) means: a word boundary not followed by any of the terms in the alternation.
In practice, the regex will skip over any of those terms.

For more info, see BBEdit's Help menu > BBEdit Help > Quick Reference > Grep Reference.

HTH,

Jean Jourdain

Reply all
Reply to author
Forward
0 new messages