how to find a repeated 200-character sentence in a large text file

64 views
Skip to first unread message

samar

unread,
Apr 25, 2022, 11:42:09 AM4/25/22
to BBEdit Talk
Hi all

While copyediting a text for a scholarly book (500+ pages when printed), I noticed that the author wrote exactly the same long sentence (= an identical string of 337 characters) once on page 23 and once on page 326. No doubt this happened because the author copied and pasted some text from his notes, unaware that he had already copied and pasted the same text earlier. I thought it would be a good idea to find out whether this has happened to the author more than one time in his 1,000,000-character book, so that I can alert him (to give him a chance to omit the repetition).

And so I turned to BBEdit. The text of the whole book is now in a txt file. When I search for the sentence that in the Word document is on page 23, I can find it in BBEdit both in paragraph 117 and in paragraph 7831. What regular expression can I use to find other such repetitions?

I tried using the following string:

(?s)(.{200}).*?\1

This is what I understand it to mean (roughly):

(?s): search across paragraphs
(.{200}).*?: search for, and capture, a string of 200 characters, optionally followed by any characters
\1: stop the search as soon as you reach a second instance of the captured string

The string does what I need if I replace 200 with a shorter number, such as 10 (but in this case BBEdit finds a lot of unproblematic repetitions, of course). Given that the sentence I have in mind is more than 300 characters long I should even have been able to use 300 instead of just 200.

Unfortunately, however, something seems to be amiss: BBEdit kept on searching and searching, without finding anything, and my notebook started fanning, and after about 20 minutes it became clear that nothing would happen, and that I cannot do anything else but to Force Quit BBEdit.

So my question is, what's wrong with the above string? How else can I find a repeated 200-character sentence in a large text file?

Thanks
Sam

MediaMouth

unread,
Apr 25, 2022, 11:59:29 AM4/25/22
to bbe...@googlegroups.com
Samar, first off this is a really cool challenge.

Personally I'd use JS as the tool of choice if only to have a lot of control over length of string, reporting, real-feedback and error-checking, fine tuning, handling multiple docs, and the ability to stop easily when the code errs.  (To be fair JS always my preferred go-to, so there's that.  People who can make this work with regex and regex alone always impress).  If you go JS lmk

Thanks for sharing. Following.



On Apr 25, 2022, at 08:42, samar <arne...@bluewin.ch> wrote:

Hi all
--
This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "sup...@barebones.com" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
---
You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbedit+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com.

bva

unread,
Apr 25, 2022, 12:41:21 PM4/25/22
to bbe...@googlegroups.com
That search pattern would start at the beginning of the text, grab the first 200 characters and search the rest of the text for that, starting with the very next (201st) character, and trying every 200-char sequence from there to the end. Then it would progress one character forward to character 2, grab it and the next 199, and repeat that search. 

Moving ahead one character at a time through the million characters until it finds a match or until the text no longer has 200 characters left would certainly take some processing time!

Are the long strings always single sentences? If so, your pattern would be slightly optimized if didn’t accept the end of sentence character (“period”, “full stop”, “dot”).

(?s)([^.]+){200}.*\1

(Inside the character class brackets ‘.’ just means dot, not “any character”.)

Given that the 200 is an arbitrary parameter (that is, you’re not looking for a string you already know, exactly that length), the above does NOT have an end of sentence character.

Assuming standard English/European writing practice, the sentences could probably also be expected to start with an upper-case alpha character after a whitespace character, so the pattern would be faster as:

(?s)(\s[A-Z][^.]+){200.*\1

But the above suggestions won’t help much if you’re searching for strings with multiple sentences.

HTH

_bruce__van_allen__santa_cruz_ca_
_831_429_1688_p_
_831_332_3649_c_

On Apr 25, 2022, at 8:42 AM, samar <arne...@bluewin.ch> wrote:

Hi all
--

samar

unread,
Apr 25, 2022, 1:40:01 PM4/25/22
to BBEdit Talk
Thank you. I probably shouldn't have mentioned "sentences" because it may well be that the author adjusted punctuation and capitalisation when copying a string of words from his note file into the main file.

In the sample text file, I have therefore replaced all punctuation with a space, and then replaced consecutive spaces with one space. (There are lots of footnotes in the file, which I have deleted as well because they are not relevant for my search.) Also, my search was not case sensitive. I simply want to make sure no erroneous and embarrassing repetitions of the same 4-line-text or so occur.

So maybe regex isn't the right tool for this? If so I need to stop right here since I have, regrettably, neither the knowledge nor the tools to work with other languages.

Sam Birch

unread,
Apr 25, 2022, 2:38:09 PM4/25/22
to BBEdit Talk

This sounds like a case of the Longest repeated substring problem. Regular expressions are not the right tool for the job, unfortunately.

There’s an online demo that might do what you need.

If you want to find all long repeated substrings, you can take an iterative approach: find the longest, remove the duplicates from the source text, and again find the longest.

Hope this helps,
-sam

Jeffrey Jones

unread,
Apr 25, 2022, 8:33:27 PM4/25/22
to bbe...@googlegroups.com
How about splitting the text so that each sentence is one line. Then use Text > Process Duplicate Lines…

samar

unread,
Apr 26, 2022, 2:15:06 AM4/26/22
to BBEdit Talk
The online demo you linked to, Sam, does exactly what I need! I've found a few other doublets in the file that clearly should not be there. Thank you all for your thoughts and inputs.

jj

unread,
Apr 30, 2022, 2:44:55 PM4/30/22
to BBEdit Talk
Here is a BBEdit Text Filter that will scan the frontmost document's selected text (or the whole document in no selection) for the longest repeated substring and log a regular expression in the 'Unix Script Output.log' that allows to find the repetition.

It works by replacing all non alphanumeric or underscore characters with a regular expression, thus comparing only the 'text' and ignoring anything else: whitespace, punctuation, math operators, etc.

As a consequence, it's more useful on textual content than on code.

It is derived from the go version of Ukkonen’s suffix tree construction and uses the gorun utility that allows to execute a go source file as a shell script.

 1. To install Go

  • from Go's installer: https://go.dev/doc/install
   
  • or with Homebrew at the terminal:
   
        % brew install go
   
 2. To install gorun at the terminal:

        % go install github.com/erning/gorun@latest
       
 3. Copy the file find_longest_repeated_substring.go to ~/Library/Application Support/BBEdit/Text Filters

 4. Use the text filter from the Text menu > Apply Text Filter > find_longest_repeated_substring

 5. In case the 'Unix Script Output.log' doesn't show up, use the Menu Go > Commands... panel to find the 'Unix Script Output.log'

 6. In the 'Unix Script Output.log', select the logged regular expression and copy it to the find window with <Command-Shift-E>.
    (Warning: Do not copy the trailing return character as it is not part of the regular expression)
 
 7. Activate you document and use the find window to search the occurrences of the repetition.
    (Warning: if the repetition is very long, the generated regular expression might not compile. Just select a shorter portion from its beginning to some sensible length.)


HTH,

Jean Jourdain

Reply all
Reply to author
Forward
0 new messages