Removing duplicate words

53 views
Skip to first unread message

ce gm

unread,
Oct 30, 2024, 8:21:46 AM10/30/24
to BBEdit Talk
I haven't found a thread on this, but apologies if one exists! 

I am new to BBEdit, and am using it to clean .txt files prior to text mining. I am converting files to .txt from PDF to ensure R reads the files in correctly (I've had issues with the R PDF reader). When I do this conversion, there are often duplicates of words, appearing like "to to" or "finally finally" throughout the text. These get flagged for grammar in TextEdit and Word, but to fix it, it requires you go through the entire document manually. I have thousands of pages to go through - if I ever want to finish my dissertation, I can't do that.

I tried the Process Duplicate Lines command in BBEdit, but it did not remove duplicates of words within lines. Does anyone know if there is a way to get BBEdit to identify duplicate words, then automatically delete one of them?

(or if not BBEdit, then Word or TextEdit?)

Thanks!

Jim Straus

unread,
Oct 30, 2024, 8:35:56 AM10/30/24
to bbe...@googlegroups.com
If you’re going to ingest the text, you could turn all the spaces into line breaks and then remove duplicate lines.  If you don’t care about punctuation you could remove that too.

Or create a script to do it.

--
This is the BBEdit Talk public discussion group. If you have a feature request or believe that the application isn't working correctly, please email "sup...@barebones.com" rather than posting here. Follow @bbedit on Mastodon: <https://mastodon.social/@bbedit>
---
You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbedit+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/bbedit/2a1b0304-1e5e-4e25-90f4-829fbd7b650cn%40googlegroups.com.

Roger Moffat

unread,
Oct 30, 2024, 8:53:14 AM10/30/24
to bbe...@googlegroups.com
That was my first thought too, but quickly realised it won’t work because in the end the document will contain only 1 example of each word in the original.

The duplicates to be removed need to be adjacent, not just appearing anywhere else in the document.

Roger



Patrick Woolsey

unread,
Oct 30, 2024, 9:38:43 AM10/30/24
to bbe...@googlegroups.com
May I ask whether these duplicate words are arbitrary or do they e.g. mainly consist of articles? Also, do these words contain any accented characters or numerals?

(I expect a suitable grep search & replace could clean up quite a bit of these, though an example file would be helpful.)


Regards,

Patrick Woolsey
==
Bare Bones Software, Inc. <https://www.barebones.com/>

Patrick Woolsey

unread,
Oct 30, 2024, 9:47:35 AM10/30/24
to bbe...@googlegroups.com
Also, for reference:

> On Oct 30, 2024, at 02:21, ce gm <german.c...@gmail.com> wrote:
> I tried the Process Duplicate Lines command in BBEdit, but it did not remove duplicates of words within lines.

That is as expected; BBEdit's line processing commands apply only to hard-wrap delineated lines, not words or paragraphs, etc.


> Does anyone know if there is a way to get BBEdit to identify duplicate words, then automatically delete one of them?

Though BBEdit does not contain any commands explicitly designed for this purpose, you may be able to accomplish the desired task (or at least a good deal of it) using its search & replace capabilities per my prior post.

Alberto Gutiérrez

unread,
Oct 30, 2024, 10:46:04 AM10/30/24
to BBEdit Talk
Hi, if you only need to catch and replace duplicated words in a text in BBEdit, you can use this GREP search:

Find: \b(\w+) +\1\b
Replace: \1

Cheers.

ce gm

unread,
Oct 30, 2024, 3:30:38 PM10/30/24
to BBEdit Talk
Thanks Alberto, that is exactly what I needed!

Thanks all for your advice!

Reply all
Reply to author
Forward
0 new messages