Search and replace based on a custom dictionary file

190 views
Skip to first unread message

Roland Küffner

unread,
Oct 27, 2011, 7:31:40 AM10/27/11
to bbe...@googlegroups.com
Hi, everybody,

maybe someone has already a solution to this and is willing to share:

More often than I thought I find myself having to replace a bunch of terms in a text file with new text. Doing it by hand means doing several search-replace-actions one after another. Putting together a Text Factory would be a solution but very often the whole replace task is a single one - no need to repeat the exact searches ever again.

My idea was to do it with some kind of dictionary file. In it each line would contain a single search replacement pair separated by tabs. Just like:

old term<tab>new term
some other random old text<tab>another replacement
...

Both, search and replacement text is often unsystematic, so doing it with regular expressions is no solution. I tried to fumble together a script that reads such a file (from the desktop, maybe), processes it line by line, searching my current front document for each old term replacing it with the according new value.
But I failed. I'm afraid, my capabilities as scripter do not even classify as "Beginner".

Does anyone perchance have some code scaffolding or hints on how to do this - I don't mind if it's Applescript, Perl or Python - I suck at all of them :-(

Regards,
Roland

Bruce Van Allen

unread,
Oct 27, 2011, 9:32:56 AM10/27/11
to bbe...@googlegroups.com
On 2011-10-27, Roland Küffner wrote:
>My idea was to do it with some kind of dictionary file. In it
>each line would contain a single search replacement pair
>separated by tabs. Just like:
>
>old term<tab>new term
>some other random old text<tab>another replacement

I picture a simple and flexible Perl script, but first, two questions:

1. Would it be the case that you would replace EVERY instance of
one of your target terms with its replacement term?

2. Typically, how many terms would be replaced -- order of
magnitude: 10, 100, 1000?

Script coming later this morning, unless someone beats me to it
(calling you out JD :-)

- Bruce

_bruce__van_allen__santa_cruz_ca_

Bruce Van Allen

unread,
Oct 27, 2011, 11:04:48 AM10/27/11
to bbe...@googlegroups.com
On 2011-10-27, Roland Küffner wrote:
>My idea was to do it with some kind of dictionary file. In it
>each line would contain a single search replacement pair
>separated by tabs. Just like:
>
>old term<tab>new term
>some other random old text<tab>another replacement

OK, I went ahead and made a script that will handle mos situations.

Put this in ~/Library/Application Support/BBEdit/Text Filters
(see NOTES below);

#! /usr/bin/perl
use strict;
my %dictionary = map {
chomp($_);
$_ ? (split /\t/ => $_, 2) : ()
} (<DATA>);
while (<>) {
for my $search_term (keys %dictionary) {
s/\b$search_term\b/$dictionary{$search_term}/ge;
}
print;
}
__END__
this THAT
some MANY
my YOUR
up DOWN
right LEFT
north SOUTH
wanted more NEEDED LESS
wanted GAVE AWAY
on OFF


## end of text filter script (don't include this line in script)

NOTES:
1. this will work on most text files. If the body of your text
is large, you might get noticeable delay while it works, with
the number of replacements also a factor.

2. The dictionary consists of the lines below the '__END__'
line. You may have as many as you want. In this sample, I
capitalized the replacement terms only to make them visible in
my text while testing.

3. As written, this only replaces case-sensitively; if you want
replacement regardless of capitalization, add an 'i' after the
'ge' at the end of the substitution line:
s/\b$search_term\b/$dictionary{$search_term}/gei;

3. This has a little bit of error protection:
a. When the dictionary terms are split, it only splits on the
FIRST tab, in case your replacement term has tabs in it (see the
'2' in the split expression).
b. the '\b' before and after the $search_term in the
substitution line protects against replacing your search term
when it happens to be part of a larger word.
c. If your dictionary has any blank lines (after the end-of-line
is chomped off), they will be ignored.

4. You could have more than one of these dictionary text
filters, each with its own set of terms and replacements. Just
put each one in its own file and save with a memorable name to
your text filters folder. The only thing you would change would
be what's below the '__END__' line;

5. My own preference would be to use a colon (':') rather than a
tab between the search term and its replacement. And I would
allow for but not require spaces on either side of the colon. In
that case, the script would look like this:

#! /usr/bin/perl
use strict;
my %dictionary = map {
chomp($_);
$_ ? (split /\s*:\s*/ => $_, 2) : ()
} (<DATA>);
while (<>) {
for my $search_term (keys %dictionary) {
s/\b$search_term\b/$dictionary{$search_term}/ge;
}
print;
}
__END__
this:THAT
some:MANY
my:YOUR
up: DOWN
right:LEFT
north:SOUTH
wanted more:NEEDED LESS
wanted : GAVE AWAY
on:OFF

- Bruce

_bruce__van_allen__santa_cruz_ca_

James Marks

unread,
Oct 27, 2011, 9:41:35 AM10/27/11
to bbe...@googlegroups.com
Oops, left a test 'print_r()' in there. It would be this:

#!/usr/bin/php
<?php

// Set up the terms
$terms = array(
'this' => 'that',
'red' => 'blue',
'stop' => 'go',
);

// Break out the find and replace lists
$find = array_keys($terms);
$replace = array_values($terms);

// Get the document contents
$filename = "/usr/local/something.txt";
$filehandle = fopen($filename, "r");
$document = fread($filehandle, filesize($filename));
fclose($filehandle);

// Perform the find and replace
str_replace($find, $replace, $document);
?>

> --
> You received this message because you are subscribed to the
> "BBEdit Talk" discussion group on Google Groups.
> To post to this group, send email to bbe...@googlegroups.com
> To unsubscribe from this group, send email to
> bbedit+un...@googlegroups.com
> For more options, visit this group at
> <http://groups.google.com/group/bbedit?hl=en>
> If you have a feature request or would like to report a problem,
> please email "sup...@barebones.com" rather than posting to the group.
> Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>

James Marks

unread,
Oct 27, 2011, 10:56:57 AM10/27/11
to bbe...@googlegroups.com
And one last correction (sorry, just woke up and haven't had my coffee yet):

#!/usr/bin/php
<?php

// Set up the terms
$terms = array(
'this' => 'that',
'red' => 'blue',
'stop' => 'go',
);

// Break out the find and replace lists
$find = array_keys($terms);
$replace = array_values($terms);

// Get the document contents
$filename = "/usr/local/something.txt";
$filehandle = fopen($filename, "r");
$document = fread($filehandle, filesize($filename));
fclose($filehandle);

// Perform the find and replace

$document = str_replace($find, $replace, $document);

// Update the document
$filehandle = fopen($filename, "w");
fwrite($filehandle, $document);
fclose($filehandle);

?>


On Oct 27, 2011, at 4:31 AM, Roland Küffner wrote:

James Marks

unread,
Oct 27, 2011, 9:39:26 AM10/27/11
to bbe...@googlegroups.com
Something like this should work (in PHP):

#!/usr/bin/php
<?php


$terms = array(
'this' => 'that',
'red' => 'blue',
'stop' => 'go',
);

$find = array_keys($terms);
$replace = array_values($terms);

print_r($replace);

$filename = "/usr/local/something.txt";
$filehandle = fopen($filename, "r");
$document = fread($filehandle, filesize($filename));
fclose($filehandle);

str_replace($find, $replace, $document);
?>


On Oct 27, 2011, at 4:31 AM, Roland Küffner wrote:

Webmaster

unread,
Oct 27, 2011, 12:09:15 PM10/27/11
to bbe...@googlegroups.com
Hello Bruce,

Thank-you for your very useful and practical PERL script. I am not sure when or how I will use it, but this is the kind of golden nuggets that I love to save from this list for possible future reference.

While I don't participate often on the list, I do appreciate all of the constructive comments and suggestions which have been made here, and the many code snippets which have been freely shared here.

It never ceases to amaze me how powerful BBEdit really is, once one begins to understand the power that is under the hood, and I personally have barely scratched the surface, due to my limited knowledge of PERL, PHP, etc.

I do have a grep question for you and others here.

As I have mentioned before on the list, I have the entire KJV Bible broken down into 31,102 individual HTML files as a part of the Google search engine queries feature on our main website.

In order to further refine user search queries, what I would like to do is this:

Each file obviously has its own unique title tag, such as this:

<title>Obadiah 1:4 - KJV ( King James Version ) Bible Verse</title>

Each file also has a keywords tag, except at this current time, they are NOT unique. In other words, all 31, 102 look like this:

<meta name="keywords" content="kjv,bible,verse,king james">

Each file also has a comment string right above the page anchor. They are NOT unique. They all look the same like this:

<!-- kjv,bible,verse,king james -->

So, what I want to do is change BOTH the keywords tag, as well as the comment string, so that they also include the actual verse reference that is found in the title tag.

So, if we use the example above...

<meta name="keywords" content="kjv,bible,verse,king james">

Would change to this...

<meta name="keywords" content="Obadiah 1:4,kjv,bible,verse,king james">

And this…

<!-- kjv,bible,verse,king james -->

Would change to this…

<!-- Obadiah 1:4,kjv,bible,verse,king james -->

You will notice that with both the keywords tag and the comment string, I wish to place the verse reference at the beginning, followed by a comma, as that will give the reference more weight than the other included keywords.

I guess the tricky part is that the title tag contains more than just the verse reference itself. It also contains the " - KJV ( King James Version ) Bible Verse" which doesn't need to be copied to the keywords tag, or to the comment string. So I am assuming that two step might be necessary in order to add the full title, and then to delete the part of the title that I don't need there.

So, how would I be able to make these two changes in each of the 31,102 files, which each contain a different verse reference in the title tag?

Thanks to anyone who can assist me with this. I imagine that this is probably a piece of cake for some of you, but it is beyond my current abilities to figure out.

WW

Webmaster

unread,
Oct 27, 2011, 12:23:37 PM10/27/11
to bbe...@googlegroups.com
Hello James Marks,

Thanks for your PHP script. Does it do exactly the same thing as Bruce's PERL script?

Exactly how would I use your script?

Can the file to be worked on be placed in any location, and then the path changed for the "$filename" variable, or does the file have to be in /usr/local?

I don't use OS X's built-in PHP installation. I use MAMP Pro for my server setup, which includes the latest versions of Apache, mySQL and PHP.

Would I still be able to use your script? If so, how?

Thanks, and sorry for so many questions.

WW

James Marks

unread,
Oct 27, 2011, 2:39:56 PM10/27/11
to bbe...@googlegroups.com
On Oct 27, 2011, at 9:23 AM, Webmaster wrote:


Thanks for your PHP script. Does it do exactly the same thing as Bruce's PERL script?

I'm not as adept at PERL as I am at PHP and so I couldn't say exactly how they compare but I'm guessing they're similar. (I find PHP to be a bit more readable than PERL but that doesn't mean it runs any better or worse.)


Exactly how would I use your script?

See below.


Can the file to be worked on be placed in any location, and then the path changed for the "$filename" variable, or does the file have to be in /usr/local?

Yes. If you provide an absolute path to the document then it should work no matter where the PHP script resides.


I don't use OS X's built-in PHP installation. I use MAMP Pro for my server setup, which includes the latest versions of Apache, mySQL and PHP.

Would I still be able to use your script? If so, how?

You'd invoke it from the (terminal) command line from the directory that the PHP script resides by typing "php nameofyourphpscript.php" and hitting "enter". PHP should be configured to work from the command line by default. The documentation is here: http://php.net/manual/en/features.commandline.php


Thanks, and sorry for so many questions.

No problem and good luck.

James

Ronald J Kimball

unread,
Oct 27, 2011, 3:02:33 PM10/27/11
to bbe...@googlegroups.com
On Thu, Oct 27, 2011 at 01:31:40PM +0200, Roland K�ffner wrote:

> My idea was to do it with some kind of dictionary file. In it each line would contain a single search replacement pair separated by tabs. Just like:
>
> old term<tab>new term
> some other random old text<tab>another replacement
> ...

I would take a slightly different approach from Bruce's, because I see some
drawbacks to doing a separate search/replace for each term.

First, I think it could be much slower if you have a lot of text and a big
dictionary.

Second, you could end up modifying the same piece of text multiple times,
depending on the order in which the search & replaces happen. Further,
when that order is based on keys in a hash, you can't reliably predict
which outcome you'll get.


For example, if you have these entries in your dictionary:

house<tab>home
my home<tab>where I live

with the text 'my house', the result could be either 'where I live' or 'my
home'.


Or if you had these entries in your dictionary:

dog<tab>cat
cat<tab>dog

with the text 'dog cat', the result would be either 'dog dog' or 'cat cat'
but not the desired 'cat dog'.


So, instead I would create a single regex that matches all the old terms,
sorted in descending order by length, in case one term is a prefix of
another.

#!perl

use strict;

my %dict;

while (<DATA>) {
chomp;
/\t/ or next;
my ($old, $new) = split /\t/, $_;
$dict{$old} = $new;
}

my $re =
'\b(' . join('|', sort { length $b <=> length $a } keys %dict) . ')\b';

while (<>) {
s/$re/$dict{$1}/g;
print;
}

__END__
house home
my home where I live

dog cat
cat dog


Here I'm loading the dictionary with a while loop, but Bruce's map approach
is perfectly fine as well. I also like his suggestion to use a colon with
optional spaces instead of a tab; that way you can line up the terms in
nice columns.


Ronald

Roland Küffner

unread,
Oct 27, 2011, 4:11:45 PM10/27/11
to bbe...@googlegroups.com
Am 27.10.2011 um 27, 17:04 schrieb Bruce Van Allen:

> On 2011-10-27, Roland Küffner wrote:
>> My idea was to do it with some kind of dictionary file. In it each line would contain a single search replacement pair separated by tabs. Just like:
>>
>> old term<tab>new term
>> some other random old text<tab>another replacement

A million thanks to Bruce and James,

Bruce's script works pretty good for normal text. That covers about 90 % of my use cases, so I'm already a happy camper. It might get a little tricky when the search terms contain characters that disturb the s/...-line in the PERL script. I tried one example where I had a URI to be replaced and the script chocked on it. But maybe I will be able to puzzle on this myself after getting nudged in the right basic direction. I will also ponder a little bit on the PHP version as my PHP is slightly better than my PERL and I might be able to tweak this more easily.

Meanwhile I tried something totally different and can provide third solution. But beware, this is a kind of an ugly hack: For my lack of scripting fu I like Text Factories and use them pretty much. Setting them up is a little tedious as it requires a lot of clicking an only little typing. So my idea was: Why not build a Text Factory that can build Text Factories (I know, it is madness!!) ...
But seriously: If you open a Text Factory in another text editor (TextEdit, if must be) you'll see that Factories are really plist and hence XML-files. A simple "Replace All" step from the factories will take about following form in plain text:

<dict>
<key>ComponentArguments</key>
<dict>
<key>CaseSensitive</key>
<false/>
<key>MatchWords</key>
<false/>
<key>ReplaceString</key>
<string>\2</string>
<key>SearchString</key>
<string>\1</string>
<key>UseGrep</key>
<false/>
</dict>
<key>ComponentName</key>
<string>ReplaceAll</string>
</dict>

I generated a new (meta-)Text Factory with a Replace all step. Using grep I simply search for
^(.+?)\t(.+)
to match my dictionary items (see OP).
The replace step contains the snippet above (ready with the \1 and \2 placeholders). This action translates every line in my dictionary text file into a Text Factory action.
The last two steps in my meta-factory add the header and the footer of the plist-file. Simply add two further "Replace All" steps. The first searches for \A (matches the very beginning of a text file), the second for \Z (matches EOF). Copy the replacement text from a sample factory file (it's a little too long to post here - but you still have it open in TextEdit, haven't you?).
The last thing you should do is to add a Translate Text to HTML step at the very beginning of the meta-factory. This will escape potential < > and & characters in your dictionary file that would mess up a XML file.

The resulting meta-factory turns a tab delimited dictionary file into a working Text Factory by the press of one shortcut. Just save your treated dictionary file as a new file and give it the ending .textfactory - when you close and reopen it, it will automatically open as a factory ready to be unleashed upon your text files.

With this approach you can even set up a dictionary file where your search/replace pairs could contain grep patterns (of course you must set the UseGrep-entry in the above snippet to true) - Imagine the possibilites ...

Enjoy,
Roland


Ronald J Kimball

unread,
Oct 27, 2011, 4:15:55 PM10/27/11
to bbe...@googlegroups.com
On Thu, Oct 27, 2011 at 03:02:33PM -0400, Ronald J Kimball wrote:

> my $re =
> '\b(' . join('|', sort { length $b <=> length $a } keys %dict) . ')\b';

I just realized I forgot one thing. This should be:

my $re =
'\b(' .

join('|', map "\Q$_\E", sort { length $b <=> length $a } keys %dict) .
')\b';

in case the terms contain special characters. "\Q$_\E" will escape the
special characters with backslashes.

Ronald

Reply all
Reply to author
Forward
0 new messages