Grep find/replace numbers

38 views
Skip to first unread message

severdia

unread,
May 22, 2021, 12:39:51 PM5/22/21
to BBEdit Talk
Hi,

I can't seem to figure out a way to find and replace some numbers using Grep. This is what I have.

<loc>2.2</loc><loc2>2.2.93</loc2>

I have many cases where there are 3 numbers separated by two periods wrapped in <loc> (like this: <loc>2.2.309</loc>) as well as the example above with 2 numbers separated by 1 period wrapped in <loc>. I want to find where there are only two numbers and delete that <loc> element. For example I tried:

<loc>[0-9]\.[0-9]</loc><loc2>[0-9]\.[0-9]\.[0-9]</loc2>

But that seems to only only single digits and I can't replace values. Example values in <loc> are:

<loc>1.13.333</loc>
<loc>5.3.28</loc>
etc.

The first number will be in the range of 1-5, the second will be 1-50, and the third will be 1-2300 (those ranges aren't as important, just for the number of digits to factor in). I imagine I could do all permutations of what I have above with single digits, but it seems there's surely a better way to find the <loc> elements containing only two numbers vs. three.

Any guidance would be appreciated! Maybe the Bbedit shortcut menu for grep characters could add something like this? 

Thanks!

Ron



jj

unread,
May 22, 2021, 1:42:18 PM5/22/21
to BBEdit Talk
Hi Ron,

Here is a regular expression that should match your pattern:

<(loc2?)>[1-5](?:\.\d+)+</\1>

The same but commented:

(?x)                (?# allow comments and whitespace)
<                   (?# start of opening tag)
(                   (?# start of capturing parenthesis)
    loc             (?# loc )
    2?              (?# optional 2 )
)                   (?# end of capturing parenthesis. Will match and store 'loc' or 'loc2' in capture \1)
>                   (?# end of opening tag)
[1-5]               (?# one digit in the range 1..5)
(?:                 (?# start of non-capturing parenthesis)
    \.              (?# a dot. It has to be escaped otherwise it will be considered a one character wildcard.)
    \d+             (?# one or more digits)
)                   (?# end of non-capturing parenthesis)
+                   (?# one or more occurrence of the non-capturing parenthesis)
</                  (?# start of closing tag)
\1                  (?# captured value of first capturing parenthesis)
>                   (?# end of closing tag)

The BBEdit Help has a very good Grep Reference:

menu Help > Quick Reference > Grep Reference

HTH,

Jean Jourdain

Christopher Stone

unread,
May 22, 2021, 7:08:36 PM5/22/21
to BBEdit-Talk
On 05/22/2021, at 10:50, severdia <seve...@gmail.com> wrote:
I can't seem to figure out a way to find and replace some numbers using Grep. This is what I have.

<loc>2.2</loc><loc2>2.2.93</loc2>

I have many cases where there are 3 numbers separated by two periods wrapped in <loc> (like this: <loc>2.2.309</loc>) as well as the example above with 2 numbers separated by 1 period wrapped in <loc>. I want to find where there are only two numbers and delete that <loc> element. For example I tried:


Hey Ron,

Explanations are welcome, but when asking for assistance with data manipulation it's best to provide concrete, real-world data samples and the expected results.

Words nearly always include assumptions and opportunities for error.

A good test case (or three) along with the expected result(s) makes it much easier for people to experiment and get the job right the first time.

If I'm understanding your task correctly then this should work:

Find:

<loc>(\d+\.\d+)</loc>

Replace:

\1


--
Best Regards,
Chris

severdia

unread,
May 22, 2021, 8:50:50 PM5/22/21
to BBEdit Talk
Hi Jean,,

Thank you for the detailed explanation! This seems to find every three-number permutation wrapped in <loc> or <loc2>, but I'm looking to remove two-number permutations. I figured out that moving the second + symbol does the trick.

Thanks again!

severdia

unread,
May 22, 2021, 8:50:50 PM5/22/21
to BBEdit Talk
Hi Chris,

Yes, you're right. I got it working, but for future reference here's what I was trying to do...I have many of these elements:

<work>Macbeth<loc>5.1</loc><loc2>5.1.64</loc2></work>

Both <loc> and <loc2> contain act/scene/line number info and either one other the other is incomplete (the one with two numbers) and I want to keep the one that's complete (with three numbers). My objective was to delete either <loc> or <loc2> if it had only two numbers in it.

Thanks!

Christopher Stone

unread,
May 24, 2021, 2:19:18 AM5/24/21
to BBEdit-Talk
On 05/22/2021, at 19:22, severdia <seve...@gmail.com> wrote:
Yes, you're right. I got it working, but for future reference here's what I was trying to do...I have many of these elements:

<work>Macbeth<loc>5.1</loc><loc2>5.1.64</loc2></work>

Both <loc> and <loc2> contain act/scene/line number info and either one other the other is incomplete (the one with two numbers) and I want to keep the one that's complete (with three numbers). My objective was to delete either <loc> or <loc2> if it had only two numbers in it.


Hey Ron,

Explaining your actual task is quite helpful – but I always want to see examples of the start condition and the desired outcome in black and white.  😎

Like so:

Data Sample 01:

<work>Macbeth<loc>5.1</loc><loc2>5.1.64</loc2></work>

Desired Outcome 01:

<work>Macbeth<loc2>5.1.64</loc2></work>

You wouldn't believe how often people leave out or otherwise mangle instructions.  The "A picture is worth a thousand words." rule is highly relevant with this sort of task.

Here's what I would do:

Find:

<loc(\d?)>\d+\.\d+</loc\1>

Replace with nothing.

With <loc(\d?)> I'm finding a digit that may ore may not exist and I'm using that capture in the closing tag </loc\1>.  This covers any loc tag from 0-9.

This is a trifle simpler and more direct (if there are only <loc> and <loc2> tags).

<loc2?>\d+\.\d+</loc2?>

While we're on the topic of data massage with regex, let me recommend this site to you:

https://regex101.com/r/WtIAmE/3 (This link contains your example problem.)

When I'm working on complex regular expressions I always start with BBEdit, but if I have substantial problems I have some regex visualizer apps (RegExRx in particular) and RegEx101.com to fall back on.

I can't use BBEdit 13's new Pattern Playground on my main Mac, because Sierra isn't supported.

One reason to look forward to getting new hardware...

--
Best Regards,
Chris

Reply all
Reply to author
Forward
0 new messages