GREP pattern to replace the first element with a tab

samar

unread,

Mar 20, 2021, 9:47:23 AM3/20/21

to BBEdit Talk

Hi

This is the first time a post a message here.

I'm looking for something which I assume is pretty easy to accomplish with GREP but I fail to see how.

I have a large text file with entries sorted in this way:

That is, each line has two elements (represented here by A\d and B\d) separated by one tab character. The length of the two elements is between 1 and 100 characters each.

Now I need a GREP Find/Replace string to make it look like this:

That is, the first element is replaced by a tab character *if* the first element is a repetition of the first element of the previous line(s). The sort order remains unchanged.

Sounds simple. And yet all I manage to GREP for is this:

The result looks promising but in now way satisfactory:

The tab characters are inserted correctly, but the problem has to do with repetition ... Does anyone see how this can be resolved?

Thanks

samar

John Delacour

unread,

Mar 20, 2021, 11:41:36 AM3/20/21

to bbe...@googlegroups.com

On 20 Mar 2021, at 11:30, samar <arne...@bluewin.ch> wrote:

I'm looking for something which I assume is pretty easy to accomplish with GREP but I fail to see how.

I have a large text file with entries sorted in this way:

...That is, the first element is replaced by a tab character *if* the first element is a repetition of the first element of the previous line(s). The sort order remains unchanged.

Create a text filter "~/Library/Application Support/BBEdit/Text Filters/SAMAR.pl" and run it from the Text Filters palette (opened from Window menu). The script will affect the whole file, or just the selection if one is made:

#! /usr/bin/perl

my $a;

while (<>) {

my ($col1, $col2) = split "\t", $_;

if ($col1 eq $a) {

print "\t$col2"

} else {

print "$col1\t$col2";

$a = $col1;

}

samar

unread,

Mar 20, 2021, 2:25:43 PM3/20/21

to BBEdit Talk

Thank you! This is helpful, and the result better than mine was.

However, when I run the text filter, not all occurrences of the first element get replaced when necessary:

Here a tab should also replace A1 in line 5, A2 in line 12, and A3 in line 18.

Thanks

samar

Christian Boyce

unread,

Mar 20, 2021, 3:13:30 PM3/20/21

to bbe...@googlegroups.com

You asked for GREP, and this isn’t GREP, but it still solves your problem. I used AppleScript.

Note: this script is written in order to be as easy to follow as possible. It could be improved of course.

Basically I’m going down your document a line at a time, comparing what’s on the left side of the tab to the value that was in Column One the last time it changed. If it’s the same as that, I don’t write it— I just write tab and Column Two to a variable called newText. If it’s different, I write Column One and Column Two to newText. Finally I set the text of the document to the contents of the variable newText.

Watch out fo the line that sets AppleScript’s text item delimiters to a tab. The line actually looks like this before compiling:

set AppleScript's text item delimiters to {"\t”}

But when you compile the slash-t is changed to a tab, which is invisible.

--

use AppleScript version "2.4" -- Yosemite (10.10) or later

use scripting additions

tell application "BBEdit"

set myDoc to document 1

set originalText to text of myDoc

set newText to ""

set ColumnOne to ""

set ColumnTwo to ""

--

set theParagraphs to paragraphs of originalText as list

repeat with aParagraph in theParagraphs

set saveTID to AppleScript's text item delimiters

set AppleScript's text item delimiters to {" “}— that’s a tab in there— you can’t tell but it is. Use “\t” as I’ve written

— on the next line

-- set AppleScript's text item delimiters to {"\t”}— commented out so you can see how it looks before compiling

set thisColumnOne to text item 1 of aParagraph

set thisColumnTwo to text item 2 of aParagraph

if thisColumnOne is ColumnOne then

set thisOutput to tab & thisColumnTwo & return

else

set thisOutput to thisColumnOne & tab & thisColumnTwo & return

set ColumnOne to thisColumnOne

end if

set newText to newText & thisOutput

end repeat

set AppleScript's text item delimiters to saveTID

set text of myDoc to newText

end tell

On Mar 20, 2021, at 4:30 AM, samar <arne...@bluewin.ch> wrote:

Hi

This is the first time a post a message here.

I'm looking for something which I assume is pretty easy to accomplish with GREP but I fail to see how.

I have a large text file with entries sorted in this way:

<tab1.png>

That is, each line has two elements (represented here by A\d and B\d) separated by one tab character. The length of the two elements is between 1 and 100 characters each.

Now I need a GREP Find/Replace string to make it look like this:

<tab2.png>

That is, the first element is replaced by a tab character *if* the first element is a repetition of the first element of the previous line(s). The sort order remains unchanged.

Sounds simple. And yet all I manage to GREP for is this:

<Screenshot 2021-03-20 at 12.26.11.png>
The result looks promising but in now way satisfactory:

<Screenshot 2021-03-20 at 12.22.46.png>

The tab characters are inserted correctly, but the problem has to do with repetition ... Does anyone see how this can be resolved?

Thanks
samar

—

Socially distantly yours,

Christian Boyce

John Delacour

unread,

Mar 20, 2021, 4:48:26 PM3/20/21

to bbe...@googlegroups.com

On 20 Mar 2021, at 16:33, samar <arne...@bluewin.ch> wrote:

...when I run the text filter, not all occurrences of the first element get replaced when necessary:

<tab3.png>

Here a tab should also replace A1 in line 5, A2 in line 12, and A3 in line 18.

That makes no sense to me. If the current col1 is identical to col1 of the previous line, then the value will not be printed; that is the clear logic of the routine. I cannot reproduce your error.

Christopher Stone

unread,

Mar 21, 2021, 4:47:34 AM3/21/21

to BBEdit-Talk

On 03/20/2021, at 15:48, John Delacour <johnde...@gmail.com> wrote:

That makes no sense to me. If the current col1 is identical to col1 of the previous line, then the value will not be printed; that is the clear logic of the routine. I cannot reproduce your error.

Hey Folks,

I'm with JD on this one. I ran his script on a sample file, and it performed as expected.

Samar – I'd want a zipped copy of your test file that's failing to test. Also - you said you're working with a large file. How large? It makes a difference in how one approaches the problem.

I like JD's Perl solution for this. It's neat, clean, fast, and will handle big files with ease.

If I was to use AppleScript I'd go this route:

-----------------------------------------------------------

# Auth: Christopher Stone <script...@thestoneforge.com>

# dCre: 2021/03/21 02:56

# dMod: 2021/03/21 02:56

# Appl: BBEdit

# Task: Massage Text of Columns A and B of a Table.

# Libs: None

# Osax: None

# Tags: @Applescript, @Script, @BBEdit

-----------------------------------------------------------

set colAStr to missing value

set {oldTIDS, AppleScript's text item delimiters} to {AppleScript's text item delimiters, tab}

tell application "BBEdit" to ¬

set paragraphList to contents of lines of front document where its contents is not ""

repeat with i in paragraphList

if colAStr = text item 1 of i then

set contents of i to tab & text item 2 of i

else

set colAStr to text item 1 of i

end if

end repeat

set AppleScript's text item delimiters to linefeed

set paragraphList to paragraphList as text

set AppleScript's text item delimiters to oldTIDS

tell application "BBEdit" to ¬

set text of front document to paragraphList

-----------------------------------------------------------

Although it could bog down with very big files.

--

Best Regards,

Chris

samar

unread,

Mar 21, 2021, 8:42:08 AM3/21/21

to BBEdit Talk

The reason you cannot reproduce the error with your file may be that the second column is limited to three different texts (B1, B2, and B3) whereas in mine there are more (up to B7 here, but the script should also work with more than seven):

A1    B1
A1    B2
A1    B3
A1    B4
A1    B5
A1    B6
A1    B7
A2    B1
A2    B2
A2    B3
A2    B4
A2    B5
A2    B6
A3    B1
A3    B2
A3    B3
A3    B4
A3    B5
A3    B6
A3    B7

samar

unread,

Mar 21, 2021, 8:42:16 AM3/21/21

to BBEdit Talk

Wow, thank you, Christian, that works exactly as expected! (I only needed to change the closing quotation mark for the script to run.)

This is very helpful – my 8,000-line file was magically modified within 8 seconds.

samar

Kaveh Bazargan

unread,

Mar 21, 2021, 8:42:16 AM3/21/21

to bbe...@googlegroups.com

Here is a two stage regex solution. Assuming no bullets (•) in your file,

Search: ^([^\t]+)\t([^\t]+)\r(?=\1)

replace: \1\t\2\r•

Search: •[^\t]+

Replace with empty

--
This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "sup...@barebones.com" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
---
You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbedit+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/B75B0684-484C-4034-AF69-D0731A9FE590%40gmail.com.

--

Kaveh Bazargan PhD

Director

River Valley Technologies ● Twitter ● LinkedIn ● ORCID

Accelerating the Communication of Research

samar

unread,

Mar 21, 2021, 8:42:16 AM3/21/21

to BBEdit Talk

Hi Chris

(For some reason my messages here occur only several hours after I've sent them ...)

Thank you very much indeed – your AppleScript works perfectly well here (8 seconds when applied to my file of 8,000 lines), the result is the same as when I run Christian's script. So I can even choose between the two!

samar

unread,

Mar 21, 2021, 3:04:43 PM3/21/21

to BBEdit Talk

Very clever! I had not thought of using positive lookahead, but this indeed makes GREP work here. The only thing I had to add was a tab character before the closing parenthesis in the first search so that similar A1 values (those beginning in the same way) will be excluded:

^([^\t]+)\t([^\t]+)\r(?=\1\t)

Thanks much!

samar

Kaveh Bazargan

unread,

Mar 21, 2021, 3:12:59 PM3/21/21

to bbe...@googlegroups.com

Glad it helped Samar. Yes, I am generally afraid of lookaheads and sometime not sure of why we can't just use a pattern and replace it verbatim, but this is a good case for it because you have to ensure you do not select the start of the next line in the grep search.

The thing I find hardest is remembering the syntax for lookaheads and look behinds so on Mac I use TypeIt4Me with shortcuts.

Regards
Kaveh

To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/9531484b-8ac0-4e99-b63a-0662d5b5c654n%40googlegroups.com.

Christopher Stone

unread,

Mar 21, 2021, 10:37:08 PM3/21/21

to BBEdit-Talk

On 03/21/2021, at 03:39, samar <arne...@bluewin.ch> wrote:

The reason you cannot reproduce the error with your file may be that the second column is limited to three different texts (B1, B2, and B3) whereas in mine there are more (up to B7 here, but the script should also work with more than seven):

Hey Samar,

Nyet. The logic of JD's script is quite clear if you know how to read Perl, and that shouldn't happen.

My best guess is your test file is different in structure than what you've given us as an example, and that's throwing off JD's script – but I can't tell without being able to reproduce the problem.

I created a 10,000 line test file with 1000 A-groups, and JD's script works perfectly in about 1/4 of a second on my old Mid-2010 17" 2.66 GHz Intel Core i7 MacBook Pro.

When asking for text processing it's best to give real-world data if at all possible, because the devil is in the details – and small anomalies can cause big problems.

I've rewritten JD's script to hopefully reduce any chance of error. It's a bit terse, but hopefully the commenting overcomes that.

Why wait 8 seconds when instantaneous is available?

--

Best Regards,

Chris

#!/usr/bin/env perl -sw

# -----------------------------------------------------------------------------------------

# Auth: Christopher Stone <script...@thestoneforge.com>

# dCre: 2012/11/27 08:12

# dMod: 2021/03/21 21:13

# Task: Create Indented Structure from a 2 Column tab-delimited table.

# Tags: @ccstone, @Shell, @Script, @Indented, @Column, @Table

# -----------------------------------------------------------------------------------------

use v5.010;

my $column1StartStr = ""; # Initialize variable

while (<>) { # Process Lines one at a time.

m!(^[^\t]+)(\t.+)!m; # Matches Column 1 & [\t]Column 2 in the current line to $1 & $2.

if ($1 eq $column1StartStr) { # Equality test – start-state of Col A & Col A of current line:

say "$2"; # If equal; print only Col B with \t delim.

} else {

say "$1$2"; # If NOT equal; print both columns.

$column1StartStr = $1; # If NOT equal; reset Col A start string.

}

# -----------------------------------------------------------------------------------------

Bruce Van Allen

unread,

Mar 21, 2021, 11:30:01 PM3/21/21

to BBEdit-Talk

On 21 Mar 2021, at 19:37, Christopher Stone wrote:

> On 03/21/2021, at 03:39, samar <arne...@bluewin.ch

> <mailto:arne...@bluewin.ch>> wrote:
>
>> The reason you cannot reproduce the error with your file may be that
>> the second column is limited to three different texts (B1, B2, and
>> B3) whereas in mine there are more (up to B7 here, but the script
>> should also work with more than seven):
>
>
>

> Hey Samar,
>
> Nyet. The logic of JD's script is quite clear if you know how to read
> Perl, and that shouldn't happen.

Hmm. When I first read Samar’s post - the one with the enclosed screen
shots - I thought he wanted it to go from something like this:

A1[tab]B1
[tab]B2
[tab]B3
[tab]B4
[tab]B5
A2[tab]B1
[tab]B2
[tab]B3
[tab]B4
A3[tab]B1
[tab]B2
[tab]B3
[tab]B4
[tab]B5
[tab]B6

to something like this:

A1[tab]B1
A1[tab]B2
A1[tab]B3
A1[tab]B4
A1[tab]B5
A2[tab]B1
A2[tab]B2
A2[tab]B3
A2[tab]B4
A3[tab]B1
A3[tab]B2
A3[tab]B3
A3[tab]B4
A3[tab]B5
A3[tab]B6

If this is correct, JD’s script could be easily changed to do it.

Note that to get the first $x ($a in JD’s script), you have to parse &
print the first line before looping over the rest.

And this assumes that the first line has a value in both columns.

#!/usr/bin/perl
my ($x, $y) = split "\t", <>;
print "$x\t$y";

while (<>) {
my ($col1, $col2) = split "\t", $_;

if ($col1 and $col1 ne $x) {
$x = $col1;
}
print "$x\t$col2";
}

But maybe I misunderstood the goal.

HTH,

- Bruce

_bruce__van_allen__santa_cruz__ca

Tim A

unread,

Mar 22, 2021, 8:50:31 AM3/22/21

to BBEdit Talk

Neat.

Any way to single step thru Find/Replace to watch the the grep pattern progress instead of doing a Replace All?

If I do a Next and Replace it doesn't work properly.

Thanks

Kaveh Bazargan

unread,

Mar 22, 2021, 9:14:08 AM3/22/21

to bbe...@googlegroups.com

Try removing the ^ at the start of search. This is because once you have placed the bullet, the next char is no longer at the start of line. now you will just be searching from the bullet forwards. So search is now:

([^\t]+)\t([^\t]+\r)(?=\1)

I also put the last \r inside the second bracketed text. Works here.

To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/1aaf8e41-f479-431e-a290-777609746a27n%40googlegroups.com.

John Delacour

unread,

Mar 25, 2021, 8:26:05 AM3/25/21

to bbe...@googlegroups.com

On 21 Mar 2021, at 08:39, samar <arne...@bluewin.ch> wrote:

The reason you cannot reproduce the error with your file may be that the second column is limited to three different texts (B1, B2, and B3) whereas in mine there are more (up to B7 here, but the script should also work with more than seven):

The script works with all files that have tab-delimited values, whatever those values may be. Your supposition is quite unfounded; the logic of the script is quite clear.

A1    B1
A1    B2
A1    B3
A1    B4
A1    B5

...

but if you separate the values with spaces, as you have done here, then the file does not meet your own criteria. As Christopher has remarked, while asking you to send your file, the script works with all valid files, and does the work not in eight seconds (!!) but in the twinkling of an eye.

If anybody can point out a real flaw in my script, I shall be happy to correct it. If you will send me your file, I shall be glad to explain why you are having the problem.

JD

Reply all

Reply to author

Forward