Capturing multiple variables in regular expression

57 views
Skip to first unread message

Doug Pinkerton

unread,
Mar 22, 2023, 11:54:52 AM3/22/23
to BBEdit Talk
I need to convert data formatted for humans into data formatted for a database. The actual document is subject to privacy regulation. The following is a mockup to illustrate the task. I need to convert this:

ART 200-2D Art I
First Level Requests
Aardvark, Amy 10
Badger, Billy 11
Cat, Carl 12

ART 200-2D Art I
Alternate Request
Dove, David 11
Fox, Fred 11
Gill, Gertrude 10

ART 215-2D Art II, Honors
First Level Requests
Kite, Kevin 11
Lizard, Larry 12
Mongoose, Marvin 11

into this:

ART 200 First Aardvark, Amy 10
ART 200 First Badger, Billy 11
ART 200 First Cat, Carl 12
ART 200 Alternate Dove, David 11
ART 200 Alternate Fox, Fred 11
ART 200 Alternate Gill, Gertrude 10
ART 215 First Kite, Kevin 11
ART 215 First Lizard, Larry 12
ART 215 First Mongoose, Marvin 11

So, I need to capture the course ID from one line, the request type from the next line, and then insert each of those into the beginning of each subsequent line, until the occurrence of a new course ID, at which point the course ID and request type are updated for insertion into the next block of lines...
... and so on for about 3,000 lines.

My grep skills are not up to the challenge. Can anyone help?

Thanks,
dp

Fletcher Sandbeck

unread,
Mar 22, 2023, 12:36:27 PM3/22/23
to bbe...@googlegroups.com
I feel like this would be easier with a script of some sort. Nevertheless the idea I had was to do the following.

First, use Shift Right to push all the data in two tab stops. And make sure there's an empty line at the start and end of the file.

ART 215-2D Art II, Honors
First Level Requests
Kite, Kevin 11
Lizard, Larry 12
Mongoose, Marvin 11

Then use a couple search/replaces to pull the ART and First/Alternate lines forward and fix the data to what you want at the end.

Find: ^\t\t([A-Z]+ [0-9]+).*$
Replace: \1

Find: \t\tFirst Level Requests
Replace: \tFirst

Find: \t\tAlternate Request
Replace: \tAlternate

Now, the data looks like this:

ART 200
Alternate
Dove, David 11
Fox, Fred 11
Gill, Gertrude 10

Collapse the first three rows in each group with this search replace. Make sure you start at the top of the file and only run it once.

Find: \r(.+?)\r\t(.+?)\r\t\t(.+?)\r
Replace: \r\1\t\2\t\3\r

ART 215 First Kite, Kevin 11
Lizard, Larry 12
Mongoose, Marvin 11

And then fill the data down into the remaining rows by running this replacement several times.

Find: \r(.+?)\t(.+?)\t(.+?)\t(.+?)\r\t\t(.+?)\t(.+?)\r
Replace: \r\1\t\2\t\3\t\4\r\1\t\2\t\5\t\6\r

ART 200 First Aardvark, Amy 10
ART 200 First Badger, Billy 11
ART 200 First Cat, Carl 12

Remove Blank Lines and the data should be importable.

[fletcher]

--
This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "sup...@barebones.com" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
---
You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbedit+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/446f907e-e61f-45c4-946a-261fafa0fa7an%40googlegroups.com.

Doug Pinkerton

unread,
Mar 22, 2023, 3:57:28 PM3/22/23
to BBEdit Talk
Thank you, Fletcher.
The first few steps worked perfectly. The data now look like this:


ART 200
     Alternate
          Dove, David 11
          Fox, Fred 11
          Gill, Gertrude 10

But when I run this step:

\r(.+?)\r\t(.+?)\r\t\t(.+?)\r
it acts on most of the student names, but not all of them. I'm not able to detect a pattern in the sequence of the skips, or any difference in the textual structure of the skipped records. The records are now structured enough that I can drop them into a spreadsheet and do some manual fill down, but I would like to figure this out, since this is a recurring project.

dp

Fletcher Sandbeck

unread,
Mar 22, 2023, 5:09:07 PM3/22/23
to bbe...@googlegroups.com
I'd check for "Gremlins" if you haven't already.

In any case that step could be cleaned up a little. The version below doesn't rely on the return at the start of the file and doesn't really care about how the name/number part of the line is formatted.

Find: (\r|^)(.+?)\r\t(.+?)\r\t\t
Replace: \1\2\t\3\t\4

A similar cleanup on the final step to pull the headers down into the rows would look like this.

Find: (\r|^)(.+?)\t(.+?)\t(.+?)\r\t\t
Replace: \1\2\t\3\t\4\r\2\t\3\t

There's a tradeoff between making these strict or flexible depending on how structured the data coming in is.

[fletcher]

Christopher Stone

unread,
Mar 22, 2023, 5:58:07 PM3/22/23
to BBEdit-Talk
On Mar 22, 2023, at 10:40, Doug Pinkerton <dplist...@gmail.com> wrote:

I need to convert data formatted for humans into data formatted for a database. The actual document is subject to privacy regulation. The following is a mockup to illustrate the task. I need to convert this:


Hey Doug,

Here's a Perl filter that will do the job.

-Chris

#!/usr/bin/env perl -0777 -nsw
# ------------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2023/03/22 16:51
# dMod: 2023/03/22 16:51
# Task: Reformat Data for Use in a Database.
# Tags: @ccstone, @Shell, @Script, @Reformat, @Data, @Database
# ------------------------------------------------------------
use v5.12;

$_ =~ s!\A\s+|\s+\Z!!g;

my @recordArray = split(/\n\n/, $_);

foreach my $record (@recordArray) {
my @record = split(/\n/, $record);
my $line1 = $record[0];
$line1 =~ s!^(.+)\h?-.+!$1!;
my $line2 = $record[1];
$line2 =~ s!^(\w+)\h.+!$1!;
my $prefix = $line1 . "\t" . $line2;
my $recordLength = scalar(@record) - 1;
my @newRecord = @record[2 .. $recordLength];

foreach my $recordItem (@newRecord) {
say $prefix . "\t" . $recordItem;
}

}

Doug Pinkerton

unread,
Mar 23, 2023, 9:21:25 AM3/23/23
to BBEdit Talk
Wow. Problem resolved. The script is bulletproof.
Thanks very much to both of you. Your gracious expertise has saved me a lot of time.

dp

Doug Pinkerton

unread,
Mar 23, 2023, 6:18:23 PM3/23/23
to BBEdit Talk
On second thought, since that went better than I imagined possible, it makes me bold to think that perhaps the entire mess can be cleaned up without any manual work. The attached file is a sample of the real thing, except that I’ve replaced all of the students’ names with “Lastname, Firstname”. This is copied from a screen print, which is the only way to get these particular records out of the SIS. I can grep the removal of sex, grade, campus, and totals, which are irrelevant for this project. And with Chris’s script, I can now make the stacked data tabular.

What remains is the column wrapping issue. Some courses consist of the header information followed by a short student roster. These present no problem. But if the roster is too long to fit on a page, the report creates a second column next to the first, without regard for alignment. So, the first record in the second column often ends up next to the course name. In each case where such wrapping occurs, I need to move the second column to the bottom of the first column, but can’t think of any way to do it other than manually. And that’s even more tedious than it seems. The sample is artificially clean. In the real document, the variation in the lengths of the student names creates a ragged edge to the first column, such that the second column weaves in and out. It’s not possible to do a rectangular text selection.

Can anyone help with this?
sample.txt
Reply all
Reply to author
Forward
0 new messages