Substitute pattern over multiple lines

42 views
Skip to first unread message

John Cordes

unread,
Dec 23, 2020, 4:49:07 PM12/23/20
to vim...@googlegroups.com
I'm seeking help with editing a GEDCOM (genealogy) file. For
this I'm using Vim 8.2 in Windows. Here is a segment of text from
the file (the language doesn't make sense since I've deleted
some internal lines in the NOTEs which aren't relevant to the
question):

=======================
1 EVEN
2 TYPE tngnote
2 NOTE I have included the children William, Charles, Alice, and
with his parents in 1881, and with his widowed mother in 1
3 CONC 891 (e.g. see my online transcription of the 1891 Smiths
with James Moser, son of Henry Moser and Mary Henneberry, and his
wife Margaret Woodin; however
3 CONC , I have not yet taken this step.
1 BIRT
=======================

The 2 lines beginning with ^3 CONC are Continuation (CONC=Concatenation) lines.

I want to surround the text of the NOTE with a 'div' tag, so that
the final result should look like this:

=======================
1 EVEN
2 TYPE tngnote
2 NOTE <div class="xxx">I have included the children William,
Charles, Alice, and with his parents in 1881, and with his widowed
mother in 1891 (e.g. see my online transcription of the 1891
Smiths with James Moser, son of Henry Moser and Mary Henneberry,
and his wife Margaret Woodin; however, I have not yet taken this
step.</div>
1 BIRT
=======================

The complete GEDCOM file (which may have 850,000 or so lines) may
have NOTE tags with 0, 1, 2, or 3 CONC tags (probably no more than
that) following.

It is this variable number of continuation lines which I find
most difficult to deal with.

For the NOTE tags where there are no continuation lines I believe
this is working:

:g/^2 TYPE tngnote/+1s/^2 NOTE\(.*\)/2 NOTE <div class="xxx">\1 <\/div>/

but when there are 1 or more CONC tags following the NOTE I get stuck.

I tried:
:g/^2 TYPE tngnote/+1s/^2 NOTE\(.*\n\(3 CONC \(.*\)\)*\)/2 NOTE <div class="xxx">\1\3<\/div> /

which 'almost' works if there is just 1 CONC tag (though it
leaves "3 CONC" in place which I don't want). So it's pretty bad!


I realize this is pretty messy looking but I'm hoping one of the
experts who so generously contribute to this group may be able to
give me a pointer for how to deal with this.

Thanks,
John Cordes


Tim Chase

unread,
Dec 23, 2020, 6:08:47 PM12/23/20
to John Cordes, vim...@googlegroups.com
I'd start with this ugly monstrosity:

:%s/^2 \u\{3,} \zs\(.*\n\(\%(\D\|3 CONC \).*\n\)\+\)/\='<div
class="xxx">'.substitute(substitute(submatch(1), '\n3 CONC ', '',
'g'), '\n', '', 'g')."<\/div>\n"

(all one line in case it breaks in the mail)

If you only want it to do "2 NOTE" lines, you can change that initial

2 \u\{3,} \zs

(which does any item that has continuations) to

2 NOTE \zs

This does join *all* the lines and doesn't re-wrap them, so you'd
then want a second pass to do the wrapping

:set tw=70
:g/<div [^>]*>.*<\/div>$/norm gqq

Hope this gives you some ideas to work with.

-tim



John Cordes

unread,
Dec 23, 2020, 7:39:20 PM12/23/20
to vim...@googlegroups.com
Yes indeed Tim -- an excellent idea. Thanks very much.
I will attempt to deconstruct your 'monstrosity' somewhat later,
but I've been trying to get things to work with my situation.

It's a bit more complicated than I first explained. Two aspects:
a) I *do* need to search on the "2 NOTE" lines, since there are
various other chunks of lines with the CONC lines; and
b) Sometimes the line "2 TYPE tngnote" has a line between it and
the "2 NOTE". The intervening line can look like this

2 DATE 18 AUG 1776
or this
2 _SDATE 1802

So the lines to change could look like this:

===================
1 EVEN
2 TYPE tngnote
2 _SDATE 1802
2 NOTE The surname of John's wife is not positively established.
However, it is certain that her given name is Elizabeth; evidence
for this comes first from the baptismal records for Rebecca and
Eliza Catherine; these children were born while th
3 CONC e family was in London so the records are available in the
London Metropolitan Archives (the other two children were born in
Sheffield). Henry's baptismal record in Sheffield also has his
parents being John (a skinner) and Elizabeth. The id
3 CONC entification of John's wife specifically with Elizabeth
Coxsey is somewhat tentative, however.
1 EVEN
===================

This search pattern
/^2 TYPE tngnote.*\n*\(\_^2 .*DATE.*\)*\n\_^2 NOTE

works to find all 3 possibilities: no DATE line, an _SDATE line
or a DATE line.

I thought I would be able to combine that with your pattern like so:

:%s/^2 TYPE tngnote.*\n*\(\_^2 .*DATE.*\)*\n\_^2 NOTE \zs\(.*\n\(\%(\D\|3 CONC \).*\n\)\+\)/\='<div class="xxx">'.substitute(substitute(submatch(1), '\n3 CONC ', '', 'g'), '\n', '', 'g')."<\/div>\n"

but that is not working. Here's an example of one small chunk of
lines which were transformed by that command:

1 EVEN
2 TYPE tngnote
2 DATE 18 AUG 1776
2 NOTE <div class="xxx">2 DATE 18 AUG 1776</div>
1 EVEN

The command is eliminating the content which had been in the NOTE tags altogether.

I will keep trying, but more help would be terrific!

Thanks,
John

George Dinwiddie

unread,
Dec 23, 2020, 8:04:37 PM12/23/20
to vim...@googlegroups.com
Why not use

:%s/\n3 CONC //

to concatenate all the continuations and then use

:%s/\(2 NOTE \)\(.*\)/\1<div> class="xxx">\2<\/div>/

to turn all the NOTE lines into <div> blocks? Or am I misunderstanding
something about the transformation you need?

- George
--
----------------------------------------------------------------------
* George Dinwiddie * http://blog.gdinwiddie.com
Software Development http://www.idiacomputing.com
Consultant and Coach https://pragprog.com/titles/gdestimate/
----------------------------------------------------------------------

John Cordes

unread,
Dec 23, 2020, 8:17:01 PM12/23/20
to vim...@googlegroups.com
On Wed, Dec 23, 2020 at 9:04 PM George Dinwiddie <li...@idiacomputing.com> wrote:
Why not use

:%s/\n3 CONC //

to concatenate all the continuations and then use

:%s/\(2 NOTE \)\(.*\)/\1<div> class="xxx">\2<\/div>/

to turn all the NOTE lines into <div> blocks? Or am I misunderstanding
something about the transformation you need?

  - George

 One big problem with the first part is that I *only* want to concatenate the continuation lines when they appear immediately following a "2 NOTE..." tag, AND that "2 NOTE" tag must be either the next or next but one line after "2 TYPE tngnote". 
 
 I neglected to make it clear earlier that I need to first search on  "2 TYPE tngnote" since there are other "2 TYPE" tags where I don't want to change anything.

 John 

Tim Chase

unread,
Dec 23, 2020, 8:31:27 PM12/23/20
to vim...@googlegroups.com
On 2020-12-23 20:39, John Cordes wrote:
>> I'd start with this ugly monstrosity:
>>
>> :%s/^2 \u\{3,} \zs\(.*\n\(\%(\D\|3 CONC \).*\n\)\+\)/\='<div
>> class="xxx">'.substitute(substitute(submatch(1), '\n3 CONC ', '',
>> 'g'), '\n', '', 'g')."<\/div>\n"
>
> I will attempt to deconstruct your 'monstrosity' somewhat later,

Tweaking it so that it only does NOTE items, not generic
continuations:

:%s/^2 NOTE \zs\(.*\n\%(\%(\D\|3 CONC \).*\n\)\+\)/\='<div
class="xxx">'.substitute(substitute(submatch(1), '\n3 CONC ', '',
'g'), '\n', '', 'g')."<\/div>\n"

Breaking it down so hopefully you can swap parts as you see fit:

:%s/^2 NOTE \zs On every line starting with "2 NOTE "
start our replacement here (\zs)
\( start capturing the note
this will be submatch(1) later
.* everything else on that line
\n and the newline
\%( a non-capturing group for another line that
\%(\D starts with either a non-digit
\| or
3 CONC a literal "3 CONC "
\) (end of this OR of things marking a continuation)
.*\n followed by the rest of the line
\) (end of this continuation-line)
\+ we can have 1 or more continuation lines
\) end the capturing
/ replace it with
\= the result of evaluating this expression
'<div class="xxx">' the literal opening tag
. and then the results of
substitute( remove all the newlines from the results of
substitute( removing from
submatch(1), the whole set of continuation stuff
'\n3 CONC ', the literal newline-followed-by-"3 CONC "
'', and replace them with nothing
'g' everywhere
), and in that "\n3 CONC "-less text, replace
'\n', newlines with
'', nothing
'g') everywhere
. and then tack on
"<\/div>\n" the literal closing </div> followed by a newline

> It's a bit more complicated than I first explained. Two aspects:
> a) I *do* need to search on the "2 NOTE" lines, since there are
> various other chunks of lines with the CONC lines; and
> b) Sometimes the line "2 TYPE tngnote" has a line between it and
> the "2 NOTE". The intervening line can look like this
>
> 2 DATE 18 AUG 1776
> or this
> 2 _SDATE 1802

Given the substitution command above, it should only touch "2 NOTE"
lines with subsequent "3 CONT" lines. It does *every* "2 NOTE" so if
you need to limit them to just those that immediately follow "2 TYPE
tngnote" (assuming there aren't any "2 TYPE tngnote" that *don't*
have a NOTE immediately following them), you can tweak that command,
changing that inital "%" to

:g/^2 TYPE tngnote//2 NOTE /s/^2 NOTE \zs…

This looks for all the "2 TYPE tngnote" lines, searches forward
(skipping over any DATE/_SDATE lines or other intervening stuff) for
the "2 NOTE " line following it, and then only performs the
subsitution on those particular lines.
I suspect that the problem snuck in by using \(…\) in your added
conditions which captured that as submatch(1). So you can either
make it non-capturing by adding that "%" before the open-paren:

\%(\_^2 .*DATE.*\)

or change the "submatch(1)" to "submatch(2)"

> Here's an example of one small chunk of
> lines which were transformed by that command:
>
> 1 EVEN
> 2 TYPE tngnote
> 2 DATE 18 AUG 1776
> 2 NOTE <div class="xxx">2 DATE 18 AUG 1776</div>
> 1 EVEN

Note that the content here is what you captured in the first group.
:-)

Hope this helps get you on the right path,

-tim




John Cordes

unread,
Dec 23, 2020, 9:08:10 PM12/23/20
to vim...@googlegroups.com
 This is amazing looking, Tim -- thanks so much! There is a lot for a nearly 80-year old to unpack here -- it's going to take me a while. :)
  It looks as though you have covered all the bases I want to deal with. 

 Thank you again,
 John
    

John Cordes

unread,
Dec 23, 2020, 9:34:48 PM12/23/20
to vim...@googlegroups.com
On Wed, Dec 23, 2020 at 10:07:26PM -0400, John Cordes wrote:
>
> On Wed, Dec 23, 2020 at 9:31 PM Tim Chase <v...@tim.thechases.com> wrote:
>
> On 2020-12-23 20:39, John Cordes wrote:
> >> I'd start with this ugly monstrosity:
> >>
> >> :%s/^2 \u\{3,} \zs\(.*\n\(\%(\D\|3 CONC \).*\n\)\+\)/\='<div
> >> class="xxx">'.substitute(substitute(submatch(1), '\n3 CONC ', '',
> >> 'g'), '\n', '', 'g')."<\/div>\n"

> :g/^2 TYPE tngnote//2 NOTE /s/^2 NOTE \zs…
>
> Hope this helps get you on the right path,
>
> -tim
>
> This is amazing looking, Tim -- thanks so much! There is a lot for a nearly
> 80-year old to unpack here -- it's going to take me a while. :)
> It looks as though you have covered all the bases I want to deal with.
>
> Thank you again,
> John

Just a quick report to say that following your suggestion above leads to:

:g/^2 TYPE tngnote//2 NOTE /s/^2 NOTE \zs\(.*\n\(\%(\D\|3 CONC \).*\n\)\+\)/\='<div class="xxx">'.substitute(substitute(submatch(1), '\n3 CONC ', '', 'g'), '\n', '', 'g')."<\/div>\n"

which as far as I can tell at the moment is working perfectly,
handling all situations the way I wanted. I will check further and
also test on another GEDCOM file when I'm fresher.

Thanks again Tim; I have learned a lot. Now if it would only stick...

John

John Cordes

unread,
Dec 23, 2020, 10:18:46 PM12/23/20
to vim...@googlegroups.com
Tim,

I hate to trouble you further about this, but possibly while it
is still reasonably fresh in your mind...

The last ":g..." command I listed above is working correctly
when there are continuation lines (i.e. at least one "3 CONC" tag
following the "2 NOTE" tag, but I think it seems to be skipping by
the "2 NOTE" tags which do *not* have a CONC / Continuation tag.
I thought the pattern would be allowing for no CONC tags but I'm
not seeing what is wrong.
At least I *think* that's what I am seeing.

John


Tim Chase

unread,
Dec 23, 2020, 10:57:18 PM12/23/20
to vim...@googlegroups.com
On 2020-12-23 23:18, John Cordes wrote:
>> :g/^2 TYPE tngnote//2 NOTE /s/^2 NOTE \zs\(.*\n\(\%(\D\|3 CONC
>> \).*\n\)\+\)/\='<div
>> class="xxx">'.substitute(substitute(submatch(1), '\n3 CONC ', '',
>> 'g'), '\n', '', 'g')."<\/div>\n"
>
> The last ":g..." command I listed above is working correctly
> when there are continuation lines (i.e. at least one "3 CONC" tag
> following the "2 NOTE" tag, but I think it seems to be skipping by
> the "2 NOTE" tags which do *not* have a CONC / Continuation tag.

Ah, while I'm not positive (so shooting from the hip here) I think you
want to change the

\+

(one or more continuation lines) to just

*

(zero or more continuation lines) to produce

:g/^2 TYPE tngnote//2 NOTE /s/^2 NOTE \zs\(.*\n\%(\%(\D\|3 CONC
\).*\n\)\+\)/\='<div class="xxx">'.substitute(substitute(submatch(1),
'\n3 CONC ', '', 'g'), '\n', '', 'g')."<\/div>\n"

(I also snuck in an extra "%" in the inner \(…\) which I missed when
transcribing it earlier, but shouldn't impact the results)

-tim



John Cordes

unread,
Dec 23, 2020, 11:07:01 PM12/23/20
to vim...@googlegroups.com
 I think that did it - on a quick check.
I had tried changing that "\+" to "\=" thinking that would allow for 0 or 1 but something went wrong - can't remember exactly what right now. I should have just tried * - can't think why I didn't.

 Thanks again!
 John


  

Tim Chase

unread,
Dec 23, 2020, 11:52:57 PM12/23/20
to John Cordes, vim...@googlegroups.com
On 2020-12-23 22:34, John Cordes wrote:
> :g/^2 TYPE tngnote//2 NOTE /s/^2 NOTE \zs\(.*\n\(\%(\D\|3 CONC
> \).*\n\)\+\)/\='<div
> class="xxx">'.substitute(substitute(submatch(1), '\n3 CONC ', '',
> 'g'), '\n', '', 'g')."<\/div>\n"
>
> which as far as I can tell at the moment is working perfectly,
> handling all situations the way I wanted.

The only glaring edge-case is a situation in which a "2 TYPE tngnote"
section is followed by *no* NOTE, followed by a section that *isn't*
a "2 TYPE tngnote" that *does* have a NOTE that shouldn't be touched
such as

2 TYPE tngnote
9 TIM FAKE ANNOTATION this tngnote has no NOTE
2 TYPE granola
2 NOTE Don't touch granola-type notes or
3 CONC rewrap their content or add <div>s!

In such a case, it will wrap the NOTE even though it's in a different
TYPE that shouldn't be touched because it's the first NOTE after a "2
TYPE tngnote", even though it's in a different section.

So that's where I'd focus my checking :-)

-tim



John Cordes

unread,
Dec 24, 2020, 9:57:51 AM12/24/20
to vim...@googlegroups.com
Thanks Tim. The GEDCOM file, exported from my desktop genealogy
program TMG, *shouldn't* have a case like that (a tngnote tag
which isn't followed by its own Note), but... Hey, it's produced
by a large, complex, software program which is no longer supported
and does have a few known bugs. So obviously one can never
guarantee what will actually happen in practice.

I will certainly keep a lookout for this edge case -- it would
indeed lead to very undesirable results. Presumably I should be
able to do a search for two successive "2 TYPE tngnote" entries
which don't have an intervening "2 NOTE " tag. Not sure how, but
I'll give it a try. :-)

Thanks for the heads-up on this,
John

Steve Litt

unread,
Dec 24, 2020, 2:43:40 PM12/24/20
to vim...@googlegroups.com
On Wed, 23 Dec 2020 17:08:32 -0600
Tim Chase <v...@tim.thechases.com> wrote:


> 2 NOTE \zs
>
> This does join *all* the lines and doesn't re-wrap them, so you'd
> then want a second pass to do the wrapping
>
> :set tw=70
> :g/<div [^>]*>.*<\/div>$/norm gqq

His destination is HTML so he doesn't need to wrap them: The browser
will wrap them for him.

SteveT

Steve Litt
Autumn 2020 featured book: Thriving in Tough Times
http://www.troubleshooters.com/thrive

Steve Litt

unread,
Dec 24, 2020, 3:01:35 PM12/24/20
to vim...@googlegroups.com
On Wed, 23 Dec 2020 21:16:20 -0400
John Cordes <john....@dal.ca> wrote:


> One big problem with the first part is that I *only* want to
> concatenate the continuation lines when they appear immediately
> following a "2 NOTE..." tag, AND that "2 NOTE" tag must be either the
> next or next but one line after "2 TYPE tngnote".
>
> I neglected to make it clear earlier that I need to first search on
> "2 TYPE tngnote" since there are other "2 TYPE" tags where I don't
> want to change anything.

Personally I'd do this as an AWK program (not an AWK one-liner). Have a
variable that gets incremented once when you hit "2 NOTE tngnote", gets
incremented again when you hit a "2 NOTE" 1 or 2 lines below, and
incremented again when you hit "3 CONC". If you increment twice like
this, you remove the "3 CONC" from the beginning of the each "3 CONC"
line and output it. At the end of the continuations, you put a </div>.
This requires that you put the corresponding <div> just before you
output the "2 NOTE" line.

If, at any time, you hit a line that forecloses the possibility of such
line-grafting, you drop the variable back to its original value.

It would also be very easy in Python, Python's advantage is that it can
easily store lines and "look back" before printing them. AWK can do
that, but it's more difficult.

I know this is offtopic on this list, but I think any Vim or ex
solution that can be made will be fragile and difficult to understand.

John Cordes

unread,
Dec 24, 2020, 3:11:18 PM12/24/20
to vim...@googlegroups.com
On Thu, Dec 24, 2020 at 3:43 PM Steve Litt <sl...@troubleshooters.com> wrote:
On Wed, 23 Dec 2020 17:08:32 -0600
Tim Chase <v...@tim.thechases.com> wrote:


>   2 NOTE \zs
>
> This does join *all* the lines and doesn't re-wrap them, so you'd
> then want a second pass to do the wrapping
>
>   :set tw=70
>   :g/<div [^>]*>.*<\/div>$/norm gqq

His destination is HTML so he doesn't need to wrap them: The browser
will wrap them for him.

  Correct.  Clearly Tim's initial response was intended to deal with the precise format I said I wanted the output to be in.

 Things are working very well now thanks to the excellent help from Tim.

 John Cordes 

Tim Chase

unread,
Dec 24, 2020, 3:35:38 PM12/24/20
to vim...@googlegroups.com
On 2020-12-24 14:43, Steve Litt wrote:
> On Wed, 23 Dec 2020 17:08:32 -0600
> Tim Chase <v...@tim.thechases.com> wrote:
>> 2 NOTE \zs
>>
>> This does join *all* the lines and doesn't re-wrap them, so you'd
>> then want a second pass to do the wrapping
>>
>> :set tw=70
>> :g/<div [^>]*>.*<\/div>$/norm gqq
>
> His destination is HTML so he doesn't need to wrap them: The browser
> will wrap them for him.

However he also wrote

"""
I want to surround the text of the NOTE with a 'div' tag, so that
the final result should look like this:

=======================
1 EVEN
2 TYPE tngnote
2 NOTE <div class="xxx">I have included the children William,
Charles, Alice, and with his parents in 1881, and with his widowed
mother in 1891 (e.g. see my online transcription of the 1891
Smiths with James Moser, son of Henry Moser and Mary Henneberry,
and his wife Margaret Woodin; however, I have not yet taken this
step.</div>
1 BIRT
=======================
"""

which included the wrapping (even if the HTML rendering engine would
do that for him) in the example desired output, so I included the
fairly straight-forward means by which one could do that if needed.

-tim


jcordes

unread,
Dec 24, 2020, 5:29:38 PM12/24/20
to vim_use
 Steve,

  I do understand. I am quite sure that if I had asked my son for help with this we would have ended up with an AWK script. That has happened before for at least one vaguely similar sort of job (in the sense of storing lines and checking back). I just really like using Vim, even though my skills for the more advanced techniques are sadly lacking.
 I have intended for ages to learn Python (I know that it is generally said to be very easy to learn) but it hasn't happened - not sure it ever will.
 
 John

Reply all
Reply to author
Forward
0 new messages