Using gsub to replace & in xml documents

Victor Kirk

unread,

Jan 20, 2002, 2:55:01 PM1/20/02

to

Hi,

I am creating some xml files from a bunch of files that contain
occurances of & which I want to replace with & No problem
with that, what is the problem is that the files also contain a
number of xml entities e.g.   for which I don't want to replace
the &.

Example.

This is some &text; that contains an &, &and; &another &.

To endup like

This is some &text; that contains an &, &and; &another &amp.

Is it possible to do this with a single gsub? I am currently
doing the replacement in a loop and checking each one but this
doesnt seem the best way.

The regex for the entities is "&[a-zA-Z]+;".

Thanks,

Vic

roland rodrigus

unread,

Jan 21, 2002, 8:59:49 AM1/21/02

to

Victor Kirk wrote:

gsub("&[^a-zA-Z]", "\\&amp ")

HTH

--
roland

Perique des Palottes

unread,

Jan 21, 2002, 10:09:11 AM1/21/02

to

roland rodrigus wrote:

>
> Victor Kirk wrote:
>
> > The regex for the entities is "&[a-zA-Z]+;".
>

> gsub("&[^a-zA-Z]", "\\&amp ")

And '&#number;' ??

--
All true believers shall break their eggs at the convenient end.
news:soc.culture.catalan FAQ at http://www.gea.cesca.es/~ipa/SCC/

Victor Kirk

unread,

Jan 20, 2002, 5:07:51 PM1/20/02

to

roland rodrigus wrote:

>>The regex for the entities is "&[a-zA-Z]+;".
>>
>

> gsub("&[^a-zA-Z]", "\\&amp ")
>

My example didnt cover all possibilities, my input files my
conatin text like

asd dfgh &qwerty &iop;

where the &qwerty needs to be changed to &qwerty because
it doesnt end with a ';'

i.e. asd dfgh &qwerty &iop;

Any suggestions?

Vic

Perique des Palottes

unread,

Jan 21, 2002, 10:16:05 AM1/21/02

to

Victor Kirk wrote:
>
> I am creating some xml files from a bunch of files that contain
> occurances of & which I want to replace with & No problem
> with that, what is the problem is that the files also contain a
> number of xml entities e.g.   for which I don't want to replace
> the &.

> ...

> Is it possible to do this with a single gsub? I am currently
> doing the replacement in a loop and checking each one but this
> doesnt seem the best way.

I'd say it is not possible in awk using a single pass.
But a possible kludgy solution might be:

{
# tag all 'amp' to be kept in real entities
# notice the & back reference in replacement
gsub(/&([a-zA-Z]+|#[0-9]+);/,"_FOO_AND_&")
# replace ALL 'amp' anywhere by 'amp entity'
# notice the \& is escaped in replacement
gsub(/&/,"\\&")
# now undo 'amp entity' for previously tagged 'amp'
# notice the \& is escaped in replacement
gsub(/_FOO_AND_&/,"\\&")
print

}

> The regex for the entities is "&[a-zA-Z]+;".

So what about the good &#number; ones?

Victor Kirk

unread,

Jan 20, 2002, 5:52:23 PM1/20/02

to

Perique des Palottes wrote:

> roland rodrigus wrote:
>
>>Victor Kirk wrote:
>>
>>
>>>The regex for the entities is "&[a-zA-Z]+;".
>>>
>>gsub("&[^a-zA-Z]", "\\&amp ")
>>
>
> And '&#number;' ??
>

I believe that the regex for entity references in XML is (assuming you
ignore unicode):

&[a-zA-Z_][a-zA-Z0-9.-_:]*;

The regex I mentioned eariler was simplified (not that it's
complicated) to my requirements, i.e. I only have chars to deal
with.

Regards,

Vic

Victor Kirk

unread,

Jan 20, 2002, 5:59:19 PM1/20/02

to

Perique des Palottes wrote:

> I'd say it is not possible in awk using a single pass.

Thats the conclusion I came too, but I am far from an expert
on awk and I'd only be too happy to be corrected.

> But a possible kludgy solution might be:

I had one but this is nicer solution, although it doesnt feel
the 'right way' to do it. Saying that I will use it.

Many thanks,

Vic

roland rodrigus

unread,

Jan 21, 2002, 11:55:50 AM1/21/02

to

Victor Kirk wrote:

> Hi,
>
Me again. The solution I proposed was based on the example you gave, which
was inconsistant with the regexp.

>
> Example.
>
> This is some &text; that contains an &, &and; &another &.

------------------------------------------------^-------^

> To endup like
>
>
> This is some &text; that contains an &, &and; &another &amp.

----------------------------------------------------^
>

Me again. The solution I proposed was based on the example you gave, which
was inconsistant with the regexp.
I can't find a simple solution with gsub or gensub.
One way is to tag the bad or the good ampersand, which requires three
passes.

Another kludge is this one

awk 'BEGIN{RS="&";ORS=""}
NR==1{print;next}
{printf(($0~/^[[:alnum:]]*;/)?"&":"&");print}'

(replace the regexp class with what suits your case.)

I can't really hope it helps

--
roland

roland rodrigus

unread,

Jan 21, 2002, 12:25:13 PM1/21/02

to

Here is something shorter.

awk 'BEGIN{RS="&[[:alnum:]]+;"}{ORS=RT;gsub(/&/,"\\&");print}'

But if someone knows how to do it with a normal use of regular expressions,
i'd be interested to learn.

--
roland

Martin Cohen

unread,

Jan 21, 2002, 1:39:03 PM1/21/02

to

Victor Kirk wrote:

This brings to mind a modification to replacement operators in general:

Have a parameter that specifies where the search starts!!!

For example, add a parameter to match that indicates the last position
looked at. Then, a match loop that can easily handle multiple matches in
a string would look like this:

i = 0;
while ( match(str, regex, i) )
{
# do your stuff
i = RSTART;
}

I did this in a little language for character manipulation that I wrote
years ago, and it proved very useful.

The alternative, in current gawk, is to remove from the string
whatever the match was, which is (sorry about this) awkward.

A similar thing could be done for sub.

Martin Cohen