I am creating some xml files from a bunch of files that contain
occurances of & which I want to replace with & No problem
with that, what is the problem is that the files also contain a
number of xml entities e.g. for which I don't want to replace
the &.
Example.
This is some &text; that contains an &, ∧ &another &.
To endup like
This is some &text; that contains an &, ∧ &another &.
Is it possible to do this with a single gsub? I am currently
doing the replacement in a loop and checking each one but this
doesnt seem the best way.
The regex for the entities is "&[a-zA-Z]+;".
Thanks,
Vic
gsub("&[^a-zA-Z]", "\\& ")
HTH
--
roland
And '&#number;' ??
--
All true believers shall break their eggs at the convenient end.
news:soc.culture.catalan FAQ at http://www.gea.cesca.es/~ipa/SCC/
>>The regex for the entities is "&[a-zA-Z]+;".
>>
>
> gsub("&[^a-zA-Z]", "\\& ")
>
My example didnt cover all possibilities, my input files my
conatin text like
asd dfgh &qwerty &iop;
where the &qwerty needs to be changed to &qwerty because
it doesnt end with a ';'
i.e. asd dfgh &qwerty &iop;
Any suggestions?
Vic
I'd say it is not possible in awk using a single pass.
But a possible kludgy solution might be:
{
# tag all 'amp' to be kept in real entities
# notice the & back reference in replacement
gsub(/&([a-zA-Z]+|#[0-9]+);/,"_FOO_AND_&")
# replace ALL 'amp' anywhere by 'amp entity'
# notice the \& is escaped in replacement
gsub(/&/,"\\&")
# now undo 'amp entity' for previously tagged 'amp'
# notice the \& is escaped in replacement
gsub(/_FOO_AND_&/,"\\&")
print
}
> The regex for the entities is "&[a-zA-Z]+;".
So what about the good &#number; ones?
> roland rodrigus wrote:
>
>>Victor Kirk wrote:
>>
>>
>>>The regex for the entities is "&[a-zA-Z]+;".
>>>
>>gsub("&[^a-zA-Z]", "\\& ")
>>
>
> And '&#number;' ??
>
I believe that the regex for entity references in XML is (assuming you
ignore unicode):
&[a-zA-Z_][a-zA-Z0-9.-_:]*;
The regex I mentioned eariler was simplified (not that it's
complicated) to my requirements, i.e. I only have chars to deal
with.
Regards,
Vic
> I'd say it is not possible in awk using a single pass.
Thats the conclusion I came too, but I am far from an expert
on awk and I'd only be too happy to be corrected.
> But a possible kludgy solution might be:
I had one but this is nicer solution, although it doesnt feel
the 'right way' to do it. Saying that I will use it.
Many thanks,
Vic
> Hi,
>
Me again. The solution I proposed was based on the example you gave, which
was inconsistant with the regexp.
>
> Example.
>
> This is some &text; that contains an &, ∧ &another &.
------------------------------------------------^-------^
> To endup like
>
>
> This is some &text; that contains an &, ∧ &another &.
----------------------------------------------------^
>
Me again. The solution I proposed was based on the example you gave, which
was inconsistant with the regexp.
I can't find a simple solution with gsub or gensub.
One way is to tag the bad or the good ampersand, which requires three
passes.
Another kludge is this one
awk 'BEGIN{RS="&";ORS=""}
NR==1{print;next}
{printf(($0~/^[[:alnum:]]*;/)?"&":"&");print}'
(replace the regexp class with what suits your case.)
I can't really hope it helps
--
roland
awk 'BEGIN{RS="&[[:alnum:]]+;"}{ORS=RT;gsub(/&/,"\\&");print}'
But if someone knows how to do it with a normal use of regular expressions,
i'd be interested to learn.
--
roland
This brings to mind a modification to replacement operators in general:
Have a parameter that specifies where the search starts!!!
For example, add a parameter to match that indicates the last position
looked at. Then, a match loop that can easily handle multiple matches in
a string would look like this:
i = 0;
while ( match(str, regex, i) )
{
# do your stuff
i = RSTART;
}
I did this in a little language for character manipulation that I wrote
years ago, and it proved very useful.
The alternative, in current gawk, is to remove from the string
whatever the match was, which is (sorry about this) awkward.
A similar thing could be done for sub.
Martin Cohen