Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Remove XML comments using Regex.

46 views
Skip to first unread message

Hongyi Zhao

unread,
Sep 25, 2017, 2:51:18 AM9/25/17
to
Hi all,

Is there simple code to remove XML comments from a xml file?

Regards
--
.: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.

Allodoxaphobia

unread,
Sep 25, 2017, 9:10:28 AM9/25/17
to
On Mon, 25 Sep 2017 06:51:13 +0000 (UTC), Hongyi Zhao wrote:
>
> Is there simple code to remove XML comments from a xml file?

http://bfy.tw/E7Ev

Helmut Waitzmann

unread,
Sep 25, 2017, 2:09:10 PM9/25/17
to
Hongyi Zhao <hongy...@gmail.com>:
> Hi all,
>
> Is there simple code to remove XML comments from a xml file?

No. There isn't any code using only regular expressions to parse
XML documents.

Chris Elvidge

unread,
Sep 25, 2017, 2:32:36 PM9/25/17
to
On 25/09/2017 07:51 am, Hongyi Zhao wrote:
> Hi all,
>
> Is there simple code to remove XML comments from a xml file?
>
> Regards
>
What do you define as an "XML comment"?
Can you post part of the file, part containing one or more "XML comments"?


--

Chris Elvidge, England

Kaz Kylheku

unread,
Sep 25, 2017, 2:57:00 PM9/25/17
to
On 2017-09-25, Chris Elvidge <ch...@mshome.net> wrote:
> On 25/09/2017 07:51 am, Hongyi Zhao wrote:
>> Hi all,
>>
>> Is there simple code to remove XML comments from a xml file?
>>
>> Regards
>>
> What do you define as an "XML comment"?

https://www.w3.org/TR/2008/REC-xml-20081126/#dt-comment

XML comments are regular; the grammar is:

Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

which is basically a regex on the right hand side. I think (Char '-')
basically is the same thing as [^-]. Thus:

Comment := '<!--([^-]|-[^-])*-->'

In other words, an XML comment is the sequence <!--, closed by --> in
which -- does not occur.

The only minor difficulty is that a comment may span multiple lines,
whereas most Unix text processing utils deal with lines. You need some
GNU extensions to match a regex across multiple lines.

With POSIX awk, we could use < as the record separator (RS). A comment
then is recognized as a $0 record which which starts with !--.

We can apply the regular expression ([^-]|-[^-])*--> to match a prefix
of this record and strip it away.

Then if we are careful about how we reconstitute the filtered records,
(we can't just set ORS to < and rely on print!) we should be able to
achieve workable XML comment stripping.

Kaz Kylheku

unread,
Sep 25, 2017, 3:01:36 PM9/25/17
to
On 2017-09-25, Kaz Kylheku <398-81...@kylheku.com> wrote:
> On 2017-09-25, Chris Elvidge <ch...@mshome.net> wrote:
>> On 25/09/2017 07:51 am, Hongyi Zhao wrote:
>>> Hi all,
>>>
>>> Is there simple code to remove XML comments from a xml file?
>>>
>>> Regards
>>>
>> What do you define as an "XML comment"?
>
> https://www.w3.org/TR/2008/REC-xml-20081126/#dt-comment
>
> XML comments are regular; the grammar is:
>
> Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
>
> which is basically a regex on the right hand side. I think (Char '-')
> basically is the same thing as [^-]. Thus:
>
> Comment := '<!--([^-]|-[^-])*-->'
>
> In other words, an XML comment is the sequence <!--, closed by --> in
> which -- does not occur.

Unfortunately, this is a little short-sighted. For example, XML
documents can contain CDATA elements, inside of which there can be
things that look like XML comments.

A CDATA begins with the sequence <![CDATA[ and ends with the sequence ]]>
in between there can be anything (other than, of course, the sequence ]]>).

So the following is valid CDATA:

<![CDATA[<!-- looks like a comment, but isn't! -->]]>

No comment-stripping program for XML can be considered correct if it
strips characters from the above CDATA.

Ben Bacarisse

unread,
Sep 25, 2017, 4:13:03 PM9/25/17
to
Kaz Kylheku <398-81...@kylheku.com> writes:

> On 2017-09-25, Chris Elvidge <ch...@mshome.net> wrote:
>> On 25/09/2017 07:51 am, Hongyi Zhao wrote:
>>> Hi all,
>>>
>>> Is there simple code to remove XML comments from a xml file?
>>>
>>> Regards
>>>
>> What do you define as an "XML comment"?
>
> https://www.w3.org/TR/2008/REC-xml-20081126/#dt-comment
>
> XML comments are regular; the grammar is:
>
> Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
>
> which is basically a regex on the right hand side. I think (Char '-')
> basically is the same thing as [^-]. Thus:
>
> Comment := '<!--([^-]|-[^-])*-->'
>
> In other words, an XML comment is the sequence <!--, closed by --> in
> which -- does not occur.

There's a small wrinkle in that. This XML document contains no comment:

<?xml version="1.0" encoding="UTF-8"?>
<x><![CDATA[<!-- I am not a comment -->]]></x>

<snip>
--
Ben.

Thomas 'PointedEars' Lahn

unread,
Sep 25, 2017, 6:36:12 PM9/25/17
to
Because regular expressions are greedy by default, as long as

- alternation is available;
- the first, not the longest, match in an alternation wins; and
- the delimiter length is fixed,

the same approach can be used to exclude strings that contain a targeted
substring, without negative lookahead:

/<!\[CDATA\[([^\]]|\][^\]]|\]\][^>])*\]\]>|<!--([^-]|-[^-])*-->/

Capturing parantheses can then be used to tell the matches apart. Because
in this case the difference between no match and a match of zero length may
be different to tell, the entire expression for matching the targeted
substring should be contained in capturing parantheses (marked below):

/<!\[CDATA\[([^\]]|\][^\]]|\]\][^>])*\]\]>|(<!--([^-]|-[^-])*-->)/
^ ^
As both POSIX awk and GNU awk apparently do not support the use of
references to functions to be called which are passed the match and their
return value is used as the replacement, the only possibility that I can see
is to use the match() function in a loop, whereas one would have to apply it
to consecutive slices of the original string while keeping track of the
position of each match, to tell the matches apart.

A user-defined function may be written so that this oft-needed feature does
not have to be implemented from scratch every time, or a more powerful
programming language may be used as this approach is rather inefficient.

But, in general, one should avoid modifying SGML- and XML-like markup using
regular expressions because those formal languages are just _not_ regular.
We have (or write) markup parsers (based on Nondeterministic Push-Down
Automata) and XSLT for that. For example, BeautifulSoup[1] is a lightweight
markup parser, and there is an XSLT processor as lxml.etree.XSLT, for
Python; and xsltproc(1), a command-line XSLT processor, is contained in
libxslt(3), the XSLT C library for GNOME.

See also <https://stackoverflow.com/a/1732454/855543> ;-)

_______
[1] <https://www.crummy.com/software/BeautifulSoup/>
--
PointedEars
FAQ: <http://PointedEars.de/faq> | <http://PointedEars.de/es-matrix>
<https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
Twitter: @PointedEars2 | Please do not cc me./Bitte keine Kopien per E-Mail.

Thomas 'PointedEars' Lahn

unread,
Sep 25, 2017, 6:37:42 PM9/25/17
to
Thomas 'PointedEars' Lahn wrote:

> […]
> /<!\[CDATA\[([^\]]|\][^\]]|\]\][^>])*\]\]>|<!--([^-]|-[^-])*-->/
>
> Capturing parantheses can then be used to tell the matches apart. Because
> in this case the difference between no match and a match of zero length
> may be different to tell, the entire expression for matching the targeted

s/different/difficult/

> substring should be contained in capturing parantheses (marked below):
>
> /<!\[CDATA\[([^\]]|\][^\]]|\]\][^>])*\]\]>|(<!--([^-]|-[^-])*-->)/
> ^ ^
> […]

Thomas 'PointedEars' Lahn

unread,
Sep 25, 2017, 6:50:21 PM9/25/17
to
Kaz Kylheku wrote:

> On 2017-09-25, Chris Elvidge <ch...@mshome.net> wrote:
>> On 25/09/2017 07:51 am, Hongyi Zhao wrote:
>>> Is there simple code to remove XML comments from a xml file?
>> What do you define as an "XML comment"?
>
> https://www.w3.org/TR/2008/REC-xml-20081126/#dt-comment
>
> XML comments are regular; […]

Apply the pumping lemma for regular languages, then say that again ;-)

(Ab ynathntr gung erdhverf nygreangvba gb qrfpevor vg vf erthyne va gur
gurbergvpny pbzchgre fpvrapr zrnavat bs gur jbeq. Gur Onpxhf-Anhe sbez jnf
vairagrq gb qrfpevor cebtenzzvat ynathntrf, juvpu ner hfhnyyl pbagrkg-serr,
ohg _abg_ erthyne.)

Ben Bacarisse

unread,
Sep 25, 2017, 8:03:01 PM9/25/17
to
Thomas 'PointedEars' Lahn <Point...@web.de> writes:

> Kaz Kylheku wrote:
>
>> On 2017-09-25, Chris Elvidge <ch...@mshome.net> wrote:
>>> On 25/09/2017 07:51 am, Hongyi Zhao wrote:
>>>> Is there simple code to remove XML comments from a xml file?
>>> What do you define as an "XML comment"?
>>
>> https://www.w3.org/TR/2008/REC-xml-20081126/#dt-comment
>>
>> XML comments are regular; […]
>
> Apply the pumping lemma for regular languages, then say that again ;-)

I don't see what you are referring to here. Valid XML documents do not
form a regular language, but XML comments do.

The pumping lemma /does/ apply to comments in some programming languages
because in some comments may be nested. But that's not the case here.

<un-rot13:>
> (No language that requires alternation to describe it is regular in the
> theoretical computer science meaning of the word. The Backus-Naur form was
> invented to describe programming languages, which are usually context-free,
> but _not_ regular.)

Alternation is central to the description of regular languages. Here
are three descriptions of a simple regular language:

(a|bb)*

<str> ::= | a<str> | bb<str>

S ->
S -> aS
S -> bbS

The first is a regular expression and the second two are forms of
Backus-Naur notation. All use alternation.

The BNF grammar for an XML comment given in the w3.org reference
above describes a regular language using alternation.

--
Ben.

Popping mad

unread,
Sep 26, 2017, 1:35:36 AM9/26/17
to
On Mon, 25 Sep 2017 18:56:52 +0000, Kaz Kylheku wrote:

> In other words, an XML comment is the sequence <!--, closed by --> in
> which -- does not occur.

line breaks

Popping mad

unread,
Sep 26, 2017, 1:37:16 AM9/26/17
to
looks like an antler problem..

Chris Elvidge

unread,
Sep 26, 2017, 6:16:42 AM9/26/17
to
Still haven't heard from Hongyi Zhao about the makeup of his XML file.
Is this yet another of his "tell me how to do this" followed by "see my
testings" when the answer does or doesn't work?


--

Chris Elvidge, England

Ben Bacarisse

unread,
Sep 26, 2017, 7:14:08 AM9/26/17
to
What about them? Just saying "line breaks" is not very helpful! Some
utilities won't match a newline with [^-] and some will; some will need
extra flags and so on to get it right, but line breaks in themselves are
not a serious obstacle.

--
Ben.

Thomas 'PointedEars' Lahn

unread,
Sep 29, 2017, 4:19:02 AM9/29/17
to
Popping mad <rai...@colition.gov> wrote:
^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^
[1] [2]

[1] It is customary and therefore considered polite to post using one’s
real name here. If you also want to use a nickname, you can insert
it in single quotes as do I.

[2] The second-level domain colition.gov is not yet registered, so
rai...@colition.gov” is not an e-mail address. It is a violation of
Internet standards and most certainly considered network abuse according
to the Acceptable Use Policy of your service provider to use a domain
that one does not own or is affiliated with. This may cause your
account with them (which I presume is the Massachusetts Institute of
Technology) to be suspended or canceled by them without further notice.
In case it is MIT, see
<https://policies-procedures.mit.edu/node/89/#sub3>.

It is inappropriate and impolite at least to post using an only
address-*like* string in an address header field; see
<http://tools.ietf.org/html/rfc5536> (Netnews Article Format) and
<http://tools.ietf.org/html/rfc5322> (Internet Message Format).

You have been warned.

> On Tue, 26 Sep 2017 00:36:05 +0200, Thomas 'PointedEars' Lahn wrote:

Attribution *line*, _not_ attribution novel.

> [full quote]

<http://www.netmeister.org/news/learn2quote.html>

> looks like an antler problem..
^^^^^^^^^^^^^^
I could not find a definition for that term on the Web. What does it mean?

(Keep in mind that this is an *internationally* distributed newsgroup.
Local idioms may not be readily understood by all subscribers.)

--
PointedEars

Thomas 'PointedEars' Lahn

unread,
Sep 29, 2017, 6:31:21 AM9/29/17
to
Ben Bacarisse wrote:

> Thomas 'PointedEars' Lahn <Point...@web.de> writes:
>> Kaz Kylheku wrote:
>>> On 2017-09-25, Chris Elvidge <ch...@mshome.net> wrote:
>>>> On 25/09/2017 07:51 am, Hongyi Zhao wrote:
>>>>> Is there simple code to remove XML comments from a xml file?
>>>> What do you define as an "XML comment"?
>>>
>>> https://www.w3.org/TR/2008/REC-xml-20081126/#dt-comment
>>>
>>> XML comments are regular; […]
>>
>> Apply the pumping lemma for regular languages, then say that again ;-)
>
> I don't see what you are referring to here.

Why do you not look it up *before* you reply?

> […]
> The pumping lemma /does/ apply to comments in some programming languages
> because in some comments may be nested. […]

Utter nonsense.

> <un-rot13:>
>> (No language that requires alternation to describe it is regular in the
>> theoretical computer science meaning of the word. The Backus-Naur form
>> was invented to describe programming languages, which are usually
>> context-free, but _not_ regular.)
>
> Alternation is central to the description of regular languages.

Utter nonsense.

> The BNF grammar for an XML comment given in the w3.org reference
> above describes a regular language using alternation.

No. Apparently you do not know the Chomsky hierarchy: All regular languages
are context-free, but not all context-free languages are regular (L₃ ⊆ L₂).

<https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy>

This means that the BNF *can* be used to describe regular languages (even
though, AISB, that is not its primary purpose), but it also means that _not_
*all* languages that can be described using the BNF are regular; or, IOW,
*that* a language can be described using the BNF does _not_ mean that it is
regular.

And then, /ex falso quodlibet/.

--
PointedEars

Ben Bacarisse

unread,
Sep 29, 2017, 4:20:49 PM9/29/17
to
Thomas 'PointedEars' Lahn <Point...@web.de> writes:

> Ben Bacarisse wrote:
>
>> Thomas 'PointedEars' Lahn <Point...@web.de> writes:
>>> Kaz Kylheku wrote:
<snip>
>>>> https://www.w3.org/TR/2008/REC-xml-20081126/#dt-comment
>>>>
>>>> XML comments are regular; […]

<snip>
>>> (No language that requires alternation to describe it is regular in the
>>> theoretical computer science meaning of the word. The Backus-Naur form
>>> was invented to describe programming languages, which are usually
>>> context-free, but _not_ regular.)
>>
>> Alternation is central to the description of regular languages.
>
> Utter nonsense.

All that is needed to refute your claim is a counterexample and I gave
one, described in three different ways in case you were quibbling about
what constitutes alternation. Here it is again

(a|bb)*

<str> ::= | a<str> | bb<str>

S ->
S -> aS
S -> bbS

In what way do you think this is not a counterexample to the statement
you made?

>> The BNF grammar for an XML comment given in the w3.org reference
>> above describes a regular language using alternation.
>
> No.

Yes it does. Quite plainly. The rule describes a language defined
using only concatenation, alternation and the Kleene star -- the very
definition of regular.

<snip bluster>
--
Ben.

Ian Zimmerman

unread,
Sep 30, 2017, 6:57:47 PM9/30/17
to
On 2017-09-29 10:18, Thomas 'PointedEars' Lahn wrote:

> > On Tue, 26 Sep 2017 00:36:05 +0200, Thomas 'PointedEars' Lahn wrote:
>
> Attribution *line*, _not_ attribution novel.

The time information is useful if the replying message breaks threading,
as happens all the time, particularly with people posting with Shmoogle
Groups (and worse).

Oh yes, my From is not a real domain, either. So sue me.

--
Please don't Cc: me privately on mailing lists and Usenet,
if you also post the followup to the list or newsgroup.
Do obvious transformation on domain to reply privately _only_ on Usenet.

Thomas 'PointedEars' Lahn

unread,
Sep 30, 2017, 9:24:19 PM9/30/17
to
Ian Zimmerman wrote:

> On 2017-09-29 10:18, Thomas 'PointedEars' Lahn wrote:
>> > On Tue, 26 Sep 2017 00:36:05 +0200, Thomas 'PointedEars' Lahn wrote:
>> Attribution *line*, _not_ attribution novel.
>
> The time information is useful if the replying message breaks threading,
> as happens all the time, particularly with people posting with Shmoogle
> Groups (and worse).

Those pointless attribution novels break threads, yes. They make them a lot
harder to read.

> Oh yes, my From is not a real domain, either. So sue me.

Too much effort. *PLONK*

F’up2 poster

Thomas 'PointedEars' Lahn

unread,
Sep 30, 2017, 9:39:42 PM9/30/17
to
Ben Bacarisse wrote:

> In what way do you think this is not a counterexample to the statement
> you made?

Learn to read.

Ben Bacarisse

unread,
Oct 1, 2017, 9:31:58 AM10/1/17
to
Thomas 'PointedEars' Lahn <Point...@web.de> writes:

> Ben Bacarisse wrote:
>
>> In what way do you think this is not a counterexample to the statement
>> you made?
>
> Learn to read.

You said:

| (No language that requires alternation to describe it is regular in the
| theoretical computer science meaning of the word.

Here is my counterexample, again -- the same language written out in
three different ways:

(a|bb)*

<str> ::= | a<str> | bb<str>

S ->
S -> aS
S -> bbS

I invite you to explain why you think it is not a counterexample.

--
Ben.
0 new messages