XML files with gawk

kahrs

unread,

Aug 13, 2004, 1:25:03 PM8/13/04

to

Hello,

recently some more readers asked about reading
XML files with AWK. I just wanted to draw your
attention to the fact that there is an experimental
extension of gawk for doing this. If someone is
interested in trying the most recent version, he
can get a source code patch relative to gawk 3.1.4 here:

http://home.vr-web.de/~Juergen.Kahrs/patch_3.1.4__xml_20040813

Example application can be found in this tar file:

http://home.vr-web.de/Juergen.Kahrs/xml_puller.tar.gz

The best online documentation we currently have is the one
written by Stefan Tramm:

http://homepage.mac.com/stefan.tramm/iWiki/DownloadSection.html
http://homepage.mac.com/stefan.tramm/iWiki/XmlGawkMacOSX.html

The original gawk 3.1.4 should eventually be available
for download, when the GNU admins have put it here:

ftp://ftp.gnu.org/gnu/gawk/

William Park

unread,

Aug 13, 2004, 3:51:15 PM8/13/04

to

This would be interesting. Expat XML parser could probably be added to
shell as well as awk.

--
William Park <openge...@yahoo.ca>
Open Geometry Consulting, Toronto, Canada

kahrs

unread,

Aug 14, 2004, 11:48:55 AM8/14/04

to

William Park wrote:

> This would be interesting. Expat XML parser could probably be added to
> shell as well as awk.

Expat can be added to everything. The question is, if it makes sense.
Does Expat's push-parser style work well in conjunction with the
Bourne shell ? No. The Bourne shell's operational model is all about
starting and controlling processes.

AWK's operational model is all about reading and writing data with
a pattern-action model in mind. Therefore, Expat's push-parsing style
integrates quite naturally into AWK programming style.

For example, Arnold Robbins needed a simple way to extract the
URL from the DOCTYPE declaration of a DocBook file. The solution
is a typical AWK-style one-liner:

gawk -vXMLMODE=1 'XMLDOCTSYSID{print XMLDOCTSYSID}' docbook_chapter.xml

William Park

unread,

Aug 14, 2004, 3:04:27 PM8/14/04

to

kahrs <Juergen.Kah...@vr-web.de> wrote:
> William Park wrote:
>
> > This would be interesting. Expat XML parser could probably be added to
> > shell as well as awk.
>
> Expat can be added to everything. The question is, if it makes sense.
> Does Expat's push-parser style work well in conjunction with the
> Bourne shell ? No. The Bourne shell's operational model is all about
> starting and controlling processes.

That's way too simplistic. As a counter-example, I offer
http://freshmeat.net/projects/bashdiff/
http://freshmeat.net/projects/basp/
which put Bash shell in line with other high-level scripting languages.
In most areas, much better and more natural; in other areas, not as
good.

I would agree, however, that Expat would be more natural with Awk than
Bash. Unless, of course, I can figure out format of query that makes
sense to both shell and XML. :-)

>
> AWK's operational model is all about reading and writing data with
> a pattern-action model in mind. Therefore, Expat's push-parsing style
> integrates quite naturally into AWK programming style.
>
> For example, Arnold Robbins needed a simple way to extract the
> URL from the DOCTYPE declaration of a DocBook file. The solution
> is a typical AWK-style one-liner:
>
> gawk -vXMLMODE=1 'XMLDOCTSYSID{print XMLDOCTSYSID}' docbook_chapter.xml

Expat uses callback functions. So, when the parser encounters DOCTYPE, it
calls that function, which in turn triggers this Awk line. I can do
that in Bash shell, but there has to be a better way than that to make it
workable in shell scripting.

kahrs

unread,

Aug 14, 2004, 6:03:18 PM8/14/04

to

William Park wrote:

> That's way too simplistic. As a counter-example, I offer
> http://freshmeat.net/projects/bashdiff/
> http://freshmeat.net/projects/basp/
> which put Bash shell in line with other high-level scripting languages.
> In most areas, much better and more natural; in other areas, not as
> good.

Interesting. If you think that it would make sense
to integrate an XML reader into bash, it would most
likely have to be a pull-parser (requesting one token
after the other with the "read" command).

> I would agree, however, that Expat would be more natural with Awk than
> Bash. Unless, of course, I can figure out format of query that makes
> sense to both shell and XML. :-)

As I said, extend the "read" command so that it produces
one token after the other, just like in the demo_puller.awk
example application. In the age of Object Orientation this
is not quite up-to-date, but it works.

> Expat uses callback functions. So, when the parser encounters DOCTYPE, it
> calls that function, which in turn triggers this Awk line. I can do
> that in Bash shell, but there has to be a better way than that to make it
> workable in shell scripting.

Watch my example applications. Maybe you find a way
along these lines. Then you could build upon my xml_puller
library:

http://home.vr-web.de/Juergen.Kahrs/xml_puller.tar.gz

It was designed to be also used without gawk.
Watch the example applications; each time I add
a feature to xmlgawk, I first add this feature to
the xml_puller lib, test it (without the gawk runtime
environment) and then I integrate the feature into gawk.

john

unread,

Sep 1, 2004, 11:24:04 AM9/1/04

to

kahrs <Juergen.Kah...@vr-web.de> wrote in message news:<2o4bv6F...@uni-berlin.de>...

Few suggestions/feature requests. I havn't tried out the patch, so
apologies in advance if any of the following doesn't make any sense.

[1] Seems gawk will receive CHARDATA as a single piece instead of
in chunks. It would be nice if CDATA and UNPARSED data are also
made available that way. For CDATA, could use a single CDATA token
instead of the two STARTCDATA and ENDCDATA.

[2] Note that registering a default handler for UNPARSED token has the
side effect of expat not expanding internal enitites. As a user, I
would probably sometime want the entities be expanded.

A solution would be to recognize the entity references in the
unparsed handler (after buffering the entire content, look for
'&' in the first character and ';' in the last character). Then
report that as a ENTITYREF token (the value should be the data
excluding the '&' and ';' characters). In addition, you would
need to register an entity declaration (and possibly notation
declaration) handler, and deliver a ENTITYDECL token. This would
enable the user to substitute the entity value for the reference.

[3] Would it make sense to use an array for the tokens (e.g XML["STARTELEM"],
XML["ENDELEM"], etc.) instead of all these scalars?

Finally, I just looked at the changelog for the latest version of expat
(version 1.95.8) in expat home page. Looks like one can now suspend the
parser for later resumption. This could possibly allow an alternative,
but not necessarily simpler, implementation of a pull-style parser.

Thanks,

John

Jürgen Kahrs

unread,

Sep 1, 2004, 3:32:37 PM9/1/04

to

john wrote:

> Few suggestions/feature requests. I havn't tried out the patch, so
> apologies in advance if any of the following doesn't make any sense.
>
> [1] Seems gawk will receive CHARDATA as a single piece instead of
> in chunks. It would be nice if CDATA and UNPARSED data are also
> made available that way.

Yes, a good idea.

> For CDATA, could use a single CDATA token
> instead of the two STARTCDATA and ENDCDATA.

We have already discussed this but there was a good
reason against (.. sorry I cant remember which).

> [2] Note that registering a default handler for UNPARSED token has the
> side effect of expat not expanding internal enitites. As a user, I
> would probably sometime want the entities be expanded.

Yes, I knew someone would ask about it.
It looks like a logical next step to use
expat's callback functions for reading
internal entities.

> [3] Would it make sense to use an array for the tokens (e.g XML["STARTELEM"],
> XML["ENDELEM"], etc.) instead of all these scalars?

Arnold Robbins also asked this. Currently, we disagree
about this .. for aesthetical reasons.

> Finally, I just looked at the changelog for the latest version of expat
> (version 1.95.8) in expat home page. Looks like one can now suspend the
> parser for later resumption. This could possibly allow an alternative,
> but not necessarily simpler, implementation of a pull-style parser.

Correct. At the moment, I prefer to go on with what we have
in the xml_puller:
1. We need a puller interface anyway because we have
to support gawk's "getline" function which acts as the "puller".
2. expat's internals are not visible above the puller interface.
We could decide to use the suspend option at any time without
changing the puller interface.

Thanks for posting your ideas.
If someone writes GNU Awk scripts for processing XML
files, please post some examples. Examples show us
which kinds of problems can be solved; which kinds are
not so easy to solve with xmlgawk and which problem-solving
approach users like.

Mark R.Bannister

unread,

Sep 2, 2004, 11:36:50 AM9/2/04

to

Jürgen Kahrs <Juergen.Kah...@vr-web.de> wrote in message news:<2pmmf9F...@uni-berlin.de>...

One reason I have a distinct loathing for XML, esp. in configuration
files, is it's very difficult to parse (with line-based editors) and
it's not very readable either. In my book, this breaks both of the
fundamental tests for a useable configuration standard .... whoever
first thought XML was a good idea for anything except document mark-up
should be shot (steps off soap box before he gets lynched for posting
off-topic).

Anyway, personal grievances aside, here's a script I was forced to
write, unhappy and at gun-point, to try and make some XML files I was
dealing with more readable. This demonstrates how much work it takes
in AWK just to parse the structure alone. This doesn't even take into
consideration reading attribute values or parsing DTDs.

The next person who thinks it's a good idea to write a configuration
file in XML will have to personally answer to my wrath ........
perhaps I should set-up a new website banxml.org or xmlboycott.com
with the sole intent to make the world see reason. Anyone with me?
:-)

#!/bin/ksh
##############################################################################
#
# xmldump
# Author: Mark R. Bannister <markb at freedomware.co.uk>
# Displays components within a set of named XML files
#
##############################################################################
CALL=$(basename $0)
USAGE="Syntax: $CALL [-cdit] xmlfile ..."

##############################################################################
###
### DisplayXML() displays selected components of a named XML file
###
### Arguments:
### arg 1 = 0 no doc content, 1 display doc content
### arg 2 = 0 no tags, 1 display tags
### arg 3 = 0 no comments, 1 display comments
### arg 4 = 0 do not change indentation, 1 recalculate indents
### arg 5 = filename
###
##############################################################################
DisplayXML()
{
nawk -v shdoc=$1 -v shtags=$2 -v shcomm=$3 -v indent=$4 '
{
pushline=levhigh=0

### If indenting strip any leading blanks from input
CloseFlags()
if (indent && !comment) sub("^[ ][ ]*","")

### Strip carriage returns
gsub("\\r","")

### Scan line one character at a time
for (c=1;c<=length($0);c++)
{
CloseFlags()
ReadChars()
DisplayChars()
}

if (newline)
{
print ""
newline=0
}
}

function CloseFlags()
{
if (comment==2) comment=0 # close comment
if (tag==2) tag=0 # close tag
if (quotes==2) quotes=0 # close quote
}

function ReadChars()
{
ch=substr($0,c,1)

if (!comment)
{
if (ch=="<" && substr($0,c,4)=="<!--")
{
comment=1 # opening comment
ch=substr($0,c,4) # stretch chars
c+=3
}
else if (!tag && ch=="<")
{
tag=1 # opening tag

### Increase or decrease indent depending
### on tag style <tag> or </tag> but not <?tag?> or <!tag>
ch2=substr($0,c,2)
if (ch2=="</") level--
else if (ch2!="<?" && ch2!="<!")
{
level++
levhigh=1
}
}
else if (tag)
{
if (!quotes && ch=="\"") quotes=1 # opening quote
else if (quotes && ch=="\"") quotes=2 # closing quote
else if (!quotes && ch==">")
{
tag=2 # closing tag

### Catch <tag/> style where
### indent level should not change
if (c>1 && substr($0,c-1,2)=="/>") level--
}
}
}
else
{
if (ch=="-" && substr($0,c,3)=="-->")
{
comment=2 # closing comment
ch=substr($0,c,3) # stretch chars
c+=2
}
}
}

function DisplayChars()
{
### Work out whether to display this character or not
dispch=0
if (comment && shcomm) dispch=1
if (tag && shtags) dispch=1
if (!comment && !tag && shdoc) dispch=1
if (dispch)
{
if (indent) IndentLine()
printf("%s",ch)
if (!newline) newline=1
}
}

function IndentLine()
{
if (pushline || comment) return
pushline=1

### Have begun processing first tag so indent level
### may already be one level too high
if ((thislevel=(levhigh?level-1:level))<0) thislevel=0
for (lev=0;lev<thislevel;lev++) printf(" ")
}' "$5"
}

#######################################################################################
###
### START HERE
###
#######################################################################################
comments=0
doc=0
indent=0
tags=0
help=0

while getopts cdit c
do
case $c in
c) comments=1;;
d) doc=1;;
i) indent=1;;
t) tags=1;;
?) help=1;;
esac
done
shift $(($OPTIND - 1))

###
### Display help message
###
if [ $help -eq 1 -o $# -eq 0 ]; then
cat << EOF

Displays components within a set of named XML files.
With no options, displays the XML files much like that cat command.
When options are supplied, displays only the selected components.

$USAGE

where -c displays comments
-d displays document contents
-i indent properly
-t displays tags

EOF
exit 2
fi

###
### If no options supplied, then display entire XML files
###
if [ $comments -eq 0 -a $doc -eq 0 -a $tags -eq 0 ]; then
comments=1
doc=1
tags=1
fi

first=1
while [ $# -gt 0 ]
do
if [ $first -eq 1 ]; then
first=0
else echo "" ### this should be Ctrl+L for a form-feed
fi

echo ""
DisplayXML $doc $tags $comments $indent "$1"
shift
done

Jürgen Kahrs

unread,

Sep 2, 2004, 1:11:58 PM9/2/04

to

Mark R.Bannister wrote:

> One reason I have a distinct loathing for XML, esp. in configuration
> files, is it's very difficult to parse (with line-based editors) and
> it's not very readable either.

Configuration files in XML, that's what I had to
face last year. I simply had to find a way to
handle them; Unix tools like sed, grep, cut, awk
did not help me anymore. That's why I extended
GNU Awk with the expat XML parser.

> Anyway, personal grievances aside, here's a script I was forced to
> write, unhappy and at gun-point, to try and make some XML files I was
> dealing with more readable. This demonstrates how much work it takes
> in AWK just to parse the structure alone. This doesn't even take into
> consideration reading attribute values or parsing DTDs.

Whether we like it or not, XML configuration files are there
and we have to work with them.

Have you tried re-writing your script with xmlgawk ?
It should be much shorter, more readable and more complete.

> The next person who thinks it's a good idea to write a configuration
> file in XML will have to personally answer to my wrath ........

Please don't misunderstand my posting. I am not recommending
the use of XML for configuration files; I have seen the
consequences of doing this. As a software developer, my
employer expects me to adapt to a changing world. And XML
files are a fact now.

William Park

unread,

Sep 2, 2004, 4:34:22 PM9/2/04

to

Mark R.Bannister <Chap...@aol.com> wrote:
> Displays components within a set of named XML files.
> With no options, displays the XML files much like that cat command.
> When options are supplied, displays only the selected components.
>
> $USAGE
>
> where -c displays comments
> -d displays document contents
> -i indent properly
> -t displays tags
>
> EOF

I think
- Gawk with XML patch (by Juergen Kahrs), or
- Bash with XML patch or regex cut/split (by me)
would've made it much much easier.

Mark R.Bannister

unread,

Sep 3, 2004, 5:05:54 AM9/3/04

to

William Park <openge...@yahoo.ca> wrote in message news:<2ppeedF...@uni-berlin.de>...

> Mark R.Bannister <Chap...@aol.com> wrote:
> > Displays components within a set of named XML files.
> > With no options, displays the XML files much like that cat command.
> > When options are supplied, displays only the selected components.
> >
> > $USAGE
> >
> > where -c displays comments
> > -d displays document contents
> > -i indent properly
> > -t displays tags
> >
> > EOF
>
> I think
> - Gawk with XML patch (by Juergen Kahrs), or
> - Bash with XML patch or regex cut/split (by me)
> would've made it much much easier.

I agree, it would have made it much easier. However, often when
working for clients they have the tools they have on UNIX systems
already built ... if you see what I mean ... they rarely allow new
tools to be installed. So sometimes we need to figure out ways of
handling these things with the old unpatched tools.

Jürgen Kahrs

unread,

Sep 3, 2004, 12:27:20 PM9/3/04

to

Mark R.Bannister wrote:

> I agree, it would have made it much easier. However, often when
> working for clients they have the tools they have on UNIX systems
> already built ... if you see what I mean ... they rarely allow new

Yes, I see what you mean. If your clients use Linux,
you may be lucky and xmllint is installed. You can
use it to re-format the XML file into a kind of
"canonical" indented XML:

xmllint --format mssecure.xml

After this, the Unix tools can be used on this output.

> tools to be installed. So sometimes we need to figure out ways of
> handling these things with the old unpatched tools.

When using these "unpatched tools", you might be interested
in this rudimentary XML parser which was posted some years
ago:

http://groups.google.de/groups?q=+Steve+Coile+group:comp.lang.awk&hl=de&lr=&ie=UTF-8&group=comp.lang.awk&selm=3BCDC614.AC548AB8%40csc.com&rnum=2

I have never used it, but it looks interesting.

Jürgen Kahrs

unread,

Oct 11, 2004, 12:43:45 PM10/11/04

to

Hello,

here is a short update on my previous posting.

> recently some more readers asked about reading
> XML files with AWK. I just wanted to draw your
> attention to the fact that there is an experimental
> extension of gawk for doing this. If someone is
> interested in trying the most recent version, he
> can get a source code patch relative to gawk 3.1.4 here:
>
> http://home.vr-web.de/~Juergen.Kahrs/patch_3.1.4__xml_20040813

The most recent patch is this one:

http://home.vr-web.de/~Juergen.Kahrs/patch_3.1.4__xml_20040920

> The best online documentation we currently have is the one
> written by Stefan Tramm:
>
> http://homepage.mac.com/stefan.tramm/iWiki/DownloadSection.html
> http://homepage.mac.com/stefan.tramm/iWiki/XmlGawkMacOSX.html

I have started writing a proper GNU manual.
You can find the very first preliminary draft here:

http://home.vr-web.de/~Juergen.Kahrs/gawk/XML/xmlgawk.ps

Updates will replace the current version without further notice.
The potential timing of a release is mentioned in the Preface,
but the data given is only an informal guess. The handling of
the installation and the patch are still described best on
Stefan's web page (link given above).

> The original gawk 3.1.4 should eventually be available
> for download, when the GNU admins have put it here:
>
> ftp://ftp.gnu.org/gnu/gawk/

Release 3.1.4 (without the patch) is now available here:

ftp://ftp.gnu.org/gnu/gawk/gawk-3.1.4.tar.gz
ftp://ftp.gnu.org/gnu/gawk/gawk-3.1.4-doc.tar.gz
ftp://ftp.gnu.org/gnu/gawk/gawk-3.1.4-ps.tar.gz