Python to XML to Python conversion

14 views
Skip to first unread message

Mark

unread,
Jul 11, 2002, 8:08:47 PM7/11/02
to
Hello,

Recently my employer has asked me to do some computer work though I'm
just savvy enough to write minor programs of little significance. My
job is to write a program that will be generic enough to take any form
of Python dictionary and be able to convert it to XML and back.
Ladies and gentlemen I am completely at a loss, I'm not a good
programmer. If anyone one has any good, simple and complicated
examples, or tips I would be eternally greatful if you'd share them
with me. I have searched desperately through every posting I could
find on the subject, but to no avail. Please Help!

Regards,______
______ |
Markus|_______|
______|

the...@binary.net

unread,
Jul 11, 2002, 9:22:17 PM7/11/02
to
Mark <mark...@hotmail.com> wrote:

> Recently my employer has asked me to do some computer work though I'm
> just savvy enough to write minor programs of little significance. My
> job is to write a program that will be generic enough to take any form
> of Python dictionary and be able to convert it to XML and back.

like -- why? If he wants to save the data from one session to the next,
the Pickle or cPickle module will be 100x easier.

> Ladies and gentlemen I am completely at a loss, I'm not a good
> programmer. If anyone one has any good, simple and complicated
> examples, or tips I would be eternally greatful if you'd share them
> with me. I have searched desperately through every posting I could
> find on the subject, but to no avail. Please Help!

The problem isn't very hard . . . no deep Python magic.

I'd do the Python -> XML like this:

outfile = file("out.xml")

outfile.write("<pydict>")
for key in dict.keys():
outfile.write("<%s>%s</%s>\n" %(key, dict[key], key) )

outfile.write("</pydict>")
outfile.close()

How's that?? Well-formed XML, without any DOM-overhead.

If you didn't understand very in of the last sentence, than you better
look at the Python docs (www.python.org/doc). Look for "minidom".
Heck, look up XML.

I'll leave the second problem to you . . . .

-- mikeh

Harry George

unread,
Jul 11, 2002, 8:01:45 PM7/11/02
to
mark...@hotmail.com (Mark) writes:

Get Gnosis Utils:
http://gnosis.cx/download/

See gnosis/xml/pickle. That may do the job for you.

--
Harry George
hgg...@seanet.com

Terry Reedy

unread,
Jul 11, 2002, 10:03:35 PM7/11/02
to

"Mark" <mark...@hotmail.com> wrote in message
news:d2f5f2d.02071...@posting.google.com...

To make a start, I believe you need PyXML plus marshal or pickle
modules. However, roundtripping *any* dict to text and back means
roundtripping *any* Python object to text and back. I am sure that
this is not always possible. You need sensible boundaries short of
*all* possible objects. With that, I am sure there is existing and
available code that does such interconversions.

Terry J. Reedy

Jeremy Bowers

unread,
Jul 11, 2002, 11:01:51 PM7/11/02
to
the...@binary.net wrote:
> I'd do the Python -> XML like this:
>
> outfile = file("out.xml")
>
> outfile.write("<pydict>")
> for key in dict.keys():
> outfile.write("<%s>%s</%s>\n" %(key, dict[key], key) )
>
> outfile.write("</pydict>")
> outfile.close()
>
> How's that?? Well-formed XML, without any DOM-overhead.

This is common and incorrect; the XML is not going to be well formed for
any number of reasons. The keys of the dict are not required to be valid
XML tag names (consider a key "1 2", wrong for starting with a number
AND having a space in it). The keys of the dict may not be strings. The
values of the dict may not be strings either. The values of the dict may
contain any of several XML chars which much be encoded, such as &.
Goodness help your XML parser if the text happens to include XML or XML
fragments.

For each key in the dict, the odds become increasingly stacked against you.

If you __know__ you have string keys and string vals, you can do
something like

from xml.sax.saxutils import quoteattr

...
outfile.write('<item name=%s value=%s>' % (quoteattr(key),
quoteattr(dict[key]))
...

(untested)

but it is still better to go with the XML marshaler or standard Pickle
module if at all possible.

Also, part of being a good programmer is learning how to elicit good
requirements. Do you understand why you need XML? XML is a good transfer
language between programs and language boundaries. If you just need to
save some data for the same program to retrieve later, you actively
*don't* want XML. Use pickle. (Or 'shelve', which I like for quick
projects.) If you *are* going to transfer this data to another program,
then what do those other programs take naturally? If they have a native
format and you can match it, you can save yourself that much trouble.

Understand the motivation. If XML is being used as a bullet point, you
may consider politely suggesting better, cheaper, faster,
faster-to-*develop* alternatives (cPickle). Failing that and if you
never intend to transfer the data anywhere, then use the XML marshaler
for the buzzword compliance and ease-of-use pickling.

(Thought: XML should never be your *first* choice of file format. It is
the choice of *last* resort, when you absolutely *need* easy parsing in
multiple languages or environments and can't get it any other way. It is
then a much better choice then other formats, but only under those
limited, albiet extremely popular, conditions.)

Peter Hansen

unread,
Jul 11, 2002, 11:41:10 PM7/11/02
to
Jeremy Bowers wrote:
>
> (Thought: XML should never be your *first* choice of file format. It is
> the choice of *last* resort, when you absolutely *need* easy parsing in
> multiple languages or environments and can't get it any other way. It is
> then a much better choice then other formats, but only under those
> limited, albiet extremely popular, conditions.)

I have to disagree with those who say, in effect, that XML is such
an unsuitable technology that it is truly a last resort.

XML was adapted from SGML to meet several key goals. Among those
was the desire to make a format which was easily human-readable
and editable using simple tools like text editors. For some people,
this is a *hugely* beneficial benefit of using XML even for storage
of simple data structures, which binary and/or specialized formats
such as pickles do not have.

In addition, it leaves open the possibility of very easily
processing the data using one of the steadily growing number
of XML utilities. One extremely simple example of this sort
of thing is using IE to load an arbitrary XML file to observe
that the file is in fact well-formed, and to get a quick idea
of the structure of it. Not quite so simple with some other
formats.

It's obviously a religious issue, but I get the feeling that while
some buy into the XML hype wholesale, and overuse it to their
detriment, others are now jumping on some anti-XML bandwagon
(possibly) without having really put it to the test. I often find
that a choice to use XML opens up interesting avenues which would not
even have occurred to me had I started off with another format,
and so far I'm not sure I regret any particular case where I've
used it.

Now I'm *not* saying XML is always a first choice, and certainly
a pickle is quite likely the Simplest Thing That Could Possibly
Work and therefore a good first choice, but I do not think it
deserves to be relegated to the abyss of "last resort".

-Peter

Oren Tirosh

unread,
Jul 12, 2002, 1:54:40 AM7/12/02
to
On Thu, Jul 11, 2002 at 11:41:10PM -0400, Peter Hansen wrote:
> Jeremy Bowers wrote:
> >
> > (Thought: XML should never be your *first* choice of file format. It is
> > the choice of *last* resort, when you absolutely *need* easy parsing in
> > multiple languages or environments and can't get it any other way. It is
> > then a much better choice then other formats, but only under those
> > limited, albiet extremely popular, conditions.)
>
> I have to disagree with those who say, in effect, that XML is such
> an unsuitable technology that it is truly a last resort.
>
> XML was adapted from SGML to meet several key goals. Among those
> was the desire to make a format which was easily human-readable
> and editable using simple tools like text editors.

XML combines all the inefficiency of text-based formats with most of the
unreadability of binary formats :-)

> In addition, it leaves open the possibility of very easily
> processing the data using one of the steadily growing number
> of XML utilities.

A growing number of utilities is a sign that a format is popular, not that
it is good. In fact, it may be argued that a simpler format would actually
need less tools.

adding-more-oil-to-the-religious-war-fire-ly yours,

Oren

Matt Gerrans

unread,
Jul 12, 2002, 3:03:28 AM7/12/02
to
> XML combines all the inefficiency of text-based formats with most of the
> unreadability of binary formats :-)

Maybe it was surreptitiously introduced by a consortium of router and disk
drive manufacturers, to ensure increasing sales of hardware, by its
voracious appetite for bandwidth and disk space. ;-)


Alex Martelli

unread,
Jul 12, 2002, 4:01:36 AM7/12/02
to
Jeremy Bowers wrote:
...

> the choice of *last* resort, when you absolutely *need* easy parsing in
> multiple languages or environments and can't get it any other way. It is

I think this assertion, as it stands, is untenable. There just about
IS *some* other way -- e.g., inventing your own little language for
data description and writing from scratch the needed parsers in all
languages and environments of interest.

Are you SERIOUSLY claiming that such reiterated reinventions of the
wheel -- which were a good part of the data interchange "state of the
art" before XML appeared -- should be used in preference to XML?!

Similar comments apply to most traditional ways of data interchange
in heterogeneous environments -- overextensions of simplistic formats
such as CSV, empirically-determined parsing and heuristics for
unspecified and underspecified proprietary and human-oriented formats,
rigid and unportable binaries. XML is generally preferable to any
of these traditional kludges, even though one or more of them most
often IS available and thus breaks your dubious criterion of "can't
get it in any other way".

Apparently, the ridiculous over-hype that has greeted XML in parts
of the media (including much non-technical media) is triggering an
allergic reaction of similarly-ridiculous hostility. On one side
we see abuses such as XML files used where (e.g.) relational DB's
would be the obvious solution, on the other, broadsides such as
this one definitely appears to be.

Fortunately, there's a lot of us engineers in the middle, applying
skeptical, field-tested "filters" to media hype AND other overbroad
tirades. XML is often a good choice for heterogeneous-environment data
interchange -- far from being "the choice of *last* resort" for such
tasks, it's generally a good default choice unless some obviously
better alternative is evident. For tasks that are borderline cases
of "heterogeneous data interchange", such as storing and retrieving
(or communicating) data among a set of programs that aren't really
all that heterogeneous, XML should still be considered when ability
to examine and potentially tweak the stored data with other programs
is of interest, and the costs wrt proprietary or language-specific
formats (in terms of time and/or space) aren't out of line with the
potential benefits.


Alex

Jonathan Hogg

unread,
Jul 12, 2002, 4:07:53 AM7/12/02
to
On 12/7/2002 6:54, in article
mailman.102645331...@python.org, "Oren Tirosh"
<oren...@hishome.net> wrote:

> XML combines all the inefficiency of text-based formats with most of the
> unreadability of binary formats :-)

Snicker away, but XML is the closest we've got to a universally accepted
structured data format.

> A growing number of utilities is a sign that a format is popular, not that
> it is good. In fact, it may be argued that a simpler format would actually
> need less tools.

It doesn't need to be "good". I'm not even sure what you mean by "good". If
you feel you can do a better job of designing an extensible structured data
format that you can convince the rest of the world to write parsers and
generators for that plug into pretty much every available language, editor
and database, then be my guest.

If I need to exchange some structured data with someone else, I can spend an
unbounded amount of time agreeing a format in advance with them or I can
just pick a sensible looking schema and dump it to XML. As long as the other
party can see what the different tags mean they can trivially import it.

I'm not sure it is possible to "overuse" XML. If you need to read and write
structured data, why bother coming up with your own format? (see: the entire
contents of /etc) Or why use something that is proprietary to a particular
language or system? (see: Pickle)

Jonathan

Erik Max Francis

unread,
Jul 12, 2002, 4:16:18 AM7/12/02
to
Peter Hansen wrote:

> It's obviously a religious issue, but I get the feeling that while
> some buy into the XML hype wholesale, and overuse it to their
> detriment, others are now jumping on some anti-XML bandwagon
> (possibly) without having really put it to the test. I often find
> that a choice to use XML opens up interesting avenues which would not
> even have occurred to me had I started off with another format,
> and so far I'm not sure I regret any particular case where I've
> used it.

It's one of those things that has been elevated to the status of
buzzword, regardless of its actual benefits and disadvantages. That
means that some people will think it is the One True Way and will use it
for absolutely every hole they can possibly cram the peg into (whether
it's appropriate for it or not), and on the same side some antizealots
who see some problems with it will exaggerate and insist that it's an
awful tool (often indicated by "X sucks" and "X is evil" without further
explication) and berate anyone who uses it.

XML has its uses, and for those uses it is very well suited, but like
any new idea there are those who will use it just for the sake of using
it, totally disregarding whether it's well-suited to the task. It's an
unfortunate fact of life, regrettably.

--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
/ \ See the son in your bad day / Smell the flowers in the valley
\__/ Chante Moore
Bosskey.net: Aliens vs. Predator 2 / http://www.bosskey.net/avp2/
A personal guide to Aliens vs. Predator 2.

Alex Martelli

unread,
Jul 12, 2002, 4:37:59 AM7/12/02
to
Jonathan Hogg wrote:
...

> I'm not sure it is possible to "overuse" XML.

It is -- easily. My pet peeve is the idea of using XML files for
tasks that obviously need a real database, preferably a relational one.

I think some people never really GOT relational databases, no matter
that they've been around for decades and are so widespread, and they're
now turning to *overusing* XML to cover up for that:-).

> If you need to read and
> write structured data, why bother coming up with your own format? (see:
> the entire contents of /etc) Or why use something that is proprietary to a
> particular language or system? (see: Pickle)

Speed, size, and convenience are possible reasons. If the structure
is highly repetitious, and the amount of data is very large, then
repeating the tags identically a zillion times can impose substantial
overhead of space and time. Pickling and unpickling can be (say)
twice as fast and consume half as much space as going to an XML
format. That doesn't really matter unless the amounts of data are
huge, of course. But the point is, sometimes they ARE. I/O-bound
programs are hardly a thing of the past: CPU's get faster much
faster than networks and disks get faster.

The need to search "random-access" wise, or "keyed"-wise, is often
an excellent reason to avoid a format that requires reading though
all of a file to get at a particular piece of data. dbm variants,
shelve, and relational databases, can be huge wins here (compared to
XML, pickle, or any other choice requiring whole-file reloading).

While XML is reasonably human-editable, there may well be formats
that are more convenient than it for this purpose, avoiding the
need of special-purpose XML-oriented editors and allowing the use
of any good old text editor with maximal ease. This is a good
reason to keep a human-editable configuration file in non-XML
form, in my opinion.


Let's try to avoid pro-XML hype in an attempt to counter the
anti-XML hype that's suddenly burst on this group...:-)


Alex

Doru-Catalin Togea

unread,
Jul 12, 2002, 5:43:54 AM7/12/02
to
Hei!

There has come a lot of insight so far on this thread, pros and
cons. That's good for those of us who have not yet thought things through.

> It is -- easily. My pet peeve is the idea of using XML files for
> tasks that obviously need a real database, preferably a relational one.
>
> I think some people never really GOT relational databases, no matter
> that they've been around for decades and are so widespread, and they're
> now turning to *overusing* XML to cover up for that:-).

Personal experience:

I tried once to install MySql on my Windows ME laptop, to use it to store
the contents of a website I am working with. I found the process
cumbersome.

Even though I am working on my masters degree in computer science, I
prefer to USE computers for specific tasks, rather then configure
utilities. I appreciate very much tools where I can concentrate on
PRODUCING something with them, rather then learning how to make them work
the way I want.

Unfortunately, in my experience, freeware needs a lot of configuration in
order to work properly, and the documentation is often inaccurate.

I would like to use Linux (and I tried several times) but I am not willing
to put that much effort into configuring it as I had to.

It seems that
there is no driver for my video card (Savaga S3, 3D, tv-out) for Windows
2000 or XP, so I am stuck with Win ME. (I can run win 2000 with a generic
screen driver which gives me max. 800x600)

I could mention other difficulties I encoutered as I tried to
somehow configure my system to cover all my needs, by trying several OSs
and corresponding tools, but in order to make a long story short, I am
still working in win ME and I store the contents of my website in XML
files.

Not by choice but by need. But I'm hoping for better days.

Catalin


<<<< ================================== >>>>
<< We are what we repeatedly do. >>
<< Excellence, therefore, is not an act >>
<< but a habit. >>
<<<< ================================== >>>>

Cameron Laird

unread,
Jul 12, 2002, 9:27:21 AM7/12/02
to
In article <aglv0h$u57$1...@slb7.atl.mindspring.net>,

CORBA people *know* this is true.
--

Cameron Laird <Cam...@Lairds.com>
Business: http://www.Phaseit.net
Personal: http://starbase.neosoft.com/~claird/home.html

Cameron Laird

unread,
Jul 12, 2002, 9:35:11 AM7/12/02
to
In article <HzwX8.71016$vm5.2...@news2.tin.it>,
Alex Martelli <al...@aleax.it> wrote:
.
.

.
>It is -- easily. My pet peeve is the idea of using XML files for
>tasks that obviously need a real database, preferably a relational one.
>
>I think some people never really GOT relational databases, no matter
>that they've been around for decades and are so widespread, and they're
>now turning to *overusing* XML to cover up for that:-).
.
.
.
I know just what'll help you feel better, Alex--thoughts
of the RDBMS vendors advertising that their datastores
have magically become the best places in the world to
keep all your tree-structured data.

The part that might shock you is that I agree with them
occasionally.

Jonathan Hogg

unread,
Jul 12, 2002, 9:44:10 AM7/12/02
to
On 12/7/2002 9:37, in article HzwX8.71016$vm5.2...@news2.tin.it, "Alex
Martelli" <al...@aleax.it> wrote:

> Jonathan Hogg wrote:
> ...
>> I'm not sure it is possible to "overuse" XML.
>
> It is -- easily. My pet peeve is the idea of using XML files for
> tasks that obviously need a real database, preferably a relational one.
>
> I think some people never really GOT relational databases, no matter
> that they've been around for decades and are so widespread, and they're
> now turning to *overusing* XML to cover up for that:-).

OK, I'll narrow my earlier statement to:

I'm not sure, when discussing file formats for off-line structured
data, that it is possible to "overuse" XML.

Clearly when the data has to be accessed on-line (by this I mean that the
file is left open and used throughout the lifetime of the application) a
database makes much more sense. I was assuming my statement would be read
within the context of the thread (Pickling data).

>> If you need to read and
>> write structured data, why bother coming up with your own format? (see:
>> the entire contents of /etc) Or why use something that is proprietary to a
>> particular language or system? (see: Pickle)
>
> Speed, size, and convenience are possible reasons. If the structure
> is highly repetitious, and the amount of data is very large, then
> repeating the tags identically a zillion times can impose substantial

> overhead of space and time. [...]

I like to consider these as special cases where I have no choice but to use
a compact file format. Even there, if space is the issue (rather than time),
running the file through a good compressor/decompressor as it is
written/read is likely to result in much better savings than trying to think
of a super-compact binary layout.

> The need to search "random-access" wise, or "keyed"-wise, is often
> an excellent reason to avoid a format that requires reading though
> all of a file to get at a particular piece of data. dbm variants,
> shelve, and relational databases, can be huge wins here (compared to
> XML, pickle, or any other choice requiring whole-file reloading).

As above, a specialised database will always make more sense here. Though
arguably, for small quantities of data that can be slurped completely into
memory, it may not make any difference. Also, XML databases can be used and
make a lot of sense if the data can be arbitrarily structured. People are
often very poor at designing relational schemas.

> While XML is reasonably human-editable, there may well be formats
> that are more convenient than it for this purpose, avoiding the
> need of special-purpose XML-oriented editors and allowing the use
> of any good old text editor with maximal ease. This is a good
> reason to keep a human-editable configuration file in non-XML
> form, in my opinion.

I just don't think that this argument is strong. XML is readable enough and
most decent editors will provide at least rudimentary syntax highlighting
for XML. With the addition of DTDs or Schemas it is easy for a generic XML
editor to validate the format of the file - very useful for complex
configuration files.

I think the most value in using XML comes in the associated standards such
as XSLT, XPointer, XPath, etc. Consider a recent question on obtaining a
list of nameservers on a machine. The answer on UNIX was variously given as:
"read /etc/resolv.conf pulling out the lines beginning with 'nameserver' and
grabbing the part after that".

Consider if resolv.conf used XML - perhaps something looking like:

<resolve-config>
<searchpath>
<domain> python.org </domain>
</searchpath>
<nameservers>
<nameserver>123.45.67.89</nameserver>
<nameserver>87.65.43.21</nameserver>
</nameservers>
</resolve-config>

You might say that this is verbose and monstrous, but it's readable and
fairly obvious in meaning. However, going back to the question, we can now
reference the information required using a simple XPath query:

>>> from xml.dom.minidom import parse
>>> from xml.xpath import Evaluate
>>>
>>> config = parse( 'resolve-config.xml' )
>>> query = '/resolve-config/nameservers/nameserver/text()'
>>> for node in Evaluate( query, config ):
... print node.nodeValue
...
123.45.67.89
87.65.43.21

Now someone can follow up to this post with a similarly small program that
would have pulled the information out of the normal resolve.conf file
faster, but this is probably the simplest format of any configuration file
used in a UNIX system. Try getting anything useful out of an Apache config
file.

If Apache used XML, then I could write an XPath query that would pull out
the document root of a particular virtual server with a single query
something like:

//virtual[normalize-space(server-name)='www.python.org']/document-root

A good XML editor would be able to point out errors in my configuration file
as I make them, and I could write an XSLT transformer to convert the
configuration file into a nice XHTML document, or more importantly use an
XSLT transformer to *create* the configuration file from XML website
specifications conveniently stored in a central XML configuration database.

> Let's try to avoid pro-XML hype in an attempt to counter the
> anti-XML hype that's suddenly burst on this group...:-)

Agreed :-)

But on the other hand, I don't think this is hype. This is stuff I have to
deal with all the time. My life would be about an order of magnitude easier
if everything used XML for configuration and interchange.

And the great thing? It's already happening, and I already see things being
made easier for me.

Jonathan

Alex Martelli

unread,
Jul 12, 2002, 10:03:28 AM7/12/02
to
Cameron Laird wrote:

> I know just what'll help you feel better, Alex--thoughts
> of the RDBMS vendors advertising that their datastores
> have magically become the best places in the world to
> keep all your tree-structured data.
>
> The part that might shock you is that I agree with them
> occasionally.

A well-implemented RDBMS is an excellent place in which to
keep data with whatever structure -- you just have to ensure
you represent that structure as a normalized relational form,
to get all the benefits.  E.g., each parent <-> children relation
in a tree is a 2-column table of (parentid, childid), for example.

I have nothing against building "richer" structure automatically
on top of relational power -- indeed, that's a good part of
what AB Strakt's application framework is all about. But I've
always been wary of using specific RDBMSs' datamodel extensions
(such as, e.g., PosgreSQL's "inheritance", tempting though it
may be) -- and so far I've managed to prosper without ever tying
any production system to any of them. Maybe if I'd ever had to
do serious, 'production' OLAP, e.g., I'd feel differently. But
I doubt XML's existence is going to change things all that much.

The ability to export some query's results as XML -- and import
such XML back to do inserts/updates -- sounds like a perfectly
reasonable utility to have for a RDBMS, just like long-standing
similar abilities for CSV and other textual file formats. Again,
such extras need not alter the RDBMS's relational abilities -- I
most definitely hope they don't!

Somebody else commented that "XML databases" are a good idea
because some programmers are bad at designing relational schemas.
<shudder>. Now THAT is an idea that sends shivers down my spine.
Maybe I'm just too pessimistic, but I'd really like to look at
the relational schemas autogenerated from DTD's or whatever --
and if the underlying relational stuff isn't there, or isn't at
all accessible, then please include me out of such plans.

Maybe I _am_ getting better understanding of where the anti-XML
rage comes from. A few years of such prospects, and I might
start on an anti-XML crusade too, if I don't watch myself...:-).


Alex

Jeremy Bowers

unread,
Jul 12, 2002, 10:52:32 AM7/12/02
to
Peter Hansen wrote:

> It's obviously a religious issue, but I get the feeling that while
> some buy into the XML hype wholesale, and overuse it to their
> detriment, others are now jumping on some anti-XML bandwagon
> (possibly) without having really put it to the test.

I'm not anti-XML. But I will say XML is only appropriate in rather
narrow circumstances as I outlined. As it so happens, these narrow
circumstances are rather popular, so XML is frequently appropriate. But
you should still use XML only when there are no better choices,
*because* you need trasferability, extensibility, or human readibility,
and nothing else really meets those needs, *not* because it's the first
thing that leaps to mind. For instance, Pickle is quite challenging to
match in XML, if you don't have the XML marshaler at hand, for a novice.

Please don't read anti-XML sentiments into my message. ;-) I said what I
meant and I means what I say. It's not a sweeping condemnation, it's a
rather limited observation.

Jeremy Bowers

unread,
Jul 12, 2002, 10:55:12 AM7/12/02
to
Alex Martelli wrote:
> Jeremy Bowers wrote:
> ...
>
>>the choice of *last* resort, when you absolutely *need* easy parsing in
>>multiple languages or environments and can't get it any other way. It is
>
>
> I think this assertion, as it stands, is untenable. There just about
> IS *some* other way -- e.g., inventing your own little language for
> data description and writing from scratch the needed parsers in all
> languages and environments of interest.

That's order N effort rather then constant effort, thus that's not easy
parsing in other environments, that's the virtually impossible job we
were faced with 10 years ago. (SGML wasn't all that easy either, from
what I gather.)

> Are you SERIOUSLY claiming that such reiterated reinventions of the
> wheel -- which were a good part of the data interchange "state of the
> art" before XML appeared -- should be used in preference to XML?!

No. This it rather negates the rest of your message, since it seems to
be based on my making that non-claim.

Jeremy Bowers

unread,
Jul 12, 2002, 11:00:21 AM7/12/02
to
Alex Martelli wrote:
> It is -- easily. My pet peeve is the idea of using XML files for
> tasks that obviously need a real database, preferably a relational one.
>
> I think some people never really GOT relational databases, no matter
> that they've been around for decades and are so widespread, and they're
> now turning to *overusing* XML to cover up for that:-).

This was the idea in the core of my point. Databases are far more
powerful then any XML library I've ever seen... not counting the XML
databases, I suppose, which don't seem to be in the common use that
relational databases are (and I don't know much about, except $$$).
Don't decide on XML until you've eliminated relational databases, for
instance. (In an experienced engineer's head, that elimination can be
instantaneous, based on the knowlege and experience of the engineer. In
a new programmer, it may take actual thinking.) Indeed, relational
databases are almost as cross-platform and cross-language as XML is!

But if you *need* transferability and extensibility, you'll probably end
up with XML. Databases don't just sit on a file system or get forwarded
in email very well.

Jonathan Hogg

unread,
Jul 12, 2002, 11:11:34 AM7/12/02
to
On 12/7/2002 15:03, in article QkBX8.72747$vm5.2...@news2.tin.it, "Alex
Martelli" <al...@aleax.it> wrote:

> Somebody else commented that "XML databases" are a good idea
> because some programmers are bad at designing relational schemas.
> <shudder>.
>
> Now THAT is an idea that sends shivers down my spine.
> Maybe I'm just too pessimistic, but I'd really like to look at
> the relational schemas autogenerated from DTD's or whatever --
> and if the underlying relational stuff isn't there, or isn't at
> all accessible, then please include me out of such plans.

Heh. Now that would have been me ;-)

Except what I actually said was:

> Also, XML databases can be used and make a lot of sense if the data can be
> arbitrarily structured. People are often very poor at designing relational
> schemas.

If the data is complex and hierarchical then a good relational schema is
going to be very hard to produce, and thus more likely than not going to be
done badly.

XML databases can be built (as can most data repositories) on top of an
RDBMS, but often they are pure databases similar to an OODB - this is
because relational databases aren't fundamentally very good at managing
arbitrary hierarchical information, which was my point.

Hierarchical data is also not well-suited to relational querying (Oracle's
"START WITH ... CONNECT BY" being the best attempt I've seen at it), which
is why XML querying languages like XPath and (soon) XQuery exist. An XML
database is best thought of as a hierarchical filesystem of XML files.
Except that, unlike a filesystem of XML files, the database can maintain
indices and a layout that enables optimised querying of fragments of
multiple documents.

For a reasonable example, take a look at Apache XIndice:

<http://xml.apache.org/xindice/>

One should be wary of getting stuck in one mindset - even if it is a very
good one ;-)

Jonathan

François Pinard

unread,
Jul 12, 2002, 10:37:30 AM7/12/02
to
[Jonathan Hogg]

> Consider if resolv.conf used XML - perhaps something looking like:

> <resolve-config>
> <searchpath>
> <domain> python.org </domain>
> </searchpath>
> <nameservers>
> <nameserver>123.45.67.89</nameserver>
> <nameserver>87.65.43.21</nameserver>
> </nameservers>
> </resolve-config>

> You might say that this is verbose and monstrous,

Your wish is my command. This is verbose and monstrous! :-)

> but it's readable and fairly obvious in meaning.

The original non-XML format is also pretty readable and obvious in meaning.
Surely, there are advantages to XML, but at first glance here, it seems we
gain nothing but verbosity and monstrosity. In my opinion, the advantages
have to be pretty real to justify such a change. We should not go XML
for the only sake of going XML.

--
François Pinard http://www.iro.umontreal.ca/~pinard

François Pinard

unread,
Jul 12, 2002, 10:49:11 AM7/12/02
to
[Peter Hansen]

> XML was adapted from SGML to meet several key goals. Among those was
> the desire to make a format which was easily human-readable and editable
> using simple tools like text editors.

I do not think so. SGML already has these virtues, and much more than XML.
The main goal of XML was to please machines, because full SGML is so
difficult to parse. The price to pay to please machines was a loss of
readability for humans. XML is so verbose that real information gets
overwhelmed under tags. The fish often gets drawn in practice.

This being said, XML has been immensely successful, at least in the
democratic fields, because it is so easier to implement. Moreover, many
people consider it is an acceptable compromise between humans and machines.
But at least compared to SGML, we have to remember that it is a compromise
away from humans and towards machines, and never turn the argument around.

Harvey Thomas

unread,
Jul 12, 2002, 11:03:36 AM7/12/02
to
François Pinard wrote:
[...snip...]


> I do not think so. SGML already has these virtues, and much
> more than XML.
> The main goal of XML was to please machines, because full SGML is so
> difficult to parse. The price to pay to please machines was a loss of
> readability for humans. XML is so verbose that real information gets
> overwhelmed under tags. The fish often gets drawn in practice.

[..snip...]

I think you should substitute "software writers" for "machines". A full SGML parser, which must always be validating, is indeed very hard to write (I only know of two that really meet the spec), but an XML parser, particulatly a well-formdness only parser, is much easier to write.

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.


François Pinard

unread,
Jul 12, 2002, 11:09:39 AM7/12/02
to
[Harvey Thomas]

> > The main goal of XML was to please machines, because full SGML is so
> > difficult to parse.

> I think you should substitute "software writers" for "machines".

Of course.

Jeremy Bowers

unread,
Jul 12, 2002, 11:53:45 AM7/12/02
to
Jeremy Bowers wrote:

> Alex Martelli wrote:
>> I think this assertion, as it stands, is untenable. There just about
>> IS *some* other way -- e.g., inventing your own little language for
>> data description and writing from scratch the needed parsers in all
>> languages and environments of interest.
>
>
> That's order N effort rather then constant effort, thus that's not easy
> parsing in other environments, that's the virtually impossible job we
> were faced with 10 years ago. (SGML wasn't all that easy either, from
> what I gather.)

Though it occurs to me to point out a bit later that parsing somebody
else's XML format can feel about as hard as parsing a new little
language; XML libraries take care of the tokenizing and parsing step but
you still get to do the semantic analysis yourself. If you're good with
writing parsers (which usually implies tokenizing is a non-issue to that
person), this can be a bad tradeoff if you could write a much simpler
special-purpose language that would tokenize and parse into something
more closely appropriate to the task at hand. Sometimes straight into
the desired format; I, and others, have used Python source code files
directly as configuration files before. Parsing an XML-version of those
files is a pain, relative to the eas of 'parsing' the Python
configuration file:

from configuration import *
or
import configuration

Complete with robust error detection.

Part of the reason XML is useful is that programmers who have not
dedicated their lives to writing compilers find it easier to muddle
through the semantic analysis then to write a good tokenizer and parser,
as the latter largely *can't* be muddled through for any non-trivial
language. (Parsers are powerful but do like to bite hard.)

Of course, other reasons include the fact that this tokenizer/parser is
available in multiple languages, reasonably human readable, etc. This
message is an observation and food for though, not the whole of my XML
views. I may not even agree with this observation tommorow. ;-)

Peter Hansen

unread,
Jul 12, 2002, 12:08:08 PM7/12/02
to
François Pinard wrote:
>
> [Peter Hansen]

>
> > XML was adapted from SGML to meet several key goals. Among those was
> > the desire to make a format which was easily human-readable and editable
> > using simple tools like text editors.
>
> I do not think so.

From http://www.idealliance.org/standards_xml.asp, picked arbitrarily
from a Google search for "xml design goals":

The design goals for XML are:

XML shall be straightforwardly usable over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XML documents.
The number of optional features in XML is to be kept to the absolute
minimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.

I believe "human-legible and reasonably clear", and "easy to create"
pretty much cover what I said above.

-Peter


Jonathan Hogg

unread,
Jul 12, 2002, 12:20:46 PM7/12/02
to
On 12/7/2002 15:37, François Pinard wrote:

>> You might say that this is verbose and monstrous,
>
> Your wish is my command. This is verbose and monstrous! :-)
>
>> but it's readable and fairly obvious in meaning.
>
> The original non-XML format is also pretty readable and obvious in meaning.
> Surely, there are advantages to XML, but at first glance here, it seems we
> gain nothing but verbosity and monstrosity. In my opinion, the advantages
> have to be pretty real to justify such a change. We should not go XML
> for the only sake of going XML.

I had thought I'd given arguments and examples as to why the XML was more
useful in the rest of the post, but perhaps I wasn't as clear as I thought I
had been.

To summarise the advantages of using XML as I see them:

* Standardised parsing (PyXML etc.)
* Standardised validation (DTDs, XSchema)
* Standardised editing (XML-aware editors)
* Standardised querying (XPath, XQuery)
* Standardised transformation (XSLT)
* Standardised storage (XML:DB)

I really am willing to eat humble pie here and admit that I'm mistaken if
someone can give me a similar list of good reasons to *not* use XML for
off-line hierarchically structured data.

Like I said before, I can spend time trying to think of a structured file
format for each application I work on, or I can just use an XML schema.
Thanks to batteries-included parsers it's easy to read and write, the data
is readily re-purposed for other uses, and I get syntax highlighting and
basic syntax checking in my editor. At a later date I can easily shift to
using an XML database instead of XML files.

Perhaps I'm missing something blindingly obvious here, but what benefits
would I gain from coming up with my own format?

[Other than people who have some kind of allergic reaction to XML would like
it more.]

Jonathan

Tim Rowe

unread,
Jul 12, 2002, 12:37:08 PM7/12/02
to
>awful tool (often indicated by "X sucks" and "X is evil" without further
>explication) and berate anyone who uses it.

And to think I thought they were complaining about X-Windows :-)

Tim Rowe

unread,
Jul 12, 2002, 12:37:18 PM7/12/02
to
On Fri, 12 Jul 2002 08:37:59 GMT, Alex Martelli <al...@aleax.it>
wrote:

>Jonathan Hogg wrote:
> ...
>> I'm not sure it is possible to "overuse" XML.
>
>It is -- easily. My pet peeve is the idea of using XML files for
>tasks that obviously need a real database, preferably a relational one.

I'd agree that XML + tools in general makes a poor substitute for a
proper DBMS (and so have most of the books I've read on it). But if
you want to move data between different databases, expecially if they
use incompatible DBMSs then XML is there as a very strong contender
(/way/ ahead of comma-separated variables). And if you're worried
about size, don't forget that it can be compressed for transfer (as
per Star Base, IIRC) , and any decent compression algorithm will make
easy work of all those repetitive tags.

>Let's try to avoid pro-XML hype in an attempt to counter the
>anti-XML hype that's suddenly burst on this group...:-)

Should we try to avoid being anti-XML too? It's not the only data
format you'll ever need, but it's pretty good at what it sets out to
do.

Sam Penrose

unread,
Jul 12, 2002, 12:50:42 PM7/12/02
to
Elliote Rusty Harold's XML News site Cafe Con Leche at
<http://www.cafeconleche.org/> features pithy commentary on this topic.
See especially <http://www.cafeconleche.org/quotes2002.html>.

Harold is also a big Java guy; his Java news site Cafe au Lait
<http://www.cafeaulait.org> occasionally contains positive mentions of
Python.

Mike C. Fletcher

unread,
Jul 12, 2002, 12:56:20 PM7/12/02
to
Oh goody, a religious war :)

I've been (peripherally) involved in a --go to XML for XML's sake--
projects. We managed to take a readable, readily-parsed, UTF-8 format
which dozens of pieces of software could read and write, which could be
readily and reliably edited with a plain-text editor, and was compact
enough for use as a web-publishing data format, and turned that ISO
standard into a format that no one would consider writing or editing by
hand (ridiculously voluminous), no-one would download over the internet
(same), no software could read or write readily (save a few sample
implementations produced by the conversion teams), and which made data
validation a hairier step for the programmes dealing with the new format.

Months (years) of effort that should have gone elsewhere, across dozens
of companies was spent dismantling and re-branding a format which had
been designed and evolved over years to do _exactly_ what it needed to
do as an interchange format for 3D data. There were conferences where
the big question wasn't "how do we make this stuff better for our
users", but "do we encode attributes as tags or tag-attributes?", and
"how do we encode the data types reliably so that editors know what they
are, but the resulting file doesn't look like trash?"

Why? Because a group of corporations had decided:

1) this format was to be a "web" format (instead of a 3D interchange
and VR format)
2) all things "web" must be blessed by being XML
3) current tools, programmes and users don't matter, because once we
are XML-based, everything will just magically work.

That, to me, is XML done totally wrong, and projects like it are why
people get allergic reactions.

The problem with cultivating an allergic reaction is that I _love_ SGML
(which I started with way back when Paul and I worked with Professor
Beam at Waterloo). I _love_ SGML (and XML by extension) as a _textual_
markup language. It's good for taking a stream of characters and
stating what type of text each bit of that stream is. It's even better
for describing truly hierarchically structured texts, such as seen in
textbooks, manuals and the like (it's not great for poetry or artistic
works of many types).

I also prefer XML for use in comp-sci problems where there's no
readily-available and superiour format available (configuration files
are fine in XML (if people prefer), as long as I can sit down and edit
one from scratch without _requiring_ an XML editor (which is hard in
many examples of XML I've seen, because of the hoops being jumped
through to shoe-horn data-typing into a _textual_ markup language)).

[As a note, I've not found a decent XML editor along the lines of
InContext's SGML editor, with useful support for editing text with the
hierarchy visible, but not interfering with navigation, along with
intelligent split/join/surround/un-surround/merge/
paste-hierarchical/paste-flat/create entity/use entity/etceteras
facilities.]

I don't even mind XML being used for SOAP-like systems, "let them eat
cake if they're only going to do it a few times a minute, they don't
mind wasting some bandwidth, why should I care". In the absence of a
better format, yes, go with XML. If you're worried about long-term
storage, I'll consider an argument for XML. For real-time work, with an
only slightly better format, sure, if we don't have a lot of code
depending on that format, go ahead. If you're downgrading service to
your customers just to "be XML", then you've screwed up.


Mastery of force is not the ability to marshal great force in all
situations, but to know where, how and when to apply minimal force to
achieve maximal benefit.
Mike


François Pinard wrote:
...


>>but it's readable and fairly obvious in meaning.
>
>

> The original non-XML format is also pretty readable and obvious in meaning.
> Surely, there are advantages to XML, but at first glance here, it seems we
> gain nothing but verbosity and monstrosity. In my opinion, the advantages
> have to be pretty real to justify such a change. We should not go XML
> for the only sake of going XML.

...

Huaiyu Zhu

unread,
Jul 12, 2002, 1:45:15 PM7/12/02
to
François Pinard <pin...@iro.umontreal.ca> wrote:
>[Peter Hansen]
>
>> XML was adapted from SGML to meet several key goals. Among those was
>> the desire to make a format which was easily human-readable and editable
>> using simple tools like text editors.
>
>I do not think so. SGML already has these virtues, and much more than XML.
>The main goal of XML was to please machines, because full SGML is so
>difficult to parse. The price to pay to please machines was a loss of
>readability for humans. XML is so verbose that real information gets
>overwhelmed under tags. The fish often gets drawn in practice.

Readability for machines does not have to come at the expense of readability
for humans. A few years back I experimented with an indentation based data
format that is:

- as readable as emacs's outline mode
- reduce to common conventions like this paragraph for simple cases
- allow mixed nested structures of set, sequence, dictionary, and seqdict
- can include binary data
- can handle different encodings/encryptions in different elements
- with average less than 5% bloat, in contrast to XML's over 100% bloat

Then I learned that there was something called XML which was already (going
to be) wide-spread, and somehow put off my own data format for some later
date. However, the more I learn about XML, the more I find it badly
designed as a universal data exchange format.

> This being said, XML has been immensely successful, at least in the
>democratic fields, because it is so easier to implement. Moreover, many
>people consider it is an acceptable compromise between humans and machines.
>But at least compared to SGML, we have to remember that it is a compromise
>away from humans and towards machines, and never turn the argument around.

As with many other inventions in the history, the success of XML is more of
a result of some opportune associations than of a result of its internal
strength.

Huaiyu

Clark C . Evans

unread,
Jul 12, 2002, 8:29:47 PM7/12/02
to
On Fri, Jul 12, 2002 at 09:07:53AM +0100, Jonathan Hogg wrote:
| I'm not sure it is possible to "overuse" XML. If you need to read and write
| structured data, why bother coming up with your own format? (see: the entire
| contents of /etc) Or why use something that is proprietary to a particular
| language or system? (see: Pickle)

For my purposes, YAML (http://yaml.org) is doing just
perfectly. It has lots of advantages over XML, first
it is readable and second it uses native Python objects
(instead of a document object model)

Best,

Clark
Yo! Check out YAML!
http://yaml.org

--
Clark C. Evans Axista, Inc.
http://www.axista.com 800.926.5525
XCOLLA Collaborative Project Management Software


Fredrik Lundh

unread,
Jul 13, 2002, 9:10:01 AM7/13/02
to
Jonathan Hogg wrote:

> Perhaps I'm missing something blindingly obvious here, but what benefits
> would I gain from coming up with my own format?
>
> [Other than people who have some kind of allergic reaction to XML would like
> it more.]

- readability for humans: compare Python's current syntax with
an XML-based representation of Python's AST, or some variant
thereof...

- readability for computers: no matter what compression you use,
it's a lot easier for a computer to read pixels if you store them as
bytes than if you store them as XML elements...

but between the extremes, XML wins most of the time (especially if
you stay away from SAX and DOM, and either use higher-level APIs,
or more Pythonic low-level ways to access the infoset).

but you knew that already, of course ;-)

</F>


François Pinard

unread,
Jul 13, 2002, 9:50:06 AM7/13/02
to
[Jonathan Hogg]

> To summarise the advantages of using XML as I see them:

> * Standardised parsing (PyXML etc.)
> * Standardised validation (DTDs, XSchema)
> * Standardised editing (XML-aware editors)
> * Standardised querying (XPath, XQuery)
> * Standardised transformation (XSLT)
> * Standardised storage (XML:DB)

This is no more advantage being `XML-standardised' for the only sake of
being `XML-standardised' than going `XML' for the only sake of going `XML'.
In the table above, `standardised' like a buzzword. As a Python lover,
I'm tempted to replace `standardised' by `easy' and `legible' wherever I can.

For simple system tables, like the one that was given as example previously
in this thread, I quite doubt the Linux kernel will soon go to the lengths
of XML parsing, querying, and database storage.

> I really am willing to eat humble pie here and admit that I'm mistaken if
> someone can give me a similar list of good reasons to *not* use XML for
> off-line hierarchically structured data.

Any file is a hierarchy of some sort. We often see a file being a sequence
of lines, a line being a sequence of fields or tokens, and tokens being
a sequence of characters. In many, many, really many applications, this
organisation in lines and fields is wholly satisfactory. Reusing the
enumeration above, it is easy to parse, easy to validate, easy to edit, easy
to query, easy to transform and easy to store. Let's be honest. People are
comfortable with lines and fields, examples and tools merely _abound_.

XML becomes more sensible when you have a _lot_ of structure, something which
is complex, difficult, and which you have to exchange with away parties.
For simple things, it is just annoying and heavy overkill, really...

Speaking for my own situation only, as a Python lover, XML is gross overkill
even for quite complex things. It is extremely simple to pickle rather
complex structures, transmit them over wires to applications on other
machines, and unpickle them there. Using Python as an API for such usages
is natural and very comfortable, and not to say, immensely faster than XML.

Of course, I would prefer XML is I had to speak outside a Python environment,
with people offering nothing simpler than an XML interfaces. I've looked
into some of these fashioned avenues. So far, they invariably seem extremely
complex and hairy to me, at least for what they provide. XML is there to
give users a reinsurance on the fact they have a last-resort control, after
all, to inspect what is going on, or to intervene if they ever need to. So,
for them, I really understand how valuable XML may be. It's a good thing.

In my simple situations, Python is much, much better than XML as a solution.
Moreover, Python offers me a good set of XML tools and interfaces if I have
no choice than communicate with an outside world groking XML! For one,
when I really need a marking language for my users, without having XML
imposed from the outside, SGML is often a better solution, as it is closer
to humans than XML. There might also be better solutions than SGML, too.
XML is mainly there to help implementors. Surely, I like humans far more
than I like machines, and this feeling mainly drives my efforts. :-)

> Perhaps I'm missing something blindingly obvious here, but what benefits
> would I gain from coming up with my own format?

For simple things? Ease, speed, simplicity, readability. Don't fear it.
The world will survive, you know, even if you sometimes don't use XML. :-)

holger krekel

unread,
Jul 13, 2002, 10:55:40 AM7/13/02
to
Huaiyu Zhu wrote:
> Readability for machines does not have to come at the expense of readability
> for humans. A few years back I experimented with an indentation based data
> format that is:
>
> - as readable as emacs's outline mode
> - reduce to common conventions like this paragraph for simple cases
> - allow mixed nested structures of set, sequence, dictionary, and seqdict
> - can include binary data
> - can handle different encodings/encryptions in different elements
> - with average less than 5% bloat, in contrast to XML's over 100% bloat

do you have any code or design documents for this?

Sounds quite interesting.

holger


Christopher Browne

unread,
Jul 13, 2002, 12:25:57 PM7/13/02
to
In the last exciting episode, pin...@iro.umontreal.ca (François Pinard) wrote:
> [Jonathan Hogg]

>> I really am willing to eat humble pie here and admit that I'm
>> mistaken if someone can give me a similar list of good reasons to
>> *not* use XML for off-line hierarchically structured data.
>
> Any file is a hierarchy of some sort. We often see a file being a
> sequence of lines, a line being a sequence of fields or tokens, and
> tokens being a sequence of characters. In many, many, really many
> applications, this organisation in lines and fields is wholly
> satisfactory. Reusing the enumeration above, it is easy to parse,
> easy to validate, easy to edit, easy to query, easy to transform and
> easy to store. Let's be honest. People are comfortable with lines
> and fields, examples and tools merely _abound_.
>
> XML becomes more sensible when you have a _lot_ of structure,
> something which is complex, difficult, and which you have to
> exchange with away parties. For simple things, it is just annoying
> and heavy overkill, really...

Heavens, that is an _excellent_ description of what's going on.

I think that also nicely describes the way that trees tend to get
nasty in SQL, too.

In fact, it more than likely characterizes why "object oriented
databases" are a controversial matter.

> Speaking for my own situation only, as a Python lover, XML is gross
> overkill even for quite complex things. It is extremely simple to
> pickle rather complex structures, transmit them over wires to
> applications on other machines, and unpickle them there. Using
> Python as an API for such usages is natural and very comfortable,
> and not to say, immensely faster than XML.

SOAP is a nice example of something that _sounds_ good, but whose
implementation turns out to be a lot uglier than you'd ideally want.

There's little point to it if you're trying to pass around
parameters/results looking like:
P = [1, 4, 7, 27, 12341, "foo"]

Many simpler and greatly more efficient marshalling schemes are
available there.

The place where it's more interesting are when you've got an XML
message looking like:

<hostlist>
<host>
<ip> 1.2.3.4 </ip>
<mainname> foo.bar.com </mainname>
<anothername> foo </mainname>
<anothername> bar.com </mainname>
<anothername> cache.bar.com </mainname>
</host>
<host>
<ip> 1.2.3.5 </ip>
<anothername> frobozz </mainname>
<anothername> mail </mainname>
<mainname> frobozz.bar.com </mainname>
</host>
</hostlist>

or
<contactlist>
<entry> <surname> Pinard </surname> <firstname> Francois </firstname>
</entry>
<entry> <company> IBM Inc </company> <url> http://www.ibm.com/ </url>
</entry>
<entry> <company> Transmeta Inc </company> <surname> Torvalds
</surname> <firstname> Linus </firstname> </entry>
</contactlist>

These are simple enough examples; the _problem_ is that the most
typical sort of SOAP handling is for these to respectively translate
into something like:

H = [ ["1.2.3.4", "foo.bar.com", "foo", "bar.com", "cache.bar.com"],
["1.2.3.5", "frobozz", "mail", "frobozz.bar.com"] ]

and

C = [[ "Pinard", "Francois"],
["IBM Inc", "http://www.ibm.com/"],
["Transmeta Inc", "Torvalds", "Linus"]]

Which are in a sense convenient enough ways to express the
information, however you're left puzzling over what the actual
intended structure is.

Perl's SOAP::Lite actually expresses the results of SOAP queries as
the full trees of elements and attributes, allowing you to walk the
tree.

Unfortunately, when it proves necessary to write a program to walk the
tree, that shows that the "S" for "Simple" part just got more than a
tad less "simple."

The Python SOAP bindings don't handle this terribly wonderfully.

>> Perhaps I'm missing something blindingly obvious here, but what
>> benefits would I gain from coming up with my own format?
>
> For simple things? Ease, speed, simplicity, readability. Don't
> fear it. The world will survive, you know, even if you sometimes
> don't use XML. :-)

I think the widespread lemming-rush to try to get _everything_ mapped
onto XML-based formats is a demonstration that a whole lot of people
have never even _looked at_ Lex and Yacc.

Not everything in this world should require writing your own recursive
descent parser. But I rather suspect that there are a lot of
situations where the programming of parsing tasks might be handled
with less code, less debugging, and less overall effort (mental and
chronological) by building a Lex-based parser than is required to
integrate an XML library into an application and then add the hooks to
provide semantics for what it parses.
--
(reverse (concatenate 'string "moc.enworbbc@" "enworbbc"))
http://cbbrowne.com/info/linux.html
MICROS~1 has brought the microcomputer OS to the point where it is
more bloated than even OSes from what was previously larger classes of
machines altogether. This is perhaps Bill's single greatest
accomplishment.

Paul Rubin

unread,
Jul 13, 2002, 12:42:57 PM7/13/02
to
Christopher Browne <cbbr...@acm.org> writes:
> I think the widespread lemming-rush to try to get _everything_ mapped
> onto XML-based formats is a demonstration that a whole lot of people
> have never even _looked at_ Lex and Yacc.
> ...

> (reverse (concatenate 'string "moc.enworbbc@" "enworbbc"))


Or Lisp... ;-)

Christopher Browne

unread,
Jul 13, 2002, 3:44:54 PM7/13/02
to

Sure. But in that it's atypical for people to be terribly accepting
of languages options that don't look a whopping lot like C, I'll go
with "something related to C."

Perhaps they should look at Baker's META or ASN.1, but that _does_ get
pretty abtruse pretty quickly, and who has heard of them?

Yacc and Lex provide nice formal ways to describe grammars, can work
with the "Langues-du-jour," and it would be very instructive to see
that they _are_ usable for building parsers for small languages.
--

(reverse (concatenate 'string "moc.enworbbc@" "enworbbc"))

http://cbbrowne.com/info/spreadsheets.html
This is Linux country. On a quiet night, you can hear NT re-boot.

François Pinard

unread,
Jul 13, 2002, 7:41:31 PM7/13/02
to
[Christopher Browne]

> Yacc and Lex provide nice formal ways to describe grammars, can work with
> the "Langues-du-jour," and it would be very instructive to see that they
> _are_ usable for building parsers for small languages.

Python users are blessed with many lexers/parsers systems. Even if
I glanced around, I surely do not know them all. My favorite so far is
SPARK, which is not only very elegant, but also quite simple to use and
powerful at what it can recognise. I also learned to like PLY.

Christopher Browne

unread,
Jul 13, 2002, 9:46:05 PM7/13/02
to
pin...@iro.umontreal.ca (François Pinard) wrote:
> [Christopher Browne]
>> Yacc and Lex provide nice formal ways to describe grammars, can work with
>> the "Langues-du-jour," and it would be very instructive to see that they
>> _are_ usable for building parsers for small languages.
>
> Python users are blessed with many lexers/parsers systems. Even if
> I glanced around, I surely do not know them all. My favorite so far is
> SPARK, which is not only very elegant, but also quite simple to use and
> powerful at what it can recognise. I also learned to like PLY.

Fair enough...

I'm pointing at Lex/Yacc since they provide _declarative_ ways of
describing grammars, which is commonly not what scripting language
schemes use. (I'm afraid I'm not familiar with SPARK/PLY; perhaps
they are different?)

In addition, Lex/Yacc are commonly what are used to _build_ the
scripting language grammar, so they probably are hiding around
somewhere anyways :-).

Furthermore, what I'm trying to have as an underlying "theme" in the
"music" is that it might very well be easier to build a Lex/Yacc
grammar using C and link it in than to fight your way through
designing the XML-based system. There may be more "elegant" options
than Lex/Yacc; the underlying theme is that that doesn't prevent the
"design-the-grammar-from-scratch" approach from being more manageable
than XML.
--
(reverse (concatenate 'string "gro.gultn@" "enworbbc"))
http://cbbrowne.com/info/spreadsheets.html
Why are there flotation devices under plane seats instead of
parachutes?

Jonathan Hogg

unread,
Jul 14, 2002, 4:18:52 AM7/14/02
to
On 14/7/2002 2:46, in article agql4s$np5pt$1...@ID-125932.news.dfncis.de,
"Christopher Browne" <cbbr...@acm.org> wrote:

> Furthermore, what I'm trying to have as an underlying "theme" in the
> "music" is that it might very well be easier to build a Lex/Yacc
> grammar using C and link it in than to fight your way through
> designing the XML-based system. There may be more "elegant" options
> than Lex/Yacc; the underlying theme is that that doesn't prevent the
> "design-the-grammar-from-scratch" approach from being more manageable
> than XML.

Do you find using XML-parsing libraries a fight? Certainly in Python I have
found using the XML libraries to be astonishingly simple.

The thing is, using lex and yacc (or whatever your favourite parsing
framework may be) may possibly be easier if you're already comfortable with
it. But in order to use them you need to design a syntax. The resulting
syntax is useful only to programs that have a parser for that syntax. With
XML, the data can be imported, queried, and translated by anyone or any
program with a basic XML toolkit.

As I tried to show before, with tools like XPath, one can extract useful
information out of any XML file.

And to answer François' earlier point about not being able to use
"standardised" meaningfully with regard to XML. I consider XML to be
"standardised" not because the W3C said so, but because parsing, validating,
querying, and transforming frameworks are available for nearly any language
off-the-shelf, editors support it, and database and data manipulation tools
support it.

I'm afraid Pickle doesn't come close in this regard (and isn't human
readable anyway). CSV is probably closer but it doesn't support complex
enough structure for me. ASN.1 might be a contender but also isn't human
readable and doesn't have the same availability of tools. And certainly, any
random syntax I might come up with will have support no further than I
write.

People like to scoff at so-called "Enterprise" computing, but in large
organisations every new file format that someone comes up with represents a
new maintenance headache. I don't want to have to reverse engineer file
formats and come up with custom parsers every time I need to make two
different systems interoperate.

Jonathan

François Pinard

unread,
Jul 14, 2002, 1:01:34 PM7/14/02
to
[Jonathan Hogg]

> [...] I consider XML to be "standardised" not because the W3C said so,


> but because parsing, validating, querying, and transforming frameworks
> are available for nearly any language off-the-shelf, editors support it,
> and database and data manipulation tools support it. I'm afraid Pickle
> doesn't come close in this regard (and isn't human readable anyway).

I can read a pickle with Python and dump it as a readable, pretty-printed
Python structure. Conversely, I can have Python to read in a text containing
the source of a Python structure, and produce a pickle from the result.
The programs to do so are very small. In practice, I do not maintain my
original structures as pickles, but rather as very straight Python source
files containing about nothing but data structures. These are easy to edit.

For most people, even to those not familiar with Python, a Python
structured constant is probably easier to read that any XML rendering of it.
Python will parse and validate it for me. I can save the original text,
or the pickle if I prefer so, in files and databases, and transmit either
over networks. Python, the language, is also a wonderful and probably
unequalled generic framework for transforming data structures.

About being compatible with "nearly any language off-the-shelf", of
course, is where some difficulty may rise. Until I have this problem,
I'll rather stay comfortable with Python than push myself into miseries
I do not need. Then, I could decide to transmit either lines and fields,
or XML if available at the other end, or even analyse and produce source
or data files for the other language, probably all from within Python.

David Mertz, Ph.D.

unread,
Jul 15, 2002, 10:49:40 AM7/15/02
to
|of Python dictionary and be able to convert it to XML and back.

My xml_pickle library is clearly the most direct answer to this
question. Mind you, I agree with some of the caveats in the thread you
generated about "buzzword compliance" and overuse of XML, etc. But
assuming you do have good reasons to do what you describe, get:

http://gnosis.cx/download/Gnosis_Utils-current.tar.gz

It will do what you want, very simply (and will do lots more if you want
to be non-simple). There's lots of documentation in there also.

Yours, David...

--
_/_/_/ THIS MESSAGE WAS BROUGHT TO YOU BY: Postmodern Enterprises _/_/_/
_/_/ ~~~~~~~~~~~~~~~~~~~~[me...@gnosis.cx]~~~~~~~~~~~~~~~~~~~~~ _/_/
_/_/ The opinions expressed here must be those of my employer... _/_/
_/_/_/_/_/_/_/_/_/_/ Surely you don't think that *I* believe them! _/_/

Huaiyu Zhu

unread,
Jul 15, 2002, 2:04:05 PM7/15/02
to

The basic idea is quite simple: consider a data structure as a tree; denote
the type of branching at each node; indent the subtrees. It appears to me
that indentation is easier to handle than quotes and escapes. Here's a
simple example:

[]
# This is a sequence
- first item
- second item
with multiple lines
-{}
# The third item in the sequence is itself a set
- element 1
-## encryption=somescheme
# element 2 is binary data
the binary data goes here
which can be multiple lines as well
-{:}
# element 3 is a dictionary
- key1: value1
- key2: value2
-[:]
# The third item in the sequence is itself a seqdict
- key1: value1
- key2:-
This value is multiline
Which keeps the same indentation
So that it is human readable

There is a complication that I cannot recall at this moment that requires
the indentation to be at least two characters.

The outermost level could be handled by blank lines to make it more
readable. So a bibtex type of file would be like

[]{:}

- bibkey: ...
- author: ...
- title: ...

- bibkey: ...
- author: ...
- title: ...

For deeply nested structures, it is more efficient but less readable to use

0)- a
1)- b
2)- c
3)- d
2)- e

in place of

- a
- b
- c
- d
- e

Assuming that the newline character occurs in binary data with 1/256
frequency, and assume that the structural denotations at the beginning of
each line occupies less than 10 characters, then the bloat factor for binary
data would be less than 5%.

OK, hope this makes sense. If this is still interesting I'll dig the thing
out. I have documents and code (perl and python) at home, but I'll have to
dig through several tar files to find them, maybe on a hard disk that's not
mounted. This all started back when I tried to use perl to manage my bibtex
files while I did not know Python, so some of them used % and @ to represent
hash and array following perl. The format itself also changed somewhat over
the years. So don't expect those to be more organized than this post. :-)
They certainly have more details, though.


Huaiyu

holger krekel

unread,
Jul 16, 2002, 9:18:30 AM7/16/02
to
Huaiyu Zhu wrote:
> holger krekel <py...@devel.trillke.net> wrote:
> >Huaiyu Zhu wrote:
> >> Readability for machines does not have to come at the expense of readability
> >> for humans. A few years back I experimented with an indentation based data
> >> format that is:
> >>
> >> - as readable as emacs's outline mode
> >> - reduce to common conventions like this paragraph for simple cases
> >> - allow mixed nested structures of set, sequence, dictionary, and seqdict
> >> - can include binary data
> >> - can handle different encodings/encryptions in different elements
> >> - with average less than 5% bloat, in contrast to XML's over 100% bloat
> >
> >do you have any code or design documents for this?
> >
> >Sounds quite interesting.
>
> The basic idea is quite simple: consider a data structure as a tree; denote
> the type of branching at each node; indent the subtrees. It appears to me
> that indentation is easier to handle than quotes and escapes. Here's a
> simple example:
>
> ...snipped...

>
> OK, hope this makes sense.

It does and it's very interesting. It does sound a lot like
http://yaml.org to me, though (They even have an RFC).
Don't you think YAML might be a superset of your ideas?

Let me add some random thoughts/questions about your/yaml's scheme
(i hope i am not missing something obvious):

- how is a binary data-stream's size determined? What about
open-ended streams? Embedding of arbitrary data-streams
is very useful (IMO).

- somehow your and yaml's scheme remind me of todays wiki techniques.
E.g. Wikis have methods of sequence-detection (bullets ...) and they
have a commitment to readability. Of course, they are generally more
concerned with graphical views than with beeing a concise persistence scheme.

- Is there a canonical conversion between XML and your scheme/YAML?
Shouldn't be too hard, anyway...

- how do you express external addresses akin XPATH?
Ideas:
- Mappings are easy, just take the 'key'.
- Sequences are easy (take the sequence number) but not very robust
to deletions and insertions of items.
- tag-names (IDs) which can be associated with any item might be interesting.
readability is likely to suffer, probably.

btw, I wonder whether some form of your and/or YAML's ideas should play a
role in the new persistence-SIG. While the actual persistence mappings
are not in the focus there are certainly some interesting connections
between the two areas.

> If this is still interesting I'll dig the thing
> out. I have documents and code (perl and python) at home, but I'll have to

> ...

this sure is useful. Especially for me since i work with a (perl-)
friend on a project which needs to address the persistence-question. And
we want to have it interoperable, simple and fast. I guess looking
at YAML might avoid that you have to dig too much into old harddisks :-)

holger


Clark C . Evans

unread,
Jul 16, 2002, 2:27:12 PM7/16/02
to
On Tue, Jul 16, 2002 at 03:18:30PM +0200, holger krekel wrote:
| Huaiyu Zhu wrote:
| > The basic idea is quite simple: consider a data structure as a tree; denote
| > the type of branching at each node; indent the subtrees. It appears to me
| > that indentation is easier to handle than quotes and escapes. Here's a
| > simple example:
|
| It does and it's very interesting. It does sound a lot like
| http://yaml.org to me, though (They even have an RFC).
| Don't you think YAML might be a superset of your ideas?

Yes, Steve Howell even has a python implementation of it...
http://yaml.org/python/PyYaml_14jul2002.tgz

I'm sure he'd love your comments/contributions.

| Let me add some random thoughts/questions about your/yaml's scheme
| (i hope i am not missing something obvious):
|
| - how is a binary data-stream's size determined? What about
| open-ended streams? Embedding of arbitrary data-streams
| is very useful (IMO).

In YAML, we have a built-in BASE64 type which hopefully most
implementations will support. As for handling embedded streams,
this works as long as they are indented properly and only use
printable characters (see specification).

| - somehow your and yaml's scheme remind me of todays wiki techniques.
| E.g. Wikis have methods of sequence-detection (bullets ...) and they
| have a commitment to readability. Of course, they are generally more
| concerned with graphical views than with beeing a concise persistence
| scheme.

Yes; Ward Cunningham's WikiWiki is a very cool concept and I'm sure
we borrowed from it sub-conciously.

| - Is there a canonical conversion between XML and your scheme/YAML?
| Shouldn't be too hard, anyway...

http://yaml.org/xml.html

| - how do you express external addresses akin XPATH?
| Ideas:
| - Mappings are easy, just take the 'key'.
| - Sequences are easy (take the sequence number) but not very robust
| to deletions and insertions of items.
| - tag-names (IDs) which can be associated with any item might be interesting.
| readability is likely to suffer, probably.

I'm toying with making IDs a "special" key so that a YPATH
like mechanism would work well... similar to XPATH/XSLT's "keys()"

| btw, I wonder whether some form of your and/or YAML's ideas should play a
| role in the new persistence-SIG. While the actual persistence mappings
| are not in the focus there are certainly some interesting connections
| between the two areas.

One of our other members, Brian Ingerson, is working on getting
YAML into the core of Parrot (http://parrot-code.org)

| this sure is useful. Especially for me since i work with a (perl-)
| friend on a project which needs to address the persistence-question. And
| we want to have it interoperable, simple and fast. I guess looking
| at YAML might avoid that you have to dig too much into old harddisks :-)

Also, there is a Perl implementation of YAML sited in the
downloads section of the website written by Brian Ingerson.
Having more members would be great! We could use more implementations,
and generic YAML tools (like YAML-DIFF)

Best,

;) Clark

Steve Howell

unread,
Jul 16, 2002, 2:31:33 PM7/16/02
to
> |
> | It does and it's very interesting. It does sound a lot like
> | http://yaml.org to me, though (They even have an RFC).
> | Don't you think YAML might be a superset of your ideas?
>
> Yes, Steve Howell even has a python implementation of it...
> http://yaml.org/python/PyYaml_14jul2002.tgz
>
> I'm sure he'd love your comments/contributions.
>

Yes indeed. It's still alpha software, but it's quite usable for basic YAML
applications, such as making config files and creating test drivers. It's
not robust enough to use as your primary serializer, but it's getting there.


Huaiyu Zhu

unread,
Jul 16, 2002, 6:14:51 PM7/16/02
to
holger krekel <py...@devel.trillke.net> wrote:
>Huaiyu Zhu wrote:
>> holger krekel <py...@devel.trillke.net> wrote:
>> >Huaiyu Zhu wrote:
>> >> Readability for machines does not have to come at the expense of readability
>> >> for humans. A few years back I experimented with an indentation based data
>> >> format that is:
>> >>
>> >> - as readable as emacs's outline mode
>> >> - reduce to common conventions like this paragraph for simple cases
>> >> - allow mixed nested structures of set, sequence, dictionary, and seqdict
>> >> - can include binary data
>> >> - can handle different encodings/encryptions in different elements
>> >> - with average less than 5% bloat, in contrast to XML's over 100% bloat
>> >
>> >do you have any code or design documents for this?
>> >
>> >Sounds quite interesting.
>>
>> The basic idea is quite simple: consider a data structure as a tree; denote
>> the type of branching at each node; indent the subtrees. It appears to me
>> that indentation is easier to handle than quotes and escapes. Here's a
>> simple example:
>>
>> ...snipped...
>>
>> OK, hope this makes sense.
>
>It does and it's very interesting. It does sound a lot like
>http://yaml.org to me, though (They even have an RFC).
>Don't you think YAML might be a superset of your ideas?

Thanks a lot for this link. The basic idea is very similar, but apparently
they have done a lot more of formal specification than I have ever
attempted. There are several differences in the details, so neither is
superset of the other. I'll comment on the differences once I have time to
read through their docs.

>Let me add some random thoughts/questions about your/yaml's scheme
>(i hope i am not missing something obvious):

Following comments only concern what my scheme does:

>- how is a binary data-stream's size determined? What about
> open-ended streams? Embedding of arbitrary data-streams
> is very useful (IMO).

It's determined by the block structure denoted by indentation.

>- somehow your and yaml's scheme remind me of todays wiki techniques.
> E.g. Wikis have methods of sequence-detection (bullets ...) and they
> have a commitment to readability. Of course, they are generally more
> concerned with graphical views than with beeing a concise persistence scheme.

The emphasis is on using indentation and leading markers to denote
structure, in contrast to markups, puctuations, quotes and escapes in the
markup languages.

>- Is there a canonical conversion between XML and your scheme/YAML?
> Shouldn't be too hard, anyway...

In principle they both can express anything. In practice I've never tried
conversion between my scheme and XML. XML is too complicated in some sense:
restriction to texts, white spaces, quotes, tags with names, specs of
repeats, etc. I do not know if there is a syntax-free specification of XML
data structures. In my scheme, the syntax comes after the abstract
structures are specified. Structure marks are never buried in data.

>- how do you express external addresses akin XPATH?
> Ideas:
> - Mappings are easy, just take the 'key'.
> - Sequences are easy (take the sequence number) but not very robust
> to deletions and insertions of items.
> - tag-names (IDs) which can be associated with any item might be interesting.
> readability is likely to suffer, probably.

An address is not a data structure, but a particular data item. It has no
meta-meaning in the scheme. I've experimented with alias nodes, sort of
like symbolic links in file systems. I found life is easier without them.
I also believe that one's document's meta data is another's plain data.

>btw, I wonder whether some form of your and/or YAML's ideas should play a
>role in the new persistence-SIG. While the actual persistence mappings
>are not in the focus there are certainly some interesting connections
>between the two areas.

There are facilities for conversion among the data structures: set, seq,
dict, seqdict, with various specifications. I do not see how yaml indicates
the types of structures.

>> If this is still interesting I'll dig the thing
>> out. I have documents and code (perl and python) at home, but I'll have to
>> ...
>
>this sure is useful. Especially for me since i work with a (perl-)
>friend on a project which needs to address the persistence-question. And
>we want to have it interoperable, simple and fast. I guess looking
>at YAML might avoid that you have to dig too much into old harddisks :-)

Yaml is very interesting. I'd say aobut 60-80% similar to what I did. I'm
sure I have stuff that they don't have. I'll dig up my stuff anyway. If
anyone is insterested in seeing a mess of code and doc ... :-)

I started with bibtex, todo lists, etc, as I had a problem keeping track of
my things. Then it drifted and I got distracted and eventually even lose
track of this project itself. The hype on XML got me really depressed, as I
thought no one would be interested in a direction I regard as fundamentally
better than XML. Any new development that can make my life easier is very
welcome.


Huaiyu

Clark C . Evans

unread,
Jul 16, 2002, 9:30:56 PM7/16/02
to
On Tue, Jul 16, 2002 at 10:14:51PM +0000, Huaiyu Zhu wrote:
| Thanks a lot for this link. The basic idea is very similar, but apparently
| they have done a lot more of formal specification than I have ever
| attempted. There are several differences in the details, so neither is
| superset of the other. I'll comment on the differences once I have time to
| read through their docs.

I look forward to the commentary, could you do it or cc the
YAML discussion list?



| The emphasis is on using indentation and leading markers to denote
| structure, in contrast to markups, puctuations, quotes and escapes in the
| markup languages.

Exactly. We started with leading markers (% and @ initially) and
eventually found ways that allowed us to skip these...

| There are facilities for conversion among the data structures: set, seq,
| dict, seqdict, with various specifications. I do not see how yaml indicates
| the types of structures.

YAML does this with the bang (!) you can see this in the preview
for type family (http://yaml.org/spec/#preview-family) and also
in the transfer method section (http://yaml.org/spec/#syntax-trans)

| Yaml is very interesting. I'd say aobut 60-80% similar to what I did. I'm
| sure I have stuff that they don't have. I'll dig up my stuff anyway. If

| anyone is insterested in seeing a mess of code and doc ... :-)a

I'd love to hear about the overlap; I'm sure we don't do everything.
But if you found something important that we don't have, I'd love to
know since we'd like to start finalizing the spec at this time so that
implementations can start emerging.

| The hype on XML got me really depressed, as I thought no one would
| be interested in a direction I regard as fundamentally better than
| XML. Any new development that can make my life easier is very
| welcome.

I think the hype of XML will start to backfire as people
realize that its function isn't magical, and that their
particular solution isn't all that useable. The information
model fits documents well, but is a poor match for object
serialization, which is 90% of the use cases programmers
face. XML came from the document processing domain, as such
it had a good head-start, but I think something like YAML that
caters more to the programmer intead of the document author
will eventually win out.

All in all, its great to hear that you were thining along similar
line of thought a few years ago. This is very comforting to me.
I'd love to hear more about your thoughts on YAML, and if possible,
we'd really welcome your participation!

Best,

James Kew

unread,
Jul 17, 2002, 6:52:50 PM7/17/02
to
"Clark C . Evans" <c...@clarkevans.com> wrote in message
news:mailman.102686906...@python.org...

> I think the hype of XML will start to backfire as people
> realize that its function isn't magical, and that their
> particular solution isn't all that useable. The information
> model fits documents well, but is a poor match for object
> serialization, which is 90% of the use cases programmers
> face.

Um: 90%? What sort of use cases do you see programmers forcing XML into?

Just curious: I fall into the "XML as poor-man's database/parser" camp at
the moment but I'm finding that for a poor man's solution it does quite a
good job with not much programmer effort to glue it together.

Should I feel guilty for not learning lex/yacc? Or should I rejoice in my
batteries-included (or at least, batteries-downloaded-from-SourceForge)
pragmatism? I'm feeling a bit of both at the moment...

James

Clark C . Evans

unread,
Jul 18, 2002, 9:26:48 AM7/18/02
to
On Wed, Jul 17, 2002 at 11:52:50PM +0100, James Kew wrote:
| > The information model fits documents well, but is a poor match
| > for object serialization, which is 90% of the use cases
| > programmers face.
|
| Um: 90%? What sort of use cases do you see programmers forcing XML into?
| Just curious: I fall into the "XML as poor-man's database/parser" camp at
| the moment but I'm finding that for a poor man's solution it does quite a
| good job with not much programmer effort to glue it together.

XML has its roots in structured document processing, and is a
descendent of SGML. For example, the research reports by Gartner Group
are primarly text, but there are specific tags to mark-up features of
the report: chapters to generate a table of contents, keywords to make
an index, vendor names to enable better searching, etc. The reports
are highly structured with tags, each tag having a beginning and an
ending. Furthermore, there is also information which must be attached to
a particular series of characters, such as an editorial comment, but must
not appear in print. All in all, structured document processing is
a rather complicated beast and SGML set out to tackle this problem.

SGML thus had many features which supported these requirements. It had
attributes for out-of-band information which must be attached to a sequence
of characters but not be printed. It allows for mixed content, so that
a paragraph for instance can contain a series of untagged characters
followed by a series of characters tagged bold. Also, SGML allows for
named lists, so that a chapter could be defined, for example, as a series
of tables, paragraphs and figures. SGML is also character based, since
documents are in essence a large blob of characters "marked-up"

SGML also had lots of features which enabled human-editing, it allowed
you to skip end tags, it even allowed you to skip intermediate tags
so that if a chapter couldn't contain characters directly (characters
must be wrapped with a paragraph) the parser would implicitly include
a paragraph anyway. These extra syntax features did wonders for SGML's
flexibility and in no small way were responsible for HTML's success.
However, the implicit and missing end tags required that a parser know
the document type definition before it could parse an SGML text. Further,
these implicit thingys made it hard to write parsers.

Therefore, there was a simplification movement in HTML land where
the strcutral features (attributes, mixed content, named list) components
of SGML were kept but the features which required a DTD and made
parsing complex (implicit tags, optional end tags, etc) were dropped.
This simplified SGML was dubbed XML and was then markeded as
HTML-next-generation. The marketing for XML has been enormous, but
at it's core, it is still primarly a structured document markup language.

Due to XML's popularity, lots of people have tried to get it to work
for other things. A few people have made XML databases and others
have used XML for object serialization and invocation (SOAP/XML-RPC)
and it has had many other uses. However, most of these uses tend to
use a vastly simplified subset of XML and indeed impose additional
constraints on XML as far as particular attributes, etc. These
additional attributes/constrains are often needed to model native
datastructures of modern languages and they include: (a) a way to
specify node type, (b) a way to express that a node occurs more than
once in the graph-serialized-as-a-tree, (c) a manner to restrict
mixed content which does not usually occur in modern languages, (d)
restrictions on named list model are also common.

However, even with these constraints and fix-ups, at its heart,
XML is a much more complicated beast and this complexity is reflected
in the DOM and SAX interface. Since this is the primary interface
used by programmers, programmers must grapple with documentisms
even if they don't need structured document features.

In summary, I'm not saying that XML is bad. It's is fantastic for
structured document processing (I have direct experience here). However,
just beacuse it has had great success in this domain doesn't mean that
this success will be long-lasting in other domains. I see people
using XML for lots of purposes it was never designed for; certainly
it is flexible enough to do it, but the question is: At what price?
With XML the price is pretty steep, especially for "object serialization"
requirements where attributes, mixed-content, and named-lists arn't
needed and where other things such as typing, graph links, map/lists,
and treating characters as a whole scalar (rather than as chunks
of characters) is what you want.

So, that said, YAML (YAML Ain't Markup Language, http://yaml.org) was
designed to meet the needs of object serialization directly. In this
domain, I must say, it is much much better than XML. Just like in the
document serialization domain for which XML was designed, YAML would
not work very well at all... YAML isn't markup. In YAML you have
dictionaries, lists, and scalars; you don't have chararacters that
are tagged. The difference may seem subtle, but the actual impact
is huge. It's a completely different mid-set. For a programmer with
serialization needs, YAML fits the bill perfectly while XML requires
quite a bit of effort to make work.

The only down side of YAML is that it isn't buzz-word compliant and
the implementation's aren't quite mature yet. The implementations will
come along (the native python one isn't bad at all). And hopefully
buzz-word compliance will come along eventually, till then there is
a subsetted XML mapping of YAML (http://yaml.org/xml.html) which
you can use. I'll patch up the python parser to read/write from
this XML format within a few more weeks. This way those team
members which have to have to be buzz-word compliant can do so.
For my day job, I'm more interested in getting the job done...

Best,

Clark


François Pinard

unread,
Jul 18, 2002, 7:11:40 AM7/18/02
to
[James Kew]

> Should I feel guilty for not learning lex/yacc?

No. These are for C programmers, and you should not have to program C to
be happy! However, if you live in a Python world, it is not bad that you
put a similar Pytonish tool in your pocket, one of these days!

> Or should I rejoice in my batteries-included (or at least,
> batteries-downloaded-from-SourceForge) pragmatism?

If you feel comfortable and happy with XML, then XML is quite OK.
As long as you resist turning into an XML frantic fanatic, you know,
those recognisable by the foam at the mouth, you're safe! :-)

Huaiyu Zhu

unread,
Jul 18, 2002, 2:10:40 PM7/18/02
to
Clark C . Evans <c...@clarkevans.com> wrote:
>On Tue, Jul 16, 2002 at 10:14:51PM +0000, Huaiyu Zhu wrote:
>| Thanks a lot for this link. The basic idea is very similar, but apparently
>| they have done a lot more of formal specification than I have ever
>| attempted. There are several differences in the details, so neither is
>| superset of the other. I'll comment on the differences once I have time to
>| read through their docs.
>
>I look forward to the commentary, could you do it or cc the
>YAML discussion list?

That'll be after I get time to read through YAML docs and review my old
code and docs.



>| The emphasis is on using indentation and leading markers to denote
>| structure, in contrast to markups, puctuations, quotes and escapes in the
>| markup languages.
>
>Exactly. We started with leading markers (% and @ initially) and
>eventually found ways that allowed us to skip these...

How like minds think alike. :-) Perl opened my mind to the possibility of
heterogeneous hierarchical data structures.

>I'd love to hear about the overlap; I'm sure we don't do everything.
>But if you found something important that we don't have, I'd love to
>know since we'd like to start finalizing the spec at this time so that
>implementations can start emerging.
>

>I'd love to hear more about your thoughts on YAML, and if possible,
>we'd really welcome your participation!

I'll try to find time to participate, but time is always in short supply.

Here are some comments at first glance. I don't see a description of the
semantics of the structures independent of any syntax. It is possible to
define all the canonical transforms among the structures [1] without
concerning any particular representation. I'd also like to emphasize that
all the indentations, markers etc should be configurable in a document[2][3].

[1] Canonical transforms, such as {a, b, c} -> [a, b, c] -> {(1:a), (3:c),
(2:b)}. There are a few dozens of them among set, seq, dict, seqdict.
Some have partial inverse. None of them are one-one correspondence.
That's why I let all these four as basic structures. These four are the
combination of keyd/nonkeyed ordered/unordered. Additional kinds of
structures, such as bags (whether keyed and whether ordered), may be
added later on. [4]

[2] I tried the following kinds of indentations (where n is level)
'(%s)' % n
' ' * n
' ' * n + '|'
Obviously there can be a lot of other variations. Such flexibility
would allow many common document formats to be transformed into
conforming format with minimum effort, sometimes by just adding a
metacomment at the beginning of the document. For example, the formats
of the current paragraphs should be accommodated.

[3] I would allow encoding and encryption to be allowed at a per node
basis, not just at the file level. In reality how to break up a tree
into subtrees to fit in files is largely arbitrary. This calls for meta
comments on each node with a simple syntax for describing them.

[4] One thing I have not solved is whether the keys can only be strings. If
keys can be substructures themselves, there are further correspondence
between sets, dicts and bags, such as {a, b} -> {a:1, b:1}. This leads
to the issue of the identity of structures. Example: {a, b}=={a} if
a==b. This complicates things and that's perhaps where I stopped.
(Over-generalization perhaps?)

So my overall comment is that this approach can be made more 'meta' than any
particular syntax or structure would allow. The worst thing about xml is
that one has to conform to its (mostly arbitrary) syntax conventions instead
of thinking about the underlying data structure that's pertinent for the
task at hand. I do believe that the good thing about standards is there are
so many to choose from. A meta syntax would open up the possibility of
interoperability on a much larger scale than xml could handle comfortably.
It is often easier to define a particular syntax by fixing some parameters
in a meta syntax. Perhaps these are already in yaml since I had only a half
hour reading of its docs.

Huaiyu

Huaiyu Zhu

unread,
Jul 18, 2002, 2:36:52 PM7/18/02
to
Clark C . Evans <c...@clarkevans.com> wrote:
>On Tue, Jul 16, 2002 at 10:14:51PM +0000, Huaiyu Zhu wrote:
>
>| There are facilities for conversion among the data structures: set, seq,
>| dict, seqdict, with various specifications. I do not see how yaml indicates
>| the types of structures.
>
>YAML does this with the bang (!) you can see this in the preview
>for type family (http://yaml.org/spec/#preview-family) and also
>in the transfer method section (http://yaml.org/spec/#syntax-trans)

It appears that the bang is used to indicate any types. I found it better
to separate structural types (seq, dict, ...) from terminal types (str, int,
date and time, ...). The former is finite in number and predefined (at any
moment in time, at least), and necessary for all the navigation tools that
traverse the data structure. The latter is numerous, application dependent,
and should be considered as blackboxes by the generic tools.

I'd suggest grouping all the structural types together in one section and
describe their semantical relations. Also group all the terminal types
together in one section and indicate that they are only first examples for
endless possibilities.

All in all I find YAML quite impressive. Congrats on an excellent job in
documenting and promoting it, unlike me, who let things fizzle and rot.

Huaiyu

Christopher Browne

unread,
Jul 18, 2002, 4:05:24 PM7/18/02
to
In the last exciting episode, pin...@iro.umontreal.ca (François Pinard) wrote::

> [James Kew]
>
>> Should I feel guilty for not learning lex/yacc?
>
> No. These are for C programmers, and you should not have to program C to
> be happy! However, if you live in a Python world, it is not bad that you
> put a similar Pytonish tool in your pocket, one of these days!

C programmers should. (Feel guilty :-).)

Python programmers should only feel guilty if they don't know
something about analagous things like TPG or YAPP or such.
--
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://cbbrowne.com/info/sap.html
Philosophy: unintelligible answers to insoluble problems.

Fredrik Lundh

unread,
Jul 18, 2002, 5:14:09 PM7/18/02