http://www.tbray.org/ongoing/When/200x/2003/03/16/XML-Prog
-r
--
email: r...@cfcl.com; phone: +1 650-873-7841
http://www.cfcl.com/rdm - my home page, resume, etc.
http://www.cfcl.com/Meta - The FreeBSD Browser, Meta Project, etc.
http://www.ptf.com/dossier - Prime Time Freeware's DOSSIER series
http://www.ptf.com/tdc - Prime Time Freeware's Darwin Collection
You may want to look at the perl-xml thread called "Tim Bray says XML is too
hard for programmers" on this topic, as well as at the xml-dev thread on the
same topic, especially http://lists.xml.org/archives/xml-dev/200303/msg00536.html.
Perl can have idioms for XML, they just need to be developed. I don't at all
believe however that that needs to happen at the p6l level. We can already make
very cool stuff using p5, and the grammar stuff in p6 ought to make the sort of
while loop Tim Bray describes quite certainly doable as well.
--
Robin Berjon <robin....@expway.fr>
Research Engineer, Expway http://expway.fr/
7FC0 6F5F D864 EFB8 08CE 8E74 58E6 D5DB 4889 2488
Tim Bray also says he gives up and uses regexes as a quick and dirty work
around. So maybe these power tools you keep touting aren't necessary after
all.
--
A witty saying means nothing. -Voltaire
To be fair he only does that on data he has 100% full control over and that is
pre-munged to match his regexen. Otherwise, you really can't do that. But that
doesn't change much to the (potential absence of) issue, see my other post.
FWIW, I've had to try to rewrite Microsoft's VCPROJ and SLN format
files(*), which look a whole lot like XML. Sadly, if you change the
order of independent entities in the file, Microsoft's internal parser
rejects the file. This despite the fact that MS already has an XML
parser dll available for public consumption (More than one version, in
fact).
To me, this says that there's no real commitment to "doing XML". What
there is seems to be a recognition that XML format is regular and
comprehensible to others, so writing "XML-like" files becomes popular.
=Austin
(*) VCPROJ and SLN files are control files for the VS.net IDE product.
Just because MS has one broken tool (surprise!) doesn't mean there's no
'commitment to "doing XML"'. There is much commitment, including from MS, and
people very rarely use XML-like formats.
We are going OT *very* fast.
That's the nature of the beast; XML requires a lexer which knows
about more than just two or so character classes; a trivial split
isn't enough to lex it; and it requires a structured language
parsing algorithm (recursive descent, or one of the table-driven
parsers, I imagine LALR1 would be about right).
These do not implement efficiently in high-level scripting
languages. A tight open-coded finite-state-machine lexer with a
well-designed hand-coded recursive-descent parser should execute on
the rough order of a half-dozen or a dozen machine instructions per
input byte.
Heck, even the vastly more trivial CSV parsing deserves enough
care that it runs breathakingly faster with Text::CSV_XS than with
Text::CSV.
> The speed issue when importing XML-like data (which we do *very
> frequently*) is a constant sticking point for us and our clients.
Then we need a good tight lexer/parser written in C, as a library.
If the existing libraries are too fragile or inflexible, this may
mean we need to design and write a new one.
> It is therefore critically important that P6 allows easy, fast
> parsing for XML-like things, not necessarily just XML proper,
> because that's the way the business winds have been blowing. And
> it needs to support it out-of-the-box.
Then this new library with glue module will have to be shipped with
perl, is all. That's no biggie.
-Bennett
Yep. Which makes things even worse. And this is pretty important
stuff.
We do a *lot* of XML parsing here (Cognitivity, that is) and even more
"XML-like" parsing. And even with Perl, it's a royal pain. There are
P5 XML modules out there which tie into C-based XML libraries... those
are quite fast, but fail badly if the XML isn't 100% well-formed, and
are largely not extensible for "XML-like" situations. You'd have to
rip one up and rewrite it, in C, for every iteration of "-like", which
we cannot credibly do.
A perl5-native parser can be rigged up fairly easily, but it's
*numbingly* slow compared to the C version. I mean, 20-50 times
slower, by my guess. The speed issue when importing XML-like data
(which we do *very frequently*) is a constant sticking point for us and
our clients. Damian's Parse::RecDescent has been a godsend,
implementation-wise -- but it of course suffers the same nasty speed
issues.
This is a big, big issue, and one that P6 needs to address well,
because this is how many businesses will judge it. What I'm hoping,
obviously, is that the new P6 regexes -- which will be *perfect* for
writing and maintaining our umpteen quite-similar parsing rulesets --
will be fast enough to at least be in the same order of magnitude as a
middling C solution. They don't have to be as fast as C, obviously,
but they can't be 20x worse.
Why does this matter so much? Because it's a barn door. Even though
it's so much easier to write XML-like parsers in Perl than, well,
anything else, the speed issue will at some point dictate moving to a
non-Perl parsing solution. At which point, the issue becomes how much
of the rest of the related system to move into that other solution as
well, since it is much cheaper to maintain expertise in one toolset
than two. So within a company, it can lead to greater use of Perl --
or abandonment of Perl -- depending on success in this one key area.
(I have seen this in action at a number of companies.)
It is therefore critically important that P6 allows easy, fast parsing
for XML-like things, not necessarily just XML proper, because that's
the way the business winds have been blowing. And it needs to support
it out-of-the-box. Seriously, it's that important.
MikeL
You wanna take command of P6ML? :-)
I'm pretty happy with the new rexen, so far. I'll probably be even
happier once the interaction between A5 and A6 solidifies (Write,
Damian, write!).
And since so much other 6PAN stuff will depend on P6ML, I'm pretty sure
we'll get the XML bits right.
But the "recode" that needs to get done to get from P6ML to FooCorp's
XMLike Format (FXF) does have the opportunity to be a sales tool:
1- It's not doable. The P6 grammar for XML parsing is so buttpuckerish
that only the original author can understand it, and that only for 10
minutes or so a day.
This will scare people off. It's probably better to do a half-assed job
than to show someone a hideous grammar as an advert for "cool new
power".
2- It's a big pain and not worth doing. Better to rewrite.
If the grammar is comprehensible but not extensible/adaptable, then it
may make for a good demo of "the power of P6" but the difficulty of
implementing may burn P6.
3- It's simple and easy to do and understand.
Woo-hoo! How much more do I need to say?
For some Epsilon, P6 should be able to implement XML +/- Epsilon
trivially.
Cases in point:
-- Configuring the rules of XML.
-- Configuring the character set. (Even weird stuff, like using [tag]
instead of <tag>).
-- Error handling/recovery.
-- Commingling XML with other data.
-- Embedding other languages into XML, and vice versa.
=Austin
I don't know that it makes a difference, as this is *really* a
library issue rather than a language one, but there's a basic parrot
XML parser in the parrot examples directory. It's faster (factor of
four or so, though should speed up with our IO speedups) than the
equivalent perl 5 version that it's a line-for-line translation of.
The performance numbers are old, it might be faster now.
--
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk