Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

XML Not good for Big Files (vs Flat Files)

16 views
Skip to first unread message

Homer

unread,
Apr 4, 2006, 11:27:51 AM4/4/06
to
I am a little bit tired of this obsession people have with XML and XML
technology. Please share your thoughts and let me know if I am thinking
in a wrong way. I believe some people are over using XML all over the
place. Nowadays Canadian Government is pushing XML to its organization
as standard for data/file transfer. Huge files moving between companies
now include tones of XML Tags repeating all over the file and slowing
down networks and crashing applications because of size.
I am not objecting to the whole technology. I know advantages of XML
and using it all the times for Config files or our web oriented
applications but using it as standard for moving big files is going too
far. Here is the example:

John,Smith,5555555,37 Finch Ave.

Is now:

<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

And Tags are repeating and repeating:

<FirstName>....</FirstName>
<LastName>....</LastName>
<PhoneNum>....</PhoneNum>
<Address>....</Address>

<FirstName>....</FirstName>
<LastName>....</LastName>
<PhoneNum>....</PhoneNum>
<Address>....</Address>


Please let me know what you think.


Regards,

Homer

James McGill

unread,
Apr 4, 2006, 11:50:22 AM4/4/06
to
On Tue, 2006-04-04 at 08:27 -0700, Homer wrote:
>
> And Tags are repeating and repeating:

XML markup does tend to bloat the data.

I personally believe you should use serializable objects that can be
represented according to an XML schema when that's appropriate, but that
also can be serialized into a tightly packed format when that is
appropriate as well. So I should be able to marshal/unmarshal the
serialized object to and from XML, but I should also be able to stream
that object without marshalling it -- and the other end should be able
to unmarshal to xml, validate according to the schema, etc.

Likewise, database bindings should be informed by the xml schema, but
the XML markup shouldn't be what you store in the db.


mtp

unread,
Apr 4, 2006, 12:01:34 PM4/4/06
to
Homer wrote:
> I am a little bit tired of this obsession people have with XML and XML
> technology. Please share your thoughts and let me know if I am thinking
> in a wrong way. I believe some people are over using XML all over the
> place. Nowadays Canadian Government is pushing XML to its organization
> as standard for data/file transfer. Huge files moving between companies
> now include tones of XML Tags repeating all over the file and slowing
> down networks and crashing applications because of size.

you can use indexing, binary XML, or compression

> I am not objecting to the whole technology. I know advantages of XML
> and using it all the times for Config files or our web oriented
> applications but using it as standard for moving big files is going too
> far. Here is the example:
>
> John,Smith,5555555,37 Finch Ave.
>
> Is now:
>
> <FirstName>John</FirstName>
> <LastName>Smith</LastName>
> <PhoneNum>5555555</PhoneNum>
> <Address>37 Finch Ave.</Address>
>
> And Tags are repeating and repeating:
>
> <FirstName>....</FirstName>
> <LastName>....</LastName>
> <PhoneNum>....</PhoneNum>
> <Address>....</Address>
>
> <FirstName>....</FirstName>
> <LastName>....</LastName>
> <PhoneNum>....</PhoneNum>
> <Address>....</Address>
>
>
> Please let me know what you think.

may be one of the computing service wanted more money for his service
with this big project ?

may be everybody think "newer is better" ?

cher...@gmail.com

unread,
Apr 4, 2006, 12:06:19 PM4/4/06
to

Yes that does seem like a network killer. It depends on what the
intended use of the file is, on the other end and the client receiving
it, if they *have to* use XML, certain optimizations can be done for
just the transfer part...

<header>
<firstName>A15</firstName>
<lastName>A15</lastName>
<phone>A10</phone>
<address>A10</address>
</header>
<data>
[[CDATA
<!-- fixed width data goes here -->
]]
</data>

OR

<header>
<fieldSeparator>;</fieldSeparator>
<field>firstName</field>
<field>lastName</field>
<field>phone</field>
<field>address</field>
</header>
<data>
[[CDATA
<!-- delimited data goes here -->
]]
</data>

OR a combination of the above.

In short, XML should be preferred only if documentation and
discoverability are more important than performance.

RC

unread,
Apr 4, 2006, 12:11:17 PM4/4/06
to Homer
Homer wrote:


> Please let me know what you think.

XML is never designed to replace database server.

You can use XML file transfer portion of data
from a database.
i.e.

SELECT lastname,fistname,phonenumber,address
FROM phonebook
WHERE state = 'NY' AND city = 'somewhere';

A flat file like this

William|John|12345678|84 5th Ave

I don't know which column is last name, first name.
3rd column is person ID or phone number?

You need let the programmers know what column is what.

Next time if some one change flat file format to

85 5th Ave|John|William|12345678

Then your database will incorrect after updated.


True XML creates large file size.
But it makes our life easier.

You can make up your own tags
<lastName> or <Last_Name>, etc.
the tags can be in English, Spanish, French, Russian, Japanese, etc.

James McGill

unread,
Apr 4, 2006, 12:19:42 PM4/4/06
to
On Tue, 2006-04-04 at 09:06 -0700, cher...@gmail.com wrote:
>
> OR a combination of the above.

You're almost touching on the big problem: Misconception of what it
means to be "standard".

XML has (several) standardized markup frameworks, but it is silent as to
content or utilization. It is ridiculous for a government entity to
demand that "XML" be "the standard" for data interchange. They need to
bless certain schemas if that's their goal, but it also needs to be
abstract enough that systems can be designed efficiently.

In your examples, the designers can claim that they are using "XML", and
therefore "are standardized" on it, but the three examples we've seen so
far are not at all interchangeable...

Oliver Wong

unread,
Apr 4, 2006, 12:44:32 PM4/4/06
to

"Homer" <hom...@hotmail.com> wrote in message
news:1144164471.6...@i39g2000cwa.googlegroups.com...

If your complaint is file size during network transfer, compress the
file before sending it.

If your complaint is file size during parsing, use SAX instead of DOM,
and don't keep the whole file in memory at once.

Use the right tool for the job. If for whatever problem you're trying to
solve, you've got a better tool than XML, then use it. But if the problem is
"The government requires me to use XML", then I can't think of a better tool
than XML to solve that particular problem (except maybe emmigration ;)).

- Oliver

Lasse Reichstein Nielsen

unread,
Apr 4, 2006, 12:58:09 PM4/4/06
to
"Homer" <hom...@hotmail.com> writes:

> I am a little bit tired of this obsession people have with XML and XML
> technology.

Hear hear!
Seems some people think XML is the solution to all problems.
I'd rather classify it as the lowest common denominator for exchanging
tree-structured data - and definitly not something fit for humans to
read or write directly.

> John,Smith,5555555,37 Finch Ave.
>
> Is now:
>
> <FirstName>John</FirstName>
> <LastName>Smith</LastName>
> <PhoneNum>5555555</PhoneNum>
> <Address>37 Finch Ave.</Address>
>
> And Tags are repeating and repeating:

> Please let me know what you think.

Apart from what everybody else have said, zipping such a file
should yield a *very* high compression factor.

/L
--
Lasse Reichstein Nielsen - l...@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

James McGill

unread,
Apr 4, 2006, 12:56:55 PM4/4/06
to
On Tue, 2006-04-04 at 16:44 +0000, Oliver Wong wrote:

> except maybe emmigration

You say that as though anyone would ever leave the utopian paradise that
is Canada...

Timbo

unread,
Apr 4, 2006, 12:39:54 PM4/4/06
to
Homer wrote:
> John,Smith,5555555,37 Finch Ave.
>
> Is now:
>
> <FirstName>John</FirstName>
> <LastName>Smith</LastName>
> <PhoneNum>5555555</PhoneNum>
> <Address>37 Finch Ave.</Address>
>
It's true that the XML data in your example is bulky, but what it
has that the unstructured doesn't have is meta-level information,
such as "John" the first name of someone. If the parties involved
(ie. that sender and receiver of this information) have an
agreement as to the meaning of "FirstName", then they are sharing
more than just text... it has some implicit meaning. If you send
it unstructured, then the receiver has to know how to parse the
data into this agreed meaning, which means it needs to know the
format of the data.

Then, on the other hand, if the data is just stored in a database
or something with no definition of the what the tags mean, then I
agree with you... using XML is of little use.

Joe Attardi

unread,
Apr 4, 2006, 1:29:07 PM4/4/06
to
> John,Smith,5555555,37 Finch Ave.
>
> Is now:
>
> <FirstName>John</FirstName>
> <LastName>Smith</LastName>
> <PhoneNum>5555555</PhoneNum>
> <Address>37 Finch Ave.</Address>

Yes but, now we know what all the data means. Your example is quite
clear, but what about this one:

Lawrence,David,Maynard,MA

Could mean several things:
(1) Lawrence David lives in Maynard, MA.
(2) David Lawrence lives in Maynard, MA
(3) David Maynard lives in Lawrence, MA
(4) Maynard David lives in Lawrence, MA
etc. You see where I'm going with this.

Where
<FirstName>Lawrence</FirstName>
<LastName>David</LastName>
<City>Maynard</City>
<State>MA</State>

leaves no question.

Yes, we as humans know intuitively that city and state go together. But
for an application using this data, there has to be some specification
defined and all systems that use it must be aware of it.

Homer

unread,
Apr 4, 2006, 2:08:13 PM4/4/06
to
I guess these responses are proving of my point. You know all that the
best solution for transferring huge files between two parties is simple
flat file that both sender/receiver have agreed upon file format and
using secure line. But you still defend adding tons of tags to a file
that both sender/receiver are familiar with the format. I believe lots
of people are using XML because it's cool and new. And these people
give advise to companies and organizations.

Some points about your suggestions:

1- Marshalling/Object Stream: Too Advance for places like government.
2- Have Mixed XML/Raw Data: Then what is the point of having XML at the
top? Unless you are sending the file to an unknown place that doesn't
know what is getting.
3- Compression: There is no good standard for compression (Unix is not
really ZIP friendly unless you add some opensource or buy Zip product)
and Mainframe is another story. Even for Windows you need to buy the
product (or use open source that most companies don't like). Also why
make file size triple and then compress it?


Let me give you another example of coolness (sorry, it's a bit off
the topic but it's about coolness):

I got a job in telecommunication company (cell phone) to convert their
code from C to C++ because OO was so cool those days but application
was working with no problem.
I did my job, converted the code/building class library for one year,
and left the company.

One year later they hired bunch of other people to come and convert the
whole thing to Java because Java was the Best.

3 years later they hired me again to convert everything again to J2EE
because J2EE is (guess what) the Best.


Regards,

Homer

James McGill

unread,
Apr 4, 2006, 2:32:17 PM4/4/06
to
On Tue, 2006-04-04 at 11:08 -0700, Homer wrote:
> I believe lots
> of people are using XML because it's cool and new.

It's anything but "cool". And as for it being "new", XML isn't old
enough to vote, but SGML is. If you aren't seeing the benefits of
logical structure and validation, standardized processing, etc.,
that may be because you aren't exploiting those things in your
application.

One of your complaints is directly counter to an explicit design goal,
from the beginning of the XML spec: "Terseness in XML markup is of
minimal importance."

XML markup is deliberately intended to favor clarity to conciseness.

But most of your complaint seems to derive from the fact that you work
in a bureaucratic government situation, where you have no authority to
make decisions, and where there is a limited backchannel for your
recommendations. That is unfortunate, but isn't it a choice you made
when you went to work for a government?

I've always been led to believe that the Canadian government is a
prototype of efficiency and reason, one that should make Americans feel
ashamed. Are you suggesting that it too may be clogged with
bureaucratic nonsense? I would be shocked to hear that!


Homer

unread,
Apr 4, 2006, 3:06:21 PM4/4/06
to
Very good guess but no, I don't work for government. All I am saying
is in these cases sender and receiver both knows the file format by
heart. They know and their application knows. That's how they were
moving files in past and if they want to establish a new file transfer
they will let each other know about upcoming file format for sure.
There is no reason to send the file format along with each file every
time they have a file transfer (unless you are wearing name tag in your
home so your family know your name).

James McGill

unread,
Apr 4, 2006, 3:25:57 PM4/4/06
to
On Tue, 2006-04-04 at 12:06 -0700, Homer wrote:
> All I am saying
> is in these cases sender and receiver both knows the file format by
> heart. They know and their application knows.

The interesting thing with XML is that in its case, the *document*
knows. In a well designed system, the DTD can change and applications
can cope.

>There is no reason to send the file format along with each file every
>time they have a file transfer

But you aren't sending the file format. You're sending a notice with a
URI that locatest the format (schema, dtd, etc.), and then sending data
that's marked up according to that format.

>(unless you are wearing name tag in your
>home so your family know your name).

Or like wearing a badge at a workplace, perhaps?


Joe Attardi

unread,
Apr 4, 2006, 3:56:37 PM4/4/06
to

Homer wrote:
> I believe lots of people are using XML because it's cool and new. And these people
> give advise to companies and organizations.
XML isn't new. It's been around almost ten years. The first working
draft for the XML spec was put together in November of 1996.

> 3- Compression: There is no good standard for compression (Unix is not
> really ZIP friendly unless you add some opensource or buy Zip product)

Gzip? In fact IIRC, the gzip algorithm takes advantage of strings that
are repeated over and over (like the tag names) that help with its
compression.

> (or use open source that most companies don't like).

That most companies don't like? I don't think you researched this much
before making this statement. Look how many of the huge players (Sun,
IBM, etc.) have strong support for open source. In addition, open
source is being adopted all over the place.

> Let me give you another example of coolness (sorry, it's a bit off
> the topic but it's about coolness):

It's not just because XML is "the cool thing". It's perfectly suited
for the exchange of data like this. The data describes itself!

Martin Gregorie

unread,
Apr 4, 2006, 3:45:14 PM4/4/06
to
Homer wrote:
> I guess these responses are proving of my point. You know all that the
> best solution for transferring huge files between two parties is simple
> flat file that both sender/receiver have agreed upon file format and
> using secure line. But you still defend adding tons of tags to a file
> that both sender/receiver are familiar with the format. I believe lots
> of people are using XML because it's cool and new. And these people
> give advise to companies and organizations.
>
Here's another thought: use ASN.1 encoding. Have a look here
<http://asn1.elibel.tm.fr/> if you haven't heard of it.

It does virtually everything XML does in terms of tagged fields and the
ability to completely omit optional fields and structures, but it uses
binary tags and can encapsulate binary data. Like XML you can take a
data description (written in BNF notation) and use it to generate file
encoders and decoders, or you can write fast interpretive decoders (as I
have). Its a standard in the telecoms industry, where its routinely used
to transfer multi-megabyte files as well as individual short messages.

Java ASN.1 schema compilers are available.

Translating a file between ASN.1 and XML should be a doddle: the site I
mentioned has a tool for doing just that.


--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |

Monique Y. Mudama

unread,
Apr 4, 2006, 4:11:00 PM4/4/06
to
On 2006-04-04, Homer penned:

> I guess these responses are proving of my point. You know all that
> the best solution for transferring huge files between two parties is
> simple flat file that both sender/receiver have agreed upon file
> format and using secure line. But you still defend adding tons of
> tags to a file that both sender/receiver are familiar with the
> format.

I guess that you are wrong. I guess that the word "best" is meaningless
unless it is qualified by something. If you want a format that is best
at clarity, then flat files lose. I guess that you don't really
understand when to use XML, and that it doesn't really matter because
you don't have the authority to change things in the environment in
which it's causing you trouble, so you've developed a grudge against
XML rather than against whoever decided to use it inappropriately or
whoever decided to create an excessively verbose schema.

> I believe lots of people are using XML because it's cool and
> new. And these people give advise to companies and organizations.

XML isn't new enough to offer the glamour factor you think it has.

--
monique

Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

Roedy Green

unread,
Apr 4, 2006, 4:15:58 PM4/4/06
to
On 4 Apr 2006 08:27:51 -0700, "Homer" <hom...@hotmail.com> wrote,
quoted or indirectly quoted someone who said :

><FirstName>....</FirstName>
><LastName>....</LastName>
><PhoneNum>....</PhoneNum>
><Address>....</Address>
>
>
>Please let me know what you think.

see http://mindprod.com/jgloss/xml.html

Pay particular attention to the images and the XML "logo".

XML needs a binary format both for compactness and automatic format
compliance.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Daniel Dyer

unread,
Apr 4, 2006, 3:26:25 PM4/4/06
to
On Tue, 04 Apr 2006 22:15:58 +0200, Roedy Green
<my_email_is_post...@munged.invalid> wrote:

> On 4 Apr 2006 08:27:51 -0700, "Homer" <hom...@hotmail.com> wrote,
> quoted or indirectly quoted someone who said :
>
>> <FirstName>....</FirstName>
>> <LastName>....</LastName>
>> <PhoneNum>....</PhoneNum>
>> <Address>....</Address>
>>
>>
>> Please let me know what you think.
>
> see http://mindprod.com/jgloss/xml.html
>
> Pay particular attention to the images and the XML "logo".
>
> XML needs a binary format both for compactness and automatic format
> compliance.

http://www.w3.org/XML/Binary/
http://asn1.elibel.tm.fr/xml/

Dan.


--
Daniel Dyer
http://www.dandyer.co.uk

Oliver Wong

unread,
Apr 4, 2006, 5:24:48 PM4/4/06
to

"Joe Attardi" <jatt...@gmail.com> wrote in message
news:1144171747....@e56g2000cwe.googlegroups.com...

>> John,Smith,5555555,37 Finch Ave.
>>
>> Is now:
>>
>> <FirstName>John</FirstName>
>> <LastName>Smith</LastName>
>> <PhoneNum>5555555</PhoneNum>
>> <Address>37 Finch Ave.</Address>
>
> Yes but, now we know what all the data means. Your example is quite
> clear, but what about this one:
>
> Lawrence,David,Maynard,MA

Ah, obviously a list of 4 arbitrary strings, i.e. (in SQL terms):

CREATE TABLE foo {
bar VARCHAR(255)
}

INSERT INTO foo VALUES ("Lawrence"),("David"),("Maynard"),("MA").


>
> Could mean several things:
> (1) Lawrence David lives in Maynard, MA.

Oops, okay, it's one record. Well, maybe it means.

Lawrence D. Maynard, who has an Masters in Arts. (Or perhaps it uses last
name first, i.e. David M. Lawrence, Masters in Arts).

Or maybe (s)he's a Medical Assitant? Or (s)he lives in Madagascar?

> (2) David Lawrence lives in Maynard, MA
> (3) David Maynard lives in Lawrence, MA
> (4) Maynard David lives in Lawrence, MA
> etc. You see where I'm going with this.

Hmm, looks like I was way off... Not being an American, I am not
familiar with American city names, nor American State abbreviations. If only
you had used XML!

- Oliver

Roedy Green

unread,
Apr 4, 2006, 5:45:37 PM4/4/06
to
On 4 Apr 2006 08:27:51 -0700, "Homer" <hom...@hotmail.com> wrote,
quoted or indirectly quoted someone who said :

><FirstName>John</FirstName>


><LastName>Smith</LastName>
><PhoneNum>5555555</PhoneNum>
><Address>37 Finch Ave.</Address>

canada has a population of some 30 million. We are talking some
fairly fat files. Not ones you feed to Winzip.

Roedy Green

unread,
Apr 4, 2006, 5:47:40 PM4/4/06
to
On Tue, 04 Apr 2006 20:45:14 +0100, Martin Gregorie
<mar...@see.sig.for.address> wrote, quoted or indirectly quoted
someone who said :

>Translating a file between ASN.1 and XML should be a doddle

what part of the world does "doddle" derive from? It just means
"easy"?

Roedy Green

unread,
Apr 4, 2006, 5:50:07 PM4/4/06
to
On 4 Apr 2006 10:29:07 -0700, "Joe Attardi" <jatt...@gmail.com>

wrote, quoted or indirectly quoted someone who said :

><FirstName>Lawrence</FirstName>


><LastName>David</LastName>
><City>Maynard</City>
><State>MA</State>

when you are transferring 30 million records, the level of detail you
need to specify is much deeper than that. The tags alone are not
really telling you anything important.

For an example have a look at the spec of the tape of postal codes the
government puts out. There is a HUGE amount of information other
than just the field names you need to interpret the tape.

Steve Wampler

unread,
Apr 4, 2006, 5:44:14 PM4/4/06
to
Oliver Wong wrote:
> Hmm, looks like I was way off... Not being an American, I am not
> familiar with American city names, nor American State abbreviations. If
> only you had used XML!

No problem:

<f1>John</f1>
<f2>Smith</f2>
<f3>5555555</f3>
<f4>37 Finch Ave.</f4>

There, that should make people happy :)
(Of course, given this group, maybe the tags should be in Klingon...)

Roedy Green

unread,
Apr 4, 2006, 6:28:09 PM4/4/06
to
On Tue, 04 Apr 2006 21:26:25 +0200, "Daniel Dyer"
<d...@dannospamformepleasedyer.co.uk> wrote, quoted or indirectly
quoted someone who said :

>http://www.w3.org/XML/Binary/
>http://asn1.elibel.tm.fr/xml/

XML is still is the "it might be a good idea" stage. asn.1 has been
working for years in production. It feels like the Java newbies
decided to reinvent the wheel with XML not realising its simplicity is
actually simple-minded and actually too ambiguous for a proper
interchange format and clumsy as all get out at quoting among other
sins. See http://mindprod.com/jgloss/xml.html

Roedy Green

unread,
Apr 4, 2006, 6:45:34 PM4/4/06
to
On 4 Apr 2006 08:27:51 -0700, "Homer" <hom...@hotmail.com> wrote,

quoted or indirectly quoted someone who said :

>John,Smith,5555555,37 Finch Ave.

There are ways now given an XML schema to create the equivalent binary
ASN.1 that can be decoded up to 100 times faster than the orgininal
XML. Given the incompetence of the W3C in designing XML, I would not
entrust them to produce a binary equivalent. Let's just stick with
ASN.1. Unless it had built-in dictionary compression, it is not going
to be sufficiently better than ASN.1 to warrant a competing format.


http://asn1.elibel.tm.fr/xml/#schema-mapping

This standardized mapping takes as input any schema written in XML
Schema and produces an ASN.1 module containing a set of type
definitions in such a way that there is a one-to-one correspondence
between ASN.1 abstract values and valid XML instances.
ASN.1 standardized encoding rules such as DER (a canonical encoding
that allows digital signatures and encryption) or PER (to very
efficiently transmit data over a radio channel), or even specific
encoding rules that are described in ECN, can then be used.
One big benefit of using a binary encoding is speed. Decoding a binary
stream improves performance by a factor 100 or more. Another benefit
is size: a binary encoding may save up to 80% or even more relative to
corresponding XML text.

Alex Hunsley

unread,
Apr 4, 2006, 7:04:24 PM4/4/06
to
RC wrote:
> Homer wrote:
>
>
>> Please let me know what you think.
>
> XML is never designed to replace database server.
>
> You can use XML file transfer portion of data
> from a database.
> i.e.
>
> SELECT lastname,fistname,phonenumber,address
> FROM phonebook
> WHERE state = 'NY' AND city = 'somewhere';
>
> A flat file like this
>
> William|John|12345678|84 5th Ave
>
> I don't know which column is last name, first name.
> 3rd column is person ID or phone number?

That's what a header field would be for.

> You need let the programmers know what column is what.
>
> Next time if some one change flat file format to
>
> 85 5th Ave|John|William|12345678
>
> Then your database will incorrect after updated.

Presumably the header field will reflect the change.
Yeah, it's an extra thing to go wrong, admittedly...

Roedy Green

unread,
Apr 4, 2006, 9:10:18 PM4/4/06
to
On Tue, 04 Apr 2006 22:28:09 GMT, Roedy Green
<my_email_is_post...@munged.invalid> wrote, quoted or

indirectly quoted someone who said :

>


>XML is still is the "it might be a good idea" stage.

I meant to say BINARY XML is still in the "it might be a good idea"
stage.

James McGill

unread,
Apr 4, 2006, 9:30:31 PM4/4/06
to
On Wed, 2006-04-05 at 01:10 +0000, Roedy Green wrote:
> On Tue, 04 Apr 2006 22:28:09 GMT, Roedy Green
> <my_email_is_post...@munged.invalid> wrote, quoted or
> indirectly quoted someone who said :
>
> >
> >XML is still is the "it might be a good idea" stage.
>
> I meant to say BINARY XML is still in the "it might be a good idea"
> stage.

In your world, the scenario of routinely "moving 30 million records"
might be more common than it is for others.

XML turns out to be quite a good fit for many situations. It's probably
totally inappropriate for the one the OP was complaining about, of
course.

Monique Y. Mudama

unread,
Apr 5, 2006, 12:24:34 AM4/5/06
to
On 2006-04-04, Alex Hunsley penned:

> Presumably the header field will reflect the change. Yeah, it's an
> extra thing to go wrong, admittedly...
>

Yeah ... the markup format is nice if partial data is considered
better than no data at all ...

Monique Y. Mudama

unread,
Apr 5, 2006, 12:27:28 AM4/5/06
to
On 2006-04-04, Roedy Green penned:

>
> There are ways now given an XML schema to create the equivalent
> binary ASN.1 that can be decoded up to 100 times faster than the
> orgininal XML. Given the incompetence of the W3C in designing XML,
> I would not entrust them to produce a binary equivalent. Let's just
> stick with ASN.1. Unless it had built-in dictionary compression, it
> is not going to be sufficiently better than ASN.1 to warrant a
> competing format.
>

Except that, apparently, it's not terribly well known or supported.
That does make a difference. One of the selling points of XML is that
it can allow diverse groups to share data.

Monique Y. Mudama

unread,
Apr 5, 2006, 12:28:20 AM4/5/06
to
On 2006-04-04, Roedy Green penned:
> On Tue, 04 Apr 2006 20:45:14 +0100, Martin Gregorie
><mar...@see.sig.for.address> wrote, quoted or indirectly quoted
>someone who said :
>
>>Translating a file between ASN.1 and XML should be a doddle
>
> what part of the world does "doddle" derive from? It just means
> "easy"?

I had a mental image of a toddler, er, toddling along. No idea if
that's actually what was meant. In the context of my brain, it meant
"so easy a toddler could do it."

Jon Martin Solaas

unread,
Apr 5, 2006, 2:25:40 AM4/5/06
to

Ofcourse, but in other cases, when the file-format has to be
communicated, nobody knows it by heart, the data need to be
hierarchical, the receiver need to validate and perhaps transform to
another format, and not to mention implementing the apps to do so, xml
is useful. When a new fileformat is to be used, xsd comes in handy, and
also allows for automatic validation. In many orgranisations
misunderstandings occur, bugs are made and so on, so validation is nice.

XML was cool when I was a student 10 years ago. Now it's just convenient.

Maybe you should get more out. It's the people outside that doesn't know
your name :-)

Dag Sunde

unread,
Apr 5, 2006, 3:46:35 AM4/5/06
to
"Monique Y. Mudama" <sp...@bounceswoosh.org> skrev i melding
news:slrne36hp...@home.bounceswoosh.org...

> On 2006-04-04, Roedy Green penned:
>>
>> There are ways now given an XML schema to create the equivalent
>> binary ASN.1 that can be decoded up to 100 times faster than the
>> orgininal XML. Given the incompetence of the W3C in designing XML,
>> I would not entrust them to produce a binary equivalent. Let's just
>> stick with ASN.1. Unless it had built-in dictionary compression, it
>> is not going to be sufficiently better than ASN.1 to warrant a
>> competing format.
>>
>
> Except that, apparently, it's not terribly well known or supported.
> That does make a difference. One of the selling points of XML is that
> it can allow diverse groups to share data.
>

Not terribly well known at all...

Is there parsers or en-/decoders for VB, Python, JavaScript and all the
other languages I frequently have to use to interpret data from other
systems?

When organizations like goverments choose XML for data-exchange they don't
do it for the "coolness factor", but because they have the need to publish
data to 3rd parties not involved in the spec. at all.

I am frequently given the task of importing som goverment/large company
data into one app or another, and am very grateful each time I'm given
an xml format with, (Important!) a proper Schema file or DTD. With the
schema/DTD, I can make sure the data is valid and well formed, and
I can even automatically adapt to changes.

My point is (I think :-) that a goverment is seldom in a situation where
they have a single counterpart where they can agree upon a fixed, flat
format...

--
Dag.


Jon Martin Solaas

unread,
Apr 5, 2006, 2:53:35 AM4/5/06
to
Roedy Green wrote:
> On 4 Apr 2006 08:27:51 -0700, "Homer" <hom...@hotmail.com> wrote,
> quoted or indirectly quoted someone who said :
>
>> <FirstName>John</FirstName>
>> <LastName>Smith</LastName>
>> <PhoneNum>5555555</PhoneNum>
>> <Address>37 Finch Ave.</Address>
>
> canada has a population of some 30 million. We are talking some
> fairly fat files. Not ones you feed to Winzip.

Why would anyone want to apply compression manually? Automate the rest
of the process and then use WinZip? It's hardly likely that the database
with all those records run on a platform that can run WinZip :-)

Also, isn't it likely that the file would be split up?

Peter....@aqute.se

unread,
Apr 5, 2006, 4:07:49 AM4/5/06
to
Interesting, I agree with your conclusion but for opposite reasonc :-)

For computer-computer communications XML is quite good, though verbose.
If you have come to the scene in the last 5 years, you have no ideas
how many issues there were sending files between computers. Character
encodings, format changes, field length differences, imposisble to
transfer certain datatypes, nested data, I can assure you it was
usually hell. XML is not a good a solution, but it is sufficient for
this purpose and has become the best because it has become a standard.
This has created a large market for tools that can easily interwork.
Today, when an XML file must be transformed because of version
mismatch, it is a trivial task.

The size problem is relatively easy to solve: zip it. In Java it is
trivial to zip the XML in an JAR or ZIP stream. This usually reduces
the size to 10%. Obviously this trades off CPU cycles versus
bandwidth/storage so it should be used with care.

The reason I think XML is bad because lazy programmers have
standardized it for Human-Computer communication. Ant, Maven, WAR,
J2EE, XSLT, and too many others force humans to write XML, and we are
lousy at it. The verbosity hides the important elements making it very
difficult to understand without inspecting the code in detail. The sole
reason for this is because the programmer is too lazy (or, god forbid,
incompetent) to write a real grammar and parser for the task at hand.
The argument that we then all use the same language is wrong. XML is
used as a meta language, the real language is still effectively hidden
in its tags and attributes. Worse, often attributes introduce an
additional language (XPath for example) This means the burden is put on
the user and not the computer, and imho that is fundementally wrong.
I'd like my time optimized, not the computer's.

Homer wrote:
> I am a little bit tired of this obsession people have with XML and XML
> technology. Please share your thoughts and let me know if I am thinking
> in a wrong way. I believe some people are over using XML all over the
> place. Nowadays Canadian Government is pushing XML to its organization
> as standard for data/file transfer. Huge files moving between companies
> now include tones of XML Tags repeating all over the file and slowing
> down networks and crashing applications because of size.
> I am not objecting to the whole technology. I know advantages of XML
> and using it all the times for Config files or our web oriented
> applications but using it as standard for moving big files is going too
> far. Here is the example:


>
> John,Smith,5555555,37 Finch Ave.
>
> Is now:
>

> <FirstName>John</FirstName>
> <LastName>Smith</LastName>
> <PhoneNum>5555555</PhoneNum>
> <Address>37 Finch Ave.</Address>
>

> And Tags are repeating and repeating:


>
> <FirstName>....</FirstName>
> <LastName>....</LastName>
> <PhoneNum>....</PhoneNum>
> <Address>....</Address>
>
> <FirstName>....</FirstName>
> <LastName>....</LastName>
> <PhoneNum>....</PhoneNum>
> <Address>....</Address>
>
>

> Please let me know what you think.
>
>

> Regards,
>
> Homer

Chris Uppal

unread,
Apr 5, 2006, 6:43:05 AM4/5/06
to
Steve Wampler wrote:

> No problem:
>
> <f1>John</f1>
> <f2>Smith</f2>
> <f3>5555555</f3>
> <f4>37 Finch Ave.</f4>
>
> There, that should make people happy :)

Slightly OT, but I believe that the Best Practise for handling addresses is
just have line1, line2, line3 and so on, rather than trying to identify the
"meaning" of each line. There is much less consistency across address formats
than most programmers (or schema designers) realise. So an XML format like
yours might be the best you can (or should) do.

-- chris

Chris Uppal

unread,
Apr 5, 2006, 6:45:12 AM4/5/06
to
Martin Gregorie wrote:

> Here's another thought: use ASN.1 encoding. Have a look here
> <http://asn1.elibel.tm.fr/> if you haven't heard of it.

I can't understand why something as simple as data exchange (not /information/
exchange which is vastly more difficult) should require nine standards
documents which between them add up to book length. Nor why it should require
a book written about it. Why do people have to make things so /complicated/ ?

XML is, if anything, even worse.

Even YAML is way too complicated, albeit not in the same league as ASN.1 or
XML.

-- chris

Chris Uppal

unread,
Apr 5, 2006, 6:45:23 AM4/5/06
to
Monique Y. Mudama wrote:

> XML isn't new enough to offer the glamour factor you think it has.

Remember that we are talking about a government here. Being only a decade
behind the times is damned impressive !

-- chris

Chris Uppal

unread,
Apr 5, 2006, 6:45:48 AM4/5/06
to
Monique Y. Mudama wrote:

[about ASN.1]

> Except that, apparently, it's not terribly well known or supported.
> That does make a difference. One of the selling points of XML is that
> it can allow diverse groups to share data.

Um, it is rather /widely/ used. Do you remember when someone discovered a
bunch of exploitable vulnerabilities in a commonly used ASN.1 related library
(I think it might have been a code generator that produced vulnerable code).
The list of affected products included just about every vendor of
network-related kit.

-- chris

Chris Uppal

unread,
Apr 5, 2006, 6:49:53 AM4/5/06
to
Monique Y. Mudama wrote:

> > what part of the world does "doddle" derive from? It just means
> > "easy"?
>
> I had a mental image of a toddler, er, toddling along. No idea if
> that's actually what was meant. In the context of my brain, it meant
> "so easy a toddler could do it."

The word's common in British English. I don't know about other
dialects/flavours.

The word "doddle" does derive from "toddle", according to the OED, where
"toddle" means the halting walk of an infant or elderly/infirm person. A
doddle, however, is just something that is easy -- as the OED puts it: "a
'walk-over'".

-- chris


Joe Attardi

unread,
Apr 5, 2006, 9:02:23 AM4/5/06
to
> Also, isn't it likely that the file would be split up?
Exactly. Any data set containing 30 million records would be grossly
inefficient in one single file, whether it be XML or otherwise.

bugbear

unread,
Apr 5, 2006, 9:34:33 AM4/5/06
to
Homer wrote:
> John,Smith,5555555,37 Finch Ave.
>
> Is now:
>
> <FirstName>John</FirstName>
> <LastName>Smith</LastName>
> <PhoneNum>5555555</PhoneNum>
> <Address>37 Finch Ave.</Address>

In the first example is 5555555 a phone number, or
part of the address?

And, w.r.t repeating tags; 1 word. gzip.
Several applications simply use gzip'd XML
to get a good compromise.

gzip (and other compressors) are rather good
at crunching off the kind of trivial
repetition you object to.

BugBear

Monique Y. Mudama

unread,
Apr 5, 2006, 9:50:42 AM4/5/06
to
On 2006-04-05, Chris Uppal penned:

*ducks*

Okay, I guess it is widely supported. I just haven't happened to have
come across anything in my development work that ever made use of it
(that I know of). I shouldn't have generalized that to the rest of
the world.

Monique Y. Mudama

unread,
Apr 5, 2006, 9:57:25 AM4/5/06
to
On 2006-04-05, Chris Uppal penned:

Now, now. In 1999 I worked on a US govt project (I think it was DoD, or
maybe DISA) to create an XML repository to share across govt branches.

I also spent 1998 through erm, a a couple of years ago working on Java
systems for some defense related stuff. I think when we started we
were using 1.1.7, and it did take a looooong time to convince the
customer to upgrade, but after that it wasn't too hard to keep moving.
I remember getting bitten by glob imports + that new List class,
engendering a hatred of glob imports that continues to this day.

Some govt customers are very into new technology (almost to the point
of silliness -- they want to reimplement in the new stuff even if
there's no direct benefit and resources would be better spent
improving the rest of the app).

Homer

unread,
Apr 5, 2006, 9:59:46 AM4/5/06
to
That's great. Put tones of repeating tags inside the file and make it
huge and now everybody is saying how to make it small with
Gzip/Binary,...

Third field (between delimiters; whatever it is) is phone number. Any
file has File Spec Document (unless you XML lovers has replaced it with
some XML equivalent).

When the sender and receiver are agreed on format there is no need to
repeat labels. Like what you write on postal envelop. Or you told your
wife your name is John 20 years ago. No need to wear a name tag just in
case you change your name (if you change your name tell her one more
time; sending File Spec Doc to receiver)

I am still saying I am %100 with you all that IF you are sending data
in small volume and/or receiver doesn't know about the file format
XML is the best solution. But use it as a tool to fix any problems is
going too far.

Homer

Gordon Beaton

unread,
Apr 5, 2006, 11:19:58 AM4/5/06
to
On 4 Apr 2006 08:27:51 -0700, Homer wrote:
> I am a little bit tired of this obsession people have with XML and
> XML technology. Please share your thoughts and let me know if I am
> thinking in a wrong way.

I don't use XML myself, but someone sent me this recently and it might
give you something to think about:

http://www.developer.com/xml/article.php/10929_3583081_1

/gordon

--
[ do not email me copies of your followups ]
g o r d o n + n e w s @ b a l d e r 1 3 . s e

Timbo

unread,
Apr 5, 2006, 10:03:01 AM4/5/06
to
Homer wrote:
> I guess these responses are proving of my point. You know all that the
> best solution for transferring huge files between two parties is simple
> flat file that both sender/receiver have agreed upon file format and
> using secure line. But you still defend adding tons of tags to a file
> that both sender/receiver are familiar with the format.
>
My guess is that you don't really understand either my post, or
XML. It's not the FORMAT of XML, it's the fact that it contains
MEANING. So, if the sender and receiver have a shared ontology
that says that FirstName is someone's first name, then the data
<FirstName>John<FirstName> is more than just some text with the
value "John"... it is saying that "John" is his first name. So
rather than just having raw data, you have information that is
useful to the receiver. Moreso, for a third-party to use this
information, you need only to give them the shared definitions,
rather them give them the format and the meaning.

Oliver Wong

unread,
Apr 5, 2006, 10:32:02 AM4/5/06
to

"Steve Wampler" <swam...@noao.edu> wrote in message
news:4432E8AE...@noao.edu...

Well, at least with this notation, I wouldn't have made my initial
mistake of thinking I was dealing with 4 records which seemed to be
arbitrary strings.

Give the tag names, I can see I am dealing with a single record with 4
fields.

So we're making progress here, but perhaps the tag names could have been
better chosen.

And if there were an XSD along with this, I could check wether f3 was
purely numeric, or if it could contain arbitrary string data as well.

- Oliver

Bent C Dalager

unread,
Apr 5, 2006, 11:02:55 AM4/5/06
to
In article <1144245586.6...@v46g2000cwv.googlegroups.com>,

Homer <hom...@hotmail.com> wrote:
>That's great. Put tones of repeating tags inside the file and make it
>huge and now everybody is saying how to make it small with
>Gzip/Binary,...
>
>Third field (between delimiters; whatever it is) is phone number. Any
>file has File Spec Document (unless you XML lovers has replaced it with
>some XML equivalent).
>
>When the sender and receiver are agreed on format there is no need to
>repeat labels.

XML isn't particularly useful for the original sender and receiver.
They would probably be better off using a binary format. It is useful
for the third party who wants his product to interact or compete with
the software used by sender and receiver and therefore needs to
reverse engineer the protocol being used between them. In this
context, a high level of protocol redundancy is extremely useful since
it makes it reasonably easy for a human to work out what is going on
so that he can replicate it.

This is part of the same philosophy that made the Internet so big in
the first place: simple protocols that anyone could understand and
hook into. SMTP isn't a very good protocol by any stretch of the
imagination, but it is _simple_ and you can very easily hook into it
to make it do the things _you_ want it to do. If SMTP had been
ASN.1-based, chances are X.400 or something would have won the email
protocol wars because only professionals would have bothered extending
SMTP or creating cheap (free) MTAs, mail clients, etc.

XML may be a resource hog, it may be absolutely preposterous from an
information theory standpoint and it may have accumulated a shedload
of idiosyncrasies over time, but it does help keep technology and
protocols accessible to hobbyists and starting programmers. This is
highly useful in itself and might very well be enough to justify its
widespread adoption.

>I am still saying I am %100 with you all that IF you are sending data
>in small volume and/or receiver doesn't know about the file format
>XML is the best solution. But use it as a tool to fix any problems is
>going too far.

I tend to bring up my "XML-based streaming video" horror scenario in
these debates just to point out that XML should be used with some
caution:

<video-frame number="1654392">
<line number="1">
<pixel number="1">
<colour red="0" green="14" blue="200"/>
</pixel>
<pixel number="2">
<colour red="0" green="13" blue="198"/>
</pixel>
<pixel number="3">
<colour red="3" green="12" blue="197"/>
</pixel>
<!-- more pixels . . -->
</line>
<!-- more lines ... -->
</video-frame>

_Now_ we're talking broadband :-)

Cheers
Bent D
--
Bent Dalager - b...@pvv.org - http://www.pvv.org/~bcd
powered by emacs

Joe Attardi

unread,
Apr 5, 2006, 11:15:02 AM4/5/06
to
> I am still saying I am %100 with you all that IF you are sending data
> in small volume and/or receiver doesn't know about the file format
> XML is the best solution. But use it as a tool to fix any problems is
> going too far.

I do agree with you on this one! XML is definitely not a catch-all
solution for every problem. Using it to send 30 million records is
probably not a good use for it.

But, you are being too harsh on XML, accusing people of using it
because it's "the cool thing" or because it's new and a novelty (both
of which are false, by the way).

_For applicable problems_, XML is extremely useful because of its
features, not because it is a "cool" technology.

What if the XML data needs to be converted to some other format? XSLT
to the rescue! I can use an XSL stylesheet to quickly convert my XML
data file to your flat file comma-delimited format.

Then there's XML data binding tools like JAXB, Castor, etc. Give it an
XSD, and *poof* it generates a set of Java classes around it. Now I can
load my XML data in a program and not worry about parsing it with
DOM/SAX or searching for data using XPath.

Joe Attardi

unread,
Apr 5, 2006, 11:20:07 AM4/5/06
to
Homer wrote:
> That's great. Put tones of repeating tags inside the file and make it
> huge and now everybody is saying how to make it small with
> Gzip/Binary,...
>
> Third field (between delimiters; whatever it is) is phone number. Any
> file has File Spec Document (unless you XML lovers has replaced it with
> some XML equivalent).

One thing nobody's really mentioned much yet is attributes. They are
just as descriptive as elements, and your example does tend to overuse
the sub-elements. Consider this:

Instead of:


<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

what about,

<PersonList>
<Person firstName="John" lastName="Smith" phoneNum="5555555"
address="37 Finch Ave." />
</PersonList>

You've cut down on duplicated text by half (since things like
firstname, lastname, etc. are now attributes and therefore don't need
closing tags).

Steve Wampler

unread,
Apr 5, 2006, 11:13:32 AM4/5/06
to
Oliver Wong wrote:
>
> "Steve Wampler" <swam...@noao.edu> wrote in message
> news:4432E8AE...@noao.edu...
>> Oliver Wong wrote:
>>> Hmm, looks like I was way off... Not being an American, I am not
>>> familiar with American city names, nor American State abbreviations. If
>>> only you had used XML!
>>
>> No problem:
>>
>> <f1>John</f1>
>> <f2>Smith</f2>
>> <f3>5555555</f3>
>> <f4>37 Finch Ave.</f4>
>>
>> There, that should make people happy :)
>> (Of course, given this group, maybe the tags should be in Klingon...)
>
> Well, at least with this notation, I wouldn't have made my initial
> mistake of thinking I was dealing with 4 records which seemed to be
> arbitrary strings.
>
> Give the tag names, I can see I am dealing with a single record with
> 4 fields.

Really? I wouldn't have thought so. What makes you think 'f' stands
for 'field'? Maybe these are four new flavours of Ben&Jerry's ice cream.
(Not that I'd buy any of them...)

The point is that the tag names are, ultimately, just strings. We might
think we understand what they mean (and can be right a high percentage of
the time if the strings are well chosen), but in the end, they mean
whatever the code at each end that defines the semantics (not the syntax)
to be. That codes *still* has to agree at both ends, just as it does
with "John,Smith,5555555,37 Finch Ave.". I haven't seen anything in XML
that does more than provide a guarantee that the syntax is right.

Oliver Wong

unread,
Apr 5, 2006, 11:22:51 AM4/5/06
to
"Roedy Green" <my_email_is_post...@munged.invalid> wrote in
message news:ait532h1pcnm2o4un...@4ax.com...

> On 4 Apr 2006 08:27:51 -0700, "Homer" <hom...@hotmail.com> wrote,
> quoted or indirectly quoted someone who said :
>
>>John,Smith,5555555,37 Finch Ave.

>
> There are ways now given an XML schema to create the equivalent binary
> ASN.1 that can be decoded up to 100 times faster than the orgininal
> XML. Given the incompetence of the W3C in designing XML, I would not
> entrust them to produce a binary equivalent. Let's just stick with
> ASN.1. Unless it had built-in dictionary compression, it is not going
> to be sufficiently better than ASN.1 to warrant a competing format.

Alternatively, you could take a "stack of services" view, as one
typically does with networking protocols, and just see XML as one of the
higher level service. It's a good way to serialize simple state-only (i.e.
no behaviour) objects to a string. Parsing XML, would be equivalent to:

ArrayList<Person> persons = new ArrayList<Person>();
Person p = new Person();
p.setFirstName("John");
p.setLastName("Smith");
p.setTelephone/Id/WhateverItIs("5555555");
p.setAddress("37 Finch Ave.");
persons.add(p);
//And so on for all the other records.

If the "raw" XML is takes too long to transfer over the network, use a
seperate compression service as a layer beneath the XML. That was, as
compression technology improves, we can swap out the underlying service
without changing any of the code that deals with the XML layer.

- Oliver

Joe Attardi

unread,
Apr 5, 2006, 11:27:12 AM4/5/06
to
Steve Wampler wrote:
> I haven't seen anything in XML
> that does more than provide a guarantee that the syntax is right.

Hierarchical data, dude. What if someone has more than one phone
number? With the comma-delimited flat file approach, it's not readily
apparent how you could implement that.

<Person>
<PhoneNumber>...</PhoneNumber>
<PhoneNumber>...</PhoneNumber>
...
</Person>

we can have as many PhoneNumbers as we want that are associated with a
person, and because it's all hierarchical we can just walk up the
hierarchy to see who these PhoneNumbers belong to.

Timbo

unread,
Apr 5, 2006, 10:36:58 AM4/5/06
to
Homer wrote:
> Or you told your
> wife your name is John 20 years ago. No need to wear a name tag just in
> case you change your name (if you change your name tell her one more
> time; sending File Spec Doc to receiver)
>
This is a good example. People don't just usually walk up to a
group of people and say "John"... they say something like "My name
is John". They identify what the piece of data "John" actually
means using a shared definition for "My name is".

> I am still saying I am %100 with you all that IF you are sending data
> in small volume and/or receiver doesn't know about the file format
> XML is the best solution. But use it as a tool to fix any problems is
> going too far.
>

Yeah, I don't really know what you mean by that last sentence. I
suspect you are implying that people who like XML tend to see it
as a solution to everything? That's certainly true of every
technology.

As someone who encounters a lot of AI stuff in my job, I can
really see the value of associating values of things with what
they are meant to represent. XML is one way to help with this. I'm
not sure what the Canadian government is doing with the
information that it is transferring, but I can see the advantage
of tagging the kind of information that they would be using. If it
was parsing raw data into XML simply for the purpose of backing it
up over a network or something, that would be odd, but I'm
guessing they are doing other things with it too.

Chris Uppal

unread,
Apr 5, 2006, 11:06:01 AM4/5/06
to
Timbo wrote:

> My guess is that you don't really understand either my post, or
> XML. It's not the FORMAT of XML, it's the fact that it contains
> MEANING.

But it doesn't. The meaning comes from the /interpretation/ of the data, not
from its transmission form. The parties sharing data must come to an agreement
about the meaning before they can share information. Once they have done that,
deciding on a shared format is pretty trivial whether they use XML, ASN.1,
YAML, CSV, or a custom format.

-- chris


Chris Uppal

unread,
Apr 5, 2006, 11:32:49 AM4/5/06
to
Bent C Dalager wrote:

> XML may be a resource hog, it may be absolutely preposterous from an
> information theory standpoint and it may have accumulated a shedload
> of idiosyncrasies over time, but it does help keep technology and
> protocols accessible to hobbyists and starting programmers.

Thing is, I doubt whether that is true. There seems to be an XML mindset that
can be summed up as "don't reinvent the wheel" (to be charitable) or "use
pre-existing work, no matter how complex it is" (to be uncharitable). XML
itself inherits all sorts of unwanted complexity from SGML. The applications
of XML tend to want to use XML as metadata. Then they start to define stuff in
terms of other XML languages, or using other XML languages. The end result is
/seriously/ complicated.

In my opinions there's a badly inverted pyramid at work. In normal situations
you build more complex systems on less complex ones. That doesn't seem to
apply to the XML world. It builds complex systems on top of even more complex
systems.

A year or two back, I got the idea that RDF would be suitable for a very small
project of mine. So I started looking into RDF. It was a private project and
I didn't want to spend more than a few days coding. By the time I realised
that I wasn't going to find the bottom of the RDF tar-pit, I'd already spent
that "few days"...

"Proper" XML (i.e. used semantically, not just as an unbelievably clunky and
ineffective file format) is not -- in my very limited experience -- accessible
to someone who can't spend /lots/ of time on it upfront, and continue to spend
lots of time on it thereafter.

-- chris

Steve Wampler

unread,
Apr 5, 2006, 11:32:53 AM4/5/06
to
Joe Attardi wrote:
> I do agree with you on this one! XML is definitely not a catch-all
> solution for every problem. Using it to send 30 million records is
> probably not a good use for it.
>
> But, you are being too harsh on XML, accusing people of using it
> because it's "the cool thing" or because it's new and a novelty (both
> of which are false, by the way).

Agreed, but if I have to sit through one more PowerPoint presentation
where the presenter throws up a bunch of slides full of XML as if that's
conveying useful information to the audience, I'm going to scream. XML
is *NOT* the appropriate tool for this! There are *much better* ways to
present data to humans and these presenters are clearly showing XML
because it's "the cool thing" - or immensely lazy - which makes me
wonder about the quality of the code they write...

Steve Wampler

unread,
Apr 5, 2006, 11:38:26 AM4/5/06
to

Eh? That's still syntax. Are you saying all syntax is non-hierarchical?

People have represented hierarchical data in many ways *well before XML*,
including, yes, flat files - and it's not that hard. It's still a syntax issue.
Heck, even arbitrary graph data (hardly "hierarchical") has many syntactic
representations, including flat files.

Look, I *like* XML *for some things*, but wish people would take the time
to recognize what it is and want it isn't, please.

Oliver Wong

unread,
Apr 5, 2006, 12:00:59 PM4/5/06
to

"Chris Uppal" <chris...@metagnostic.REMOVE-THIS.org> wrote in message
news:4433e324$0$1168$bed6...@news.gradwell.net...

[In this post, I will group "XML", "ASN.1", "YAML", and "CSV with
headers" all under a single group which I will call "XML"; basically, this
"XML" group means data with metadata tags. As for "CSV without headers" and
"custom format", I'm going to group them together as "typical binary file".]

I'd say it's somewhere in between Timbos and Chris' claims [with the
distortion of Chris' claim as described above]. If you plonked a typical
"binary" file onto my desktop (e.g. perhaps ripping a random file from a
Playstation DVD), and told me to try to interpret it, I could get out my hex
editor, and look around for human-readable strings, and from there maybe
look for end-of-string markers, or some sort of length-of-string headers,
and then from there try to figure out markers for other datatypes, but I'd
probably wouldn't get very far.

Give me a typical XML file though, and I could probably come up with an
interpretation that is near the original, depending on how the elements and
attributes are named. If they file contains a reference to a DTD or XSD,
then I could navigate over to that URL and gain even more information.

- Oliver

Oliver Wong

unread,
Apr 5, 2006, 12:07:11 PM4/5/06
to

"Steve Wampler" <swam...@noao.edu> wrote in message
news:4433DE9C...@noao.edu...

> Oliver Wong wrote:
>>
>> "Steve Wampler" <swam...@noao.edu> wrote in message
>> news:4432E8AE...@noao.edu...
>>> Oliver Wong wrote:
>>>> Hmm, looks like I was way off... Not being an American, I am not
>>>> familiar with American city names, nor American State abbreviations. If
>>>> only you had used XML!
>>>
>>> No problem:
>>>
>>> <f1>John</f1>
>>> <f2>Smith</f2>
>>> <f3>5555555</f3>
>>> <f4>37 Finch Ave.</f4>
>>>
>>> There, that should make people happy :)
>>> (Of course, given this group, maybe the tags should be in Klingon...)
>>
>> Well, at least with this notation, I wouldn't have made my initial
>> mistake of thinking I was dealing with 4 records which seemed to be
>> arbitrary strings.
>>
>> Give the tag names, I can see I am dealing with a single record with
>> 4 fields.
>
> Really? I wouldn't have thought so. What makes you think 'f' stands
> for 'field'? Maybe these are four new flavours of Ben&Jerry's ice cream.
> (Not that I'd buy any of them...)

If the 4 elements were 4 of the same things, they'd have the same name.
So if they were all flavors, the document should have looked something like:

<f>John</f>
<f>Smith</f>
<f>5555555</f>
<f>37 Finch Ave.</f>

Then I'd say we have 4 records, each record containing 1 field, which
can be an arbitrary string (semantically, the field might represent
Ben&Jerry ice creams).

Since each element had a different name, I can conclude that this is 1
record with 4 fields. Perhaps each field represents a flavor of ice cream
(e.g. this is somebody's top 4 favorite ice creams, or these are the 4 most
profitable flavors, or these are, in order of submission, 4 flavors being
requested by customers etc.)

BTW, the fact that the tag names contained an 'f' is irrelevant to my
calling them fields. The document could have been

<boo>John</boo>
<bar>Smith</bar>
<buntz>5555555</buntz>
<batz>37 Finch Ave.</batz>

And I would have still come to the conclusion that we're dealing with a

single record with 4 fields.

> The point is that the tag names are, ultimately, just strings. We might


> think we understand what they mean (and can be right a high percentage of
> the time if the strings are well chosen), but in the end, they mean
> whatever the code at each end that defines the semantics (not the syntax)
> to be. That codes *still* has to agree at both ends, just as it does
> with "John,Smith,5555555,37 Finch Ave.". I haven't seen anything in XML
> that does more than provide a guarantee that the syntax is right.

Later on i nthe thread, someone mentions hierarchy, and you respond that
we've had hierarchy before XML. Well, we had syntax checking before XML too.
XML doesn't give us anything new in that sense. It just gives us a "better"
way of doing what we've been previously doing, where "better" depends the
problem you're trying to solve.

- Oliver

Bent C Dalager

unread,
Apr 5, 2006, 12:11:35 PM4/5/06
to
In article <4433e325$0$1168$bed6...@news.gradwell.net>,

Chris Uppal <chris...@metagnostic.REMOVE-THIS.org> wrote:
>
>Thing is, I doubt whether that is true. There seems to be an XML mindset that
>can be summed up as "don't reinvent the wheel" (to be charitable) or "use
>pre-existing work, no matter how complex it is" (to be uncharitable). XML
>itself inherits all sorts of unwanted complexity from SGML. The applications
>of XML tend to want to use XML as metadata. Then they start to define stuff in
>terms of other XML languages, or using other XML languages. The end result is
>/seriously/ complicated.

You are referring to the use of namespaces, and importing a namespace
someone else made instead of making your own tags for the same stuff?
If so, I would agree that this leads to added complexity. It tends to
force you to have to relate to a number of tags that are unnecessary
for the application at hand but which happened to be inherited from
the external namespace, and the organization of that namespace may not
be optimal for the use it is getting put to in the derived
application.

>In my opinions there's a badly inverted pyramid at work. In normal situations
>you build more complex systems on less complex ones. That doesn't seem to

I don't know. I have yet to write a Swing application that is more
complex than Swing <g>

>apply to the XML world. It builds complex systems on top of even more complex
>systems.
>A year or two back, I got the idea that RDF would be suitable for a very small
>project of mine. So I started looking into RDF. It was a private project and
>I didn't want to spend more than a few days coding. By the time I realised
>that I wasn't going to find the bottom of the RDF tar-pit, I'd already spent
>that "few days"...

I haven't used RDF, but I would agree that it is quite possible to
completely bollox up an XML application. When defining XML
applications for my own use, I always find that it is a serious
mistake to try and be clever about it :-)

There is an enormous amount of advanced features in XML (mostly
inherited from SGML, as you point out) that you really don't want to
be using. It will just end up confusing both yourself and any other
developers who have to relate to your XML application.

This, incidentally, is why I tend to shake my head at any Microsoft PR
person that says such things as "it is defined in XML and therefore it
is an open format". Along this particular axis, XML is more like Perl
than it is like Java. It may be incredibly useful and powerful, but it
also supports arbitrarily horrid levels of obfuscation. Whether or not
you can get any sense out of an XML application is entirely up to
whoever wrote it.

Timbo

unread,
Apr 5, 2006, 11:55:29 AM4/5/06
to
Steve Wampler wrote:
> I haven't seen anything in XML
> that does more than provide a guarantee that the syntax is right.

Ok, so say you are writing an application that deploys an agent to
find you the best prices for CDs on the web. If you share the same
ontological definition of CD attributes, you could have the
following album embedded in a webpage:

<Album>
<Artist> Stevie Wonder </Artist>
<Title> Innervisions </Title>
<Producer> .. </Producer>
<Track number=1 name=".."/>
<Track number=2 name=".."/>
... etc..
<Price> £5</Price>
</Album>

Compare that to the text:

Stevie Wonder, Innervisions, 1: ..., 2: ..., £5

You can see that clearly, any online CD store that follows the XML
definition in the first one (which could be defined in a schema)
would be easier to browse than one that has free text, especially
if some CDs have data that others don't, such as accompanying
musicians. You could find the grammar for the free text, write a
parser for it (or download one), and interpret the parsed data,
but simply sharing the set of definitions is more straightforward.

Timbo

unread,
Apr 5, 2006, 12:01:29 PM4/5/06
to
Chris Uppal wrote:
> Timbo wrote:
>
>
>>My guess is that you don't really understand either my post, or
>>XML. It's not the FORMAT of XML, it's the fact that it contains
>>MEANING.
>
>
> But it doesn't. The meaning comes from the /interpretation/ of the data, not
> from its transmission form. The parties sharing data must come to an agreement
> about the meaning before they can share information.

??? Which was exactly what I said in the sentence after the one
you quoted! :-) In hindsight, MEANING wasn't the correct word...
and I'm not sure of what IS the correct word...

> Once they have done that,
> deciding on a shared format is pretty trivial whether they use XML, ASN.1,
> YAML, CSV, or a custom format.
>

Sure, you can send it in a CSV format, but to keep the meta-data,
then it would be:
FirstName=John, LastName=Smith, Phone=55555, etc,

where you basically have the tags in the CSV, and you are then
facing the same problems as the original poster was complaining
about. It's not the syntax of XML that is useful (frankly, I find
it tediously difficult to follow when I am forced too), it's the
fact that it provides an easy way to store meta-data, and there
are lots of nice tools to support this. It's this meta-information
that the original poster does not like.

Timbo

unread,
Apr 5, 2006, 12:05:06 PM4/5/06
to
Steve Wampler wrote:

> Agreed, but if I have to sit through one more PowerPoint presentation
> where the presenter throws up a bunch of slides full of XML as if that's
> conveying useful information to the audience, I'm going to scream. XML
> is *NOT* the appropriate tool for this!

AAAGGGHHH!!! Yes!! I hate it when people put XML into the
research/technical papers and presentations! Unless of course the
paper/presentation is actually about XML, then I guess it could be
quite necessary :-P XML is not a human-readable format.

Steve Wampler

unread,
Apr 5, 2006, 12:44:04 PM4/5/06
to
Oliver Wong wrote:
> If the 4 elements were 4 of the same things, they'd have the same
> name. So if they were all flavors, the document should have looked

You're attaching semantics (note the 'should'). There is nothing in
XML that prevents someone from using a unique tag for every entry.
Granted that's not accepted convention (and a bad idea, to boot), but you
have to make some semantic assumptions to get any interpretation out of XML.

> Later on i nthe thread, someone mentions hierarchy, and you respond
> that we've had hierarchy before XML. Well, we had syntax checking before
> XML too. XML doesn't give us anything new in that sense. It just gives
> us a "better" way of doing what we've been previously doing, where
> "better" depends the problem you're trying to solve.

Oh, I agree with that. The point I was responding to was the statement
that seemed to imply hierarchy was not syntax, and that it was difficult
to represent hierarchy without XML. The difficulty is that there are
too many ways to do so (no standardization) - that's XML's real
contribution, to me - even though it's flawed, it is nearly ubiquitous.
(Although there are still too many ways to represent the same data in
XML as well - which is probably true of any "sufficiently powerful" syntax -
at least it's possible to always parse the XML. Making sense of the
result is another matter).

The fact that it is also (somewhat) self-defining in syntax is also useful
in some contexts, but not something I find *overwhelmingly* valuable.

James McGill

unread,
Apr 5, 2006, 12:58:17 PM4/5/06
to
On Wed, 2006-04-05 at 16:11 +0000, Bent C Dalager wrote:
>
> You are referring to the use of namespaces, and importing a namespace
> someone else made instead of making your own tags for the same stuff?
> If so, I would agree that this leads to added complexity.

Here's a case in point, from my corner of the real world:

The "Any" element that can go in a RFC-2518 "DAV" Multistatus.
It turns out to be quite difficult to bind this kind of XSD to
Objects, although, with a little work, Castor handles it just fine.
But you basically have a schema that says "put anything here", and in
order to implement that, you have to compromise and specify what
"anything" may consist of.

Another problem that came to light from that experience was, once
Microsoft interprets a spec wrong, that wrong interpretation becomes the
spec, no matter what anyone else says. Even a very well designed, fully
specified XML schema is no guarantee of a successful interface
constraint!

Steve Wampler

unread,
Apr 5, 2006, 1:01:51 PM4/5/06
to

Hmmm, I, as a human, find the second form *much* easier to browse. I can pick
out the actual content *much* faster. Granted, I might prefer something like:

Steve Wonder: Innervisions ($9.25)
1: ....
2: ....
3: ....

but that would depend on whether I'm more interested in the artist and album or
the details of the album content. (Great price, by the way!)

Of course, you're talking about computer handling of the data, where your points
are more valid. That's *still* syntax though.

Oliver Wong

unread,
Apr 5, 2006, 2:19:39 PM4/5/06
to

"Steve Wampler" <swam...@noao.edu> wrote in message
news:4433F7FF...@noao.edu...

I find Timo's XML version as easy to read as Timbo's CSV version.
However, I do find Steve's "custom" version easier to read over the other
two, as a human.

However, another nice thing about XML over the other two formats is that
there is a standardize escaping mechanism. Artists are... well...
artistic... and they sometimes do crazy things. In CSV, or the custom
format, how do you distinguish being an album whose name is the empty
string, and an album whose name is the single space character? What if the
album contains a colon in it? What if the artist name contains a colon in
it? What if the album name contains an open-parenthesis and dollar sign in
it, but no close-parenthesis? Etc.

As purely digital music becomes more popular (e.g. songs existing only
as OGG or MP3 files, and no physical albums, so no cover art nescessary),
you could have tech-savy artists define the names of their tracks to be the
newline character for some specific platform, for example. Maybe I'll go
write a song right now whose name is the value of the Java literal String
expression "\u0000\r\n\u0008\r\n\n". For clarity, the name of my song is 7
characters long, and is not intended to be pronounced (there will be no
lyrics in the song).

With XML, it's possible to express unambiguously any possible string of
characters (using, e.g., entity-references). With CSV or the custom format,
you'd have to invent an escaping-system, and then I, as a human, would have
to learn about your escaping system to either be able to read the data
myself, or to implement a program which can parse the data.

- Oliver

Oliver Wong

unread,
Apr 5, 2006, 2:33:05 PM4/5/06
to

"Steve Wampler" <swam...@noao.edu> wrote in message
news:4433F3D4...@noao.edu...

> Oliver Wong wrote:
>> If the 4 elements were 4 of the same things, they'd have the same
>> name. So if they were all flavors, the document should have looked
>
> You're attaching semantics (note the 'should'). There is nothing in
> XML that prevents someone from using a unique tag for every entry.
> Granted that's not accepted convention (and a bad idea, to boot), but you
> have to make some semantic assumptions to get any interpretation out of
> XML.
>

Yes, but that'd be abusing XML in the same sense that there's no reason
why you couldn't "use" CSV in the following manner:

<ExampleCSVDocument>
,,,,,,
,,
,,,,,,,,
,,,,
,,
,

,,,,,,
</ExampleCSVDocument>

Where I'm representing a sequence of integers such that each row (which
is equivalent to a line in CSV) is an integer, and the number of fields
(which is equal to the number of commas plus one) is the value of that
integer. I could represent ASCII (or even Unicode) text that way, but it'd
be breaking the unenforced-but-understood conventions of CSV.

In fact, I could further obfuscate the above document by putting in
random content in between the commas, and defining that you should just
ignore such content, and only count the number of commas.

> The difficulty is that there are
> too many ways to do so (no standardization) - that's XML's real
> contribution, to me - even though it's flawed, it is nearly ubiquitous.

I agree. The great thing about XML is that you can pretty much "accept"
them no matter what platform you're running on (as opposed to, say, a
Microsoft Word document). That's why I sometimes write documents in XHTML,
rather than Microsoft Word, as I want to make it readable by the widest
audience possible.

As to the "too many ways" issue, I've mentioned elsewhere in this thread
the "stack/layer of services" view. TCP/IP can be used for so many
applications: e-mail, file transfer, instant messenging, etc. How can we
make sense of all these uses? Well, the applications don't directly use
TCP/IP, but rather they use services (e.g. FTP) that use TCP/IP. And then
you could built a service on top of FTP to simulate a shared file system,
and so on.

So to me XML is one layer. You can then build layers on top of that
(e.g. XHTML, RSS, SOAP), and then build layers on top of those, and so on.
You can also put XML over other layers (e.g. XML->gzip->FTP->TCP/IP to send
XMLs between computers while still using a reasonable amount of bandwidth).

> The fact that it is also (somewhat) self-defining in syntax is also useful
> in some contexts, but not something I find *overwhelmingly* valuable.

I like XML's linking to DTDs or XSDs. When you have this strange XML
file, and you think to yourself "Where can I find out more information on
the format?", you have an URL telling you exactly where to go. This is
better, IMHO, than the previous practice of relying on file extensions to
define what kind of data is in the file, and then using a site like
http://filext.com/ to find out more about those kinds of files.

- Oliver

Oliver Wong

unread,
Apr 5, 2006, 2:36:47 PM4/5/06
to
"Timbo" <ti...@noreply.invalid> wrote in message
news:e116e6$39h$1...@kinder.server.csc.liv.ac.uk...

>
> AAAGGGHHH!!! Yes!! I hate it when people put XML into the
> research/technical papers and presentations! Unless of course the
> paper/presentation is actually about XML, then I guess it could be quite
> necessary :-P

I agree. Another appropriate use might be if the presentation is about
some piece of software which USES XML, and the presenter is just showing an
example document to give an idea of what kind of data will be present. E.g.
someone explaining why all blogging software should implement RSS feeds (I'm
looking at you Xanga, <shakes fist>).

> XML is not a human-readable format.

I disagree.

- Oliver

James McGill

unread,
Apr 5, 2006, 2:41:09 PM4/5/06
to
On Wed, 2006-04-05 at 18:36 +0000, Oliver Wong wrote:
> "Timbo" <ti...@noreply.invalid> wrote in message
> news:e116e6$39h$1...@kinder.server.csc.liv.ac.uk...
> >
> > AAAGGGHHH!!! Yes!! I hate it when people put XML into the
> > research/technical papers and presentations! Unless of course the
> > paper/presentation is actually about XML, then I guess it could be quite
> > necessary :-P
>
> I agree. Another appropriate use might be if the presentation is about
> some piece of software which USES XML, and the presenter is just showing an
> example document to give an idea of what kind of data will be present.

Even then, I usually say something like "the data formats are defined in
an XML Schema document which is in CVS." (My product has database
bindings, configs, and messaging, generated by and bound to XSD).

If I'm presenting something from this, I normally show the data as the
result of a transform into somehing presentable. But then, my audience
is always closed, consisting only of people who are already on the same
page and are as close to the mechanics of the product as I am.

James McGill

unread,
Apr 5, 2006, 2:44:13 PM4/5/06
to
On Wed, 2006-04-05 at 11:45 +0100, Chris Uppal wrote:
>
> Remember that we are talking about a government here.

The Canadian government, which I've been led to understand is the most
progressive on Earth, etc.

Oliver Wong

unread,
Apr 5, 2006, 3:02:53 PM4/5/06
to

"James McGill" <jmc...@cs.arizona.edu> wrote in message
news:1144262469.3...@localhost.localdomain...

Yeah, this works if the people you're speaking to have access to (and
know how to use) CVS. I was envisionning a situation where you're trying to
convince a bunch of people to adopt a new technology, and so the burden is
on you to provide all the relevant information. E.g. you're in a room with a
bunch of business people, and you show them an example RSS document, and say
essentially say "See? It's so easy, even you, who had no training in
computer science, can figure out how to write an RSS document if you really
wanted to."

- Oliver

Timo Stamm

unread,
Apr 5, 2006, 3:10:59 PM4/5/06
to
Steve Wampler schrieb:

> I haven't seen anything in XML
> that does more than provide a guarantee that the syntax is right.

Have a look at XSDs.


Timo

Timo Stamm

unread,
Apr 5, 2006, 3:20:07 PM4/5/06
to
Steve Wampler schrieb:

> There is nothing in XML that prevents someone from using a unique
> tag for every entry.


Of course there is. There are various ways to define schemes for XML
documents:

http://en.wikipedia.org/wiki/XML_schema#XML_schema_languages


Timo

Andrew McDonagh

unread,
Apr 5, 2006, 3:24:54 PM4/5/06
to
Joe Attardi wrote:
>> Also, isn't it likely that the file would be split up?
> Exactly. Any data set containing 30 million records would be grossly
> inefficient in one single file, whether it be XML or otherwise.
>

besides ...the subject is about large XML files vs flat file.

Not xml vs RDBMs.

If we were talking about using the XML file as a database, then they a
point (small or large file). Relational databases won over Hierarchical
databases years ago for many good reasons.

grasp...@yahoo.com

unread,
Apr 5, 2006, 4:29:36 PM4/5/06
to
What is your take on JSON?

Roedy Green

unread,
Apr 5, 2006, 5:17:47 PM4/5/06
to
On Wed, 05 Apr 2006 11:44:13 -0700, James McGill
<jmc...@cs.arizona.edu> wrote, quoted or indirectly quoted someone
who said :

>The Canadian government, which I've been led to understand is the most
>progressive on Earth, etc.

A government has with a smaller population to serve has a huge
advantage when it comes to being light on its feet. I worked for a
Canadian crown corporation writing an RFP for about a million dollars
worth of computer equipment. I was in Seattle for a New Year's eve
party and met a guy doing something similar there. We both bitched
about all the silly regulations and petty legalities. We decided to
swap RFPs to see who had it worse. His was ten times thicker.

The thing that blows my mind about the US bureacracy is that crooks
have managed to embezzle trillions of dollars over the last decade and
hardly anyone even knows about it. See
http://mindprod.com/politics/iraqeconomics.html near the bottom.
Mastermind crooks pulled off the heist of the century and it did not
even make the front page.

The amount of activity and the amounts of money or so huge that nobody
stays on top of what is going on. Further the amounts of money are so
huge that corruption and coverup are guaranteed.

--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Steve Wampler

unread,
Apr 5, 2006, 5:11:06 PM4/5/06
to

No there is isn't. There *is* if *someone else* defines the schema, but
if I'm defining it, exactly what is going to stop me? [Please note, if
you've come into this discussion late, that I (personally) am *not*
advocating doing so.]

I've actually seen XML (*not* mine!) where the person had defined an
array's contents via (paraphrasing, this is a while ago, fortunately):

<array size=15>
<a1>5</a1>
<a2>13</a2>
...
<a15>37</a15>
</array>

(I suppose this allowed position-independent arrangement of the elements, but
there are certainly better ways, even in XML...)

Steve Wampler

unread,
Apr 5, 2006, 5:16:58 PM4/5/06
to

I have. I stand by my statement. What about XSD *isn't* about syntax?
Granted, XSDs provide very fine-grained control over syntactic issues.

Roedy Green

unread,
Apr 5, 2006, 5:24:08 PM4/5/06
to
On Wed, 05 Apr 2006 15:03:01 +0100, Timbo <ti...@noreply.invalid>

wrote, quoted or indirectly quoted someone who said :

>My guess is that you don't really understand either my post, or

>XML. It's not the FORMAT of XML, it's the fact that it contains

>MEANING. So, if the sender and receiver have a shared ontology
>that says that FirstName is someone's first name, then the data
><FirstName>John<FirstName> i

Evan a csv file with a first line using field names contains the same
amount of information for a file like the one shown as the obese XML.

What the raw XML provides is not particularly useful information. You
can glean that by inspecting the file.Information you want which is
missing is how validated are each of the fields. What guarantees
exist on values, what are the complete set of possibilities of each
enumeration and what do they mean.

Since the early DOS days I have been exporting data to people in
several formats, SQL, CSV, and fixed length ascii fields. I generate
a separate human-readable "schema" file that describes the field,
including limits and its length and offset.

No body has ever had trouble interpreting one of the files.

for a FLAT file there is no need to use tags. That is only when you
have a structrured file.

Roedy Green

unread,
Apr 5, 2006, 5:31:14 PM4/5/06
to
On 5 Apr 2006 08:27:12 -0700, "Joe Attardi" <jatt...@gmail.com>

wrote, quoted or indirectly quoted someone who said :

>Hierarchical data, dude. What if someone has more than one phone


>number? With the comma-delimited flat file approach, it's not readily
>apparent how you could implement that.
>
><Person>
> <PhoneNumber>...</PhoneNumber>
> <PhoneNumber>...</PhoneNumber>

You use a comma to represent any field which is not present. You
don't just have a list of phone numbers, you assign them specific
functions.. You have something like this:

cell
home
work
800
fax
messages
emergency

the other way you do it is to have a separate phone numbers file (this
is SQL-think). Then you can have an arbitrary number of phone numbers.

the phone number file has the form

account#, phone

If you are exporting data only to import SQL again, this is a much
more convenient format than XML hierarchy. SQL does not handle
variable numbers of things well directly, so you end up having to
write a complicated mess of XML export and import handling code, as
well as the process taking 100 times longer than it need do.

Roedy Green

unread,
Apr 5, 2006, 5:34:44 PM4/5/06
to
On Wed, 05 Apr 2006 16:55:29 +0100, Timbo <ti...@noreply.invalid>

wrote, quoted or indirectly quoted someone who said :

><Album>


> <Artist> Stevie Wonder </Artist>
> <Title> Innervisions </Title>
> <Producer> .. </Producer>
> <Track number=1 name=".."/>
> <Track number=2 name=".."/>
> ... etc..
> <Price> £5</Price>
></Album>

Hand coded XML is almost guaranteed to contain errors. Unless you do
something to insist XML is validated before use, all you have done is
invented yet another avenue for data corruption. You can't even tells
if it has been validated against some schema.

It is the same bloody mess that HTML has foisted on us.

Roedy Green

unread,
Apr 5, 2006, 5:36:29 PM4/5/06
to
On Wed, 05 Apr 2006 18:19:39 GMT, "Oliver Wong" <ow...@castortech.com>

wrote, quoted or indirectly quoted someone who said :

> With XML, it's possible to express unambiguously any possible string of
>characters (using, e.g., entity-references).

You have made a much better case for binary strings that don't need
fancy XML escaping than you have for XML.

Roedy Green

unread,
Apr 5, 2006, 5:38:35 PM4/5/06
to
On Wed, 5 Apr 2006 07:50:42 -0600, "Monique Y. Mudama"
<sp...@bounceswoosh.org> wrote, quoted or indirectly quoted someone who
said :

>Okay, I guess it is widely supported. I just haven't happened to have
>come across anything in my development work that ever made use of it
>(that I know of). I shouldn't have generalized that to the rest of
>the world.

Actually you probably have, but did not recognize it. You have a
digital cert, perhaps self signed do you not?

ASN.1 is used to define all manner of thing from the format of
digital certificates, credit card transactions, cell phone messages

Roedy Green

unread,
Apr 5, 2006, 5:44:11 PM4/5/06
to
On Wed, 05 Apr 2006 14:34:33 +0100, bugbear
<bugbear@trim_papermule.co.uk_trim> wrote, quoted or indirectly quoted
someone who said :

>
>In the first example is 5555555 a phone number, or
>part of the address?

The traditional way to handle that is either with a first line
consisting of field names, and also a separate document describing
each field in proper detail with what it means.

Have you ever written a computer program to submit something to a bank
or a the government of any country? The specifications for a single
file comes as a book. There are paragraphs on every field.

The XML description is just a fraction of the information. And, for a
flat file, there is no need to spell the tags out over and over and
over. Any programmer understands the first time. The repetition just
introduces the complication that the tags might NOT be perfectly
repetitive.

XML is for tree structured data. It is hopeless at anything else.

Andrew McDonagh

unread,
Apr 5, 2006, 5:48:29 PM4/5/06
to
Roedy Green wrote:
> On 5 Apr 2006 08:27:12 -0700, "Joe Attardi" <jatt...@gmail.com>
> wrote, quoted or indirectly quoted someone who said :
>
>> Hierarchical data, dude. What if someone has more than one phone
>> number? With the comma-delimited flat file approach, it's not readily
>> apparent how you could implement that.
>>
>> <Person>
>> <PhoneNumber>...</PhoneNumber>
>> <PhoneNumber>...</PhoneNumber>
<Pet>
<Type>Dog</Type>
<CuteName>Spot</CuteName>

>
> You use a comma to represent any field which is not present. You
> don't just have a list of phone numbers, you assign them specific
> functions.. You have something like this:

One of XML file greatest advantage over CSV, flatfile, etc., is that it
supports schema evolution without requiring code changes.

Due to the nature of applications looking for the XML nodes they know
about, they ignore all other nodes. So In the Person node example,
should we need to add a child node <Pets>, we can without harming the
existing app.

Kent Paul Dolan

unread,
Apr 5, 2006, 5:55:21 PM4/5/06
to
"Homer" <hom...@hotmail.com> wrote:

> I am a little bit tired of this obsession people
> have with XML and XML technology.

Bad call.

> Please share your thoughts and let me know if I am
> thinking in a wrong way.

Yes, you are.

> I believe some people are over using XML all over
> the place.

Nope, it's pretty much become the data encoding
method of choice purely on its merits.

> Nowadays Canadian Government is pushing XML to its
> organization as standard for data/file transfer.

Excellent! They are taking the appropriate steps to
avoid the universal experience of first world
governments in the sixth decade of computer handling
of government data, that files which are not
self-describing "go stale" and become
uninterpretable over long periods of time as
technologies supersede one another.

Anecdote: I once worked for/alongside the US
National Ocean Survey. The original survey
documents, from 1803, in paper RECORD logbooks
visually identical to ones you can purchase in a
stationery shop today, were still in use as active
data. At the same time, I was tasked with finding
some digital technology that would endure even half
a century. The sad conclusion was that at the time
(1975), no such techology existed. The point isn't
that DVDs have solved that problem (they haven't),
but that government records are still of interest
decades-to-centuries after they are first encoded.
Only self-describing documents have a prayer of
meeting that requirement.

> Huge files moving between companies now include
> tones of XML Tags repeating all over the file and
> slowing down networks and crashing applications
> because of size.

1) XML tags are highly redundant, so XML files,
compressed, are little larger than alternative
encoding techniques.

2) XML isn't guaranteed to be "legal" until the
whole document has been parsed, but that doesn't
prevent that the document is parsed as it is
received, and stored internally in some much more
compact format than the transmittal format. So,
if a program crashes trying to cope with an XML
document, that same document will overwhelm the
program in _any_ encoding.

3) Thus, your complaint is properly about large
document transmittal, not the XML encoding of
those documents.

> I am not objecting to the whole technology. I know
> advantages of XML and using it all the times for
> Config files or our web oriented applications but
> using it as standard for moving big files is going
> too far.

Would it be a good guess that French rather than
English is your native language? Yeesh.

Anyway, despite that I myself put off learning XML
far too long, and still can't claim competence with
it, XML isn't just a fad, it is the wave of the
future.

HTH

xanthian.


--
Posted via Mailgate.ORG Server - http://www.Mailgate.ORG

Roedy Green

unread,
Apr 5, 2006, 5:55:30 PM4/5/06
to
On Wed, 5 Apr 2006 15:02:55 +0000 (UTC), b...@pvv.ntnu.no (Bent C
Dalager) wrote, quoted or indirectly quoted someone who said :

>XML isn't particularly useful for the original sender and receiver.
>They would probably be better off using a binary format. It is useful
>for the third party who wants his product to interact or compete with
>the software used by sender and receiver and therefore needs to
>reverse engineer the protocol being used between them. In this
>context, a high level of protocol redundancy is extremely useful since
>it makes it reasonably easy for

So what if instead you wrote your schema, then using automated tools
created an ASN.1 binary file much more compact that you can parse 100
times faster and can turn back into fluffy XML any time you want using
the ASN.1 schema. It really amounts to more clever than usual
compression scheme for XML in that you can read it directly rather
than having to decompress it first.

Then look on fluffy XML as a debugging dump format. For computer to
computer you exchange ASN.1 and created and parse ASN.1. The fluffy
form never exists except conceptually.

Your problem now is making sure XSD and ASN.1 schemas for files are
easily available. You stop exchanging schema-less unvalidated files.
You stop exchanging fluffy XML. You only exchange ASN.1 compact file
and store your large XML files as ASN.1. You might still leave
configuration files as XML, though a smart app would parse them any
time they change to make sure they pass muster and then thereafter us
the compact ASN.1 files. The advantage is the app does not need to
load a whacking great XML parser and schema every time it loads. All
it needs is a tiny binary "parser" which is not even parsing in the
classic sense.

Roedy Green

unread,
Apr 5, 2006, 5:58:27 PM4/5/06
to
On Wed, 5 Apr 2006 15:02:55 +0000 (UTC), b...@pvv.ntnu.no (Bent C
Dalager) wrote, quoted or indirectly quoted someone who said :

>SMTP isn't a very good protocol by any stretch of the
>imagination, but it is _simple_ and you can very easily hook into it

And because it was so simple look what a fucking mess email is in.
People who write email clients are not simpletons. They need a
protocol that works, not one you can understand in five minutes.

SMTP was a hack to do an email demo. It was not rethought once the
problems of scale and spam became apparent.

James McGill

unread,
Apr 5, 2006, 5:48:20 PM4/5/06
to
On Wed, 2006-04-05 at 21:17 +0000, Roedy Green wrote:
>
> The thing that blows my mind about the US bureacracy is that crooks
> have managed to embezzle trillions of dollars over the last decade and
> hardly anyone even knows about it.

Controversial opinion, informed by partisan bias, and not one that I
necessarily disagree with. Take it to alt.politics (where I read your
posts and often correspond).

So, what's the ASN.1 equivalent of JAXB?

Roedy Green

unread,
Apr 5, 2006, 6:05:29 PM4/5/06
to
On 5 Apr 2006 08:20:07 -0700, "Joe Attardi" <jatt...@gmail.com>

wrote, quoted or indirectly quoted someone who said :

><LastName>Smith</LastName>
><PhoneNum>5555555</PhoneNum>
><Address>37 Finch Ave.</Address>
>
>what about,
>
><PersonList>
><Person firstName="John" lastName="Smith" phoneNum="5555555"
>address="37 Finch Ave." />
></PersonList>

in that particular case, you might still want phoneNum as a tag so you
could have mulitples. But even so, you still bulk up your 30 million
record file with the same information specified over and over and
over. Computers and even humans hear you the first time.

For computer to computer communication you need to put the format
information up front in a computer-understandable way. Then the data
can be densely packed with minimal tags. To view the data you need
something that understand the header and can either display it in
conventional XML format or like a tree, or like a spreadsheet or in
some custom template that is maximally convenient for viewing the
particular data of interest. The whole point of all that tagging
originally was so you could extract just what was currently of
interest. You should not be looking at raw XML normally.

Monique Y. Mudama

unread,
Apr 5, 2006, 6:14:10 PM4/5/06
to
On 2006-04-05, Roedy Green penned:

> On Wed, 5 Apr 2006 07:50:42 -0600, "Monique Y. Mudama"
><sp...@bounceswoosh.org> wrote, quoted or indirectly quoted someone
>who said :
>
>>Okay, I guess it is widely supported. I just haven't happened to
>>have come across anything in my development work that ever made use
>>of it (that I know of). I shouldn't have generalized that to the
>>rest of the world.
>
> Actually you probably have, but did not recognize it. You have a
> digital cert, perhaps self signed do you not?
>
> ASN.1 is used to define all manner of thing from the format of
> digital certificates, credit card transactions, cell phone messages

Hence my "(that I know of)" fudge =)

--
monique

Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

Oliver Wong

unread,
Apr 5, 2006, 6:26:32 PM4/5/06
to

"Roedy Green" <my_email_is_post...@munged.invalid> wrote in
message news:62e832hmbhmnf6c95...@4ax.com...

> On Wed, 05 Apr 2006 18:19:39 GMT, "Oliver Wong" <ow...@castortech.com>
> wrote, quoted or indirectly quoted someone who said :
>
>> With XML, it's possible to express unambiguously any possible string
>> of
>>characters (using, e.g., entity-references).
>
> You have made a much better case for binary strings that don't need
> fancy XML escaping than you have for XML.

The problem with a "straight-to-binary" approach is that you'd have to
use custom tools to process the data. With XML, you can use a generic XML
editor, or worse case, a simple text-editor.

I don't "mind" ASN.1 so much if only the editors were more readily
available. From my perspective, it's almost the same as using gzip to unzip
a file yielding an XML document, and then using an XML Editor on the
resulting XML document.

- Oliver

Oliver Wong

unread,
Apr 5, 2006, 6:34:19 PM4/5/06
to

"Roedy Green" <my_email_is_post...@munged.invalid> wrote in
message news:aqd832hg8j6m3iqho...@4ax.com...

> On Wed, 05 Apr 2006 16:55:29 +0100, Timbo <ti...@noreply.invalid>
> wrote, quoted or indirectly quoted someone who said :
>
>><Album>
>> <Artist> Stevie Wonder </Artist>
>> <Title> Innervisions </Title>
>> <Producer> .. </Producer>
>> <Track number=1 name=".."/>
>> <Track number=2 name=".."/>
>> ... etc..
>> <Price> £5</Price>
>></Album>
>
> Hand coded XML is almost guaranteed to contain errors.

Guaranteed is a bit strong here. I've written XML documents by hand
before and got them right on the first try.

> Unless you do
> something to insist XML is validated before use, all you have done is
> invented yet another avenue for data corruption.

The only other place corruption could occur is the name of the elements,
the names of the attribute, or some of the punctuation (e.g. '<', '>', '/').
Should such corruption occur, it's trivial for a human to fix them, and some
software tools are pretty good at guessing at the fixes as well.

Contrast this with the majority of so-called "binary" formats.

> You can't even tells
> if it has been validated against some schema.

I think for most file formats, you cannot tell, just by looking at the
file, if it was "checked" for correctness before it arrived on your
harddisk. You could check it for correctness, just like you can check an XML
document for correctness, but you can't check that whoever wrote it first
validated it before sending it to you.

> It is the same bloody mess that HTML has foisted on us.

I think HTML is pretty good for the problems it tries to solve
(human-writable-and-readable representation of documents in an platform
independent fashion, with some hyper linking functionality), and every
version is better than the last. The only serious competition I can think of
is LaTeX, and I found it far more difficult to use than HTML, though it is
more powerful.

- Oliver

Oliver Wong

unread,
Apr 5, 2006, 6:36:39 PM4/5/06
to
"Steve Wampler" <swam...@noao.edu> wrote in message
news:4434326A...@noao.edu...

> I've actually seen XML (*not* mine!) where the person had defined an
> array's contents via (paraphrasing, this is a while ago, fortunately):
>
> <array size=15>
> <a1>5</a1>
> <a2>13</a2>
> ...
> <a15>37</a15>
> </array>
>
> (I suppose this allowed position-independent arrangement of the elements,
> but
> there are certainly better ways, even in XML...)

Of course, that should read:

<array size="15">
<element index="1">5</element>
<element index="2">13</element>
...
<element index="15">37</element>
</array>

Having elements with different names (e.g. "a1", "a2", etc.)
representing the same "kind" of thing is a no-no.

As other have said, XML's strengths are more apparent when the data to
store is hierarchical, rather than flat (as an array is).

- Oliver

Oliver Wong

unread,
Apr 5, 2006, 6:39:53 PM4/5/06
to

"Roedy Green" <my_email_is_post...@munged.invalid> wrote in
message news:l5e8321li61htjoe2...@4ax.com...

> On Wed, 5 Apr 2006 07:50:42 -0600, "Monique Y. Mudama"
> <sp...@bounceswoosh.org> wrote, quoted or indirectly quoted someone who
> said :
>
>>Okay, I guess it is widely supported. I just haven't happened to have
>>come across anything in my development work that ever made use of it
>>(that I know of). I shouldn't have generalized that to the rest of
>>the world.
>
> Actually you probably have, but did not recognize it. You have a
> digital cert, perhaps self signed do you not?
>
> ASN.1 is used to define all manner of thing from the format of
> digital certificates, credit card transactions, cell phone messages

I think there's two different intended meanings of "use" here:

A: I've never use C++. All my development work is in Java.
B: What about that OS you're running? That's written in C++!

I've never used ASN.1 in the sense that person A is thinking, though if
ASN.1 is used for credit cards, probably a heck of a lot of people have used
ASN.1 in the sense that person B is thinking (including myself).

- Oliver

It is loading more messages.
0 new messages