Section1.1 of the Extensible Markup Language (XML) gives as a design goal that Terseness in XML markup is of minimal importance. The Standard Generalized Markup Language (SGML), of which XML is a Profile, has a number of features intended to reduce typing when humans are entering markup directly, or to reduce file sizes, but these features were not included in XML.
The resulting XML specification gave us a highly regular language, but one that can use a considerable amount of bandwidth to transmit in any quantity. Furthermore, although parsing has been greatly simplified in terms of code complexity and run-time requirements, larger data streams necessarily entail greater I/O activity, and this can be significant in some applications.
There has been a steadily increasing demand to find ways to transmit pre-parsed XML documents and Schema-defined objects, in such a way that embedded, low-memory and/or low bandwidth devices can make use of an interoperable, accessible, internationalised, standard representation for structured information, yet without the overhead of parsing an XML text stream.
Multiple separate experimenters have reported significant savings in bandwidth, memory usage and CPU consumption using (for example) an ASN.1-based representation of XML documents. Others have claimed that gzip is adequate.
In September 2003, The W3C ran a Workshop, hosted by Sun Microsystems in Santa Clara, California, USA, to study methods to compress XML documents, comparing Infoset-level representations with other methods, in order to determine whether a W3C Working Group might be chartered to produce an interoperable specification for such a transmission format.
The Workshop concluded that the W3C should do further work in this area, but that the work should be of an investigative nature, gathering requirements and use cases, and prepare a cost/benefit analysis; only after such work could there be any consideration of whether it would be productive for W3C to attempt to define a format or method for non-textual interchange of XML.
David Orchard, BEA: Not inventing something new - lots of solutions out there. Carefully analyze the problem to be solved, and pick a good one. If no existing solution works, probably no new one would either.
Noah Mendelson, IBM: Infoset is not what results from parsing an xml document; it's only sometimes true. But there are also synthetic infosets, e.g. created via DOM and may be serialised later, but need not be. Essential to be clear on definitions.
Louis Reich, NASA: Are the last three lines of your last slide out of scope for this workshop? Seem highly relevant to me. Problems currently outside XML scope might be brought into the fold by using a binxml solution (eg large binary blobs, etc). Which part of 80:20 are we looking at?
Louis Reich, NASA: infoset can hold binary data. Our community would prefer a sub-optimal but standard way rather than our own, discipline-specific standard. A single W3C standard would be very exciting for us. Disagree with your assertion - people would indeed use this if it was a standard.
Michael Rys, Microsoft: would need to extend the infoset to do this. Comparison with image compression, lots of special ways, most with encumbering IPR, most specific to particular type of content - would people abandon these to get a single binary representation? Or, better to use a packaging format and keep the image in its special, efficient format. It's a payload packaging problem.
Larry Masinter, Adobe: Agree with much of the analysis, puzzled by conclusion about whether w3C should work on this. Clearly other areas where the solution was not clear (eg semantic web) so what is the threshold of research required. Not clear at all that standardisation work would increase fragmentation?
Michael Rys, Microsoft: semantic web is research, it's not standards work. It's still too early, research is needed and should not be done at W3C as it stifles innovation once a standard is set. MS has internal binary representations, but we use textual XML for interop. It's all that works in all cases.
John Schneider, Agile Delta: LZ or Huffman (frequency based) compression only works on large messages with high character redundancy. it does not work for high frequency streams of small messages, typical of mobile environments and Web services. Zip will often make these bigger instead of smaller.
Also, I've heard many people expecting to see big improvements in user-visible performance. [supplied later by John Schneider: To be honest, it's not that clear that mobile users will see a noticeable speed increae given the high latency of mobile networks. The more significant benefits are economic. Carriers spend a great dela of money buying frequencies and putting up cell towers to increase capacity of their pipes. If the size of the data shrinks by 10 times, carriers can now fit 10 times more customers on the same pipe, meaning they can generate 10 times more revenue without huge infrastructure investments. For always-on packet-switched networks where users pay by the kilobyte, these savings are passed along to the customer.]
John Schneider, AgileDelta: Mobile devices don't necessarily use gateways any more. They can now hit any URL and access enterprise infrastructure directly. If I hit a SOAP Web service using my mobile device, the payload comes back as raw XML, with no gateway inbetween. So mobile devices definitely need efficient access to XML everywhere, not just through gateways.
[John expanded this in email later, for clarification: Like others, I also prefer zero standards to two. Unfortunately, however, zero is not an option. There are already two mainstream standards organizations working on binary encodings for XML, MPEG7 and ASN1. Are they compatible? No. Is either one of them general purpose enough to handle all mainstream XML applications? No. For example, neither one can handle mixed content, which is required for XHTML -- a pretty popular use case. We need a general purpose standard that can deal with the broad uses of XML.]
David Orchard, BEA: Our position is not "no"; it's "be sure what you are designing" and "pick one". My question to Microsoft (who said there is no 80:20 point now): Do you think there will be an 80:20 point in the future?
Robin Berjon, Expway: with each message? and depends on the richness of the schema. With each message, only send the schema parts that are actually used that time. Or send a schema with a set of messages. We can also send incremental schema updates.
John Schneider, AgileDelta: well done! You dealt with many of the problems that we encountered. About representing any general infoset - MPEG7 currently does not support mixed content models, namespace prefixes, and some other infoset items.
[by email, John expanded this: Actually, prefixes are part of the infoset and while many people don't care about preserving prefixes, there are some communities that require it. Our solution needs to apply to all uses of xml, and hit the mass market rather than high cost niche solutions]
Robin Berjon, Expway: need to be sure the low level format is fragmentable, to let higher levels do it. Fragments need context like inscope ns declarations. BiM is for broadcast so it was designed to do that.
sw: needs to be little overhead from getting the data off the wire and starting using it. Currently [with text/xml interchange] a whole lot happens, object creation, lots of moves of small amounts of data, pointer creation.
Noah Mendelson, IBM: Whole identity of in-memory and on-the-wire is at cross purposes to why many of us came to XML - so parties don't have to agree on their APIs and internal models, many of which are preexisting and already deployed. We tried to do this with DCOM and CORBA; now we are using XML because it decouples needing to care about all that byte-level pointer stuff. Your API still has elements and attributes?
Steve Williams: yes, it's like nested objects, most 3GL object oriented languages do this. Can choose to have overhead by mapping to native objects etc at some efficiency cost. Worst case is no worse than best case now, best case is a lot better.
Mike Conner, IBM: no, because parsing technology is making big leaps forward and production parsers are much faster than stock, free parsers. Code quality and maturity is a major determining factor. [network costs were not significant - 6% - extensive tests. Measuring instructions per character processed]
Noah Mendelson, IBM: a 30% increase is an insignificant reason for standardising, but could be significant for terabyte range data. Not convinced that random access adressing is tightly bound to binfoset.
Mark Nottingham: Could want to do XML signing, encryption etc and for that it needs to know the element and attribute names, etc. Variability in performance with fallback is less desirable than uniform performance.
NM: (Looking at the slide with two multicolored bars - Performance Results, time spent in layers) So, there are four messages in that, roughly a gigahertz processor - we are seeing better numbers than that for textual XML in our applications. You are saying it's taking a million instructions to do the one message? 2000 instructions per character? Your left bar is too big by a factor of 10 or so.
epl: Have to do a binding to do something, but it's optimised to transfer the objects. Removes the databinding level.sw: We cannot compare these without a standard corpus of test files that are run on different implementations.
Larry Masinter (Adobe): Goal should be to improve the average over likely implementations, not to finely tune one implementation to the max - it would be just one sample point, and not address the wide variety of use cases. As far as creating a working group in W3C, do we need consensus on requirements to form a working group? Or can we find a group who is interested enough in a subset of requirements, and that everyone else agrees to leave them alone to do that work?
3a8082e126