Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Lint for OOXML?

54 views
Skip to first unread message

Ole Streicher

unread,
Sep 4, 2010, 3:26:23 AM9/4/10
to
Hi,

is there a "lint" style tool available to check self-created .docx
files? Preferrably as a portable (OS independent) command line tool, but
a Windows program or even a Word (2007) plugin would be also OK.

Word itself seems to open (unfortunately in my case) .docx documents
quite lazy, even if they contain some (smaller) errors. This brings the
danger to create files that may not be readable by other OOXML aware
programs.

So, does a kind of "lint" exist for docx? The best would be if it would
also warn about possible problems with the different OOXML
implementations.

Or, is there an option in Word that can be used for that?

Ole

Peter Flynn

unread,
Sep 4, 2010, 11:27:54 AM9/4/10
to

Any good XML parser will check for syntactic conformance to XML.
Any schema-based validating parser should be able to validate against
the OOXML schemas.

///Peter

Ole Streicher

unread,
Sep 4, 2010, 12:09:43 PM9/4/10
to
Peter Flynn <peter...@m.silmaril.ie> writes:
>> So, does a kind of "lint" exist for docx? The best would be if it would
>> also warn about possible problems with the different OOXML
>> implementations.

> Any good XML parser will check for syntactic conformance to XML.
> Any schema-based validating parser should be able to validate against the
> OOXML schemas.

This is not enough. For example, an image file should be in word/media,
it should have an Id set in word/_rels/document.xml.rels, be of a
limited set of types (Encapsulated Postscript does not work here, for
example), the suffix and the type should be linked together in
[Content_Types].xml and so on. This all is AFAIK not defined in the
OOXML schema, but this is the part where the problems start to rise.

So, not all files that are conform to the OOXML schemas are valid .docx
files. And even files that are valid in that sense could make problems
due to several existing implementations -- for example, OpenOffice does
not read all .docx files correctly, so there is a portability problem.

Cheers

Ole

Peter Flynn

unread,
Sep 5, 2010, 11:06:29 AM9/5/10
to
On 04/09/10 17:09, Ole Streicher wrote:
> Peter Flynn<peter...@m.silmaril.ie> writes:
>>> So, does a kind of "lint" exist for docx? The best would be if it would
>>> also warn about possible problems with the different OOXML
>>> implementations.
>
>> Any good XML parser will check for syntactic conformance to XML.
>> Any schema-based validating parser should be able to validate against the
>> OOXML schemas.
>
> This is not enough. For example, an image file should be in word/media,
> it should have an Id set in word/_rels/document.xml.rels, be of a
> limited set of types (Encapsulated Postscript does not work here, for
> example), the suffix and the type should be linked together in
> [Content_Types].xml and so on. This all is AFAIK not defined in the
> OOXML schema, but this is the part where the problems start to rise.

That is what validation against the OOXML schemas *should* do for you.
If Microsoft have placed additional undocumented constraints into the
business process *outside* the schema that properly belong *inside*, I
suggest you stop using Word and switch to ODF.

> So, not all files that are conform to the OOXML schemas are valid .docx
> files. And even files that are valid in that sense could make problems
> due to several existing implementations -- for example, OpenOffice does
> not read all .docx files correctly, so there is a portability problem.

OOXML has been known to have these problems since before it was
hornswoggled through the standards process. It is not reliable enough to
run a business on, so it is really only useful as an example of how
*not* to design a schema for text documents.

///Peter

Ole Streicher

unread,
Sep 5, 2010, 11:40:15 AM9/5/10
to
Peter Flynn <peter...@m.silmaril.ie> writes:
> On 04/09/10 17:09, Ole Streicher wrote:
>> Peter Flynn<peter...@m.silmaril.ie> writes:
>>>> So, does a kind of "lint" exist for docx? The best would be if it would
>>>> also warn about possible problems with the different OOXML
>>>> implementations.
>>
>>> Any good XML parser will check for syntactic conformance to XML.
>>> Any schema-based validating parser should be able to validate against the
>>> OOXML schemas.
>>
>> This is not enough. For example, an image file should be in word/media,
>> it should have an Id set in word/_rels/document.xml.rels, be of a
>> limited set of types (Encapsulated Postscript does not work here, for
>> example), the suffix and the type should be linked together in
>> [Content_Types].xml and so on. This all is AFAIK not defined in the
>> OOXML schema, but this is the part where the problems start to rise.
>
> That is what validation against the OOXML schemas *should* do for
> you.

XML Schema cannot do this since it does not provide instruments to link
the internal structure to external entities, like file names.

> If Microsoft have placed additional undocumented constraints into the
> business process *outside* the schema that properly belong *inside*, I
> suggest you stop using Word and switch to ODF.

It is easy to be rhetorical but it is also shitty thumb.

I am not using word. I am writing OOXML. This is a difference. And I
have not the option to switch to another format for this work.

BTW, just to be curious: Are you sure that Openoffice does not put any
additional undocumented constraints outside their schemas? And, if yes,
would you recommend to switch back to OOXML? How does ODF f.e. link an
external file name requirement to an internal entity?

Ole

Peter Flynn

unread,
Sep 5, 2010, 12:44:06 PM9/5/10
to
On 05/09/10 16:40, Ole Streicher wrote:
> Peter Flynn<peter...@m.silmaril.ie> writes:
>> On 04/09/10 17:09, Ole Streicher wrote:
>>> Peter Flynn<peter...@m.silmaril.ie> writes:
>>>>> So, does a kind of "lint" exist for docx? The best would be if it would
>>>>> also warn about possible problems with the different OOXML
>>>>> implementations.
>>>
>>>> Any good XML parser will check for syntactic conformance to XML.
>>>> Any schema-based validating parser should be able to validate against the
>>>> OOXML schemas.
>>>
>>> This is not enough. For example, an image file should be in word/media,
>>> it should have an Id set in word/_rels/document.xml.rels, be of a
>>> limited set of types (Encapsulated Postscript does not work here, for
>>> example), the suffix and the type should be linked together in
>>> [Content_Types].xml and so on. This all is AFAIK not defined in the
>>> OOXML schema, but this is the part where the problems start to rise.
>>
>> That is what validation against the OOXML schemas *should* do for
>> you.
>
> XML Schema cannot do this since it does not provide instruments to link
> the internal structure to external entities, like file names.

AFAIK no validator for XML does this: it's not required by the spec, and
I'm not sure that it would be useful, because documents can be moved out
of their editorial context (and usually are). The OOXML/ODF concept of
wrapping the document with ancillary files and zipping it all up is not
addressed by the XML spec (which governs validation) because it is an
application, and thus the responsibility of some other software, not the
parser/validator.

> I am not using word. I am writing OOXML. This is a difference. And I
> have not the option to switch to another format for this work.

In that case I think you might have to write some kind of add-on for the
validator, or perhaps wrap a validator in some external code which will
detect external references like filenames and perform an fopen() to see
if they exist.

> BTW, just to be curious: Are you sure that Openoffice does not put any
> additional undocumented constraints outside their schemas? And, if yes,
> would you recommend to switch back to OOXML? How does ODF f.e. link an
> external file name requirement to an internal entity?

I'm not sure that it has no external constraints: it's possible that it
does and that I have just not encountered them. The problem with using
schemas is that they don't have external entities like a DTD, so there
is no way (AFAIK) to inform the PSVI that a specific object is to be
sought outside the document. Using schemas for text documents is usually
an unnecessary design choice, unless they require some very specific
data types, which is very rare. OOXML and ODF are unfortunately both
victims of exogenous requirements for links to other formats in their
respective suites, which need robust data typing. That's what you are
buying into when you choose a wordprocessor file format for your
documents instead of structure-oriented format like XML (or even LaTeX).
But very few users consider this, and we (you) have to deal with the
consequences of their decisions.

///Peter

0 new messages