Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

file command: "XML document text" vs "XML document text"

33 views
Skip to first unread message

Adam Funk

unread,
Oct 16, 2012, 10:06:38 AM10/16/12
to
I've just used the file command on four files of RDF-XML with the
following output:

augtfidf.rdf: XML document text
kyoto.rdf: XML document text
stuff.rdf: XML document text
tfidf.rdf: XML document text

What does it mean that one of them has an extra space between "XML"
and "document"?


--
The kid's a hot prospect. He's got a good head for merchandising, an
agent who can take you downtown and one of the best urine samples I've
seen in a long time. [Dead Kennedys t-shirt]

Janis Papanagnou

unread,
Oct 16, 2012, 11:20:14 AM10/16/12
to
On 16.10.2012 16:06, Adam Funk wrote:
> I've just used the file command on four files of RDF-XML with the
> following output:
>
> augtfidf.rdf: XML document text
> kyoto.rdf: XML document text
> stuff.rdf: XML document text
> tfidf.rdf: XML document text
>
> What does it mean that one of them has an extra space between "XML"
> and "document"?

Hard to tell without further information.

What does file *.rdf | od -c show you?

Janis

Mirko K.

unread,
Oct 16, 2012, 1:23:55 PM10/16/12
to
I've found out something strange. I don't know why this happens (perhaps a
bug in file/libmagic?), maybe it helps investigating further.

I've used locate+file+grep to look for .rdf files in my system with the two
spaces and found some. It seems that the double space appears when using
single quotes in the <?xml version='1.0'?> start tag. See this:

mirko@WizBox:~$ cp /etc/kompozer/profile/localstore.rdf .
mirko@WizBox:~$ cp localstore.rdf localstore_orig.rdf
mirko@WizBox:~$ vim localstore.rdf
mirko@WizBox:~$ file localstore*
localstore_orig.rdf: XML document text
localstore.rdf: XML document text
mirko@WizBox:~$ head -n1 localstore*
==> localstore_orig.rdf <==
<?xml version="1.0"?>

==> localstore.rdf <==
<?xml version='1.0'?>




Thomas 'PointedEars' Lahn

unread,
Oct 16, 2012, 2:37:41 PM10/16/12
to
Adam Funk wrote:

> I've just used the file command on four files of RDF-XML with the
> following output:
>
> augtfidf.rdf: XML document text
> kyoto.rdf: XML document text
> stuff.rdf: XML document text
> tfidf.rdf: XML document text
>
> What does it mean that one of them has an extra space between "XML"
> and "document"?

file(1) is free software. UTSL: <ftp://ftp.astron.com/pub/file/>.

--
PointedEars

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.

Adam Funk

unread,
Oct 16, 2012, 4:45:51 PM10/16/12
to
Not much. I tried od -x too, and that confirmed that the two spaces
are both 0x20.


--
The history of the world is the history of a privileged few.
--- Henry Miller

Janis Papanagnou

unread,
Oct 16, 2012, 5:28:51 PM10/16/12
to
On 16.10.2012 22:45, Adam Funk wrote:
> On 2012-10-16, Janis Papanagnou wrote:
>
>> On 16.10.2012 16:06, Adam Funk wrote:
>>> I've just used the file command on four files of RDF-XML with the
>>> following output:
>>>
>>> augtfidf.rdf: XML document text
>>> kyoto.rdf: XML document text
>>> stuff.rdf: XML document text
>>> tfidf.rdf: XML document text
>>>
>>> What does it mean that one of them has an extra space between "XML"
>>> and "document"?
>>
>> Hard to tell without further information.
>>
>> What does file *.rdf | od -c show you?
>
> Not much. I tried od -x too, and that confirmed that the two spaces
> are both 0x20.

The file(1) man page points to magic(5). Depending on the actual file
characteristics there seem to be more than one entry possible for a
file, independent of its extension. As a wild guess: maybe depending
on byte order or any such file characteristic.

It may be helpful if you inspect that 'magic' file on your system and
see what entries are present for XML and what's the difference in the
definition of the respective entries with those two text strings that
you observed.

Janis

>
>

Mirko K.

unread,
Oct 16, 2012, 7:54:01 PM10/16/12
to
Janis Papanagnou wrote:

> The file(1) man page points to magic(5). Depending on the actual file
> characteristics there seem to be more than one entry possible for a
> file, independent of its extension. As a wild guess: maybe depending
> on byte order or any such file characteristic.

There are indeed multiple possible entries for a "file type". The point of
'file' and 'magic' is exactly that it tries to determine the type (and
possible sub-types/versions/etc) from the file's content (certain special
byte/string sequences near the beginning of the file, aka "magic numbers"),
instead of the extension, and actually, 'file' completely ignores the
extension.

Since there is an unlimited number of file formats, 'file' depends on a
heuristic which is easy to trick.

> It may be helpful if you inspect that 'magic' file on your system and
> see what entries are present for XML and what's the difference in the
> definition of the respective entries with those two text strings that
> you observed.

Unfortunately, there is no single "magic" text file on Ubuntu (and Debian)
anymore. It has been replaced with some binary format
(/usr/share/misc/magic.mgc). One has to install the source of the 'file' (or
libmagic1) package to view the different magic files that this magic.mgc is
compiled from). The relevant file (in the source) is "Magdir/sgml". I cannot
fluently read it, but didn't found an obvious explanation for the symptoms
described by the OP.

Still, my observation written my other post should a helpful start for
asking the file/libmagic devs about this. :-)

HTH

Janis Papanagnou

unread,
Oct 16, 2012, 8:32:32 PM10/16/12
to
On 17.10.2012 01:54, Mirko K. wrote:
>
>> It may be helpful if you inspect that 'magic' file on your system and
>> see what entries are present for XML and what's the difference in the
>> definition of the respective entries with those two text strings that
>> you observed.
>
> Unfortunately, there is no single "magic" text file on Ubuntu (and Debian)
> anymore. It has been replaced with some binary format
> (/usr/share/misc/magic.mgc). One has to install the source of the 'file' (or
> libmagic1) package to view the different magic files that this magic.mgc is
> compiled from). [...]

The binary and text files are, on my Xubuntu, both under /usr/share/file,
also referenced through soft-links in /usr/share/misc. There's nothing
you'd have to install separately.

magic: magic text file for file(1) cmd
magic.mgc: magic binary file for file(1) cmd (version 7) (little endian)
magic.mime: magic text file for file(1) cmd

The file called magic was the one that I inspected.

Janis

Mirko K.

unread,
Oct 17, 2012, 8:51:08 AM10/17/12
to
Janis Papanagnou wrote:

> The binary and text files are, on my Xubuntu, both under /usr/share/file,
> also referenced through soft-links in /usr/share/misc. There's nothing
> you'd have to install separately.
>
> magic: magic text file for file(1) cmd
> magic.mgc: magic binary file for file(1) cmd (version 7) (little endian)
> magic.mime: magic text file for file(1) cmd
>
> The file called magic was the one that I inspected.
>
> Janis

That seems to have changed somewhere between Ubuntu 10.04 and 12.04, also
Debian 6. On my old U10.04 installation these files are still there, on this
U12.04 there is certainly no single plain text magic file anymore and I had
to download the source (does not mean, that Xubuntu 12.04 might not have
them.)

Adam Funk

unread,
Oct 19, 2012, 9:04:59 AM10/19/12
to
On 2012-10-16, Mirko K. wrote:

> I've found out something strange. I don't know why this happens (perhaps a
> bug in file/libmagic?), maybe it helps investigating further.
>
> I've used locate+file+grep to look for .rdf files in my system with the two
> spaces and found some. It seems that the double space appears when using
> single quotes in the <?xml version='1.0'?> start tag. See this:
>
> mirko@WizBox:~$ cp /etc/kompozer/profile/localstore.rdf .
> mirko@WizBox:~$ cp localstore.rdf localstore_orig.rdf
> mirko@WizBox:~$ vim localstore.rdf
> mirko@WizBox:~$ file localstore*
> localstore_orig.rdf: XML document text
> localstore.rdf: XML document text
> mirko@WizBox:~$ head -n1 localstore*
>==> localstore_orig.rdf <==
><?xml version="1.0"?>
>
>==> localstore.rdf <==
><?xml version='1.0'?>


I get the same results --- harmless, I guess, but very strange.


--
No sport is less organized than Calvinball!

Ben Bacarisse

unread,
Oct 19, 2012, 12:05:43 PM10/19/12
to
I chased this down yesterday, but the end result was not very
satisfactory. The magic patterns allow subsequent matches to add text
to previous results and the 'file' code adds a space between them. All
good so far.

The double space comes from two rules. The first adds "XML" and the
second adds " document text" so you get two spaces. The tricky part is
that the rules that adds " document text" is this:

>15 search/1 >\0 %.3s document text

where the %.3s expands to nothing. Simply writing this

>15 search/1 >\0 document text

fixes the problem but the %0.3s is very suggestive. Maybe this change
would break other file calls?

I suspect that maybe the code should be passing the old string as an
argument to the formatted print call, but then the result would be "XML
XML document type" so that alone is not the problem. Maybe the file
code is supposed to treat strings with formats differently by not
appending the old and the new strings? I could not tell.

I thought the situation complex enough that I didn't think I could
anything over simply reporting the effect. Has anyone done that?

--
Ben.

Mirko K.

unread,
Oct 19, 2012, 2:08:52 PM10/19/12
to
Ben Bacarisse wrote:

> I chased this down yesterday, but the end result was not very
> satisfactory. The magic patterns allow subsequent matches to add text
> to previous results and the 'file' code adds a space between them. All
> good so far.
>
> The double space comes from two rules. The first adds "XML" and the
> second adds " document text" so you get two spaces. The tricky part is
> that the rules that adds " document text" is this:
>
> >15 search/1 >\0 %.3s document text
>
> where the %.3s expands to nothing. Simply writing this
>
> >15 search/1 >\0 document text
>
> fixes the problem but the %0.3s is very suggestive. Maybe this change
> would break other file calls?
>
> I suspect that maybe the code should be passing the old string as an
> argument to the formatted print call, but then the result would be "XML
> XML document type" so that alone is not the problem. Maybe the file
> code is supposed to treat strings with formats differently by not
> appending the old and the new strings? I could not tell.
>
> I thought the situation complex enough that I didn't think I could
> anything over simply reporting the effect. Has anyone done that?
>

Not yet. I found a different (partial) fix. First there is:

0 string/t \<?xml\ version=" XML

And a few lines later:

0 string \<?xml\ version=' XML

Changing that second field to string/t seems to fix it.

However, this seems to break it for UTF-x files (the t makes the test only
for ASCII files I think).

Feel free to report it if you think that's enough info. Otherwise I'll
report it in the next days, I want to play around with this a little more.

Ben Bacarisse

unread,
Oct 19, 2012, 5:16:13 PM10/19/12
to
I don't think I'll have time, but I'll post here if I am about to -- no
point in duplicating.

--
Ben.

Alan Curry

unread,
Oct 22, 2012, 4:55:50 PM10/22/12
to
In article <s6m2l9-...@WizBox.localnet>,
Mirko K. <mirkok...@googlemail.com> wrote:
>
>That seems to have changed somewhere between Ubuntu 10.04 and 12.04, also
>Debian 6. On my old U10.04 installation these files are still there, on this
>U12.04 there is certainly no single plain text magic file anymore and I had
>to download the source (does not mean, that Xubuntu 12.04 might not have
>them.)

I complained about this last year. It's Debian bug#625259. Nothing has been
done about it.

--
Alan Curry
0 new messages