What compression algorithms are used to compress the TOC?

40 views
Skip to first unread message

George Rüthschilling

unread,
Apr 12, 2008, 8:13:37 AM4/12/08
to xar-devel
Dear all,

I am pretty new to the xar file format.

I want to mount xar archives in Netbeans, therefore I am writing a
small Java library for xar.
I have already implemented the header parser - that was easy. ;-)

But now I am a little bit stuck on how to compress and decompress the
TOC section.
What compression algorithm do you use here?


Best regards,
George

jdd...@gmail.com

unread,
Apr 16, 2008, 6:41:09 AM4/16/08
to xar-devel

Not absolutly sure, but I think the header algorithm is zlib deflate.
(http://en.wikipedia.org/wiki/DEFLATE_(algorithm))

George Rüthschilling

unread,
Apr 16, 2008, 9:51:23 AM4/16/08
to xar-devel


On 16 Apr., 12:41, "jddu...@gmail.com" <jddu...@gmail.com> wrote:
> Not absolutly sure, but I think the header algorithm is zlib deflate.
> (http://en.wikipedia.org/wiki/DEFLATE_(algorithm))


You are right, the TOC is compressed in the zlib "78 DA" format - best
compression + dictionary.

I had a small bug in my zlib implementation, java has no unsigned byte
data type,
so I have the extra burden to convert it to an internal unsigned
representation and back.

The really nasty thing was that my hex viewer code had done the
conversion correctly, so I was really a little bit confused.
So I started to write some lines of C using the original zlib, that
worked quite well, after that it has not been a great issue any more.


So I have now the "muscles", I can read/write the 3 xar sections
header, TOC and heap,
now I have to find out more about the already existing bindings to the
ext2 and darwin extended attributes meta data.

I want to get an idea how to map Windows NTFS meta data to XAR toc xml
elements.

It seems to me a quite natural approach to use simply file-system
objects provided by the underlying operating system.
That way I can use reflection to query the object's meta data
properties, and I do not have to write (too) many lines of code
manually.

Another feasible approach is to retrieve platform independent objects
via the Common Information Model (CIM).
I am not sure, but I think this would allow to make archives which are
binary portables with platform dependent, conditional TOC sections.

Best regards
George

jdd...@gmail.com

unread,
Apr 17, 2008, 6:58:06 AM4/17/08
to xar-devel
> I want to get an idea how to map Windows NTFS meta data to XAR toc xml
> elements.
>
> It seems to me a quite natural approach to use simply file-system
> objects provided by the underlying operating system.
> That way I can use reflection to query the object's meta data
> properties, and I do not have to write (too) many lines of code
> manually.
>
>  Another feasible approach is to retrieve platform independent objects
> via the Common Information Model (CIM).
> I am not sure, but I think this would allow to make archives which are
> binary portables with platform dependent, conditional TOC sections.
>
> Best regards
>  George

Most of the archived meta data are describe here:

http://code.google.com/p/xar/wiki/ArchivedMetadata

The problem i encounter when i tried to compile xar on Windows is that
Windows does not support:
- sym links, hard links and all others special files (sockets,
devices, etc...)
- POSIX permission (user, group, everybody), and owner.
- POSIX flags: chflags(2)
- Store file times in a different way than the UNIX one.

And it provide uniq flags like compressed.

I think you should try to find equivalent for most of the metadata
that are already defined for other systems, and define other elements
for ntfs specifics.
Maybe something like <ntfs_flags>1234</ntfs_flags> or
<ntfs><flags>12345</flags>... other specifics elements </ntfs>

I did find any serious discussion about Win32 port in this groups,
maybe it's time to get one.

George Rüthschilling

unread,
Apr 19, 2008, 12:56:45 PM4/19/08
to xar-devel
For simple port of the XAR lib to the Windows platform I would
consider to use the Interix/ Unix subsystem instead of the Win32
subsystem.
Interix offers a complete implementation of the POSIX libraries.
So it is a more unix like environment, I add two link about it.

http://stephesblog.blogs.com/papers/usenix-interix.pdf
http://technet.microsoft.com/de-de/magazine/cc160802(en-us).aspx


My next step is to think about a XAR container without any operating
system specific meta data, and to derive from this "naked" container a
specification for the XAR container itself. I think it is a good idea
to separate container specifics from the content specifics. A modular
system would be more extensible. The advantage I see here is that
there would be less interferences between potentially incompatible
operating system data.

E.g. a ntfs module element could look like this.

<file id="1">
<name>test.xml</name>
<type>file</type>
...
<atime>2008-04-19T13:41:21Z</atime>
<mtime>2008-04-19T13:41:21Z</mtime>
<ctime>2008-04-19T12:53:36Z</ctime>
<ntfs>
<Attributes>ReadOnly, Hidden, Archive, NotContentIndexed,
Compressed, Encrypted </Attributes>
<FileSecuritySettings>
<ControlFlags>32772</ControlFlags>
<OwnerPermissions>True</OwnerPermissions>
</FileSecuritySettings>
<Owner>
<AccountName>JR</AccountName>
<BinaryRepresentation>[... Base64?]</
BinaryRepresentation>
<ReferencedDomainName>MMW</ReferencedDomainName>
<SID>S-1-5-21-1123456780-627654331-5435298798-1122</
SID>
<SidLength>28</SidLength>
</Owner>
<AlternativeDataStreams />
<ACL />
</ntfs>
</file>


Best regards,
George

jdd...@gmail.com

unread,
Apr 20, 2008, 6:29:57 AM4/20/08
to xar-devel

Look fine, but you may store the AlternativeDataStreams as extended
attributes, that the way forks (the HFS name for data stream) are
stored on OS X.

George Rüthschilling

unread,
Apr 21, 2008, 1:24:33 PM4/21/08
to xar-devel
Microsoft products are often really feature rich , .... ;-)

NTFS also supports extended attributes, they are different from the
alternative data streams.
They have been added for the os/2 sub system of Win NT.

The Win32 subsystem has a feature called extended file properties.
This can be also regarded as extended attributes.
These properties can be either stored inside of a OLE container, which
make them independent of the underlying file system,
or they can be stored inside of 3 predefined alternative data
streams.
(
^ESummaryInformation[
Title, Subject, Author, Keywords, Comments,
Template, Last Saved By, Revision Number,
Total Editing Time, Last Printed, Number of Pages,
Number of Words, Number of Characters, Thumbnail,
Name of Creating Application,
] ,
^EDocumentSummaryInformation[
Category, PresentationTarget, Lines, Paragraphs, Slides,
Notes, HiddenSlides, MMClips, HeadingPairs, TitlesofParts,
Manager, Company, {your own Office properties}
],
^ESebiesnrMkudrfcoIaamtykdDa[
Computer { mostly undocumented, at least 3 other attributes... }
] )

Since some of these properties are not plain text, it is unclear if
other operating systems can use them, potentially, it may even crash
some applications which expect slightly different data.

I think an archiver should not interpret and change the archived data.
It is a much easier task to write platform specific code for exporting
the archive's content to a specific platform than to write code that
normalizes platform specific meta data to a universal representation
which meets the needs of all existing file and operating systems.

The small, little differences that sometimes matter a lot make system
programming such an exciting field.

Similar concepts do not always lead to compatible implementations.

Apple mentions somewhere that EAs should not be longer than 3802 bytes
on Mac OS 10.4..
XFS has 64KB limit for each EA.
NTFS does not know such size restrictions. An alternative data stream
behaves in many aspect just like a normal file,
but it inherits the ACL and some other meta data from its file.

Best regards,
George

jdd...@gmail.com

unread,
Apr 22, 2008, 5:38:59 AM4/22/08
to xar-devel
Oky. On OS X, extended attributes are just forks (one fork per
attribute).
In fact, the 3802 bytes size limit is an Apple implentation choice,
not a technical limit of the HFS+ format. This is to limit ea size to
one block. One block is 4096 bytes wide and I think 3802 is one block
minus some header. And the resources fork is a special case that is
not limited in size.

Anyway your are right, it's probably better to isolate file system
specific data and write code for each plateform that try to interpret
those data if they are relevant.

Best Regards,
Jean-Daniel
Reply all
Reply to author
Forward
0 new messages