[expath][ANN] ZIP Module first draft published

7 views
Skip to first unread message

Phil Fearon

unread,
Oct 12, 2010, 7:31:38 AM10/12/10
to EXPath
Hi

I'm pleased to announce the first draft for the ZIP Module
specification is now available at:

http://expath.org/spec/zip

To quote from the spec:

'This specification defines a set of functions to read and write ZIP
files structure and actual content. It has been designed as a general
ZIP tool set for XPath, while it is expected to be particularly useful
with document package formats based on XML and ZIP, as for instance
[EPUB], [Open XML], and [OpenDocument].'

Though this is a collaborative work, the original design and
implementation
for this module was by Florent Georges.

Florent's Saxon (Java) implementation is still the only free
implementation of this specification available, however, a separate
implementation (using Saxon.NET) is now incorporated in Qutoric's
commercial XPath and XSLT test/publishing tools.

All comments are welcome.

Regards,

Phil Fearon
http://qutoric.com

mozer

unread,
Oct 12, 2010, 9:56:55 AM10/12/10
to exp...@googlegroups.com
Hi Phil,

Thanks for this spec which is wanted for a long time

Here a few comments

1) There is no reference to ZIP ? Which implementation of ZIP are you
refering to ? Can we use UTF-8 in file name ?
2) Can we handle JAR/WAR/EAR ?
3) Can we handle Widget[1] ?
3) EXProc has already an unzip step [2]. How does it relate ?
4) Marklogic has already a ZIP library [3] ? How does it related ?
5) You refer to XQuery, XPath and XSLT please make explicit reference
6) You reference HTML Document ? probably you can add HTML5 algorithm
defined here [4] ?
7) Why aren't there any function to access to a zip:entry when you
already called zip:entries() ?


[1] http://www.w3.org/TR/2010/WD-widgets-20101005/
[2] http://exproc.org/proposed/steps/other.html#unzip
[3] http://developer.marklogic.com/pubs/4.0/apidocs/package.html
[4] http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parsing

Regards,

Xmlizer

> --
> You received this message because you are subscribed to the Google Groups "EXPath" group.
> To post to this group, send email to exp...@googlegroups.com.
> To unsubscribe from this group, send email to expath+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/expath?hl=en.
>
>

Philip Fearon

unread,
Oct 12, 2010, 1:14:45 PM10/12/10
to exp...@googlegroups.com
Hi,

Thanks for your comments, my replies below:

On Tue, Oct 12, 2010 at 2:56 PM, mozer <xml...@gmail.com> wrote:
> Hi Phil,
>
> Thanks for this spec which is wanted for a long time
>
> Here a few comments
>
> 1) There is no reference to ZIP ?

A1.1). Agree that one should be added, is[1] ok?

>Which implementation of ZIP are you refering to ?

A1.2) This is difficult because I believe there are many ZIP
implementations. I would recommend we reference a certain ZIP
specification as a minimum and then leave support for later versions
to be left to specific EXPath implementations, any suggestions?

Can we use UTF-8 in file name ?

A1.3) I'm assuming you're refering to the values used in the 'href'
attribute of the 'zip:file element' and the $href argument in the
entry extraction functions? If so, my view is that its probably best
that the character set supported (I don't think this is an encoding
issue) for the name is implementation defined, and even then, on
platform independent solutions (like Florent's), it may be different
across various operating systems.

> 2) Can we handle JAR/WAR/EAR ?

A2) Required support for various packaging formats/conventions like
the onces you mention should contribute towards the decision on the
minimum ZIP spec supported (see A1.2).

> 3) Can we handle Widget[1] ?

A3.1) My view is same as for [A3]

> 3) EXProc has already an unzip step [2]. How does it relate ?

A3.2) Probably, but I'm not sure how. Should this be in the spec as an
acknowledgement anyone?

> 4) Marklogic has already a ZIP library [3] ? How does it related ?

A4) I don't know, but Florent might.

> 5) You refer to XQuery, XPath and XSLT please make explicit reference

A5) Agreed. I'll add references.

> 6) You reference HTML Document ? probably you can add HTML5 algorithm
> defined here [4] ?

A6) You're referencing the HTML5 parsing algorithm section, so I'm
guessing this is related to the zip:html-entry() function (in Section
2.1) which is about parsing HTML so that it can be returned as an XML
document node. I can add the HTML algorithm you reference as an
aspiration, but so far as I can tell (from a quick glance), this
simply creates a DOM. Wouldn't we then still need to have a defined
way of mapping the DOM to an XML document node?

> 7) Why aren't there any function to access to a zip:entry when you
> already called zip:entries() ?

A7) I'm not quite sure what the question is.

(A7a) that you want the ability to insert a ZIP compressed file as a
ZIP entry within a ZIP file ?
- You can, by using a URI value locating the ZIP file for
the 'src' attribute in the zip:entry element (section 4.3)

or, (A7b) that you want to read an entry (or get details on the
structure) of a ZIP file that you've already created with the
zip:entries() call?
- In this case, it could be awkward as I don't think you
can guarantee the order of execution and you would have to start
relying on side-effects

If I've misunderstood the question, perhaps you could be more specific
on what you want the specification to allow you to do, with a small
use case?

[1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT


Regards

Phil Fearon
http://qutoric.com

Florent Georges

unread,
Oct 14, 2010, 5:42:28 PM10/14/10
to exp...@googlegroups.com
On 12 October 2010 12:31, Phil Fearon wrote:

Phil,

> I'm pleased to announce the first draft for the ZIP Module
> specification is now available at:

> http://expath.org/spec/zip

Thank you so much for taking over this spec! It's been due for
a long time :-) A few random comments about the current draft:

1/ It is maybe a good time to rename zip:zip-file() as
zip:create-file() or something? I really hate renaming things,
but we are still in a early stage, and "zip-file()" is kind of
ambiguous. Does it create one zip file, read one, etc.?

2/ In zip-file() and update-zip() (aka the creation functions),
we should probably add a param $contents (like $bodies in the
HTTP Client). To be able to provide some entries' content as
individual items/nodes, without having to embed them in the
zip:file element.

3/ In the creation functions, should we add a param $href? In
order to be able to change the destination URI without having to
copy the whole zip:file element.

4/ What if in an entry does not exist i a ZIP file when calling
one of the functions zip:*-entry()? Probably return the empty
sequence. But their return type is not optional. So we should
probably change their return type (e.g. for zip:text-entry()) to
"xs:string?" instead of "xs:string", and say the empty sequence
is returned when the entry does not exist.

5/ Should we add a function zip:entry-exists()? (returning true
or false)

6/ And the function zip:xml-entry-available(), to be consistent
with fn:doc-available()? (not only the entry has to exist, but to
be "available", so well-formed, etc.)

7/ Besides the core functionalities, we should probably also
provide some helper functions (can be done in plain XPath, but
convenient to have them), for instance:

(: $entry is either a zip:dir or a zip:entry, returns the
path of $entry in its ancestor zip:file :)
zip:entry-path($entry as element()) as xs:string

(: return either a zip:dir or a zip:entry, a descendent of
$zip, corresponding to $path :)
zip:entry-descriptor($zip as element(zip:file),
$path as xs:string) as element()?

Thanks again for the draft, regards,

--
Florent Georges
http://fgeorges.org/

Message has been deleted

Florent Georges

unread,
Oct 14, 2010, 7:06:25 PM10/14/10
to exp...@googlegroups.com
On 12 October 2010 18:14, Philip Fearon wrote:

> On Tue, Oct 12, 2010 at 2:56 PM, mozer wrote:

Hi,

> Thanks for your comments, my replies below:

Instead of responding separately to Mozer and to you, I'll
respond to both of you in the same email.

>> 1) There is no reference to ZIP ?

> A1.1). Agree that one should be added, is[1] ok?

It makes sense to me. That's also the one referred to by the
"Widget Packaging and Config" spec, as well as by ODF. Who
knows, we're maybe going to have an ISO ZIP spec one day? ;-)

>> Which implementation of ZIP are you referring to ?

What do you mean Mozer, by "referring to an implementation of
ZIP" in the context of this spec?

>> Can we use UTF-8 in file name ?

> A1.3) I'm assuming you're refering to the values used in the
> 'href' attribute of the 'zip:file element' and the $href
> argument

I might be wrong, but I think he refers instead to the name of
the entries in the ZIP file (zip:dir and zip:entry's @name).

>> 2) Can we handle JAR/WAR/EAR ?

>> 3) Can we handle Widget[1] ?

As those are ZIP files, yes, we should be able to read them.
At the ZIP layer level of course (nothing specific to JAR files
or widgets themselves). But this is both interesting use cases.
Validating that we can support all options in the JAR and Widget
specs is an interesting indicator (e.g. the signing stuff).

>> 3) EXProc has already an unzip step [2]. How does it relate ?

> A3.2) Probably, but I'm not sure how. Should this be in the
> spec as an acknowledgement anyone?

It does not relate at all. Thanks for the link Moz, I've never
seen this one before. Is it a new step? It seems to only return
either the manifest (if we can call like that the zip:file
element representing the structure of a ZIP file) or a specific
entry as XML or binary (so like zip:entries(), zip:xml-entry()
and zip:binary-entry()).

It also shows more information about each entry (like
timestamps and sizes), which would be interesting to add also in
the ZIP module.

Maybe we should coordinate with EXProc: tell them about the new
ZIP draft and see if they want to share the effort or to keep two
separate specs. I think that'd make sense for EXProc to refer to
the EXPath spec and say that some functions are actually provided
as steps instead, and just define the interface of those steps
and how they map to the corresponding function definition.

>> 4) Marklogic has already a ZIP library [3] ? How does it
>> related ?

> A4) I don't know, but Florent might.

Depends on your definition of "relate", Moz :-) Maybe more of
interest is [A] actually. Where xdmp:zip-manifest() is the
equivalent of zip:entries(), xdmp:zip-get() the equivalent of the
various zip:*-entry() functions, and xdmp:zip-create() the
equivalent of zip:zip-file().

A big difference is that they use (or resp. generate) a binary
item (a proprietary item type of MarkLogic) instead of reading
(or resp. creating) files identified by a URI. Which could make
sense in some cases/environments. Or not.

>> 6) You reference HTML Document ? probably you can add HTML5
>> algorithm defined here [4] ?

> A6) You're referencing the HTML5 parsing algorithm section, so
> I'm guessing this is related to the zip:html-entry() function
> (in Section 2.1) which is about parsing HTML so that it can be
> returned as an XML document node. I can add the HTML algorithm
> you reference as an aspiration

Yes, I think you are right (Phil). Personally I would be
reluctant to introduce a normative reference to that algorithm.
The initial references to Tag Soup and HTML Tidy come from the
XProc spec, which does the same thing for the p:http-request
step.

We can probably add a reference to the HTML algorithm as one of
the possible way to do it (forbidding the evaluation of any
script; <script>document.write('<p>');</script> MUST be returned
as one element with one text node: "document.write('<p>');").

>> 7) Why aren't there any function to access to a zip:entry when
>> you already called zip:entries() ?

> A7) I'm not quite sure what the question is.

I guess the question is: "when you already read a ZIP file
using zip:entries(), why do I have to read again the file by
providing its URI again and an entry path instead of just
providing a zip:entry", or something like that.

Two points here. First, while this is true that you have to
access the ZIP file twice if you call zip:entries() then, say,
zip:text-entry(), reusing the zip:entry element wouldn't change
that; and there is no overhead as the ZIP file is not read in is
entirety by zip:entries(), only the relevant parts to generate
the manifest.

Second, from a user point of view, it would probably make sense
to provide an overload of the zip:*-entry() functions to accept a
zip:entry element instead of both $href and $path:

(: $entry must be a descendant of a zip:file returned by
zip:entries() :)
zip:text-entry($entry as element(zip:entry)) as xs:string

> [1] http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Thanks for your comments and responses!

Regards,

--
Florent Georges
http://fgeorges.org/


[A]http://developer.marklogic.com/pubs/4.2/apidocs/Document-Conversion.html

John Snelson

unread,
Oct 15, 2010, 6:04:43 AM10/15/10
to exp...@googlegroups.com
On 14/10/10 22:42, Florent Georges wrote:
> 2/ In zip-file() and update-zip() (aka the creation functions),
> we should probably add a param $contents (like $bodies in the
> HTTP Client). To be able to provide some entries' content as
> individual items/nodes, without having to embed them in the
> zip:file element.

This would be my main comment too. It's not nice to have to save
documents to the filesystem/database in order to create a zip file of them.

John

--
John Snelson, Senior Engineer http://twitter.com/jpcs
MarkLogic Corporation http://www.marklogic.com

Florent Georges

unread,
Oct 15, 2010, 7:09:02 AM10/15/10
to exp...@googlegroups.com
On 15 October 2010 11:04, John Snelson wrote:
> On 14/10/10 22:42, Florent Georges wrote:

Hi,

>> 2/ In zip-file() and update-zip() (aka the creation
>> functions), we should probably add a param $contents (like
>> $bodies in the HTTP Client). To be able to provide some
>> entries' content as individual items/nodes, without having to
>> embed them in the zip:file element.

> This would be my main comment too. It's not nice to have to
> save documents to the filesystem/database in order to create a
> zip file of them.

That's nice to see we agree on this one, but I am not sure we
are talking about the same thing actually :-) For now, we have:

zip:zip-file(
<zip:file href="new-file.zip">
<zip:entry name="README">This is a sample.</zip:entry>
</zip:file>)

I.e. the content of README (that is, the content of what will
be the entry 'README' in the new ZIP file) is part of the
zip:file element. It is never serialized prior to the creation
of the ZIP file. An alternative way, in order to use existing
static files, is to use @src:

zip:zip-file(
<zip:file href="new-file.zip">
<zip:entry src="some/where/logo.png"/>
</zip:file>)

What I suggest here, is to be able to do something like the
following (as this is already the case in the HTTP Client, see
http://expath.org/spec/http-client, look for param "$bodies"):

let $readme := 'This is a sample.'
return
zip:zip-file(
<zip:file href="new-file.zip">
<zip:entry name="README"/>
</zip:file>,
$readme)

I see at least two reasons for that. If we generate large
content, for some implementations it could be difficult to be
efficient and prevent unnecessary copies of the content just to
add it to zip:file. But furthermore the content can be changed
as a side-effect of adding it to zip:file. For instance, both
following examples won't produce the same entry in the ZIP file:

let $xml := <hello>World!</hello>
return
zip:zip-file(
<zip:file href="one-file.zip">
<zip:entry name="hello.xml"> {
$xml
}
</zip:entry>
</zip:file>,
$readme)

let $xml := <hello>World!</hello>
return
zip:zip-file(
<zip:file href="another-file.zip">
<zip:entry name="hello.xml"/>
</zip:file>,
$xml)

When serializing the element 'hello' into the entry hello.xml,
the former will have to add a binding for the zip namespace.

Regards,

John Snelson

unread,
Oct 15, 2010, 7:12:31 AM10/15/10
to exp...@googlegroups.com
On 15/10/10 12:09, Florent Georges wrote:
> On 15 October 2010 11:04, John Snelson wrote:
>> On 14/10/10 22:42, Florent Georges wrote:
>
>>> 2/ In zip-file() and update-zip() (aka the creation
>>> functions), we should probably add a param $contents (like
>>> $bodies in the HTTP Client). To be able to provide some
>>> entries' content as individual items/nodes, without having to
>>> embed them in the zip:file element.
>
>> This would be my main comment too. It's not nice to have to
>> save documents to the filesystem/database in order to create a
>> zip file of them.
>
> That's nice to see we agree on this one, but I am not sure we
> are talking about the same thing actually :-) For now, we have:
>
> zip:zip-file(
> <zip:file href="new-file.zip">
> <zip:entry name="README">This is a sample.</zip:entry>
> </zip:file>)
>
> I.e. the content of README (that is, the content of what will
> be the entry 'README' in the new ZIP file) is part of the
> zip:file element. It is never serialized prior to the creation
> of the ZIP file.

I hadn't realised that from my skim of the spec.

I agree.

Phil Fearon

unread,
Oct 15, 2010, 9:53:39 AM10/15/10
to EXPath
Florent, John, Mozer and All,

Right, thanks for all comments, we have quite a bit here already!

I've tried here, to list the proposals to the spec that are material.
That is, they require new or different functions, changes to existing
function signatures, or changes to existing elements. The other
proposals for the spec, such as references to other specs are equally
important but perhaps best covered separately, to keep things
manageable.

I've no objections to any of these proposals (I've listed below), they
all seem like they could be useful for users and don't add too much
burden to the implementer. All I've added is suggestions for different
names, and a bit of extra detail:

Once I've answers to my naming suggestions and added details, if there
aren't any further significant comments on these proposals, I can
modify the spec accordingly so as to capture where we're at, and then
we can go through the review process on these changes. More proposals/
issues/questions/clarifications are of course still welcome!


1/ It is maybe a good time to rename zip:zip-file() as
zip:create-file() or something? I really hate renaming things,
but we are still in a early stage, and "zip-file()" is kind of
ambiguous. Does it create one zip file, read one, etc.?

A1. I agree, that if we're to rename functions we should do it now,
collectively, so they have some coherence.

So, my vote for the 3 zip file-handling functions (omitting the
required prefix) would be:

list-zip()
create-zip()
update-zip()

I know these would look a bit odd when using the 'zip' prefix, but we
shouldn't infer too much meaning from prefixes anyway. What do you
think? (I haven't a strong view on this, because naming is always
awkward)


[Florent's proposal, with agreement from John]
2/ In zip-file() and update-zip() [sic: update-entries()] (aka the
creation functions),
we should probably add a param $contents (like $bodies in the
HTTP Client). To be able to provide some entries' content as
individual items/nodes, without having to embed them in the
zip:file element

A2. I agree, we should add this as a convenience

3/ In the creation functions, should we add a param $href? In
order to be able to change the destination URI without having to
copy the whole zip:file element

A3. I've no objections. Should we state then that the $href parameter
can be used to override the 'href' attribute in the zip:file element?

4/ What if in an entry does not exist in a ZIP file when calling
one of the functions zip:*-entry()? Probably return the empty
sequence. But their return type is not optional. So we should
probably change their return type (e.g. for zip:text-entry()) to
"xs:string?" instead of "xs:string", and say the empty sequence
is returned when the entry does not exist

A4. I've no objections to this. I'll add the '?' occurence indicator.

5/ Should we add a function zip:entry-exists()? (returning true
or false)

6/ And the function zip:xml-entry-available(), to be consistent
with fn:doc-available()? (not only the entry has to exist, but to
be "available", so well-formed, etc.)

A5 + A6. I'm happy if we add these, but, on naming, how about using
'zip:entry-is-xml' instead of 'xml-entry-available'?

7/ Besides the core functionalities, we should probably also
provide some helper functions (can be done in plain XPath, but
convenient to have them), for instance:

(: $entry is either a zip:dir or a zip:entry, returns the
path of $entry in its ancestor zip:file :)
zip:entry-path($entry as element()) as xs:string

(: return either a zip:dir or a zip:entry, a descendent of
$zip, corresponding to $path :)
zip:entry-descriptor($zip as element(zip:file),
$path as xs:string) as element()?

A7. Agreed

8/[Florent - in response to Mozer]
Second, from a user point of view, it would probably make sense
to provide an overload of the zip:*-entry() functions to accept a
zip:entry element instead of both $href and $path:

(: $entry must be a descendant of a zip:file returned by
zip:entries() :)
zip:text-entry($entry as element(zip:entry)) as xs:string

A8. Agreed


9/ [Florent replying to Mozer's reference to EXProc] It also shows
more information about each entry (like
timestamps and sizes), which would be interesting to add also in
the ZIP module.

A9. I'll add 'timestamp' (xs:dateTime) and 'size' (size in KB as
xs:integer) to zip:entry for the case where its returned by
zip:entries()

--------------

Regards

Phil Fearon
http://qutoric.com

> MarkLogic Corporation                        http://www.marklogic.com- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

mozer

unread,
Oct 28, 2010, 6:03:17 AM10/28/10
to exp...@googlegroups.com
Sounds like a very good step foward !!
I vote yes for all

Xmlizer

Reply all
Reply to author
Forward
0 new messages