Copying files from one ZIP to another

6 views
Skip to first unread message

Martynas

unread,
Jan 8, 2010, 9:17:40 AM1/8/10
to EXPath
Hey,

I'm still working on my ODT to ePub conversion in XSLT.
The markup content goes all the way ODT > XHTML > ePub.
However, other media such as pictures has to go directly from ODT to
ePub.
That is, I need to copy files from file.odt!/Pictures/ to file.epub!/
OEBS/pictures/. How does one achieve that with EXPath?
I create a <zip:dir name="pictures">, but has to go inside?

Thanks,

Martynas
semantic-web.dk

Philip Fearon

unread,
Jan 8, 2010, 4:40:22 PM1/8/10
to exp...@googlegroups.com
For this, I think the info you need is at:

http://www.balisage.net/Proceedings/vol3/html/Georges01/BalisageVol3-Georges01.html

Specifically:

<<Quote>>

To be complete, the module also provides 4 functions to read one
specific entry from an existing ZIP file, for instance depending on
the result of zip:entries(). They return either a document node, a
string or a xs:base64Binary item, following the same rules as
http:send-request():

zip:xml-entry($href, $entry) as document-node()
zip:html-entry($href, $entry) as document-node()
zip:text-entry($href, $entry) as xs:string
zip:binary-entry($href, $entry) as xs:base64Binary

<<Unquote>>

These functions let you get zip entry data (uncompressed) from you odt
file. You should (in ODT -> XHTML) then be able to serialize this
xs:base64Binary in an xsl:result document. In your XHTML -> ePub, you
could then use the doc() funcion to get the result document, pick out
your base64 text node, and just include it in a <zip:entry
output="binary"> sequence constructor within your <zip:dir> element.
So it might look like:

<zip:dir name="picture" >
<zip:entry name="sample.gif" output="binary">
<xsl:value-of select="doc('picture-content.xml')/main/picture
cast as xs:base64Binary"/>
</zip:entry>
</zip:dir>

You also have the more efficient option of just using
zip:binary-entry() directly, if ODT -> ePub works for you:

<zip:dir name="picture" >
<zip:entry name="sample.gif" output="binary">
<xsl:value-of select="zip:binary-entry($href, $entry) cast as
xs:base64Binary"/>
</zip:entry>
</zip:dir>

I'm guessing a bit here, because I'm not using an 'official'
implementation, just my own Saxon.NET implementation, should this work
anyone?

Phil Fearon
http://qutoric.com

> --
> You received this message because you are subscribed to the Google Groups "EXPath" group.
> To post to this group, send email to exp...@googlegroups.com.
> To unsubscribe from this group, send email to expath+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/expath?hl=en.
>
>
>
>

Florent Georges

unread,
Jan 8, 2010, 6:37:50 PM1/8/10
to exp...@googlegroups.com
2010/1/8 Martynas wrote:

Hi Martynas,

> I'm still working on my ODT to ePub conversion in XSLT.
> The markup content goes all the way ODT > XHTML > ePub.

Before going further, how technically is this pipeline implemented?
Are those two different transformations? Or is the XHTML an
intermediary format just used as internal of a single transform? If
there is two distinct transforms, how is the output of the first one
passed as the input of the second one? Does the context of the
transform allow to right to the disk? Etc.

Maybe you already have published a version of your tool? (if it is public)

Regards,

--
Florent Georges
http://www.fgeorges.org/

Florent Georges

unread,
Jan 8, 2010, 6:54:04 PM1/8/10
to exp...@googlegroups.com
2010/1/8 Philip Fearon wrote:

Hi PHilip,

> These functions let you get zip entry data (uncompressed) from
> you odt file. You should (in ODT -> XHTML) then be able to
> serialize this xs:base64Binary in an xsl:result document. In
> your XHTML -> ePub, you could then use the doc() funcion to get
> the result document, pick out your base64 text node, and just
> include it in a <zip:entry output="binary"> sequence
> constructor within your <zip:dir> element.

Something along those lines, yes I think. Two points to be
aware of though:

- Martynas does not say in this message (maybe in a previous
one, but then I do not remember) whether his transform is
a single one XSLT transform with several passes, or if it
is done in two separate XSLT transforms.

If this is a single one transform, this wouldn't work as
the same transform cannot access by doc() a document it
created itself with xsl:result-document. But if it is one
transform, there is no need to serialize anything on the
disk anyway;

- instead of writing each entry one by one, you can serialize
the complete zip:file with all the "static" entries (the
entries that won't change in the second transform/pass).
And you can even write the corresponding ZIP file, and use
zip:update-entries() to create the final ePub file as a
copy, instead a few modified entries.

Anyway, regardless whether it is a single one transform with
two passes or several transforms, a priori I would in the first
one just extract the entries I need to get the intermediary
format (here XHTML) and use the ZIP file again in the second one
to either update it with zip:update-entries() or extract specific
entries to be output to the final ePub file.

But more infos about the architecture of this transform would
help, here.

Martynas Jusevicius

unread,
Jan 9, 2010, 8:41:11 AM1/9/10
to exp...@googlegroups.com
Hey all,

I'm doing a 2-pass transform.
1st one transforms ODT to XHTML using a 3rd party stylesheet.
2nd splits the XHTML into several files and packs it into ePub using
EXPath. It also has to access the original ODT, e.g. for additional
info from styles.xml or, like in this situation, to copy all the
pictures.

If I'm running it from command line then XHTML gets serialized on the
disk, however from Java it's running through streams.

Is this of any help? I'll try Philip's approach. Basically I need to
copy all the files from the Pictures/ folder. Not used to working with
binary in XSLT.

I'm doing this for a client, so I'll have to clear the opensourcing
possibilities with them.

Martynas

Martynas Jusevicius

unread,
Jan 9, 2010, 11:12:00 AM1/9/10
to exp...@googlegroups.com
Hey again,

well I did some experiments, and this doesn't work:

<zip:entry name="{$img-name}">
<xsl:value-of select="zip:binary-entry('file.odt',
concat('Pictures/', $img-name)) cast as xs:base64Binary"/>
</zip:entry>

Gives the following error:
Handle entry: OEBPS/pictures/10000000000001900000002CD33A5186.png/
net.sf.saxon.trans.XPathException: Non-whitespace text nodes are not allowed
Same with copy-of.

I tried adding output="binary", but it seems it's not supported:
net.sf.saxon.trans.XPathException: zip:entry/@output has incorrect value: binary

How do I go about this?
And isn't there some API documentation? It's hard to figure out what
the parameters are.

Martynas

On Sat, Jan 9, 2010 at 12:54 AM, Florent Georges <fgeo...@gmail.com> wrote:

Florent Georges

unread,
Jan 9, 2010, 1:38:20 PM1/9/10
to exp...@googlegroups.com
2010/1/9 Martynas Jusevicius wrote:

Hi Martynas,

> I'm doing a 2-pass transform.
> 1st one transforms ODT to XHTML using a 3rd party stylesheet.
> 2nd splits the XHTML into several files and packs it into ePub
> using EXPath. It also has to access the original ODT, e.g. for
> additional info from styles.xml or, like in this situation, to
> copy all the pictures.

> If I'm running it from command line then XHTML gets serialized
> on the disk, however from Java it's running through streams.

Ok, so if I am right, there are actually two distinct
transforms.

> Is this of any help? I'll try Philip's approach. Basically I
> need to copy all the files from the Pictures/ folder.

I think the basic approach here, because you have access to the
ODT file within the second transform, is to copy the images
directly within the zip:file in the second transform:

<!-- the structure of the ODT zip file -->
<xsl:variable name="odt" select="zip:entries('input.odt')"/>

<!-- the entries in its Pictures/ dir -->
<xsl:variable name="pictures" select="
$odt/zip:dir[@name eq 'Pictures']/zip:entry"/>

<!-- to adapt to your actual output structure -->
<zip:file href="output.epub">
<!-- other stuff, e.g. from XHTML -->
...
<!-- the new pics dir -->
<zip:dir name="new-pics">
<xsl:for-each select="$pictures">

<!-- the entry descriptor for the pic in epub -->
<zip:entry name="{ @name }" output="base64">
<!-- the content of the pic in odt -->
<xsl:sequence select="
zip:binary-entry(
'input.odt',
concat('Pictures/', @name)
)"/>
</zip:entry>

</xsl:for-each>
</zip:dir>
</zip:file>

Unfortunately, this do not work with the 0.1 version (which I
guess is the one you have) because of a bug. If you want to test
it before a new release, you can try the following JAR file:

http://www.fgeorges.org/tmp/expath-zip-saxon.jar

Florent Georges

unread,
Jan 9, 2010, 2:00:36 PM1/9/10
to exp...@googlegroups.com
2010/1/9 Martynas Jusevicius wrote:

Martynas,

Thanks for your experiments and for taking time to report them
(and the bugs you discover)!

> <zip:entry name="{$img-name}">
> <xsl:value-of select="zip:binary-entry('file.odt',
> concat('Pictures/', $img-name)) cast as xs:base64Binary"/>
> </zip:entry>

> Gives the following error:
> Handle entry: OEBPS/pictures/10000000000001900000002CD33A5186.png/
> net.sf.saxon.trans.XPathException: Non-whitespace text nodes
> are not allowed Same with copy-of.

Yes, this is because the version you use wrongly thinks the
above entry is a directory entry (because it does not have an
@output). And text is not allowed within a zip:dir.

> I tried adding output="binary", but it seems it's not
> supported:
> net.sf.saxon.trans.XPathException: zip:entry/@output has
> incorrect value: binary

Actually, a binary entry can be of two types: base64 binary or
hex binary. So the value of @output is either "base64" or "hex".
As zip:binary-entry() always returns a base64 item, you can use
output="base64".

But I think the version you use has a bug preventing using
this. If you want to test it, you can use the following JAR file
instead (be sure to read my previous email about it before):

http://www.fgeorges.org/tmp/expath-zip-saxon.jar

> And isn't there some API documentation? It's hard to figure out
> what the parameters are.

Yes, I understand. This module was first written as a proof of
concept, and the spec has not been written down yet. For now,
until I write down more, the only doc is the paper I presented at
Balisage[1]. And this mailing list.

If you are ready to help even more, I can give you access to
the wiki (i.e. to write some documentation) and/or give you more
info about how we can collaborate to write the spec.

About possible changes in a near future, from what I presented
at Balisage, I think the biggest change to make is to base the
serialization options on the XSLT & XQuery Serialization spec, as
for the HTTP Client. And maybe allowing an implementation
-defined item type "binary" besides "base64" and "hex", to allow
an implementation to use its own binary items support (as in
MarkLogic for instance).

Thanks for the report,

--
Florent Georges
http://www.fgeorges.org/

[1]http://www.balisage.net/Proceedings/vol3/html/Georges01/BalisageVol3-Georges01.html

Martynas Jusevicius

unread,
Jan 12, 2010, 10:50:46 AM1/12/10
to exp...@googlegroups.com
Thanks, this worked perfectly!

If you have ideas on how we could colloborate, drop me a letter to
martynas....@gmail.com

Martynas

Florent Georges

unread,
Jan 12, 2010, 12:03:05 PM1/12/10
to exp...@googlegroups.com
Hi Martynas,

I am glad your last problem was solved. About collaboration,
well, any good will is welcome :-) That depends on the time you
are ready to spend and what you would like to do.

I guess you are interested in the ZIP module. Besides the
implementation for Saxon and the presentations for Prague and
Balisage (as well as a few examples) there is still everything to
create... Some random tasks that need to be done:

- writing some documentation and examples. The wiki seems to
be a good place to have rapidly something useful; it can be
just writing down what you find out when you use the module
functions or when you ask for help on the mailing list;

- write a first draft of the module. In my humble opinion,
this is the next step to do. That means describing the
module at a glance, each of its functions, their parameters
and their side-effects, etc. There are also a few remarks
on the mailing list we should take into account;

- writing a test suite. I think this is very important, but
at the same time I guess that should be done after a first
spec draft, even if that could be done before;

- writing and/or maintaining an implementation, or trying to
convince the vendor of your processor to do so ;-)

If you want to pick one of them, just give it a try, and let us
know what you did. If you need write access to the wiki, please
tell me in private. If you want more info about the technical
aspects of writing a draft, tell me and I will send you the
stylesheets and I will help you to set your tools up.

And if you want to help in any other way, you're welcome :-)

If you are using this module in a commercial project, I think
writing a first draft is definitely the next step. That's what
will help to get a more stable interface, as well as to maybe end
up convincing other vendors to support this module, get more
users using it, etc. All that leads to a better quality of the
module...

Regards,

--
Florent Georges
http://www.fgeorges.org/

Martynas Jusevicius

unread,
Jan 15, 2010, 2:06:07 PM1/15/10
to exp...@googlegroups.com
Btw, there is still no way to avoid serialization to file when calling
from Java?

On Sat, Jan 9, 2010 at 8:00 PM, Florent Georges <fgeo...@gmail.com> wrote:

Florent Georges

unread,
Jan 20, 2010, 10:57:02 AM1/20/10
to exp...@googlegroups.com
2010/1/15 Martynas Jusevicius wrote:

Hi Martynas,

> Btw, there is still no way to avoid serialization to file when
> calling from Java?

I am not sure what serialization you are exactly talking about.
If this is temporary serialization of the images in the initial
ZIP file to pass them from the 1st to the 2nd pass, I think you
should avoid them as you can instead access them directly within
the 2nd pass.

If this is the temporary serialization of the result of the 1st
transform to feed the input of the 2nd one, you can avoid them by
streaming directly the output of the 1st one to the input of the
2nd one (see your processor or API documentation for details).

Hope that helps, regards,

Martynas Jusevicius

unread,
Jan 20, 2010, 4:26:17 PM1/20/10
to exp...@googlegroups.com
Hey Florent,

going back to copying files again...
How do do you go about inserting binary files from the filesystem as a
ZIP entry?
In the Balisage presentation, I found the <zip:entry name="index.html"
href="/some/file.html"/> syntax, but when I try it says that
zip:entry/@href is unknown. Has it changed?
Philips suggestion with doc() cast as xs:base64Binary doesn't seem to
work either. I wonder if you can get XPath functions working with
binary files at all?

Thanks for help,

Martynas

Florent Georges

unread,
Jan 20, 2010, 7:30:20 PM1/20/10
to exp...@googlegroups.com
Martynas Jusevicius wrote:

> In the Balisage presentation, I found the <zip:entry
> name="index.html" href="/some/file.html"/> syntax, but when I
> try it says that zip:entry/@href is unknown. Has it changed?

Yes. It has been renamed @src to not clash with zip:file/@href
and to be consistent with http:body/@src in the HTTP Client
module.

I know the lack of a specification so far does not make any
kind of reference look up possible. Sorry about that. If there
is any volounteer... :-)

Regards,

Martynas Jusevicius

unread,
Jan 22, 2010, 8:26:02 AM1/22/10
to exp...@googlegroups.com
And it seems it's expecting a filesystem path, not an URI?
Because it chokes if I pass smth like 'file:/C:/whatver/img.png', but
works fine if I strip off the 'file:'.
Isn't URI more appropriate?

I know I should get myself together and find some time to help you out...

Florent Georges

unread,
Jan 22, 2010, 11:39:29 AM1/22/10
to exp...@googlegroups.com
2010/1/22 Martynas Jusevicius wrote:

> And it seems it's expecting a filesystem path, not an URI?
> Because it chokes if I pass smth like 'file:/C:/whatver/img.png', but
> works fine if I strip off the 'file:'.
> Isn't URI more appropriate?

Yes. This is a limitation (no, this is a bug) in the current implem
for Saxon. I will have to fix it. Do not hesitate to remind me in
the following days if I do not give any news on that.

> I know I should get myself together and find some time to help you out...

Reporting bugs and using the extensions is helping yet. But every
help is welcome ;-)

Florent Georges

unread,
Jan 24, 2010, 7:55:16 PM1/24/10
to exp...@googlegroups.com
2010/1/22 Florent Georges wrote:
> 2010/1/22 Martynas Jusevicius wrote:

>> And it seems it's expecting a filesystem path, not an URI?
>> Because it chokes if I pass smth like 'file:/C:/whatver/img.png', but
>> works fine if I strip off the 'file:'.
>> Isn't URI more appropriate?

>  Yes.  This is a limitation (no, this is a bug) in the current implem
> for Saxon.  I will have to fix it.

@src and @href are resolved against the base URI. But for the
functions that extract entries, the URI is given as an item (not as
part of an element) and I am not sure it is possible to get the base
URI in Saxon in this case (at least not without integrated extension
functions).

I've updated a temporary JAR at:

http://fgeorges.org/tmp/expath-zip-saxon.jar

Regards,

Reply all
Reply to author
Forward
0 new messages