Using amara.xslt.transform with a unicode source?

Chime

unread,

Mar 8, 2013, 8:50:15 PM3/8/13

to ak...@googlegroups.com

Hello,

I'm using Amara's transform function to implement a decorator for Akara services that transforms what the service returns using an XSLT. This is part of the Akamu [1] project. In particular, the web applications I need this for involve non-ASCII characters that originate from the server's RDF store, are streamed into XML using amara.writers.struct.structwriter, and returned (as unicode) from the Akara service.

When the service is decorated with @xslt_rest, the result is transformed using amara.xslt.transform and the response is sent to the client. Because of this pipeline, I basically need to pass on a unicode object to transform so it can handle the non-ascii characters. However, it looks like there is a discrepancy between the current version of Amara and a previous version that prevents it from handling a unicode source document.

I keep getting a amara.ReaderError exception when I invoke the service and return a unicode object for the transformation. It is being raised in line 236 of lib/xslt/processor.py [2].

The docstring for the transform method says that the source argument can be an "XML source document in the form of a string (not Unicode object)". Looking at the transform implementation, it instantiates an inputsource with the source as its only argument and passes this on to the run method, which calls amara.tree.parse on the input source. However the inputsource docstring says the first argument is "a string, Unicode object (only if you really know what you're doing)."

I'm able to reproduce it with 2 simple cases:

$ python -c "import amara; from amara.lib import inputsource;print amara.tree.parse(inputsource('<Root/>'));print amara.__version__"

2.0.0a6

$ python -c "import amara; from amara.lib import inputsource;print amara.tree.parse(inputsource(u'<Root/>'));print amara.__version__"

Traceback (most recent call last):

File "<string>", line 1, in <module>

File "/home/chimezie/lib/python2.6/amara/tree.py", line 69, in parse

return _parse(inputsource(obj, uri), flags, entity_factory=entity_factory,rule_handler=rule_handler)

amara.ReaderError: In urn:uuid:24c7cbe4-bf65-4352-860e-35d77530f030, line 1, column 1: not well-formed (invalid token)

I didn't notice this until I did a fresh intall of Amara on a new server via:

pip install -i http://pypi.zepheira.com/releases/index Amara

Apparently, the version I had been using on my local machine doesn't have this limitation and is an earlier version:

$ python -c "import amara; from amara.lib import inputsource;print amara.tree.parse(inputsource(u'<Root/>'));print amara.__version__"

2.0a5

Is this by design? If so, what is the right way to use the transform function on a unicode object? Should I install the earlier version of Amara via pip install?

I didn't see any mention of this on the Wiki.

Thanks

-- Chime

[1] https://code.google.com/p/akamu/wiki/XSLTasHTTPMethod

[2] https://github.com/zepheira/amara/blob/master/lib/xslt/processor.py#L236

Uche Ogbuji

unread,

Mar 8, 2013, 10:11:48 PM3/8/13

to ak...@googlegroups.com

Hi Chime,

I'm surprised any version of Amara at all parses a unicode object. We've always said (and most XML libraries say): Don't Do That. Even the Amara 1.x manual said clearly:

<para>You can pass <command>amara.parse</command> a string (<emphasis>not Unicode object</emphasis>) with the
XML content, an open-file-like object, a file path or a URI.</para>

XML parsing is defined against a stream of bytes, and it's the parser's job to decode to the abstract Unicode character model. In our case we delegate that to Expat.

I've never come across a case where the right thing was not to encode before passing to a parser. If you control the prior output you can just omit the XML declaration and encode to UTF-8.

amara.tree.parse(inputsource(u'<Root/>'.encode('utf-8')))

Remember Python's streaming encoders and decoders, which you might want to integrate into your overall pipeline.

--
Uche Ogbuji http://uche.ogbuji.net
Founding Partner, Zepheira http://zepheira.com
http://wearekin.org
http://www.thenervousbreakdown.com/author/uogbuji/
http://copia.ogbuji.net
http://www.linkedin.com/in/ucheogbuji
http://twitter.com/uogbuji

Chimezie Ogbuji

unread,

Mar 9, 2013, 12:48:28 PM3/9/13

to ak...@googlegroups.com

Uche, thanks.

Please excuse my ignorance regarding encoding. See my response inline below

On Fri, Mar 8, 2013 at 10:11 PM, Uche Ogbuji <uc...@ogbuji.net> wrote:
> Hi Chime,
> ..snip..

> XML parsing is defined against a stream of bytes, and it's the parser's
> job to decode to the abstract Unicode character model. In our case we
> delegate that to Expat.

Ok.

> I've never come across a case where the right thing was not to encode
> before passing to a parser. If you control the prior output you can just
> omit the XML declaration and encode to UTF-8.
>
> amara.tree.parse(inputsource(u'<Root/>'.encode('utf-8')))

Ok. So my pipeline is basically like this:

@xslt_rest( .. )
def AkaraServiceFn(..):
.. fetch database_results from store as unicode objects..
src = StringIO() #cStringIO, not StringIO
w = structwriter(indent=u"yes", stream=src)
w.feed(
ROOT(
E(u'Root',
( E(u'Element',{ ... }) for foo,bar in database_results )
)
)
)
return src.getvalue()

The function returned by the @xslt_rest decorator has now been changed
to invoke encode('utf-8') on what is returned by the service if it is
a unicode object or to just use it as it is, otherwise. This is then
passed on as the source to the transform method.

This seems to be working fine. The non-ASCII characters fetched from
the store, streamed into the cStringIO object, and encoded to UTF-8
prior to transformation appear properly in the output. However, with
regards to your comment below, I suspect this is not necessarily the
*most* efficient way to leverage streaming capabilities here.

> Remember Python's streaming encoders and decoders, which you might want to
> integrate into your overall pipeline.

So, I'm assuming you are not referring to either StringIO or cStringIO
(the latter being more efficient but admits to not being "able to
accept Unicode strings that cannot be encoded as plain ASCII strings."
in its documentation):

>>> from StringIO import StringIO
>>> stream = StringIO()
>>> stream.write(u'<Root/>')
>>> print amara.tree.parse(inputsource(stream))

Traceback (most recent call last):

..snip..
amara.ReaderError: In urn:uuid:4291d0fe-4956-44bf-b249-faec8b08fa87,
line 1, column 0: no element found

Or perhaps I'm not using StringIO/cStringIO in the way Amara expects
WRT "open-file-like object".

In the best case scenario (beyond my current workaround) I would like
the structwriter to write into a stream that can handle non-ASCII
characters and then pass that on to be encapsulated as an amara input
source, without incurring the cost of 'serializing' the entire content
prior to encoding to UTF-8.

Unless I've misunderstood the advantage of the role of the stream, I
would think this is especially preferable if database_results is a
generator of results from the store (as it is in this case).

Perhaps, I should be using codecs (Codec registry and base classes)
instead for this scenario?

Thanks.

-- Chime

Uche Ogbuji

unread,

Mar 9, 2013, 12:56:40 PM3/9/13

to ak...@googlegroups.com

On Sat, Mar 9, 2013 at 10:48 AM, Chimezie Ogbuji <chim...@gmail.com> wrote:

This seems to be working fine. The non-ASCII characters fetched from
the store, streamed into the cStringIO object, and encoded to UTF-8
prior to transformation appear properly in the output. However, with
regards to your comment below, I suspect this is not necessarily the
*most* efficient way to leverage streaming capabilities here.

> Remember Python's streaming encoders and decoders, which you might want to
> integrate into your overall pipeline.

So, I'm assuming you are not referring to either StringIO or cStringIO
(the latter being more efficient but admits to not being "able to
accept Unicode strings that cannot be encoded as plain ASCII strings."
in its documentation):

Ah no. I mean the IncrementalEncoder & IncrementalDecoder objects that come with Python Unicode codecs.

http://docs.python.org/2/library/codecs.html#incrementalencoder-objects

Alas cStringIO is pretty broken in Python 2.x with regard to Unicode. This is fixed (along with a lot of other Unicode ugliness) in Python 3.x and especially 3.3. You can usually get with extra, extra care, which means encoding before writing to cStringIO, as you say. But IncrementalEncoder & IncrementalDecoder might be more efficient, depending on your typical pipeline. Some testing with timeit might not be a bad idea.

http://pymotw.com/2/timeit/

Perhaps, I should be using codecs (Codec registry and base classes)
instead for this scenario?

It feels that way, yeah.

Reply all

Reply to author

Forward