Uche, thanks.
Please excuse my ignorance regarding encoding. See my response inline below
On Fri, Mar 8, 2013 at 10:11 PM, Uche Ogbuji <
uc...@ogbuji.net> wrote:
> Hi Chime,
> ..snip..
> XML parsing is defined against a stream of bytes, and it's the parser's
> job to decode to the abstract Unicode character model. In our case we
> delegate that to Expat.
Ok.
> I've never come across a case where the right thing was not to encode
> before passing to a parser. If you control the prior output you can just
> omit the XML declaration and encode to UTF-8.
>
> amara.tree.parse(inputsource(u'<Root/>'.encode('utf-8')))
Ok. So my pipeline is basically like this:
@xslt_rest( .. )
def AkaraServiceFn(..):
.. fetch database_results from store as unicode objects..
src = StringIO() #cStringIO, not StringIO
w = structwriter(indent=u"yes", stream=src)
w.feed(
ROOT(
E(u'Root',
( E(u'Element',{ ... }) for foo,bar in database_results )
)
)
)
return src.getvalue()
The function returned by the @xslt_rest decorator has now been changed
to invoke encode('utf-8') on what is returned by the service if it is
a unicode object or to just use it as it is, otherwise. This is then
passed on as the source to the transform method.
This seems to be working fine. The non-ASCII characters fetched from
the store, streamed into the cStringIO object, and encoded to UTF-8
prior to transformation appear properly in the output. However, with
regards to your comment below, I suspect this is not necessarily the
*most* efficient way to leverage streaming capabilities here.
> Remember Python's streaming encoders and decoders, which you might want to
> integrate into your overall pipeline.
So, I'm assuming you are not referring to either StringIO or cStringIO
(the latter being more efficient but admits to not being "able to
accept Unicode strings that cannot be encoded as plain ASCII strings."
in its documentation):
>>> from StringIO import StringIO
>>> stream = StringIO()
>>> stream.write(u'<Root/>')
>>> print amara.tree.parse(inputsource(stream))
Traceback (most recent call last):
..snip..
amara.ReaderError: In urn:uuid:4291d0fe-4956-44bf-b249-faec8b08fa87,
line 1, column 0: no element found
Or perhaps I'm not using StringIO/cStringIO in the way Amara expects
WRT "open-file-like object".
In the best case scenario (beyond my current workaround) I would like
the structwriter to write into a stream that can handle non-ASCII
characters and then pass that on to be encapsulated as an amara input
source, without incurring the cost of 'serializing' the entire content
prior to encoding to UTF-8.
Unless I've misunderstood the advantage of the role of the stream, I
would think this is especially preferable if database_results is a
generator of results from the store (as it is in this case).
Perhaps, I should be using codecs (Codec registry and base classes)
instead for this scenario?
Thanks.
-- Chime