UTF-8 with BOM not supported?

91 views
Skip to first unread message

ahenket

unread,
Jun 24, 2013, 6:17:23 AM6/24/13
to orb...@googlegroups.com
Orbeon Forms 3.9.0.201105152046 CE

I've used the exact code from the HowTo
<http://wiki.orbeon.com/forms/how-to/view/upload-and-download-instance> to
upload XML:
saxon:parse(saxon:base64Binary-to-string(xs:base64Binary(instance('upload')),
'UTF-8'))

Whenever I upload UTF-8 encoded XML with a BOM, I get "no content allowed in
prolog".

Bug or feature? How to avoid? Fixing all input at workplaces around the
world is not feasible.

--
View this message in context: http://discuss.orbeon.com/UTF-8-with-BOM-not-supported-tp4656932.html
Sent from the Orbeon Forms community mailing list mailing list archive at Nabble.com.

Alessandro Vernet

unread,
Jun 25, 2013, 2:55:30 PM6/25/13
to orb...@googlegroups.com
Hi ahenket,

saxon:parse() expects XML, and if what people upload isn't XML, it just
won't work. Now, I am wondering if something else could be happening. Could
you maybe add an <xf:output
value="saxon:base64Binary-to-string(xs:base64Binary(instance('upload')),
'UTF-8')"/> somewhere in your form to see what that value looks like. Is it
proper XML? If it looks to you like it is, but saxon:parse() fails, could
you share with us a specific example of that XML, so we can reproduce the
issue?

Alex

-----
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
--
View this message in context: http://discuss.orbeon.com/UTF-8-with-BOM-not-supported-tp4656932p4656944.html

ahenket

unread,
Jun 25, 2013, 4:36:01 PM6/25/13
to orb...@googlegroups.com
Hi. It's absolutely XML. OxygenXML is my tool of choice for editing
XML/XQuery etc. and it is set to be very picky. I've validated the files
before uploading using oxygen and it gave no errors. I removed the 3 UTF-8
BOM characters with a Hexeditor (I found out later that you can instruct
Oxygen to remove the UTF-8 BOM upon save) and then uploaded without any
problem. There's no question that the BOM was the only thing between me and
a successful upload.

--
View this message in context: http://discuss.orbeon.com/UTF-8-with-BOM-not-supported-tp4656932p4656945.html

Alessandro Vernet

unread,
Jun 26, 2013, 9:12:24 PM6/26/13
to orb...@googlegroups.com
Hi ahenket,

Indeed, looks like a bug in saxon:base64Binary-to-string() to me. Since that
function is UTF-8 aware, it should know how to interpret the BOM. Even if
this is an issue with Saxon (at least the version we're using), I added an
issue against Orbeon Forms:

https://github.com/orbeon/orbeon-forms/issues/1093

In you can manually strip the BOM in XForms if present, as done in this
example: view.xhtml <http://discuss.orbeon.com/file/n4656951/view.xhtml> .
I also copied here the relevant part:

<xf:var name="dec"
value="saxon:base64Binary-to-octets(xs:base64Binary(.))"/>
<xf:var name="has-bom" value="$dec[1] = 239 and $dec[2] = 187 and
$dec[3] = 191"/>
<xf:bind ref="." type="xs:base64Binary" calculate="if ($has-bom) then
saxon:octets-to-base64Binary($dec[position() > 3]) else ."/>

Alex

-----
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
--
View this message in context: http://discuss.orbeon.com/UTF-8-with-BOM-not-supported-tp4656932p4656951.html

ahenket

unread,
Aug 9, 2013, 10:38:21 AM8/9/13
to orb...@googlegroups.com

Hi, thanks for the workaround. I'll be on holiday for 3 weeks so I'll get back to it afterwards most likely.

Alexander

Op 27 jun. 2013, om 03:12 heeft Alessandro Vernet [via Orbeon Forms community mailing list] <[hidden email]> het volgende geschreven:

Hi ahenket,

Indeed, looks like a bug in saxon:base64Binary-to-string() to me. Since that function is UTF-8 aware, it should know how to interpret the BOM. Even if this is an issue with Saxon (at least the version we're using), I added an issue against Orbeon Forms:

https://github.com/orbeon/orbeon-forms/issues/1093

In you can manually strip the BOM in XForms if present, as done in this example: view.xhtml. I also copied here the relevant part:

    <xf:var name="dec" value="saxon:base64Binary-to-octets(xs:base64Binary(.))"/>
    <xf:var name="has-bom" value="$dec[1] = 239 and $dec[2] = 187 and $dec[3] = 191"/>
    <xf:bind ref="." type="xs:base64Binary" calculate="if ($has-bom) then saxon:octets-to-base64Binary($dec[position() > 3]) else ."/>

Alex
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet



If you reply to this email, your message will be added to the discussion below:
http://discuss.orbeon.com/UTF-8-with-BOM-not-supported-tp4656932p4656951.html
To unsubscribe from UTF-8 with BOM not supported?, click here.
NAML



View this message in context: Re: UTF-8 with BOM not supported?

Alessandro Vernet

unread,
Aug 11, 2013, 4:35:09 PM8/11/13
to orb...@googlegroups.com
Hi Alexander,

Sure, there of course no rush at all; you'll let us know when you get a
chance to test this.

Alex

-----
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
--
View this message in context: http://discuss.orbeon.com/UTF-8-with-BOM-not-supported-tp4656932p4657132.html

ahenket

unread,
Mar 29, 2014, 5:01:21 PM3/29/14
to orb...@googlegroups.com
This obviously fell off my radar. We decided to go a different, but similar
route solving this in xquery as Saxon under eXist-db has the exact same
issue, so we need circumvention deeper down.

let $file-data := if (request:exists()) then (request:get-data())
else ()
let $update :=
if (not(empty($file-data))) then
(:Hack alert: upload fails when content has UTF-8 Byte Order Marker.
the UTF-8 representation of the BOM is the byte sequence
0xEF,0xBB,0xBF:)
let $file-content := util:base64-decode($file-data/content)
let $content-no-bom := if
(string-to-codepoints(substring($file-content,1,1))=65279) then
(substring($file-content,2)) else ($file-content)
let $store := xmldb:store($messageStoragePath,
encode-for-uri($filename), $content-no-bom)
else ()

--
View this message in context: http://discuss.orbeon.com/UTF-8-with-BOM-not-supported-tp4656932p4658168.html

Alessandro Vernet

unread,
Mar 30, 2014, 6:13:32 PM3/30/14
to orb...@googlegroups.com
Hi Alexander,

I'm glad doing this in eXist works for you. BTW, have you tried asking Mike
Kay, the Saxon author, about this? (If you haven't already, the saxon-help
mailing list would be a good place.)

Alex

-----
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
--
View this message in context: http://discuss.orbeon.com/UTF-8-with-BOM-not-supported-tp4656932p4658171.html
Reply all
Reply to author
Forward
0 new messages