XML load - utf8 - illegal char complaints for certain valid Euro chars

203 views
Skip to first unread message

Jack G. Conrad

unread,
Jun 4, 2014, 5:36:44 PM6/4/14
to scala...@googlegroups.com
Running Scala 2.11.1 on a MacBook Pro and downloading XML UTF8 encoded files (in Spanish).

JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8


In a script run via scalac

val xml = XML.loadFile(filename)


results in "error: illegal character" *apparently* for just a few characters: « » “ ” and "\u2003" (spec. space)

Yet these are valid UTF8 characters.  

Que pasa?   --JGC

Som Snytt

unread,
Jun 4, 2014, 7:18:25 PM6/4/14
to Jack G. Conrad, scala-user
Without asking whether JAVA_TOOL_OPTIONS is honored by the script runner, I can see that file.encoding is not honored by XML.load.

You'd expect load to setEncoding, but it doesn't.

import xml._
Console println sys.props("file.encoding")
Console println XML.loadFile(args(0))
Console println XML.loadXML( {
  val is = new org.xml.sax.InputSource(new java.io.FileInputStream(args(0)))
  is setEncoding sys.props("file.encoding")
  is }, XML.parser)


I think this shows the first loadFile succeeding (though output is munged by ?), while the second loadXML fails.

$ scala -Dfile.encoding=US-ASCII -nc hw.scala hw.txt
US-ASCII
<html>
  <h3>?hello, world?</h3>
</html>
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Byte "194" is not a member of the (7-bit) ASCII character set.
    at com.sun.org.apache.xerces.internal.impl.io.ASCIIReader.read(ASCIIReader.java:158)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1762)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1638)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1676)
    at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:196)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:333)
    at scala.xml.factory.XMLLoader$class.loadXML(XMLLoader.scala:41)
    at scala.xml.XML$.loadXML(XML.scala:60)
    at Main$$anon$1.<init>(hw.scala:4)
    at Main$.main(hw.scala:1)
    at Main.main(hw.scala)


Normally, showing the local platform default:

$ scala -nc hw.scala hw.txt
UTF-8
<html>
  <h3>«hello, world»</h3>
</html>
<html>
  <h3>«hello, world»</h3>
</html>





--
You received this message because you are subscribed to the Google Groups "scala-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages