[fcrepo-user] Proia multilingual - java.io.UTFDataFormatException

1 view
Skip to first unread message

Dimitris Gavrilis

unread,
Nov 23, 2010, 10:16:42 AM11/23/10
to fedora-com...@lists.sourceforge.net
Hi,

I've setup fedora with Proai and whenever proai tries to parse non english records (Greek) I get a java.io.UTFDataFormatException. Although I've seen that this problem exists, I haven't managed to find a solution. When i exclude non-English text, proai works fine.

Thanks in advance,
Dimtris.

Dimitris Gavrilis

unread,
Nov 24, 2010, 3:31:37 AM11/24/10
to Support and info exchange list for Fedora users.
Hi Steve,

I'm attaching an xml sample of a record that produces this error.

Thanks,
Dimitris.

On Wed, Nov 24, 2010 at 9:55 AM, Steve Bayliss <stephen...@acuityunlimited.net> wrote:
Hi Dimitris
 
Do you have an example object FOXML file that could be used to reproduce this?
 
Thanks
Steve

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Fedora-commons-users mailing list
Fedora-com...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


iid_1_mods.xml

West, Graeme

unread,
Nov 25, 2010, 4:44:23 AM11/25/10
to Support and info exchange list for Fedora users.
Hi Dimitris,
I notice that on the first line, your XML declaration states:

<?xml version="1.0" encoding="UTF8"?>

This should be:
<?xml version="1.0" encoding="UTF-8"?>

ProAI is probably rejecting the documents because of this 'unknown' encoding.

Hope this helps.

Regards,

Graeme West
Digital Repository Developer
Information Services
Glasgow Caledonian University
graem...@gcu.ac.uk<mailto:graem...@gcu.ac.uk>



On 24 Nov 2010, at 08:31, Dimitris Gavrilis wrote:

Hi Steve,

I'm attaching an xml sample of a record that produces this error.

Thanks,
Dimitris.

On Wed, Nov 24, 2010 at 9:55 AM, Steve Bayliss <stephen...@acuityunlimited.net<mailto:stephen...@acuityunlimited.net>> wrote:
Hi Dimitris

Do you have an example object FOXML file that could be used to reproduce this?

Thanks
Steve


-----Original Message-----
From: Dimitris Gavrilis [mailto:gavr...@gmail.com<mailto:gavr...@gmail.com>]
Sent: 23 November 2010 15:17
To: fedora-com...@lists.sourceforge.net<mailto:fedora-com...@lists.sourceforge.net>
Subject: [fcrepo-user] Proia multilingual - java.io.UTFDataFormatException

Hi,

I've setup fedora with Proai and whenever proai tries to parse non english records (Greek) I get a java.io.UTFDataFormatException. Although I've seen that this problem exists, I haven't managed to find a solution. When i exclude non-English text, proai works fine.

Thanks in advance,
Dimtris.

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Fedora-commons-users mailing list
Fedora-com...@lists.sourceforge.net<mailto:Fedora-com...@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users



Email has been scanned for viruses by Altman Technologies' email management service<http://www.altman.co.uk/emailsystems>

<iid_1_mods.xml>------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems
_______________________________________________
Fedora-commons-users mailing list
Fedora-com...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems


Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Steve Bayliss

unread,
Nov 25, 2010, 5:21:33 AM11/25/10
to Support and info exchange list for Fedora users.
Hi Dimitris

It would certainly be worthwhile trying Graeme's suggestion, although I
suspect that if Fedora didn't determine the correct encoding on ingest then
this would cause problems elsewhere. (In any case you should correct this
incorrect encoding declaration to UTF-8).

I've taken a look at the proai oaiprovider source, and there is some
"unsafe" code in there where the default platform encoding will be used. (eg
FedoraOAIDriver.java line 275)

1) could you provide a full log of the exception (ie the full stack trace)
2) could you try setting the JVM default encoding by using
-Dfile.encoding=utf-8 (eg add this to CATALINA_OPTS)
<stephen...@acuityunlimited.net<mailto:stephen.bayliss@acuityunlimited.
Fedora-com...@lists.sourceforge.net<mailto:Fedora-commons-users@lists
----------------------------------------------------------------------------
--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Fedora-commons-users mailing list
Fedora-com...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


Dimitris Gavrilis

unread,
Nov 25, 2010, 6:00:54 AM11/25/10
to Support and info exchange list for Fedora users.
Dear Steve,

Thanks for you help. I did change the header (UTF-8) in the top of the file as you suggested but I still get the same error. The file seems ok when accessed through fedora (http://localhost:8080/fedora/objects/iid:1/datastreams/mods/content).

I'm attaching below the error from the fedora's console:


proai.error.ServerException: Error parsing record xml
        at proai.cache.ParsedRecord.<init>(ParsedRecord.java:70)
        at proai.cache.Worker.attempt(Worker.java:111)
        at proai.cache.Worker.run(Worker.java:51)
Caused by: java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequen
ce.
        at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unk
nown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContent
Dispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Un
known Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
        at proai.cache.ParsedRecord.<init>(ParsedRecord.java:62)
        ... 2 more



------------------------------------------------------------------------------

Dimitris Gavrilis

unread,
Nov 25, 2010, 6:06:20 AM11/25/10
to Support and info exchange list for Fedora users.
I also added the -Dfile.encoding=utf-8 to catalina_opts as you suggested but still nothing.

Dimitris.

On Thu, Nov 25, 2010 at 12:21 PM, Steve Bayliss <stephen...@acuityunlimited.net> wrote:
------------------------------------------------------------------------------

West, Graeme

unread,
Nov 25, 2010, 6:11:03 AM11/25/10
to Support and info exchange list for Fedora users.
Hi Dimitris,
Did this error occur after making the encoding change?

It may be a good idea to stop your servlet container, drop/truncate the tables from the ProAI database, delete the ProAI temporary files directory (by default /tmp/proai ), and then restart your servlet container. This will rebuild the ProAI database completely and ensure that you're not seeing cached errors.

Regards,

Graeme
graem...@gcu.ac.uk<mailto:graem...@gcu.ac.uk><mailto:graem...@gcu.ac.uk<mailto:graem...@gcu.ac.uk>>



On 24 Nov 2010, at 08:31, Dimitris Gavrilis wrote:

Hi Steve,

I'm attaching an xml sample of a record that produces this error.

Thanks,
Dimitris.

On Wed, Nov 24, 2010 at 9:55 AM, Steve Bayliss
<stephen...@acuityunlimited.net<mailto:stephen...@acuityunlimited.net><mailto:stephen.bayliss@acuityunlimited<mailto:stephen.bayliss@acuityunlimited>.
Fedora-com...@lists.sourceforge.net<mailto:Fedora-com...@lists.sourceforge.net><mailto:Fedora-commons-users@lists<mailto:Fedora-commons-users@lists>
.sourceforge.net<http://sourceforge.net/>>
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users



Email has been scanned for viruses by Altman Technologies' email management
service<http://www.altman.co.uk/emailsystems>

<iid_1_mods.xml>------------------------------------------------------------
------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
Email has been scanned for viruses by Altman Technologies' email management
service - www.altman.co.uk/emailsystems<http://www.altman.co.uk/emailsystems>
_______________________________________________
Fedora-commons-users mailing list
Fedora-com...@lists.sourceforge.net<mailto:Fedora-com...@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Email has been scanned for viruses by Altman Technologies' email management
service - www.altman.co.uk/emailsystems<http://www.altman.co.uk/emailsystems>


Glasgow Caledonian University is a registered Scottish charity, number
SC021474

Winner: Times Higher Education's Widening Participation Initiative of the
Year 2009 and Herald Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en
.html<http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en%0A.html>

----------------------------------------------------------------------------
--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Fedora-commons-users mailing list
Fedora-com...@lists.sourceforge.net<mailto:Fedora-com...@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Fedora-commons-users mailing list
Fedora-com...@lists.sourceforge.net<mailto:Fedora-com...@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


Email has been scanned for viruses by Altman Technologies' email management service<http://www.altman.co.uk/emailsystems>

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev

Dimitris Gavrilis

unread,
Nov 26, 2010, 4:41:28 AM11/26/10
to Support and info exchange list for Fedora users.
Hi Steve,

I did delete the tmp/proai folder and truncated the proai database but I still get the same error (see the log below).




proai.error.ServerException: Error parsing record xml
        at proai.cache.ParsedRecord.<init>(ParsedRecord.java:70)
        at proai.cache.Worker.attempt(Worker.java:111)
        at proai.cache.Worker.run(Worker.java:51)
Caused by: java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequen
ce.
        at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unk
nown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContent
Dispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Un
known Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
        at proai.cache.ParsedRecord.<init>(ParsedRecord.java:62)
        ... 2 more




Steve Bayliss

unread,
Nov 26, 2010, 5:02:50 AM11/26/10
to Support and info exchange list for Fedora users.
Hi Dimitris
 
Just to confirm,
 
- were you specifying the encoding to the Tomcat JVM (ie using -Dfile.encoding=utf-8)?
- which SQL database (and version) are you using?

Dimitris Gavrilis

unread,
Nov 26, 2010, 6:29:19 AM11/26/10
to Support and info exchange list for Fedora users.
Dear Steve,

It finally worked. I think it was the -Dfile.encoding=utf-8 in the JAVA_OPTS.

Thank you very much for your assistance,
Dimitris.

Steve Bayliss

unread,
Nov 26, 2010, 7:18:27 AM11/26/10
to Support and info exchange list for Fedora users.
Hi Dimitris
 
Thanks very much for confirming this.
 
From my inspection of the source, there is some "unsafe" code where the the platform default encoding will be used rather than UTF-8, which in cases could cause this problem (particularly this will happen on Windows).
 
Out of interest, what OS are you running on?
 
Are you able to identify the default encoding (in Java this is java.nio.charset.Charset.defaultCharset() - I can send you a small utility to find this out if you are willing to provide this feedback).  This would be useful information so that (a) the bug can be reproduced and (b) the fix can be correctly tested.
 
Now that you have verified that setting the encoding resolve the issue this indicates that there is a bug, and I have raised https://jira.duraspace.org/browse/FCREPO-832 for this and attached your sample XML.
 
The work-around is to set the JVM file.encoding - I have added a note to the oaiprovider page about doing this.
 
Regards

Dimitris Gavrilis

unread,
Nov 26, 2010, 8:45:47 AM11/26/10
to Support and info exchange list for Fedora users.
Dear Steve,

I have two installations (one in Debian Linux and one in Windows 7). The problem was caused in Windows edition (although I didn't try it in Linux yet). You can send me the tool whenever you want.

Thanks again,
Dimitris.
Reply all
Reply to author
Forward
0 new messages