Tikal defaulting to windows-1252 on Windows...

3 views
Skip to first unread message

jimbo

unread,
Sep 2, 2022, 12:42:27 PM9/2/22
to Group: okapi-devel
If no encoding is specified Tikal uses "Charset.defaultCharset()" -
which on Windows returns windows-1252. This is causing an XML parsing
error for some of our docx documents:

An error occurred during extraction
Unexpected first character (char code 0xEF), not valid in xml document:
could be mangled UTF-8 BOM marker. Make sure that the Reader uses
correct encoding or pass an InputStream instead

Should we change the "Charset.defaultCharset()" code and simply force
utf-8 as the default encoding? I think that's what we do for Rainbow now.

Jim

jimbo

unread,
Sep 2, 2022, 1:31:13 PM9/2/22
to Group: okapi-devel
There must be an issue with how the Tikal extract pipeline is setup.
Even when we force utf-8 with "-ie utf-8 -oe utf-8" we still get the
error below.

jimbo

unread,
Sep 2, 2022, 1:50:42 PM9/2/22
to Group: okapi-devel
Forgot to mention that the same file runs fine on Linux.

jimbo

unread,
Sep 6, 2022, 1:38:14 PM9/6/22
to Group: okapi-devel
I found the problem - the OpenXml filter depends on the default charset
- which on Windows is still windows-1252. This fails on some documents
that have BOMS.

See ticket:
https://bitbucket.org/okapiframework/okapi/issues/1162/openxml-filter-fails-on-windows

Jim
Reply all
Reply to author
Forward
0 new messages