Reason why metalex translation may fail ?

2 views
Skip to first unread message

Ashok Hariharan

unread,
Jul 30, 2009, 3:53:18 AM7/30/09
to akomantoso-dev
Hi Luca,

I am getting a strange problem sometimes ; the translation fails at
the point of the metalex transform --

I get an error like :

Error on line 1 column 1295 of temp7544657180364232322.xml:
SXXP0003: Error reported by XML parser: Invalid byte 2 of 2-byte
UTF-8 sequence.There was a problem during the translation to the
METALEX metaformat

Which is basically the metalex output being empty ...

I have attached an example for which the translation fails ... (Does
it fail for you for that file ?)

Ashok

ke_debaterecord_2009-7-24_eng.odt

Ashok Hariharan

unread,
Jul 30, 2009, 3:57:28 AM7/30/09
to akomantoso-dev
Just to give more perspective -- the metalex translation does not fail
for the attached file -- but fails for the file attached to the
previous email...

Ashok

ke_debaterecord_2009-7-26_eng.odt

Ashok Hariharan

unread,
Jul 30, 2009, 4:38:43 AM7/30/09
to akomantoso-dev
Hi Luca,

I realized what the problem is ---

The file that is causing the error has text that was imported from a
word document --- this causes the sax parser to fail as some of the
characters are read as UTF-1 instead of UTF-8 (see the attched file
and _copy_3 file ... the copy_3 file translates successfully as i have
removed the imported text from word... ).

The best way to resolve this problem is to fix the way we are reading
Xml documents for parsing --

right now we have calls like :

File mergedOdfFile = odfUtil.mergeODF(aDocumentPath);
StreamSource ODFDocument = new StreamSource(mergedOdfFile);

The StreamSource is then passed onto the Xml parser ...

Instead of construction the StreamSource using a File handle if we use
a BufferedReader the problem is solved for most encoding related
issues -- since BufferedReader translates everything to utf-16 ->
which is then changed back to utf-8 by the parser ... so there are no
mixed encodings to deal with ...

Something like :
File mergedOdfFile = odfUtil.mergeODF(aDocumentPath);
StreamSource ODFDocument = new StreamSource(new BufferedReader(new
FileReader(mergedOdfFile));
or
StreamSource ODFDocument = new
StreamSource(new BufferedReader(new InputStreamReader(new
FileInputStream(mergedOdfFile)));

If you have checked in your most latest changes -- i can make this fix
in the translator source ... ?

Ashok

ke_debaterecord_2009-8-1_eng.odt
ke_debaterecord_2009-8-1_eng_copy_3.odt

Luca Cervone

unread,
Jul 30, 2009, 5:45:36 AM7/30/09
to akomant...@googlegroups.com
Dear Ashok, 
Yes the problem regards the encoding. 
Basically, every time the translator returns a blank document, it means that there is a problem with the encoding or a problem with the source document. 
I have updated the code online. You can do this modification by yourself or I'll do it tomorrow. As you want. 
Let me know. 

Ciao
Luca


<ke_debaterecord_2009-8-1_eng.odt><ke_debaterecord_2009-8-1_eng_copy_3.odt>

Luca Cervone
Web and XML solutions designer

e-mail:     cervo...@gmail.com

mobile phone:    0039 348 26 27 545
home   phone:  0039 051 199 82 854

skype:   cervoneluca



Ashok Hariharan

unread,
Jul 30, 2009, 9:16:10 AM7/30/09
to akomant...@googlegroups.com
On Thu, Jul 30, 2009 at 5:45 AM, Luca Cervone<cervo...@gmail.com> wrote:
> Dear Ashok,
> Yes the problem regards the encoding.
> Basically, every time the translator returns a blank document, it means that
> there is a problem with the encoding or a problem with the source document.
> I have updated the code online. You can do this modification by yourself or
> I'll do it tomorrow. As you want.
> Let me know.
> Ciao
> Luca


Hi Luca,

I am making this modification myself right now

Ashok

Ashok Hariharan

unread,
Jul 31, 2009, 8:58:19 AM7/31/09
to akomant...@googlegroups.com
Luca,

I have updated the translator with the fix for the bufferedReader for
file inputs ...

Also fixed broken references to minixslt files in the bill pipeline

Ashok

Reply all
Reply to author
Forward
0 new messages