I am getting a strange problem sometimes ; the translation fails at
the point of the metalex transform --
I get an error like :
Error on line 1 column 1295 of temp7544657180364232322.xml:
SXXP0003: Error reported by XML parser: Invalid byte 2 of 2-byte
UTF-8 sequence.There was a problem during the translation to the
METALEX metaformat
Which is basically the metalex output being empty ...
I have attached an example for which the translation fails ... (Does
it fail for you for that file ?)
Ashok
I realized what the problem is ---
The file that is causing the error has text that was imported from a
word document --- this causes the sax parser to fail as some of the
characters are read as UTF-1 instead of UTF-8 (see the attched file
and _copy_3 file ... the copy_3 file translates successfully as i have
removed the imported text from word... ).
The best way to resolve this problem is to fix the way we are reading
Xml documents for parsing --
right now we have calls like :
File mergedOdfFile = odfUtil.mergeODF(aDocumentPath);
StreamSource ODFDocument = new StreamSource(mergedOdfFile);
The StreamSource is then passed onto the Xml parser ...
Instead of construction the StreamSource using a File handle if we use
a BufferedReader the problem is solved for most encoding related
issues -- since BufferedReader translates everything to utf-16 ->
which is then changed back to utf-8 by the parser ... so there are no
mixed encodings to deal with ...
Something like :
File mergedOdfFile = odfUtil.mergeODF(aDocumentPath);
StreamSource ODFDocument = new StreamSource(new BufferedReader(new
FileReader(mergedOdfFile));
or
StreamSource ODFDocument = new
StreamSource(new BufferedReader(new InputStreamReader(new
FileInputStream(mergedOdfFile)));
If you have checked in your most latest changes -- i can make this fix
in the translator source ... ?
Ashok
<ke_debaterecord_2009-8-1_eng.odt><ke_debaterecord_2009-8-1_eng_copy_3.odt>
Hi Luca,
I am making this modification myself right now
Ashok
I have updated the translator with the fix for the bufferedReader for
file inputs ...
Also fixed broken references to minixslt files in the bill pipeline
Ashok