New user, first impressions

158 views
Skip to first unread message

DL

unread,
Jan 17, 2013, 1:55:19 PM1/17/13
to bitex...@googlegroups.com, rdo...@yahoo.com
Hello,
As a new prospective user of Bitext, I thought these few comment might help the development process :
1) Great tool in its overall goals and principles
2) I have however experienced serious difficulties on my first attempts to create a tmx file from 2 documents (source and target) I had just translated. Here are the steps I took :
a) Source.doc and Target.doc were first saved as Source.txt and Target.txt using Unicode UTF-8 format.
b) bitext was then started and the 2 txt files used as source and target respectively.
c) I then ran down the Bitext2tmx 1.0M0 editing window to adjust the alignments that were off due to either differences in punctuation of the 2 documents, non grammatical full-stop marks such as decimal points and other reference numbers, etc.
d) Once the 2 sides were (or seemed) perfectly aligned, I saved a xxxx.tmx file, using Unicode UTF-8 format, hoping to be able to use it in an OmegaT Project.
e) I copied the newly created tmx file into the TM subdirectory of my OmegaT project and opened the project ... I got an error message saying : Failed to load translation memory xxxx.tmx for project! ParseError at [row,col]:[27,78] Message: an invalid XML character (unicode 9x1e)was found in the element content of the document.
d)I examined the Bitext editor again, saw no apparent anomaly in the alignment, saved it this time in ISO-8859-1 format, but got the same error message in OmegaT.
e) I tried to load the TMX file in Benchmark to check if the problem was specific to OmegaT, but no, Benchmark had the same complaint and gave the same error about parse error in the xxxx.tmx file.

Conclusion and comments :
My first impressions are :
1) The fact that the Bitext windows are not expandable, especially the bottom windows that contain the texts of the selected segments, severely limits it user-friendliness as one cannot see the end of longer segments in these windows.

2) The text I used as a test was rather straightforward and short(Englih-French pair). Getting a fatal error on such a simple trial is a bit frustrating and does not incite to adopting Bitext to create larger TMX files from past translated text.

3) I hope this info can help the developers because the concept is great and the overall configuration of the tool is rather simple and easy to learn. A very worthwhile tool to develop further.

Any insight and suggestions on what this ParseError may be and/or what alternative methods or tools are available to create sound TMX files will be much appreciated.

Thank you!

laseray

unread,
Jan 17, 2013, 4:10:33 PM1/17/13
to bitext2tmx - bitext aligner/converter
Try the Validator from the OmegaT+ project, applications section
(http://omegatplus.sf.net).

You will also note that OmegaT is not always completely bug free. As
the developer of OmegaT+ I have found bugs in OmegaT when it loads
certain TMX, and these errors have not been corrected for years. This
may or may not be the case with your TMX.

Check it with the validator first and see what happens.

Raymond

DL

unread,
Jan 17, 2013, 4:52:10 PM1/17/13
to bitex...@googlegroups.com
Thanks for the tip, I did use OmegaT+ and still got a fatal error.
I have a clue : I have tried to open the faulty TMX file with MsWord and got exactly the same error referring to an unrecognized character. I then tried to open the TMX file with FireFox and here it was : not only it gave me the same error message, (ParseError at [row,col]:[27,78], but it nicely printed in red the offending segment and a red arrow underneath identified an hyphen as the culprit. It was in a date like "23-24 March". I went back to the source, replaced this hyphen by a word (23 to 24 March) and recreated a TMX with Bitext, copied it to the TM directory of OmegaT and reopened my project : sure enough, the fatal error was gone BUT there was another one pointing to a similar error with different coordinates (ParseError at [row,col]:[xx,yy] and this time in the target file. I went back to Firefox and this time it pointed to another similar hyphen occurring down the text in the target file.
The logical conclusion is that Bitext has difficulty handling the regular hyphen, even if the source file was saved in Unicode UTF-8. I wonder if it is a known bug and if there are fixes other than first replacing all - by another character in the source and target files before Bitexting them.
I'll keep this forum posted on progress.
Thanks

laseray

unread,
Jan 18, 2013, 1:00:04 AM1/18/13
to bitext2tmx - bitext aligner/converter
Okay. You really should have tried Validator rather using a browser to
track this down. It can validate your TMX, find the characters in
error, and clean them.

Anyway, this doesn't necessarily mean there is something wrong with
B2T because I cannot see the files you used with it. There might
actually be something wrong with how you converted them to text from
doc format. Did you validate whether your text files were in proper
UTF-8? By this I mean not merely converting with a program, like a
wordprocessor, but running a program that checks the encoding
specifically before you use B2T.

You can also try opening the TMX you created with B2T with B2T. If it
cannot open the thing it created then it certainly is part of the
problem.

DL

unread,
Jan 19, 2013, 1:08:28 AM1/19/13
to bitex...@googlegroups.com
Well, I did follow your suggestion and downloaded Validator (with Swordfish).I started the TMX generation from scratch with the same 2 files in Bitext, then I used Validator on the produced TMX file, it did point to the same error at the same coordinates in the TMX file. I asked Validator to clean the file and it said it did.
I then started a project with OmegaT+ and used the clean TMX file in the TMS directory, the source file in the "originals" directory and I opened the project. This time the error was "Java.lang.NullPointerException" and no fuzzy match appeared in the correspondances window.
I will most likely give up with Bitext, because in parallel I did the very same exercice with LF-Aligner-3.11 and it worked really well, produced a much more accurate alignement of my source and target documents, especially by distinguishing the decimal points and the grammatical full-stops and using only the latter as segment boundaries, whereas Bitext segmented at every encountered dot in the file, even between parenthesis like in a reference (john Doe, manual, vol. 6. p.4). I believe the algorithm used in LF-aligner to segment the aligned files is quite sophisticated. With the same 2 files, the number of adjustements I had to make manually to finalize the alignment was minimal compared to Bitext, for the reason given above (only true full stops at the end of a sentence were used by LF-aligner as segment boundaries). Then I used OmegaT with the source document as a project and the TMX file as a memory : exact matches from the TMX were proposed for almost all segments and I could almost redo the entire translation in minutes.
Well, I'll follow my learning curve with LF-aligner and I will keep an eye on future versions of Bitext to see of the robustness improves.
Thank you very much for your help and so long for now.
Reply all
Reply to author
Forward
0 new messages