Of course, OCRing the PDFs I usually work from introduces all sorts of
interesting quirks that need to be cleaned up or ignored, but that's not
the fault of OmegaT+.
Jon Johanning // jjoha...@igc.org
thanks for you immediate feedback.
I should have clarified that OT+ does give some TM results, but that those are based on segments within a segment, I think, for example (and I translate here, so you can see the problem):
Segment 1: He said: Hi there, how are you doing?
Segment 2: He said: Hi there yourself, how are you doing yourself?
Segment 3: He said: What do you think, should we go home and cook?
Imagine those segments to be chinese. I would translate the first segment and get 33 % matches for the other two, because they all have one identical part and even though the second segment is much more similar to the first one than the third. My guess is, that's because OmegaT+ will think the sentences consist of only three words each (e. g. "Hesaid", "Hithere" and "howareyoudoing" in segment 1), with one "word" being identical and the other two completely different.
I've actually created a project to demonstrate the above situation. I've also included a second test file with a number similar example sentences; I've translated the first one just to give you an idea of how OmegaT+ will match them. There's a third test file with the same simple sentence ("How are you?"), broken up with commas in different ways.
I have attatched the zipped project file to this mail.
Could you also see the search function issue I mentioned?
jo.
在 2010/10/16 18:23 時, laseray 寫到:
> --
> The OmegaT+ project is hosted on sourceforge at: http://sourceforge.net/projects/omegatplus
> Homepage is: http://omegatplus.sourceforge.net
>
> Note: members who post messages that are considered abusive or go beyond proper conduct will be banned without notice.
>
> You received this message because you are subscribed to the Google Groups "OmegaT+" group.
> To post to this group, send email to omega...@googlegroups.com
> To unsubscribe from this group, send email to omegatplus-...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/omegatplus
jo.
在 2010/10/16 22:46 時, laseray 寫到:
> I was able to search, even down to one glyph in Chinese.
>
> Raymond
>
I don't know what 'tokenization' means, but character (ideograph) based matching is probably everything OTP needs to do. Guessing 'word tokens', if I understand correctly, would mean it had somehow to evaluate which characters (ideographs) together form a word - which is near to impossible for a computer to do. (Try google-translating any more complex chinese text...)
So this is great news!
Cheers,
jo.
在 2010/11/7 8:32 時, laseray 寫到:
well, I get a lot of useless matches now, too. For example: "He said:" followed by any text will be matched with lots and lots of segments that OTP thinks are similar, but which actually only have "He said:" in common. (As I mentioned before, I mostly translate literature.)
Key to matching being more useful would of course have to include matching strings of characters. This could be done by matching the subsegments of a segmentized chinese text, e. g. the text between commas or other punctuation marks, against each other. But I don't understand enough about how matching works and how it works best, so I'm not sure if that makes any sense.
Anyway, don't forget to get some rest and stay healthy!
Cheers,
jo.
PS: I'll gladly do some testing, as long as my current project won't be affected by it. But as I understand it, the translation and saving process would stay the same and there's only the 'danger' of OTP getting sluggish or crashing, not destroying already translated segments right?
在 2010/11/7 14:31 時, laseray 寫到: