OmegaT+ and Chinese / Concordance

jotheman

unread,

Oct 14, 2010, 3:44:03 PM10/14/10

to OmegaT+ CAT Tools

Hello Ray,

before I start, I should mention that I used to translate with OmegaT+
mostly for the sake of having source text and translation nicely
segmented in one editor. For the types of translations that I do
(mostly literature) I didn't really need the translation memory, as
there weren't many repetitive parts where it could come in handy,
anyway.

Still, since nobody seems to have mentioned this here before, I feel
like I should point out a big problem with using OmegaT+ for Chinese
texts, that almost prevents it from working as a TM at all: it seems
to treat a sequence of Chinese characters as one word, looking for
spaces and punctuation marks to separate them. But there are no spaces
between words in Chinese, since each character itself is a word, that
itself can have a meaning or is used in combination with other
characters.

I guess the Chinese colleagues that kindly helped with the Chinese
localisation didn't encounter this problem mainly for one reason: they
probably translate into Chinese, not from Chinese.

For translatinos from Chinese into any other language, OmegaT+ needs
to be thought to treat every Chinese character as a single word, which
probably isn't hard to fix, I think.

I'll gladly assist with beta testing, if needed. Please don't hesitate
to contact me directly through E-Mail.

Cheers,

jo.

Message has been deleted

laseray

unread,

Oct 15, 2010, 11:45:07 AM10/15/10

to OmegaT+ CAT Tools

Hi,

Okay, I checked this out a little. And I find that matching for
Chinese original segments does work to some extent. I think perhaps
you are expecting it to work in a different manner that it does.

The key thing to understand is that there has to be enough similarity
between segments for matches to be returned. If they differ too much
then there will not be any matches given. And this applies to all
languages.

As far as I can tell right now, spaces have little to do with any
issue. Just try making an experiment with a small text document. Add
in a number of similar lines of text with small variations so that
matches have to occur. Try some lines with just a few glyphs/
characters. Add spaces between them also on other lines. Open the
project, make some translations, generate the translated documents/
TMX, and activate the segments that should have matches. You will get
some kind of results. Maybe not exactly optimal or as expected, but
something.

Let us all know what you find and how you think it can be improved. We
can see where it goes from there.

Thanks.

Raymond

laseray

unread,

Oct 15, 2010, 11:54:36 AM10/15/10

to OmegaT+ CAT Tools

Hi,

Excuse other message. I see the problem. The Chinese goes into the
TMX, but matching returns no results for it. Probably a character
encoding problem. More testing needed.

Thanks.

Raymond

Jon Johanning

unread,

Oct 14, 2010, 11:24:36 PM10/14/10

to omega...@googlegroups.com

I am having no trouble using OmegaT+ on Japanese, which also has no
spaces between words. I'm segmenting sentences, which end in the
equivalent of a period, and that works fine. The lack of
European-language-style spaces within the segments has no effect on the
matching. However, there might be something about Chinese which makes a
difference from Japanese; I don't know about that.

Of course, OCRing the PDFs I usually work from introduces all sorts of
interesting quirks that need to be cleaned up or ignored, but that's not
the fault of OmegaT+.

Jon Johanning // jjoha...@igc.org

laseray

unread,

Oct 15, 2010, 2:43:16 PM10/15/10

to OmegaT+ CAT Tools

Hi Jon,

Okay, thanks for the info.

I checked again with both Chinese and Japanese documents and the
matching does work. Getting a little mixed up there somewhere! Anyhow,
if there are some real problems somewhere please report them, with
specific details to track it down properly.

Just make sure to check yourself with segments that are similar
enough. And note that matching cannot work down to the most detailed
level of glyph/character results, or else it would take extremely long
to return any matches and then most of them would be uesless.

Raymond

jo.

unread,

Oct 16, 2010, 9:02:31 AM10/16/10

to omega...@googlegroups.com, lase...@gmail.com

Hello Ray,

thanks for you immediate feedback.

I should have clarified that OT+ does give some TM results, but that those are based on segments within a segment, I think, for example (and I translate here, so you can see the problem):

Segment 1: He said: Hi there, how are you doing?
Segment 2: He said: Hi there yourself, how are you doing yourself?
Segment 3: He said: What do you think, should we go home and cook?

Imagine those segments to be chinese. I would translate the first segment and get 33 % matches for the other two, because they all have one identical part and even though the second segment is much more similar to the first one than the third. My guess is, that's because OmegaT+ will think the sentences consist of only three words each (e. g. "Hesaid", "Hithere" and "howareyoudoing" in segment 1), with one "word" being identical and the other two completely different.

I've actually created a project to demonstrate the above situation. I've also included a second test file with a number similar example sentences; I've translated the first one just to give you an idea of how OmegaT+ will match them. There's a third test file with the same simple sentence ("How are you?"), broken up with commas in different ways.

I have attatched the zipped project file to this mail.

OTP-test.zip

laseray

unread,

Oct 16, 2010, 12:23:51 PM10/16/10

to OmegaT+ CAT Tools

Okay, looked at your project. I do see an issue.

For the first segment of the three you gave, with a translation added.
Just by adding in one space into another copy of the segment, and
loading it in the project, there are no matches returned. This is so
wrong because it should return a match that is up in ~90% range. Not
sure what is going on with that, it is nonsense. You do not even get
the 33% you pointed out, which is just coincidental to the underlying
problem. The run of characters in a segment are basically matched one
by one up until the end. If a minimum matching threshold is not
exceeded, then no matches are returned. Being off by one character
should not result in no match in this case.

The problem has something to do with matching in regard to collating
sequence, character encoding or similar. Not sure if this is fixable
as of yet.
Please file bug report on the project site so this can be tracked.

Raymond

jo.

unread,

Oct 16, 2010, 4:28:17 PM10/16/10

to omega...@googlegroups.com

I would think that it might be an encoding issue, but that's just an educated guess.

Could you also see the search function issue I mentioned?

jo.

在 2010/10/16 18:23 時， laseray 寫到：

> --
> The OmegaT+ project is hosted on sourceforge at: http://sourceforge.net/projects/omegatplus
> Homepage is: http://omegatplus.sourceforge.net
>
> Note: members who post messages that are considered abusive or go beyond proper conduct will be banned without notice.
>
> You received this message because you are subscribed to the Google Groups "OmegaT+" group.
> To post to this group, send email to omega...@googlegroups.com
> To unsubscribe from this group, send email to omegatplus-...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/omegatplus

laseray

unread,

Oct 16, 2010, 4:46:56 PM10/16/10

to OmegaT+ CAT Tools

I was able to search, even down to one glyph in Chinese.

Raymond

jo.

unread,

Oct 16, 2010, 4:55:46 PM10/16/10

to omega...@googlegroups.com

What could be the cause of the search not working for me then?

jo.

在 2010/10/16 22:46 時， laseray 寫到：

> I was able to search, even down to one glyph in Chinese.
>
> Raymond
>

laseray

unread,

Oct 16, 2010, 5:21:06 PM10/16/10

to OmegaT+ CAT Tools

No idea, but check your search options. Make sure something is there
to search. Try in a different language.

Raymond

jotheman

unread,

Nov 4, 2010, 4:40:20 PM11/4/10

to OmegaT+ CAT Tools

Hi Ray,

any news on the Chinese TM matching front?

Cheers,

jo.

laseray

unread,

Nov 4, 2010, 8:04:58 PM11/4/10

to OmegaT+ CAT Tools

Hi jo,

On Nov 4, 4:40 pm, jotheman <hims...@jotheman.de> wrote:
> any news on the Chinese TM matching front?

Sorry, have not looked at this yet.

I will take a look at this soon, as I am deciding on what to do for
the next version now.
Just so you know, this does not mean that it can necessarily be fixed
or fixed in the short term though. I may be able to pinpoint the
cause, but this has the potential to be a serious problem to fix if it
is related to some core part of Java and how it has implemented
parsing, collating, and so forth for Chinese or Asian langauges in
general. Hopefully not. I just do not want to get your hopes up before
I have made a serious attempt to work on it.

Keep nudgung me on it once in a while if I do not get to it. I also
want to go over other bugs/feature requests and see what can be done
in general.

Raymond

laseray

unread,

Nov 7, 2010, 2:32:51 AM11/7/10

to OmegaT+ CAT Tools

Hi jo,

Over the last day I took a look at this problem you brought up and I
do see some way out of this. What you have brought to my attention has
to do with the general issue of tokenization.

Now there is good news and bad news. The bad news is that it will take
quite a while for me to fix the tokeniztion code for the general case
covering most languages, towards a major version of the program. The
good news is that I think I can make a workaround for this problem in
the interim that may be at least satisfactory. That is, I already was
able to experiment and get matches on Chinese at the level of
ideographs instead of what it thinks are word tokens. Now I just have
to work out the details.

jo.

unread,

Nov 7, 2010, 7:59:47 AM11/7/10

to omega...@googlegroups.com

Hi Raymond,

I don't know what 'tokenization' means, but character (ideograph) based matching is probably everything OTP needs to do. Guessing 'word tokens', if I understand correctly, would mean it had somehow to evaluate which characters (ideographs) together form a word - which is near to impossible for a computer to do. (Try google-translating any more complex chinese text...)

So this is great news!

Cheers,

jo.

在 2010/11/7 8:32 時， laseray 寫到：

laseray

unread,

Nov 7, 2010, 8:31:14 AM11/7/10

to OmegaT+ CAT Tools

Hi jo,

Excuse my computerese. Tokenization is just the process of breaking up
a piece of text into smaller parts (tokens). The program uses these
tokens to do the actual matching. OTP was doing it by word as the
default. By character can work also, but it seems to return irrelevant
matches when there are no segments in the project or TMs that are
similar enough. It can work, although perhaps somewhat annoyingly if
you keep seeing too many unrelated matches. It also makes matching
slower. Not by that much on what I tested. It could become a problem
in large projects, so that will have to be tested.

I can send you a version with character matching to see if it works
for you. Very preliminary, so more work will be needed towards a
release version to sort out user controllable options for this and to
make sure it does not introduce bugs. Anyway, I can send that later
on. I have some other things to do right now, like sleep. Ha!

If anybody else wants a copy, just ask.

jo.

unread,

Nov 7, 2010, 9:07:19 AM11/7/10

to omega...@googlegroups.com

Hi Ray,

well, I get a lot of useless matches now, too. For example: "He said:" followed by any text will be matched with lots and lots of segments that OTP thinks are similar, but which actually only have "He said:" in common. (As I mentioned before, I mostly translate literature.)

Key to matching being more useful would of course have to include matching strings of characters. This could be done by matching the subsegments of a segmentized chinese text, e. g. the text between commas or other punctuation marks, against each other. But I don't understand enough about how matching works and how it works best, so I'm not sure if that makes any sense.

Anyway, don't forget to get some rest and stay healthy!

Cheers,

jo.

PS: I'll gladly do some testing, as long as my current project won't be affected by it. But as I understand it, the translation and saving process would stay the same and there's only the 'danger' of OTP getting sluggish or crashing, not destroying already translated segments right?

在 2010/11/7 14:31 時， laseray 寫到：

Reply all

Reply to author

Forward