Some thoughts about enhancing Japanese language search results in Felix

Steven P. Venti

unread,

Jun 15, 2009, 9:11:04 PM6/15/09

to felix...@googlegroups.com

I just finished a five-month-long, 350,000-character project, during
which I used Felix from start to finish with the aim of maintaining
consistency of terminology. I think of Felix mainly as a tool for
maintaining quality rather than increasing throughput, which is
something that comes mainly from the nonrepetitive nature of most of the
work I do. In this respect, Felix really works wonderfully for me, as it
enables me to go back and dredge up expressions that I translated
some time ago with practically no effort at all.

This means that what I really find most useful about Felix is its
ability to search both translation memories for past collocations and
glossaries for standard vocabulary. One issue I have found, however, is
that using the same search algorithm for both translation memories and
glossaries has its limitations.

It appears to a layman like myself that the way that Felix calculates
the "accuracy" of Japanese language search results is simply by dividing
the total number of characters in the search string by the total number
of characters found to be common with the search string in any given
target string. So, if there are 10 characters in the search string, and
three of those characters are found in the target string, then the
accuracy is 30%.

While I find this approach more or less acceptable for searching
translation memories, I find it wholly inadequate for searching
glossaries for the simple reason that no weight is placed on the
sequence or relative position of the characters within the string, and
using a relatively high accuracy setting parses out far too many
relevant hits, while using a relatively low accuracy allows far too many
spurious hits. The old "feast or famine" syndrome.

What I would like to suggest to improve this situation is that the
search paradigm for glossary items be enhanced so that the user can
select the number of characters to use as a base unit.

You probably already understand what I'm getting at, but just to explain
it more clearly:

At the moment, if the search string is ABCD, the search paradigm
apparently searches for all As, then all Bs, then all Cs, and finally
all Ds, which is to say, it uses a base unit of one character. What I'm
suggesting is that, if the user were to specify two characters as the
base unit, the paradigm should search for all ABs, all BCs, and all CDs.
I don't know if this kind of approach would make any difference when
searching in alphabet-based languages, but I think you will recognize
immediately that this will probably improve accuracy for things like
yoji-jukugo almost immediately, since the four character strings in CJK
languages are quite often parsable as two two-character strings.
Obviously, this doesn't improve relevance for 100% matches, but I think
it will produce significant more relevant results for matches that are
less than 100%, especially those in the 50% or higher range.

I'm sure it would take a little bit of experimentation before it could
be implemented successfully, but if something like this were available
for glossary searches, I think it make the results much more relevant
for matches of less than 100%. I don't know if it would produce any
improvements for longer strings found in translation memories, although
I tend to think it would, especially if different settings could be
provided for kanji and kana. In other words, a base unit of two
characters for kanji words, but a base unit of three or more characters
for hiragana and katakana would probably increase the relevance of
search results in translation memories too.

Also, one more suggestion that is ancillary to the first is to provide
the user with the capability to customize the way that Felix lists
(sorts) the results in the glossary window. Sometimes I find that there
are 100% matches way down at the bottom of a long list of spurious
matches. Allowing the user to sort results by 前方一致、後方一致、or
some other conditions (ala Jamming) would also produce a considerable
improvement in usability.

Anyway, it would be nice if we could have some discussion about
improving search results and how they are displayed.

-----------------------------------------------------------------
Steven P. Venti
Mail: spv...@bhk-limited.com
Songs to Aging Children
http://www.youtube.com/profile?user=spventi&view=playlists
-----------------------------------------------------------------

Ginstrom IT Solutions (GITS)

unread,

Jun 15, 2009, 11:46:50 PM6/15/09

to felix...@googlegroups.com

> [mailto:felix...@googlegroups.com] On Behalf Of Steven P. Venti

> What I would like to suggest to improve this situation is
> that the search paradigm for glossary items be enhanced so
> that the user can select the number of characters to use as a
> base unit.

Thanks for posting this to the list, Steven.

I'm now working on an improved search and replace feature for Felix, which
I'm planning to add to the 1.5 release. I think this is therefore a good
time to discuss glossary and other searches as well.

To describe how glossary matching works, and the order in which matches are
displayed, would make this post a bit long, so I've put this information
into a blog post:
http://felix-cat.com/blog/2009/06/16/how-glossary-matching-works-in-felix/

I'd characterize the matching algorithm that you proposed as based on
"closeness" or "stickiness." That is, we want to give more weight to
continuous strings of characters. I think that this is true for languages
like English as well as Japanese, because different letters at the ends of
words could mean that the words are still similar, while different letters
in the middle of words usually means that the words are different. (Of
course this doesn't hold for languages that insert things into the middle of
words.)

Technically, I see no reason why this wouldn't be possible. I'm going to
plan some experiments to see how this would work in practice.

I'd appreciate any other feedback that users have about these features.

Regards,
Ryan

=================================
Ryan Ginstrom
Felix Translation Memory Software
sup...@felix-cat.com
http://felix-cat.com/
=================================

Charles Aschmann

unread,

Jun 16, 2009, 6:40:44 AM6/16/09

to felix...@googlegroups.com

I like the way the concordance works in the TM part. I see the point of
the strings weighting, but I advise caution in implementing any
complicated algorithm. This is what ruined the Trados concordance
search. They implemented some sort of algorithm that distinguished
between hiragana, katakana and kanji. It screwed up all mixed results
and resulted in not being able to find many useful strings. My
conclusion was that for Japanese, the dumber the better. You have to be
careful not to eliminate too much. Perhaps an ordering of results by
this weighting principle would be better than a search by this weighting
principle.

Charles Aschmann

Steven P. Venti

unread,

Jun 21, 2009, 7:27:43 PM6/21/09

to felix...@googlegroups.com

Charles Aschmann <asch...@gmail.com> wrote:
> I advise caution in implementing any complicated algorithm.

The points I had hoped to discuss in this thread were:

1) Would it improve usability to use different algorithms for TM and
glossary searches?

2) What can be done to make glossary search results more usable (relevant)?

The latter point might even be rephrased to ask what level of
customization users like to see in the way glossary searches are
performed and displayed.

Having said that, I would be curious to know if the lack of discussion
indicates that most users are satisfied with the way glossary searches
now work.

-----------------------------------------------------------------
Steven P. Venti
Mail: spv...@bhk-limited.com

Rockport Sunday
http://www.youtube.com/watch?v=bCPpd20CgXE
-----------------------------------------------------------------

Ginstrom IT Solutions (GITS)

unread,

Jun 21, 2009, 9:41:12 PM6/21/09

to felix...@googlegroups.com

> [mailto:felix...@googlegroups.com] On Behalf Of Steven P. Venti

> 1) Would it improve usability to use different algorithms for
> TM and glossary searches?

The glossary-search algorithm is slightly different from the TM algorithm
now, but I do see your point. I think that adding a rule like "the first X
characters of the term must match" could be useful, and it wouldn't hurt as
an option. Another possibility is "cohesion" or stickiness -- giving better
scores to strings of continuous matching characters.

I'm planning to do some testing to see how this works out in practice.

> 2) What can be done to make glossary search results more
> usable (relevant)?
>
> The latter point might even be rephrased to ask what level of
> customization users like to see in the way glossary searches
> are performed and displayed.

I think that allowing customization of the order in which results are
displayed can be very useful, and plan to implement this feature in an
upcoming release.

Charles Aschmann

unread,

Jun 21, 2009, 10:22:09 PM6/21/09

to felix...@googlegroups.com

Ginstrom IT Solutions (GITS) wrote:

> I think that adding a rule like "the first X
> characters of the term must match" could be useful, and it wouldn't hurt as
> an option. Another possibility is "cohesion" or stickiness -- giving better
> scores to strings of continuous matching characters.
>
> I'm planning to do some testing to see how this works out in practice.
>

A good bit of this depends on how people use glossaries, so I would
definitely make anything like this an option rather than a hardwired
part of the algorithm. If people are using glossaries in their simplest
form, word matches, this makes more sense. However, if people use phrase
glossaries, it might make less sense. TM works very well if you fill a
glossary with common phrases that you encounter and use it to just drop
them in rather than just using word glossaries. This might require a
slightly different alignment of things.

Charles Aschmann

Reply all

Reply to author

Forward