This means that what I really find most useful about Felix is its
ability to search both translation memories for past collocations and
glossaries for standard vocabulary. One issue I have found, however, is
that using the same search algorithm for both translation memories and
glossaries has its limitations.
It appears to a layman like myself that the way that Felix calculates
the "accuracy" of Japanese language search results is simply by dividing
the total number of characters in the search string by the total number
of characters found to be common with the search string in any given
target string. So, if there are 10 characters in the search string, and
three of those characters are found in the target string, then the
accuracy is 30%.
While I find this approach more or less acceptable for searching
translation memories, I find it wholly inadequate for searching
glossaries for the simple reason that no weight is placed on the
sequence or relative position of the characters within the string, and
using a relatively high accuracy setting parses out far too many
relevant hits, while using a relatively low accuracy allows far too many
spurious hits. The old "feast or famine" syndrome.
What I would like to suggest to improve this situation is that the
search paradigm for glossary items be enhanced so that the user can
select the number of characters to use as a base unit.
You probably already understand what I'm getting at, but just to explain
it more clearly:
At the moment, if the search string is ABCD, the search paradigm
apparently searches for all As, then all Bs, then all Cs, and finally
all Ds, which is to say, it uses a base unit of one character. What I'm
suggesting is that, if the user were to specify two characters as the
base unit, the paradigm should search for all ABs, all BCs, and all CDs.
I don't know if this kind of approach would make any difference when
searching in alphabet-based languages, but I think you will recognize
immediately that this will probably improve accuracy for things like
yoji-jukugo almost immediately, since the four character strings in CJK
languages are quite often parsable as two two-character strings.
Obviously, this doesn't improve relevance for 100% matches, but I think
it will produce significant more relevant results for matches that are
less than 100%, especially those in the 50% or higher range.
I'm sure it would take a little bit of experimentation before it could
be implemented successfully, but if something like this were available
for glossary searches, I think it make the results much more relevant
for matches of less than 100%. I don't know if it would produce any
improvements for longer strings found in translation memories, although
I tend to think it would, especially if different settings could be
provided for kanji and kana. In other words, a base unit of two
characters for kanji words, but a base unit of three or more characters
for hiragana and katakana would probably increase the relevance of
search results in translation memories too.
Also, one more suggestion that is ancillary to the first is to provide
the user with the capability to customize the way that Felix lists
(sorts) the results in the glossary window. Sometimes I find that there
are 100% matches way down at the bottom of a long list of spurious
matches. Allowing the user to sort results by 前方一致、後方一致、or
some other conditions (ala Jamming) would also produce a considerable
improvement in usability.
Anyway, it would be nice if we could have some discussion about
improving search results and how they are displayed.
-----------------------------------------------------------------
Steven P. Venti
Mail: spv...@bhk-limited.com
Songs to Aging Children
http://www.youtube.com/profile?user=spventi&view=playlists
-----------------------------------------------------------------
Thanks for posting this to the list, Steven.
I'm now working on an improved search and replace feature for Felix, which
I'm planning to add to the 1.5 release. I think this is therefore a good
time to discuss glossary and other searches as well.
To describe how glossary matching works, and the order in which matches are
displayed, would make this post a bit long, so I've put this information
into a blog post:
http://felix-cat.com/blog/2009/06/16/how-glossary-matching-works-in-felix/
I'd characterize the matching algorithm that you proposed as based on
"closeness" or "stickiness." That is, we want to give more weight to
continuous strings of characters. I think that this is true for languages
like English as well as Japanese, because different letters at the ends of
words could mean that the words are still similar, while different letters
in the middle of words usually means that the words are different. (Of
course this doesn't hold for languages that insert things into the middle of
words.)
Technically, I see no reason why this wouldn't be possible. I'm going to
plan some experiments to see how this would work in practice.
I'd appreciate any other feedback that users have about these features.
Regards,
Ryan
=================================
Ryan Ginstrom
Felix Translation Memory Software
sup...@felix-cat.com
http://felix-cat.com/
=================================
The points I had hoped to discuss in this thread were:
1) Would it improve usability to use different algorithms for TM and
glossary searches?
2) What can be done to make glossary search results more usable (relevant)?
The latter point might even be rephrased to ask what level of
customization users like to see in the way glossary searches are
performed and displayed.
Having said that, I would be curious to know if the lack of discussion
indicates that most users are satisfied with the way glossary searches
now work.
-----------------------------------------------------------------
Steven P. Venti
Mail: spv...@bhk-limited.com
Rockport Sunday
http://www.youtube.com/watch?v=bCPpd20CgXE
-----------------------------------------------------------------
The glossary-search algorithm is slightly different from the TM algorithm
now, but I do see your point. I think that adding a rule like "the first X
characters of the term must match" could be useful, and it wouldn't hurt as
an option. Another possibility is "cohesion" or stickiness -- giving better
scores to strings of continuous matching characters.
I'm planning to do some testing to see how this works out in practice.
> 2) What can be done to make glossary search results more
> usable (relevant)?
>
> The latter point might even be rephrased to ask what level of
> customization users like to see in the way glossary searches
> are performed and displayed.
I think that allowing customization of the order in which results are
displayed can be very useful, and plan to implement this feature in an
upcoming release.
Charles Aschmann