2. How can I understand your selection of sources for
translation? The user interfaces of Wikipedia.org and
Librarything.com are not listed on your code.google.com
page. Lots of translations could be derived from Wiktionary.
3. Inflected words, e.g. Swedish "senaste" and "senast",
which are true synonyms, yield entirely different
translations. Is the database lookup too literal, and
don't you need a synonym grouping step first?
--
Lars Aronsson (la...@aronsson.se)
Aronsson Datateknik - http://aronsson.se
Start by downloading the following file
http://open-tran.eu/stuff/en.tar.gz and uncompress it. If you have any
problems with it, let me know - we will determine the best way to send
you its contents.
The archive contains HTML files that need to be translated. UTF-8
encoding is recommended. If you prefer Latin1, just let me know and
I'll convert.
You can rename the files, if you wish, but their names need to be
composed of Latin characters - I need to be able to type the names of
the files on my keyboard :)
Update contact information so that it does not encourage people to
translate the site anymore. I would also appreciate it, if you added
your contact information for people willing to share their problems in
Swedish (and a link to your site if you wish to be named in our Hall of
Fame).
Finally, we need a translation of the phrase "Translation suggestions"
into Swedish. "Översättning Förslag"?
> 2. How can I understand your selection of sources for
> translation? The user interfaces of Wikipedia.org and
> Librarything.com are not listed on your code.google.com
> page. Lots of translations could be derived from Wiktionary.
All the sources are listed on http://open-tran.eu/projects.html page.
Import from any other source is currently out of question due to
resources constraints.
> 3. Inflected words, e.g. Swedish "senaste" and "senast",
> which are true synonyms, yield entirely different
> translations. Is the database lookup too literal, and
> don't you need a synonym grouping step first?
You are right. While it's not a problem for English, it is an important
issue with highly-inflected languages. However, I don't know of any
multi-language inflection dictionary that could be easily integrated
with the search.
Jacek
The Snowball stemmer supports a number of languages, including Swedish.
http://snowball.tartarus.org/texts/stemmersoverview.html
Integrating this with the search would probably require a pretty loose
definition of "easily" (and an even looser definition of "dictionary" to
include algorithmic approaches) to meet your criteria.
There was a Google Summer of Code proposal idea a couple of years back
to integrate Snowball with the Pootle/Translate Toolkit but no student
was brave enough to take it on.
@alex
--
mailto:alex....@mac.com
I tried feeding the demo with "mice" and "women", but apparently it
didn't like them :)
But I had no idea about this project and now it seems very interesting.
I'll poke around with it.
Thanks!
Jacek
That source code looks a bit oldish and rough, making
special defines for Latin-1 letters with umlauts, instead
of just using UTF-8 as it is. Perhaps nobody cared much
about this software in the last ten years? I guess today,
you would just list some regexp patterns, that can be
used in Perl, PHP or Python. The invention of a separate
language for such a limited task seems like overkill.
Let's instead focus on the set of rules for stemming.
The rules given here would stem "senaste" and "senast"
into "sen" (i.e. "latest" into "late"), which is clearly
too aggressive. It would lead to false translations.
However, the regex s/aste$/ast/ would be useful.
The usefulness of the ruleset needs to be tested on some
test vocabulary. In the case of open-tran, that would be
all the input that users enter in the search box. If
they never enter "senaste", we don't need a stemmer for
that. So, are the search expressions logged and could
that log be published? Or would that violate the privacy
of the users? The second best alternative for a test
vocabulary would be the stored translations.
Actually, I don't think this is so much of an overkill. Snowball
stemmers have been integrated with the NLTK toolkit and I am working on
using them in Open-Tran. One important side-effect will be reducing
size of the database, because I expect the number of words to decrease.
Jacek
Fixed :)
Jacek