[gcd-software] I18N / L10N

2 views
Skip to first unread message

Alexandros Diamantidis

unread,
May 21, 2010, 4:42:57 PM5/21/10
to gcd-softwar...@googlegroups.com
Since Herny asked me specifically about internationalization, I'll post
some thoughts about the topic and we can discuss from here.

First of all, in an older message I'd mentioned that INDUCKS has some
nice n18n capabilities: For example, take a look at a character's page
there:

http://coa.inducks.org/character.php?c=JOC

Apart from all the other links to information, you can see a list of all
names used in various languages the character has appeared in. The same
piece of data appearing in different languages is quite frequent when
comics are translated.

In fact, the same character can have different names in the same
language depending on the translation. For example, older translations
of Asterix comics in Greek used different names for the secondary
characters than the newer translations, and as for Disney comics, there
are characters there that have appeared with more than five different
translated names through the years. This applies to many other
translated comics as well, those are just some examples.

Now, the same obviously goes for creator names as well. Since a creator
can appear with different names even in the same language (aliases,
different spellings, etc), I think using the same mechanism here as well
would be nice.

In the two cases above, I think there should be a way to mark a name as
primary or canonical for a specific language.

For other fields, we currently mostly have them as text in the book's
native language. This applies to format, genre, notes (both series,
issue and story level notes, as well as parenthetical notes on creators,
characters, prices, dates, etc). This is good, in that it can support a
language-specific subcommunity in GCD, but makes things more difficult
for anyone searching for info originating in a language they're not
fluent in (for example, while searching for info on a creator's foreign
work, or on reprints of some story). Automatic translation can help, but
it would be worthwhile, I think, to have a way to allow translations for
parts of the text stored in the database, and not only for the web site,
for example, story titles or synopses: any story reprinted in a
different language will be able to have translations of those fields
stored both in the original and in the reprint language, but I think it
would be useful to allow this even for stories that haven't been
reprinted anywhere.

Now, there are also cases where something may not be usefully
translatable. For example, country-specific genres, publication formats,
etc. This probably applies more to small tags or labels and not to
longer text like titles.

Moving on to some other points raised in the agenda:

Presenting the site differently for different languages or communities:
I don't think much is needed here, if at all. From what I've seen from
sites in other languages (blogs, wikis etc), whatever differences depend
on language or country arise organically, from the behaviour of
different communities, and not so much from technical implementation
differences.

There are areas where GCD's data organisation doesn't fit so well
because of differences on how separate comics "scenes" operate, and I
think these differences can be as great or greater in the same country
than cross-country. For example, Henry mentioned about how creator roles
are different in the production of manga, but actually creator roles are
equally different in the production of single-creator self-published
minicomics, translated European comics albums, comic strip collections,
and so on. So, I think that if we solve the problem of recording
completely but without undue strain on the indexers all the spectrum of
different models used, for instance, in the USA, it will be easy to
extend this to Japan or wherever else.

Now, moving on from the easy stuff... ;-) I don't know if you remember
a discussion on GCD and oriental languages a couple of years ago...
required reading for anyone pondering the i18n aspects of the project:
(Marc Miyake's messages in particular)

http://groups.google.com/group/gcd-main/browse_thread/thread/63edaaa0bf324889/d4b649edc99784d8

Alexandros

Henry Andrews

unread,
May 22, 2010, 3:07:13 PM5/22/10
to gcd-softwar...@googlegroups.com
Hi Alexandros- great ideas here, moving the I18N discussion beyond just basic translation and localization of the site. This is the sort of stuff we should be considering on the multi-year time scale. The name translation feature from INDUCKS and your related translated data field ideas seem like valuable features to me, with a broad impact on how we should structure things to support different translations of data. I'm not quite sure what exact structural changes are needed but we don't need to solve that in this committee- we do need to discuss and decide on what results we want those changes to produce.

A few more specific comments below:

> Presenting the site differently for different
> languages or communities: I don't think much is needed here, if at all. From
> what I've seen from sites in other languages (blogs, wikis etc), whatever
> differences depend on language or country arise organically, from the
> behaviour of different communities, and not so much from technical
> implementation differences.


You are probably right that there's a recruiting component that's more important here.

> There are areas where GCD's data
> organisation doesn't fit so well because of differences on how separate
> comics "scenes" operate, and I think these differences can be as great or
> greater in the same country than cross-country. For example, Henry mentioned
> about how creator roles are different in the production of manga, but
> actually creator roles are equally different in the production of
> single-creator self-published minicomics, translated European comics albums,
> comic strip collections, and so on. So, I think that if we solve the problem
> of recording completely but without undue strain on the indexers all the
> spectrum of different models used, for instance, in the USA, it will be easy
> to extend this to Japan or wherever else.


Good comparison. The flexible model of credits we have planned in the "New Fun" schema should make this example fairly easy to solve.

> Now, moving on from the easy
> stuff... ;-) I don't know if you remember a discussion on GCD and oriental
> languages a couple of years ago... required reading for anyone pondering the
> i18n aspects of the project: (Marc Miyake's messages in
> particular)
>
> http://groups.google.com/group/gcd-main/browse_thread/thread/63edaaa0bf324889/d4b649edc99784d8



Yes I remember that thread very well. While Marc raises a number of worthwhile points, his relentlessly negative conclusions are not warranted. And quite frankly really piss me off (and did at the time as well, but as we couldn't even implement unicode support at the time I decided not to bother replying in depth). If we break down his concerns, it looks like this:

* Character encodings in Asia are complicated.
-> This part's easy. All of the tools we use support dealing with this.

* Localizing the site is hard.
-> Technically, we just need to make sure the layout still works with character sets that use larger/blockier/more complex characters than those found in the Latin (or Greek or Cyrillic) alphabets. I tried to make the layouts pretty robust to that, and we'd like to improve the design anyway. This is not all that hard. Supporting right-to-left languages is harder, but even that's not bad with a decent set of CSS files.
-> Recruiting volunteers to do this will be challenging, but it's similar for any language we want to localize that doesn't have some really active volunteers here already. We just need to go out and do some recruiting. We don't need a huge community of folks to localize- just one or two dedicated bilingual folks that will stick with it long enough to get us started. Two per language being ideal so that they can check each other's work.

* Multiple ways of writing things in Asian languages (or transliteration to Latin alphabets, or other alphabets) makes everything complicated.
-> This is no more complicated than the scenario Alexandros just pointed out at INDUCKS. Yes, there are more options, but if we support a concept of localized names at all, supporting the full complexity of Asian language localization is not substantially more difficult.

* Marc thinks it's a big project needing lots of international volunteers and he's "intimidated"
-> This is where I got mad. This is entirely Marc's problem, and not the GCD's. If folks had had this attitude years ago, the GCD would never have been started at all. It's not like there aren't a ton of "Western" comics. Yes, we'll need to recruit the right people, and it will never be a large resource if we don't do that. And we'll most likely have to deal with some existing projects that are way ahead of us in those languages. That's also an issue in Greece IIRC. And no doubt in many other countries.

But really, Marc's negativity boils down to his own problems in not wanting to tackle a large complex problem. None of the issues he brings up are insurmountable. None of the technical ones are all that hard- people have been solving these specific problems regularly for a while now. The recruiting challenges are larger, but no different than what we've ever faced (I'm not sure why Marc thinks there aren't enough English-speakers to serve as a bridge to non-English speakers. I've certainly met plenty, and the ability of software firms to outsource to places like China is firm evidence of a base of technical English-speakers which is what we would need to get started.

So yes, Marc raises all of the right points. But his conclusions are just based on his own evaluation that large projects are hopeless, which is completely contrary to the spirit of the GCD in the first place.

thanks,
-henry

Alexandros Diamantidis

unread,
May 24, 2010, 4:18:04 PM5/24/10
to gcd-softwar...@googlegroups.com
* Henry Andrews [2010-05-22 12:07]:
> I'm not quite sure what exact structural changes are needed but we
> don't need to solve that in this committee- we do need to discuss and
> decide on what results we want those changes to produce.

Right, I don't think it's possible to see exactly how something like
this should best be implemented before you try to implement it ;-) and the
current code and db schema will need a few more iterations to offer the
necessary infrastructure. Our older release schedule doesn't apply any
more, but the "Add creator and character tables and tools to assist in
migration." goal from "The Dandy" is a prerequisite.

> > So, I think that if we solve the problem of recording completely but
> > without undue strain on the indexers all the spectrum of different
> > models used, for instance, in the USA, it will be easy to extend
> > this to Japan or wherever else.
>
> Good comparison. The flexible model of credits we have planned in the "New Fun" schema should make this example fairly easy to solve.

Yeah, that's what I was thinking about. Not trivial, especially if you
consider searching and migrating existing data, but not impossible
either.

> While Marc raises a number of worthwhile points, his relentlessly negative conclusions are not warranted. And quite frankly really piss me off

I agree with what you're saying, but you shouldn't be pissed-off! Given
the time that discussion took place, and how GCD's development was
progressing then, Marc was justified in being intimidated by the scope
of what will be needed.

> We just need to go out and do some recruiting. We don't need a huge community of folks to localize

Right, and I think as long as progress (even slow) is not halted, the
people who can help will find GCD and get involved. So while some direct
recruiting might not hurt, I believe the most important is to continue
to improve GCD and publicize it and any improvements through various
channels.

> the ability of software firms to outsource to places like China is firm evidence of a base of technical English-speakers which is what we would need to get started.

That's somewhat different - of course there are competent technical
English-speakers in China (or India, Eastern Europe, Japan, etc) but
they're probably more willing to talk to you when it's about a business
rather than a non-profit project. On the other hand, people from those
somewhat isolated countries join open-source projects all the time (not
in great numbers, but enough to make a difference) so I'm also
optimistic about GCD expanding there, too.

By the way, there is a commercial manga database:

Post-War Japanese Shōnen and Shōjo Magazine Database
http://manga-db.fms.co.jp/bgmag/

It offers some basic info for free, and then there is a scale of further
pay options - see Google's translation here of the options on offer:

http://translate.googleusercontent.com/translate_c?hl=el&ie=UTF-8&sl=ja&tl=en&u=http://manga-db.fms.co.jp/bgmag/purchase.aspx&prev=_t

As for French comics, the most complete DB I'm aware of is Bedetheque:
http://www.bedetheque.com/

It's associated with a commercial comics collection tracking app (BD
Gest').

Alexandros

Lionel English

unread,
May 24, 2010, 7:03:25 PM5/24/10
to gcd-softwar...@googlegroups.com
On Fri, May 21, 2010 at 1:42 PM, Alexandros Diamantidis <ad...@hellug.gr> wrote:
For other fields, we currently mostly have them as text in the book's
native language. This applies to format, genre, notes (both series,
issue and story level notes, as well as parenthetical notes on creators,
characters, prices, dates, etc). This is good, in that it can support a
language-specific subcommunity in GCD, but makes things more difficult
for anyone searching for info originating in a language they're not
fluent in (for example, while searching for info on a creator's foreign
work, or on reprints of some story). Automatic translation can help, but
it would be worthwhile, I think, to have a way to allow translations for
parts of the text stored in the database, and not only for the web site,
for example, story titles or synopses: any story reprinted in a
different language will be able to have translations of those fields
stored both in the original and in the reprint language, but I think it
would be useful to allow this even for stories that haven't been
reprinted anywhere.

Now, there are also cases where something may not be usefully
translatable. For example, country-specific genres, publication formats,
etc. This probably applies more to small tags or labels and not to
longer text like titles.
 
Question:  There is a "rule" in the GCD that comics should be indexed in their original language, but it's always also been an option to index the comic in English, meaning you can't reliably predict what language the notes/synopsis/etc are actually indexed in.  Does this present a problem in terms of translation?  For example, say a French comic was indexed in English, and we want to translate it to German.  Are our (theoretical) tools smart enough to realize that the existing text is actually in English rather than French?
 
For Types and Genres, I know there are some genres (I'm thinking specifically of manga) where the original language genre name is used in other countries, rather than a translated name.  For example, yaoi manga are more commonly known in the US as yaoi rather than as ... "gay love stories for women" or whatever an appropriate translation might be.  So genre in particular might be an area where we want to have direct control over the translation rather than relying on tools that automatically translate.

--
Lionel English
lio...@beanmar.net

Alexandros Diamantidis

unread,
May 24, 2010, 9:20:08 PM5/24/10
to gcd-softwar...@googlegroups.com
* Lionel English [2010-05-24 16:03]:
> There is a "rule" in the GCD that comics should be indexed in their
> original language, but it's always also been an option to index the
> comic in English, [...]
> Does this present a problem in terms of translation?

Well, of course: if you want to present the users a piece of text as
German when they ask for that language, you have to know it's actually
German ;-) So, we either need a way to explicitly tag text in the DB
with language information, or some set of conventions to tell reliably
the language in the absense of tagging.

> Are our (theoretical) tools smart enough to realize that the existing
> text is actually in English rather than French?

There is existing code that can identify the language from the
statistical properties of the text. I've bookmarked TextCat
(http://odur.let.rug.nl/~vannoord/TextCat/) which also has a Python
implementation. But we don't want to always try to identify the
language, we should do it once when we reach the stage when we offer the
option of having the same index in multiple languages, and thereafter
have the indexers specify what language they're working in.

Probably the language an indexer is using should be specified implicitly
but with a way to override when needed - they shouldn't be constantly be
asked for this when 99% of the time they'll have some personal
preferences. For example, everything in English, or everything in
English except for Greek books where I always use Greek, or maybe for
Greek books I want to enter synopses and notes both in Greek and
English (the last would be nice for me personally - I don't know how
many others would want to do that though). Always having the option of
English is good, by the way, since it's the current international
language...

> For Types and Genres, I know there are some genres (I'm thinking
> specifically of manga) where the original language genre name is used in
> other countries, rather than a translated name. For example, yaoi manga are
> more commonly known in the US as yaoi rather than as ... "gay love stories

Right - another example is the specifically Italian genre of "giallo" (a
mix of crime, mystery and horror), or the French "polar" (a mix of
police and noir). I don't think this will be much of a problem if we
implement multilingual Types and Genres. We can just display the
language requested by the user, with a fall-back to the book's language
or to English.

For example, for a common genre like, say, "Superhero": Let's say I'm an
English speaker and I end up in an Italian book where the Italian
indexer specified "supereroi" - no problem, it's displayed in English
because at some point we specified these as the same genre in two
different languages.

Now let's say I'm still browsing in English, and I view an Italian book
where the indexer specified the genre "giallo", and there is no English
version of this genre. I'll get the original Italian word - it might be
a bit confusing, but there's no great harm done, and I might just Google
it and learn its meaning.

Now, let's say I'm browsing in Greek, and I end up in the index of a
Japanese manga where the Japanese indexer selected "やおい" (yaoi) as
the genre. Furthermore, let's say the database contains "Yaoi" as the
English translation of this genre, but no Greek translation.

We have two options: show the original Japanese genre, or fall back to
the English translation as more readable to an international audience.
I'm not sure what's the best course... Browsers can be configured to
send an "Accept-Language" header, by which they specify a list of
languages to the server in order of preference. So, if I've configured
my browser correctly to ask preferably for Greek, then English, then
French, the site can deduce what to do. I'm not sure we want to honour
this preference by field in this case, though - it seems like it would
take an awful lot of coding work for not much gain.

An example with creator names: Vasilis Lolos has published books with
US publishers, where he was credited like this, but also with Greek
publishers, where he was credited with the native version of his name
("Βασίλης Λώλος"). Should we always show his name as it appeared on the
book, or always in the preferred language of the user? And what if
someone is using the site in German? Do we display the English or the
Greek version? Maybe we should always ask indexers to enter a latin
transliteration of the name when entering a new creator whose name isn't
written in the latin alphabet...

I think we should revisit these questions when and if we implement
multilingual display for those fields, after seeing if it's a problem in
practice.

Alexandros

Lionel English

unread,
May 24, 2010, 11:33:37 PM5/24/10
to gcd-softwar...@googlegroups.com
On Sat, May 22, 2010 at 12:07 PM, Henry Andrews <hh...@cornell.edu> wrote:
Hi Alexandros- great ideas here, moving the I18N discussion beyond just basic translation and localization of the site.  This is the sort of stuff we should be considering on the multi-year time scale.  The name translation feature from INDUCKS and your related translated data field ideas seem like valuable features to me, with a broad impact on how we should structure things to support different translations of data.  I'm not quite sure what exact structural changes are needed but we don't need to solve that in this committee- we do need to discuss and decide on what results we want those changes to produce.

 
We've talked in the past how hard it will be to normalize the credits and character appearances data.  And how we'll most likely want to start by building the creator and character tables independently and then hook them up to the main database.

I think the hooking them up part might be outside the time frame of the current discussion (it's a *very* long term goal; I don't think we'll get there that quickly).  Do we think that the creation of the creator table (which is simpler than the character table)  is something that's within our time frame and within the scope of this committee?  It seems like it would give us a nice sandbox for some new features, such as working out some of our I18N / L10N issues.

--
Lionel English
lio...@beanmar.net
Reply all
Reply to author
Forward
0 new messages