Chinese language translation

155 views
Skip to first unread message

drzax

unread,
Jan 2, 2013, 2:17:44 AM1/2/13
to silverst...@googlegroups.com
We've recently been building a site which uses the translation module and has translations in both Simplified and Traditional Chinese. This process has taught me a lot about translations and Chinese translation in particular, but I'm still far from being expert on the subject, however there seem significant deficiencies in the Translation module which relate specifically to Chinese.

Chinese is a complicated use case for translation because there are multiple versions of both the written and spoken language and there are additional complicating geographic inconsistencies. This makes the standard way of defining translations (the combination of language and region) less than ideal. This seems to be a reasonable summary. The W3C recommends using zh-Hant and zh-Hans for the lang attribute of (x)html content which currently isn't really supported.

As I said at the top, I'm far from being an expert on this topic, but I wanted to open the discussion and see if anyone with more expertise would weigh into the conversation because I think this aspect of the translation module could really do with some work. It certainly would have made our current project easier, and Chinese translations are becoming a more regular requirement in our studio.

Cheers,
Simon

Ingo Schommer

unread,
Jan 2, 2013, 3:07:30 AM1/2/13
to silverst...@googlegroups.com
Hey Simon, how about just overriding SiteTree->MetaTags() in your Page class?
Or for HTML5 usage with <html lang="…">, create a getter in ContentController
which uses SiteTree->Locale and transforms it accordingly.

I don't really understand why you can't use the correct locales in the first place though? 

--
You received this message because you are subscribed to the Google Groups "SilverStripe Core Development" group.
To view this discussion on the web visit https://groups.google.com/d/msg/silverstripe-dev/-/XPZiXTfkvVsJ.
To post to this group, send email to silverst...@googlegroups.com.
To unsubscribe from this group, send email to silverstripe-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/silverstripe-dev?hl=en.

drzax

unread,
Jan 3, 2013, 2:39:46 AM1/3/13
to silverst...@googlegroups.com
Hi Ingo,

Thanks for the reply. Just to be clear, I'm not looking for technical assistance here per se - this issue is solved as far as my current project is concerned. I guess my point was that it would be good to address this in a more generalised way. It would be good to support translation in both simplified and traditional Chinese out of the box and in the way recommended by W3C.

Simply adding new locales as suggested in the docs doesn't quite do it unfortunately. For example, the Translatable->getContentLanguage function won't work when using zh-Hant and zh-Hans as a locale. I think it's also worth noting that they aren't actually locales so simply adding them in this way mislabels them and potentially adds confusion to the code. I'm not sure how they should be described, but they're more like identifiers of a written language.

Simply using a locale to describe what language the document is in (which is sufficient in most cases and seems to be a fairly standard practice) just doesn't really work for Chinese. I realise this isn't a problem unique to Silverstripe, and I haven't done a great deal of research into how others are solving it.

I just wanted to raise the topic here because I think it's an issue worth examining in more depth (maybe others don't agree?). I've submitted the odd patch here and there for the Translation module, but I don't know enough about i18n or translation to be unilaterally making the kinds of changes to the translation module code which I think are required to solve this - hence opening the discussion on the Core Dev list.

Cheers,
Simon

Ingo Schommer

unread,
Jan 3, 2013, 2:58:45 AM1/3/13
to silverst...@googlegroups.com
On 3/01/2013, at 8:39 AM, drzax <si...@elvery.net> wrote:

Hi Ingo,

Thanks for the reply. Just to be clear, I'm not looking for technical assistance here per se - this issue is solved as far as my current project is concerned. I guess my point was that it would be good to address this in a more generalised way. It would be good to support translation in both simplified and traditional Chinese out of the box and in the way recommended by W3C.

Simply adding new locales as suggested in the docs doesn't quite do it unfortunately. For example, the Translatable->getContentLanguage function won't work when using zh-Hant and zh-Hans as a locale. I think it's also worth noting that they aren't actually locales so simply adding them in this way mislabels them and potentially adds confusion to the code. I'm not sure how they should be described, but they're more like identifiers of a written language.
There's no Translatable->getContentLanguage() so I can't comment on the issue.
Regardless of zh-Hans (vs. zh-CN etc) is a "locale" in its strictest definition,
the i18n locales in SS are stored as simple key-value maps, so should in theory
support all kinds of arbitrary values. 

Date/time format auto-detection won't work (accurately) with arbitrary locales
since Zend follows its own locale naming. I'd suggest that you set i18n::set_date/time_format() manually in this case.

String translations through _t() has the same issue, just with PHP file naming.
There's no solution at the moment, although using the language tag ("zh")
in 3.0 by default rather than the combined locale ("zh_CN") makes a better fallback.
We could use some kind of "fallback lookup" system in core, true.

Did I miss anything? Still not sure I understand your problem as
far as actual technical restrictions in SS goes...

Simply using a locale to describe what language the document is in (which is sufficient in most cases and seems to be a fairly standard practice) just doesn't really work for Chinese. I realise this isn't a problem unique to Silverstripe, and I haven't done a great deal of research into how others are solving it.

I just wanted to raise the topic here because I think it's an issue worth examining in more depth (maybe others don't agree?). I've submitted the odd patch here and there for the Translation module, but I don't know enough about i18n or translation to be unilaterally making the kinds of changes to the translation module code which I think are required to solve this - hence opening the discussion on the Core Dev list.

Cheers,
Simon


On Wednesday, January 2, 2013 6:07:30 PM UTC+10, Ingo Schommer wrote:
Hey Simon, how about just overriding SiteTree->MetaTags() in your Page class?
Or for HTML5 usage with <html lang="…">, create a getter in ContentController
which uses SiteTree->Locale and transforms it accordingly.

I don't really understand why you can't use the correct locales in the first place though? 

On 2/01/2013, at 8:17 AM, drzax <si...@elvery.net> wrote:

We've recently been building a site which uses the translation module and has translations in both Simplified and Traditional Chinese. This process has taught me a lot about translations and Chinese translation in particular, but I'm still far from being expert on the subject, however there seem significant deficiencies in the Translation module which relate specifically to Chinese.

Chinese is a complicated use case for translation because there are multiple versions of both the written and spoken language and there are additional complicating geographic inconsistencies. This makes the standard way of defining translations (the combination of language and region) less than ideal. This seems to be a reasonable summary. The W3C recommends using zh-Hant and zh-Hans for the lang attribute of (x)html content which currently isn't really supported.

As I said at the top, I'm far from being an expert on this topic, but I wanted to open the discussion and see if anyone with more expertise would weigh into the conversation because I think this aspect of the translation module could really do with some work. It certainly would have made our current project easier, and Chinese translations are becoming a more regular requirement in our studio.

Cheers,
Simon

--
You received this message because you are subscribed to the Google Groups "SilverStripe Core Development" group.
To view this discussion on the web visit https://groups.google.com/d/msg/silverstripe-dev/-/XPZiXTfkvVsJ.
To post to this group, send email to silverst...@googlegroups.com.
To unsubscribe from this group, send email to silverstripe-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/silverstripe-dev?hl=en.


--
You received this message because you are subscribed to the Google Groups "SilverStripe Core Development" group.
To view this discussion on the web visit https://groups.google.com/d/msg/silverstripe-dev/-/ulJ-FQL3AzkJ.

Sam Minnée

unread,
Jan 6, 2013, 4:35:21 PM1/6/13
to silverst...@googlegroups.com
> Thanks for the reply. Just to be clear, I'm not looking for technical assistance here per se - this issue is solved as far as my current project is concerned. I guess my point was that it would be good to address this in a more generalised way. It would be good to support translation in both simplified and traditional Chinese out of the box and in the way recommended by W3C.

So, if I understand correctly, this means that we need to have a way of using "zh-Hant" and "zh-Hans" as locale identifiers? In addition to that, the use of these locales would need to failover to "zh_CN" and/or "zh".

- Are "zh-Hant" and "zh-Hans" recognised by Zend as locales?
- Can we add locales that aren't supported by Zend's i18n stuff?
- Can we set up customised "failover chains" more complex than "XX_YY fails over to XX fails over to en_US fails over to en", which I believe is the current hard-coded default.

> Simply adding new locales as suggested in the docs doesn't quite do it unfortunately. For example, the Translatable->getContentLanguage function won't work when using zh-Hant and zh-Hans as a locale. I think it's also worth noting that they aren't actually locales so simply adding them in this way mislabels them and potentially adds confusion to the code. I'm not sure how they should be described, but they're more like identifiers of a written language.
> Simply using a locale to describe what language the document is in (which is sufficient in most cases and seems to be a fairly standard practice) just doesn't really work for Chinese. I realise this isn't a problem unique to Silverstripe, and I haven't done a great deal of research into how others are solving it.

In principle, both the core i18n system and the translatable module should be able to use arbitrary strings to identify their translations. For example, we had something along the lines of "en-lolcats". When we based our i18n code on Zend, however, I believe this got locked down. I would expect that the best way of solving this would be to have a way of adding additional strings to the Zend list of legal locales.

Right now, a locale string is used to define both the language a site is presented in, and a bunch of other settings like date/time formats. We don't need to change that if we set up more sophisticated Chinese configuration by having more verbose locale strings. For example, zh-Hant-CN, zh-Hant-TW, zh-Hans-CN, zh-Hans-TW. We would need some kind of getter that returned the relevant value of the HTML LANG attribute, since this would no longer match the locale string.

Alternatively, we can have set-locale and set-lang as separate settings, long with (presumably) set-time-format, set-date-format, set-number-format, etc. A locale would provide defaults for all of these things, but allow for them to be overridden. For simplicity, a locale string can always be used to define a lang (en_US vs en_GB being the most well-known example of this), but may be other strings - for Chinese, we would have zh_CN zh_TW, zh-Hant, and zh-Hans options.

In either case, it would be worth looking to see if Zend has solved this already (or is planning to in an upcoming release) and pulling that work in. This seems relevant: http://stackoverflow.com/questions/6722948/why-doesnt-the-zend-locale-honor-abbreviated-formats-like-zh-hk-or-zh-cn

> I just wanted to raise the topic here because I think it's an issue worth examining in more depth (maybe others don't agree?). I've submitted the odd patch here and there for the Translation module, but I don't know enough about i18n or translation to be unilaterally making the kinds of changes to the translation module code which I think are required to solve this - hence opening the discussion on the Core Dev list.

Yeah, that's definitely the right way to raise the issue. :-)

Ingo Schommer

unread,
Jan 6, 2013, 5:00:32 PM1/6/13
to silverst...@googlegroups.com
Zend_Locale does indeed have zh_Hant and zh_Hans,
with further regional variations such as zh_Hant-TW and zh_Hans-CN.

It also has zh_CN, which is implemented as aliases to zh_Hans-CN,
making it quite flexible in terms of language declaration.

Zend_Translate (the system powering i18n::_t()) allows for routing:
We could implement this through a new i18n::add_locale_routing(<locale>, <fallback>) API, which can be chainable.
i18n::_t() would need to be fixed accordingly. 

The language routing doesn't always overlap with the preferred Zend_Locale routing though,
which is used for date/time format autosetting. That was why we had to remove something like en_Lolcat.
i18n::_t() would correctly fall back to en, but Zend_Locale complains. That's only fixed 
by adding a new XML file - you can't register arbitrary locales in Zend without providing a spec for them.
Enough of an edge case to ignore, and simply reuse the add_locale_routing() API?


--
You received this message because you are subscribed to the Google Groups "SilverStripe Core Development" group.

Simon Elvery

unread,
Jan 7, 2013, 2:24:12 AM1/7/13
to silverst...@googlegroups.com
Your suggestions seem sensible, Sam. Thanks for weighing in, I was starting to think I was the only one who cared/saw an issue here. 

Ingo, you didn't miss anything. Like I said, I don't have a particular problem or technical issue which can't be solved with a little creativity - I just think Simplified/Traditional Chinese should be better handled straight out of the box. (Also, you're right about the Translatable->getContentLanguage function. I was working in an old branch of my fork when I looked that up, one I used before I discovered the functions in the DBLocale class.) 

The current example in the docs for adding a new locale is incorrect (see pull request: https://github.com/silverstripe/silverstripe-translatable/pull/77), and in addition to that for some CMS functions to work as expected at least the i18n::$common_languages and i18n::$all_locales arrays would also need updating.

From my cursory look at the Zend stuff, it seems like the translation module should be moving toward using Zend for all the Language/Locale mapping since it seems more comprehensive and would remove the necessity of maintaining these. Does that seem feasible to you?

Cheers,

Ingo Schommer

unread,
Jan 7, 2013, 3:22:52 AM1/7/13
to silverst...@googlegroups.com
PHP 5.3 comes with its own Locale class (http://de2.php.net/locale).
Its just a small collection of static methods, so we can't pass
around a locale as an object like with Zend_Locale.
Its power comes from the underlying CLDR definitions built into PHP,
e.g. date formats when also using IntlDateFormatter.

One mundane reason to switch to PHP's own handling:
We'd reduce the SS filesize by about 10MB when extracted
because we wouldn't need those CLDR definitions in XML format.
But that'd also mean replacing Zend_Translate underneath i18n::_t()
since it relies on Zend_Locale. PHP's own handler (MessageFormatter) is much less
capable, pretty much a message map with some string formatting options,
nothing around handling formats like LXML or YML.

So realistically, we'll stick with Zend for the time being.

Concerning using Zend_Locale more in our own i18n class,
I think it makes sense to use Zend's locale list in favour
of our i18n::$all_locales and i18n::$common_locales. 
We've used this static directly, so deprecation is a bit of an issue. 
Defining new locales would mean creating an XML definition then,
but that's really an edge case given the exhaustive list of locales in the CLDR project.
Simon, do you want to create a patch for this?
Reply all
Reply to author
Forward
0 new messages