_t() and translate.ss.org in 3.0

252 views
Skip to first unread message

Ingo Schommer

unread,
Nov 18, 2011, 10:49:06 AM11/18/11
to silverst...@googlegroups.com
Hello there!

As we're developing more of the 3.0 UI, we also need to start thinking about
the required translation effort. At the moment we're in a bind where the 
new translation strings can't be imported easily to translate.silverstripe.org without 
cutting off the ability to safely export those translations to 2.4 again.

There's certain hacks around this which I'm looking at on the short term,
but in general I don't think the translate.silverstripe.org project has a future.
I've asked for maintenance help numerous times on this list, with little feedback.
The 3.0 dilemma was just the last drop in the bucket, so to speak.

So, here's what I'm proposing: 
1. Zend_Translate to replace parts of the i18n class (and the _t() method).
3. Switch the default lang format in core to Gettext PO files.

There's numerous considerations with this suggestion -
I'm mainly concerned about getting this over the line quickly
even if it means some unpleasant compromises.
This decision has been hanging around way too long.

# Translator Platform
- Free project hosting and translator signup (unlimited translators)
- Web-based translation
- JavaScript translations
- Automated import/export onto github
- Manual approval of languages (might be achieved through selective export scripts  as well)
- Context strings (an estimated 5% of core entities have more context to clarify the string itself)
- Optional: Collaboration tools like voting or flagging a translation
- Optional: API to retrieve/update strings (maybe even collaborator stats)
- Optional: Import of existing translation stats so we can credit translators for past work (icing on the cake)

# Entity retrieval
- Existing i18nTextCollector should be easy to retrofit to any format we decide on. Its still an ugly mess of regexes though.
- Hard to replace completely, unless we do away with the concept of i18nEntityProvider and rely fully on static code parsing.

# Backend requirements
Zend_Translate fits, as we already use Zend_Locale, the underlying CLDR codebase and Zend_Cache,
so no new dependencies on top of that. Has many file backends, plus we can write our own one for legacy file support.
Will need to investigate caching abilities, we don't want to parse large amounts of YML or XML on every request.

# File format requirements
- Support for plural forms, so multiple translations for the same string based on quantity.
Thats not possible in the current i18n class, but its an essential localization technique that we missed.
- I think we can safely ditch the "priority" parameter currently present in _t(). Nobody uses it.

# File formats: Pro/Con

## Rails 2.2. YML
- Pro: Simple and readable file format. Arbitrary nesting. Simple context definitions through YML comments.
More likely to have useable and cutting edge tools for translators.
- Con: We'd need to build a custom Zend_Translate adapter (a similar adapter for *.ini is about 20 LOC though).
Tooling will be most likely around Ruby ecosystem, not PHP.

## Gettext
- Pro: Wide tooling support (e.g. Zend_Translate)
- Con: A bit arcane file format, will be hard to generate from current i18nTextCollector implementation.
Not sure about native Gettext parser support for our custom format with _t(id, string, priority, context).
Only reliably works with specialized parsers, although they're a'plenty in PHP.

## XLIFF
- Pro: Wide tooling support (e.g. Zend_Translate), open standard
- Con: XML verbosity and uglyness. Will take uncompressed core lang files from 2.8MB currently to an estimated 10MB.

## PHP Arrays
- Pro: No caching required, close to the current format.
- Con: Inconsistent tooling support, no standard format. E.g. getlocalization supports it, but not with context or pluralization.

Links to i18n topics:
http://www.oasis-open.org/committees/xliff/documents/cs-xliff-core-1.1-20031031.htm#context

Links to alternative translation web tools:

OK, a lot of info :) Let me know what you think.

Ingo

---
Ingo Schommer | Senior Developer

Sam Minnée

unread,
Nov 18, 2011, 3:06:07 PM11/18/11
to silverst...@googlegroups.com
My initial thought on this is that replacing translate.silverstripe.com is a really great idea but that replacing _t() sounds like a red herring. In particular, it seems that leaving textcollector and entityprovider as-is, except for changing them to write a file format compatible with a 3rd party tool, might be a better step?
--
You received this message because you are subscribed to the Google Groups "SilverStripe Core Development" group.
To post to this group, send email to silverst...@googlegroups.com.
To unsubscribe from this group, send email to silverstripe-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/silverstripe-dev?hl=en.

Ingo Schommer

unread,
Nov 18, 2011, 7:34:09 PM11/18/11
to silverst...@googlegroups.com
I'm actually quite far along with this whole YML conversion already, created a little script here:

Converted the files in a branch for now:
The YML looks a lot cleaner than even the PHP arrays, good stuff!

Also used the opportunity to download all new translations from translate.ss.org,
and migrate the entities to their proper module (the cms<->sapphire file migrations have caused a lot of wrong translations).
And removed orphaned translations which no longer have any representations in the master table.
So all in all, a good starting point for translators. 

No data loss apart from the useless priority flag,
the context is simply a YML comment. Hard to tell if any translations are lost (its 13k in total...),
raw counts don't work that well as I do duplicate checks, move stuff around etc.
It *roughly* matches up though, done some entity based diffing.

So, here's the interesting bit: I've imported these YML files into http://www.getlocalization.com/sapphire/.
Had to import the ~60 language files by hand, so I haven't done the cms module just yet.
Overall it works freaking great, amazing platform.

But here's the kicker: They have an "in page editor", and it works in the 3.0 CMS out of the box!
Thats like the holy grail of translations :) Need to suss out the details on how it identifies
strings, but very encouraging already.

TODO:
- Write Zend_Translate YML adapter
- Convert remaining modules from translate.ss.org (not urgent)
- Fix LOLCAT translation, locale not detected (very urgent! *g*)
- Talk to getlocalization in order to get context comments showing up in editor

@Sam, regarding _t(): Yeah it would most likely just route through to Zend_Translate->translate().
We might want to consider dropping the priority argument for 3.0 though as an API change, its completely pointless.
In all its five years of existence, I haven't seen it being used ONCE ;)

Jeremy Shipman

unread,
Nov 19, 2011, 6:03:16 PM11/19/11
to silverst...@googlegroups.com
Without wanting to digress from the main discussion, I just want to
point this out:

There has been some discussion recently about the community making an
effort to improve the modules database. I'm not sure where things are at
currently with that effort, but I think it would be valuable to think
about how translations could be tied in with that work.

It would be great to submit a module to the database, and it is
immediately available for translations to come in.

Brice Burgess

unread,
Nov 19, 2011, 6:48:35 PM11/19/11
to silverst...@googlegroups.com
On 11/18/2011 9:49 AM, Ingo Schommer wrote:
So, here's what I'm proposing: 
1. Zend_Translate to replace parts of the i18n class (and the _t() method).
3. Switch the default lang format in core to Gettext PO files.


Ingo,

I like this a lot. Gettext is a very fast solution with little system overhead -- most PHP distributions come with gettext support compiled in, it's the standard, and widely known.  PO files are actually a pleasure to deal with and there is an abundance of viewers and editors out there which help translators manage and collaborate on translations.

I'm not as familiar with getlocalization.com -- but having an organized means for translators to collaborate on a particular locale is key... and it would be nice to expand translate.silverstripe.org ||  getlocalization.com to cover SilverStripe modules utilizing _t()  (similar to Drupal's http://localize.drupal.org/).

~ Brice

Sam Minnée

unread,
Nov 19, 2011, 10:45:31 PM11/19/11
to silverst...@googlegroups.com
Hey Ingo,

Glad to hear that you've made progress on this! :-)

 * I think that _t() (with the priority argument dropped) should still be the primary API.  I think we should discourage the use of Zend_Translate->translate() in regular code because it's way too verbose for normal people to use.  It's critical that the translation function is terse.


 * I fail to see the value in having separate http://www.getlocalization.com/silverstripe-framework/ and http://www.getlocalization.com/silverstripe-cms/ projects, it just seems to make it harder for a would-be translator to provide a translation. 

 * Is there any way that we can automate the insertion of entities into www.getlocalization.com?  Ideally the uploading of entities and downloading of translations would be something that we could make a daily automated process, so master was as closed to "production ready" as possible in this regard.

--
You received this message because you are subscribed to the Google Groups "SilverStripe Core Development" group.
To view this discussion on the web visit https://groups.google.com/d/msg/silverstripe-dev/-/wRzWu82dhRkJ.

Hamish Friedlander

unread,
Nov 20, 2011, 3:29:25 PM11/20/11
to silverst...@googlegroups.com
Just to add more confusion to the mix, somehow we missed merging Julian's ssviewer enhancements for the alpha, but that includes the new _t replacement for templates (I suspect there'll be merge clashes with whatever you've done Ingo, sorry).


This change also improves the php version of _t to support the named arguments in the replacement string.

We had a long discussion at that stage about how translate calls should look. I don't think we need to have another one. If Zend_Translate can't be made to support the syntax from that discussion and patch, I'm against using it. We already have a working translation system, and I'm against replacing a working component with a third party one just to reduce our maintenance surface.

However I think replacing the data structure with yaml or something else is perfectly sensible. The config system uses yaml, so if that looks good I'd vote for that.

Hamish Friedlander

Sam Minnée

unread,
Nov 20, 2011, 4:12:22 PM11/20/11
to silverst...@googlegroups.com
Zend_Translate should be able to be plugged into that without too much bother, but this is another argument for leaving all the existing APIs as the primary APIs rather than deprecating them in favor over some Zend thing.

It does raise a question, however:

 - Does the new template system work with TextCollector?
 - Does TextCollector have a plug-in API so that text collection for the template system can be part of the template system?  I can see need for 3 text collector modules so far: PHP, JavaScript, and Templates.

Ingo Schommer

unread,
Nov 21, 2011, 4:24:35 AM11/21/11
to silverst...@googlegroups.com
Hey guys,

Renaming "sapphire": Of course, its renamed (doh).
Project name updated fine, but URL seems to be "stuck", I've emailed support.

We need two different projects though, to avoid confusion with the master files. 
There can only be one "en.yml" in the system, demonstrated through the API
We *could* rename them automatically to "en-framework.yml" and "en-cms.yml" before upload,
but still leaves us with the problem that translated languages will be downloaded as one file. 

On Sunday, November 20, 2011 10:12:22 PM UTC+1, Sam Minnée wrote:
Zend_Translate should be able to be plugged into that without too much bother, but this is another argument for leaving all the existing APIs as the primary APIs rather than deprecating them in favor over some Zend thing.
Yep, not advocating to change the public API, _t() is fine (hence I said "route through Zend_Translate").
In the end, we're just changing the way $transEntity is retrieved (at the bottom of Hamish's linked patch).
The Zend API doesn't even try to solve this. I'm not quite sure how the named arguments are actually passed through though.
Is there any docs on this? Mark's SSViewer docs pull request doesn't have it, and can't find anything in core.

As you can reuse translation entities created in templates elsewhere in PHP,
the PHP version needs to support the same notation, right?
That'd be a good way to get argument ordering support as well,
as it varies based on language.

On the "we already have a working translation system": Its missing pluralization,
as well as locale fallbacks (de_AT.yml austrian german falling back to de.yml, rather than the completely separate de_DE.php and de_AT.php in the current system). 
Both possible to retrofit onto our system, I need to look into this in more detail on the amount of work required.
We need to write a YAML loader anyway, but for custom code we'd also need to write the caching layer (built-in to Zend_Translate).
Also, while supporting multiple adapters in addition to YML doesn't add much value to the immediate core usage,
it makes things easier for projects with existing translations to migrate over to SilverStripe.

It does raise a question, however:

 - Does the new template system work with TextCollector?
Nope. 
 - Does TextCollector have a plug-in API so that text collection for the template system can be part of the template system?  I can see need for 3 text collector modules so far: PHP, JavaScript, and Templates.
Nope. Good idea though. Who's implementing it? ;)

Daniel Lindkvist

unread,
Nov 21, 2011, 6:01:03 AM11/21/11
to silverst...@googlegroups.com

Hi,

 

Just like to give my input on translations.

 

We’ve made translation module which enabled in context editing as well as editing translations in a list view. I haven’t been able to publish it as a proper module since it contains a few kinks such as:

·         Copying the cached templates but with rewritten versions  of _t calls (with input capabilities to enable in site editing) to a new directory

·         Selecting said directory when admin user is performing in site translations editing

·         Requiring a _ss_environment.php file to do above mentioned switch

 

Previously we saved any changes directly into the lang files. This was a horrible mistake since it means that any changes made live would be overwritten on a new deploy, or it would require keeping the live lang files while deploying and putting them back when you’re done.

 

The approach we’re doing now is to store any changes in the database and then add the ability to import, export to and from the database.

Also we’re hooking into controller onBeforeInit to overwrite the lang values with any changes from the database enabling translations edited live to be shown instantly.

 

Is there any noticeable performance impact from using yml instead of the php lang files? We’re having performance issues as it is and would like to avoid more of them J

 

Best Regards

 

Daniel Lindkvist

--

You received this message because you are subscribed to the Google Groups "SilverStripe Core Development" group.

Sam Minnée

unread,
Nov 21, 2011, 3:47:11 PM11/21/11
to SilverStripe Core Development
> Previously we saved any changes directly into the lang files. This was a horrible mistake since it means that any changes made live would be overwritten on a new deploy, or it would require keeping the live lang files while deploying and putting them back when you're done.
>
> The approach we're doing now is to store any changes in the database and then add the ability to import, export to and from the database.
> Also we're hooking into controller onBeforeInit to overwrite the lang values with any changes from the database enabling translations edited live to be shown instantly.

As I understand it, you could provide an alternative Zend_Translate
back-end that reads a database rather than a YML file, so you should
be able to do this without hacks.

You module sounds cool! The one thing that I would suggest is to only
store *modifications from the default* in your database, so that if
the default strings in the PHP files are updated by developers,
these. You might also wish to store the "default en_US value" that
was provided in the code for each of the modified strings - if the
default value in the code changes, you know that your translation
might be out of date.

> Is there any noticeable performance impact from using yml instead of the php lang files? We're having performance issues as it is and would like to avoid more of them :)

If there is we could always cache as a PHP array. I believe that the
yml config system is also going to do this.

Ingo Schommer

unread,
Nov 27, 2011, 6:33:45 AM11/27/11
to silverst...@googlegroups.com
Hey guys,

I've spent the better part of my weekend on this stuff, here's where I'm at:

Also created a Translate_Adapter_RailsYAML module for this purpose,
which is independent of sapphire and included in sapphire/thirdparty via piston:

Its not finished, a teeny tiny bug is that the CMS doesn't load ;)
But from a code and unit test perspective its pretty much there.
So I would appreciate some peer review, particularly:

- Can we remove (not deprecate) the i18n plugin mechanism in favour of Zend_Translate? 
Same idea, but the plugin is a bit more hardwired to assume the $lang global (no return values).
It was added fairly recently in 2.4, so don't think it has usage beyond Mark's module.
- Does the $exclusive flag in SS_ClassLoader make sense to you? 0e885467cd98ca6a5c5619a1376822f56a4e4540
Or can you think of another way to test "autoloading" of lang files in a submodule without removing references to all core files from the manifest?
The test requiring this patch is i18nTest::testIncludeByLocale()
- If the YAML format has any conceptual flaws.
- If you understand how the different translation adapters interact and fallback on each other in i18n::_t(), i18n::include_by_locale(), i18n::get_translators().
The "legacy" adapter for 3.0 is added by default for now, so you can use modules and custom code with unconverted PHP lang files.
As PHP and YML can co-exist in one folder, that might not be necessary if we require all modules to convert their stuff for 3.0 compat.
Note: In theory modules with PHP-only lang files will still work in 3.0 without the legacy adapter in the default language because of the _t() default strings.

TODO:
- Fix infinite loop when loading the CMS (doh)
- Performance tests: Should be fairly similar, still only loads locales on demand (which was a pain to implement, the concept is not available in Zend as such). 
Plus the YAML parsing overhead, cache deserialization and more Zend code of course.
- Fix i18nTextCollectorTest and weird manifest issues after i18nText has run
- More transparent cache invalidation (currently time based only)
- Update i18nTextCollector to output YAML (ideally through a new adapter system)
- Test fallback adapter with a couple of popular modules containing PHP files
- Write upgrade notices: How to convert to YAML, rewrite any hardcoded $lang overrides
- Write documentation: Note about fallback languages (de_DE to de)
- Talk to Mark about how his DB-driven translation module can become a translation adapter

Thanks
Ingo

Ingo Schommer

unread,
Dec 4, 2011, 3:55:54 PM12/4/11
to silverst...@googlegroups.com
I've fixed the CMS loading, and looked at performance. There's a bunch of new commit on the branch (which I'm planning to squash eventually).
https://github.com/chillu/sapphire/commits/yml-lang-files

Performance results are mixed - currently a CMS page load is about 25% slower (462ms instead of 379ms).
Thats due to the massive amounts of entities it loads (570), so the _t() system is a major contributor to the overall performance.
These results are with Zend_Cache primed, so its not parsing the YAML each time but rather unserializing cached PHP arrays.
Its not entirely unexpected to have a library being slower than effectively raw PHP array lookups in $lang, the question is if its good enough.

The uncached parsing stage brings the CMS load time from 10.7s to 12.4s, with more or less the same memory usage.
As the cache can basically exist forever (translations rarely change), I don't think thats a big deal as long
as we don't exceed critical values like 64M of RAM usage or 30s of execution time on average hosts.

One simple way to optimize this is to return the default string for en_US, bypassing the whole library.
That would disregard any overwritten translations though, and just work for english speaking users obviously.

Note on my environment: SSD on MySQL, PHP caches and webroot, xcache on but without opcode caching.
Keen to see how it performs on Debian with APC caching as well.

Ingo

Ingo Schommer

unread,
Dec 4, 2011, 3:57:58 PM12/4/11
to silverst...@googlegroups.com
Oh, and benchmark methodology: A mix of XHProf diff'ing between the master and yml-lang-files branches (both cached and uncached),
as well as ApacheBench: ab -n 50 -C alc_enc=1%3A773e9e2fd1cb3ec48f854f0d8de302a4bb06bf4a -C PastMember=1 -C PHPSESSID=hr3kr0k74llkajufh48aisjo56 http://localhost/admin/page/edit/show/1
ApacheBench results are attached (first is the YML branch, second is master).
benchmark.txt

Sam Minnée

unread,
Apr 3, 2012, 8:30:01 PM4/3/12
to silverst...@googlegroups.com
OK, this got stalled a bit.

I think we can go ahead with this approach; if the performance problem is an issue then we can optimise while still using YML files (we might end up bypassing Zend stuff if that's the bottleneck).

I've created a ticket http://open.silverstripe.org/ticket/7104 but I have no idea what extra work is required.

Ingo, could you amend the ticket with relevant bullet points?

Thanks,
Sam

Ingo Schommer

unread,
Apr 13, 2012, 7:52:43 PM4/13/12
to silverst...@googlegroups.com
Hello!

I'm trying to get this over the line, and think I'm pretty close:

I've reduced the number of _t() calls required in a typical CMS load from 600 to 300
through some in memory and disk caching, which also reduces the overall performance impact.
Looking at XHProf, its roughly 5% of the overall CMS execution time (3% of homepage request),
so the impact of any further performance will be limited. That being said,
it still parses the (disk-cached) YAML data on every request, which isn't strictly required.
I'll look into adding another caching layer for PHP arrays on top of that.

In the meantime, I've switched the i18nTextCollector to output YAML
(through a new pluggable writer API). While I was at it, also removed
support for the $priority argument to _t(), which I've never seen used.
And further down the rabbit hole, ended up rewriting the regexes of hell (tm) 
in i18nTextCollector to a more maintainable PHP Tokenizer.

Feedback welcome, but please be quick as I'm planning to get this into beta2 in the next days.
Ingo

Ingo Schommer

unread,
Apr 15, 2012, 3:57:33 PM4/15/12
to silverst...@googlegroups.com
This is now in master. I've written up some outstanding tasks at http://open.silverstripe.org/ticket/7104, but none of them are beta-release critical.

Ingo Schommer

unread,
Apr 29, 2012, 5:46:38 PM4/29/12
to silverst...@googlegroups.com
We've now switched our community translation system from the homebrew solution on translate.silverstripe.org,
to an externally maintained project, getlocalization.com. This was made possible by the new YML
format discussed earlier, and hopefully will enable a smoother translation workflow.


In the last days, I've emailed the 190 translators registered on the old platform
which were actively participating in the last three years (thanks guys!).
and some have already started translating on the new system.

At the moment, getting translations into master is not automated,

We're now encouraging module owners to convert their language files, and create their own getlocalization projects.
Our own maintained modules still need to go through that process as well: http://open.silverstripe.org/ticket/7234

Thanks!
Ingo
Reply all
Reply to author
Forward
0 new messages