Proposal: Improving Unicode/i18n support by binding ICU

50 views
Skip to first unread message

Erick Guan

unread,
Mar 30, 2017, 4:07:36 PM3/30/17
to rubyonrails-gsoc

It's an incomplete draft. And I it's in markdown not rich HTML.


## Project Name

Improving Unicode/i18n support by binding ICU

 

## Project Description

ActiveSupport hosts a collection of utilities for Rails which includes some i18n and Unicode functionalities. It's said complete because some parts reject the future change, i.e. `Inflector`. But I still see the benefit and possibility to work for i18n support on Rails level without over engineering it. Not every web programmers are lucky enough to only learn UTF-8. Even with the current i18n support from Rails, there are more features asked. Ruby and Rails evolves fast in this decade. Importantly, lots of i18n and Unicode problems are tackled by International Components for Unicode (ICU) for decades. The niche is that Ruby and Rails community has another way to see i18n. Though I believe it's better if someone works on this direction. So dare me to propose the project to see what will happened.

 

## Why did you choose this idea?

The story behind was that I tried to support Unicode username for Discourse but Sam Saffron mentioned the security risk like visual confusables and invisible things as well as difficulties to build a slug. Besides, ActiveSupport's `transliterate` always return nothing when I feed it with Chinese. As a Chinese, I see no chance to provide a list for supporting thousands of Chinese characters in a locale file (A nightmare to publish to Transifex for localization). Yet, these characters have Latin representation and useful to the end user. And there are experts who devoted to solve them, including the existing library.

The reason why ICU doesn't appear on Ruby has a history. Starting from [Ruby 1.9](http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/), MRI builds the codeset independent model for string implementation instead of an internal UTF16/32 presentation. Thereafter, it's slow to see the work on Unicode on MRI.

ICU is built for processing Unicode and some i18n problems. The robustness of the library and DRY drives me to this idea. twitter-cldr-rb has to implement all algorithms defined in Unicode standard which doesn't get much attention now.

 

## Please describe an outline project architecture or an approach to it

ICU has C/C++ interfaces and Java interfaces. I would [continue with the C binding gem](https://github.com/fantasticfears/icu4r-next). In the mean time, I will investigate how JRuby community works with Rails so I maybe construct a similar API in par with Java interfaces. JRuby are lucky enough to invoke ICU classes. Then I am going to work on an backend for i18n gem by using ICU library. I don't expect a change to ActiveSupport's normalization or its utilities in the project due to platform support and compatibility reason. But we can have a steady ICU gem first and better ideas to improve i18n and Unicode support in ActiveSupport. It takes time to learn Unicode, the more I learn what it is, the more I am impressed with the complexity.

 

## Give us details about the milestones for this project

1. Communication. I am going to spend the first week to communicate with the community to reasoning the current work as well as researching Rails prospects for i18n support.

2. Binding ICU library. It's going to take 4 weeks to continue my [proof of concept](https://github.com/fantasticfears/icu4r-next). Besides the glue code, I need to focus on the thread safety and memory management. The binding should include conversion, locales, normalization, calendars, date and time formatting, transliteration, collation and spoof detection.

3. Writing documentation and evaluation with JRuby community. With the finished, it's important to have more eyes on API exposed to Ruby world. The API should be low-level, stable and similar to ICU4j if applicable. This should take 2 weeks.

4. The next 3 weeks I will work on a `i18n` backend gem to leverage on ICU.

5. I'll draft some patches for Rails to demonstrate the possibilities for changing ActiveSupport in the last two weeks.

 

## Why will your proposal benefit Ruby on Rails?

 

It will bring industrial, battle-tested and robust ICU gem to Ruby ecosystem. If ActiveSupport can make use of the binding, it should be significantly faster than plain Ruby code as well as conformance to the Unicode standard. Plus, no one needs to reinvent a wheel like twitter-cldr-rb. And I doubt no one will like w

 

## Please describe any previous Open Source development experience

 

I contribute to Discourse quite often. I also make a few one off contribution to the open source projects I'm interested in, that including Pundit, some Wikipedia related projects as well as translation.

I also published my work I used on GitHub.

For the particular issue, I built a simple gem called [tr39_confusables](https://github.com/fantasticfears/tr39_confusables) which implements the algorithm to detect visual confusables. I also started [the proof of concept for the binding](https://github.com/fantasticfears/icu4r-next) which is working and [executing faster than others in MRI](https://github.com/fantasticfears/icu4r-next/blob/master/benchmark/normalization_result.txt).

 

## Why are you interested in Open Source?

 

Open source built the common software block as well as sharing and benefiting the world. [By writing less code means less errors](https://blog.codinghorror.com/the-best-code-is-no-code-at-all/) for the programing, it looks legit.

Besides, it's an opportunity for me to learn and get involved.

 

## How long will the project take? When can you begin?

 

3 months

 

## How much time do you expect to dedicate to this project? (weekly)

 

40

 

## Where will you be based during the summer?

 

Sundsvall, Sweden

 

## What timezone will you be working in and what hours do you plan to work? (so we can best match mentors)

UTC+2, I works on the daytime.

 

## Do you have any commitments for the summer? (holidays/work/summer courses)

 

I may plan a trip at the end of June for one week. I'll still be reachable if I decide to go.

 

## Have you ever participated in a previous GSoC? If yes, describe your project.

 

Discourse, I built the webhooks feature.

 

## Why did you apply for the Google Summer of Code ?

 

Make a contribution and get paid is fulfillment process. I have to wait VISA :( in the summer time and GSoC is remote.

 

## Why did you choose Ruby on Rails as a mentoring organisation?

 

Rails team has the hugest impact for Ruby ecosystem perhaps. Unicode/i18n is not easy task and few people knows a lot about it. Most people learn and use Rails happily and tolerate the need to reinvented an custom gem for their problem. It's also extremely expensive to improve Unicode support on MRI since the regexp engine Onigmo and string implementation so MRI team was hesitating and moved on CSI model. However, the Rails community really needs the support for i18n since our projects rely on it. On one hand, the open source project tanks without enough attentions. Driving by the need from the community, the project is more likely to sustain and be useful. On the other hand, it's more likely I can find a mentor.

 

## Why do you want to participate and why should Ruby on Rails choose you?

 

Years ago, I leant Rails and encouraged by Chinese Ruby community. There are also some serious local business built upon Rails. However, they always bring up generations localization solutions over years. It feels like reinvesting a exact thing with little knowledge learnt by all. It bites them and also bites others where some other projects who values internationalization strategy. I hope I can pay it forward and I feel the need to communicate the i18n problems to all.

Reply all
Reply to author
Forward
0 new messages