Questions on plug-ins and the "cin" format in Leopard

Eric Rasmussen

unread,

Nov 8, 2007, 11:25:45 AM11/8/07

to Chinese Mac

Well, updating the Yale site for Leopard turned out to only take a few
hours, since I no longer need to do little intros and guides to the
input methods. But it's not yet uploaded, and won't be for a while,
because of one remaining obstacle:

Plug-in input methods.

This has been changed somewhat in Leopard -- we've already seen what
may be a fairly serious bug possibly related to wrongly assigning the
".cin" extension to a file that should have the ".inputplugin"
extension. My current thinking is that ".inputplugin" is for the Apple
format that used to be converted into a ".dat" file by the Input
Method Plug-in Converter. But I don't know anything about what
constitutes a valid ".cin" format file -- how are they different from
the Apple format? Do they really work in Leopard? Or are there
limitations?

I know ".cin" originated with the Xcin project for X11. I also know
OpenVanilla is designed to use the ".cin" format.

Also, I'm moving OpenVanilla to its own section on the input methods
page. It doesn't quite fit in with the other input methods, since it
is a framework. I may put it in a new section called "Plug-in
Frameworks", containing OV and the Apple plug-in mechanism. Does that
make sense?

Anyhow, any advice, wisdom, and/or links to information about the
".cin" format, OV, and the like would be most welcome!

Eric

Lukhnos D. Liu

unread,

Nov 9, 2007, 12:25:00 PM11/9/07

to Chinese Mac

On Nov 9, 12:25 am, Eric Rasmussen <keras...@mac.com> wrote:
> This has been changed somewhat in Leopard -- we've already seen what
> may be a fairly serious bug possibly related to wrongly assigning the
> ".cin" extension to a file that should have the ".inputplugin"
> extension. My current thinking is that ".inputplugin" is for the Apple
> format that used to be converted into a ".dat" file by the Input
> Method Plug-in Converter. But I don't know anything about what
> constitutes a valid ".cin" format file -- how are they different from
> the Apple format? Do they really work in Leopard? Or are there
> limitations?

I'll begin with some history that I know.

.cin was first introduced by Xcin, an input method framework for X11
developed in the mid 1990s, as a data format for table-based input
methods. By table-based I mean input methods that can be implemented,
or seen, as a table look-up mechanism. Around 90% of input methods
(Chinese and beyond) can be implemented that way. Apple's .inputplugin
also belongs to that category. Almost every mainstream input method
framework supports at least one form of user-customizable IME
creation. .cin seems to have become one of the standard data formats
because it's simple and many user-generated tables are already in wide
circulation.

I have very limited knowledge of Xcin and other frameworks, but in the
early days, .cin was intended as a source format, not to be consumed
directly by input method framework (or more precisely, the table-based
input method "generator"). Also back then a .cin could use any
encoding recognized by the framework. So phone.cin (renamed to
bpmf.cin in OV) was encoded in Big5, pinyin.cin in GB, and so on.

When we were developing the "generic" module (first named OVIMXcin,
later renamed to OVIMGeneric) to support .cin in OpenVanilla, we made
two decisions: first, we no longer require user to run a compiler/
converter to make .cin into a binary format, as it was so, which means
the .cin is consumed by the input method module directly. Second,
all .cin files must use UTF-8 encoding. This opened the door to bigger
character set and the famous "♨" input method.

So what constitutes a valid .cin file? For OpenVanilla, a .cin file
consists of three sections:

1. a header consisting of directives beginning with "%", like %ename,
%selkey, %endkey. Some of them are like meta-data, some of them are
controlling directives.

2. a keyname block between the directives "%keyname begin" and
"%keyname end". This tells the generic input method to map the key
typed to a character displayed in the composing stage (mostly to
represent radicals in radcial-based input methods).

3. a chardef block between the directives "%chardef begin" and
"%chardef end". This is the body of the data table. "chardef" is
somewhat an anachronistic misnomer. It used to define the relationship
between key sequences to characters (hence the name), but modern
implementations like OV and gcin allow phrases in this block.

Different frameworks have implemented the details somewhat
differently. OV's implementation disallows the use of Windows-style CR
LF (so only the UNIX-style \n is used, and that's also what OS X
uses), and comment lines (beginning with #) is not allowed in the
chardef block.

Although .cin contains enough information for key-character/phrase
mapping, but many input methods (like Cangjei/"Changjei" or Simplex/
Jianyi) require finer control. For OpenVanilla, the control is
provided in the form of input method preferences (with some mind-
bogging names like "force composition when reaching maximum length of
radical" or "use space to select the 1st candidate). Different input
methods require different controls (and those are a must -- failure to
provide those controls yields barely usable input methods). gcin
differs from OV's implementation in that it allows those control
directives to be expressed as a .cin header, with its own directive
extensions.

OpenVanilla's repository of .cin is available at:

http://openvanilla.googlecode.com/svn/trunk/Modules/SharedData/

Zonble has written an excellent tutorial (in Chinese) on how to create
your own input method by writing up a .cin, which is kind of standard
text now:

http://docs.google.com/View?docid=ah6d8th954vw_201fd5dkx

Technically .cin is really just a set of key-value pairs with its own
convention. OV makes heavy use of .cin as a format. Things like
reverse radical/pinyin lookup or associated phrases are also done
with .cin-based data tables. I see it a good sign that Apple adopts a
popular (and mostly consistent and cross-framework compatible) data
format for Leopard.

So what about Leopard? As far as I know, dropping in a UTF-8-
encoded .cin into ~/Library/Input Methods or /Library/Input Methods
then re-login just works. A new input method, using the name defined
in the .cin, shows up in the Input Menu tab of the International
preferences panel. I'm not aware of any per-method level control so
far (I might be very ignorant on this).

In terms of limitation, I'm not aware of that either. OV's own
implementation (and many others) is only limited by memory and your
patience (loading a .cin with 200,000 entries on a G3 is no small
thing; a database-backed design will solve the problem). Leopard's own
take should not differ much. So it should be very flexible and easily
customizable.

d.

Lukhnos D. Liu

unread,

Nov 9, 2007, 1:11:13 PM11/9/07

to Chinese Mac

On Nov 10, 1:25 am, "Lukhnos D. Liu" <lukh...@gmail.com> wrote:
> In terms of limitation, I'm not aware of that either. OV's own
> implementation (and many others) is only limited by memory and your
> patience (loading a .cin with 200,000 entries on a G3 is no small
> thing; a database-backed design will solve the problem). Leopard's own
> take should not differ much. So it should be very flexible and easily
> customizable.

I should quickly point out that both OV and Leopard's .cin
implementations are fast. Loading 200k entries on a G3 is an extreme
case.

d.

傅可恩

unread,

Nov 9, 2007, 9:08:11 PM11/9/07

to Chinese Mac

Wow. Thanks for that Lukhnos!

I remember trying to implement Xcin on X11 when OS X first came out
and didn't have good Chinese support. I couldn't get anywhere, so I
really appreciate how much work has gone into making it easy to do
things with OV!

Cheers,

Kerim

On Nov 10, 1:25 am, "Lukhnos D. Liu" <lukh...@gmail.com> wrote:

Reply all

Reply to author

Forward