Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Cross-variety English Spellchecking

53 views
Skip to first unread message

David Chan

unread,
Jun 15, 2013, 6:43:34 PM6/15/13
to dev-...@lists.mozilla.org
Hello,

I am in the process of open-sourcing a cross-variety English
spellchecker project at http://github.com/divec/en-global .

This "global" spellchecker allows all national variant English
spellings (so "color" and "colour" are both accepted) and has better
global placename and personal name support than region-specific
spellcheckers. It is helpful whenever cross-regional forms should be
supported; e.g. for English learners or international collaboration.

I am packaging it as an XPI extension. Currently I have to do this by
creating multiple identical files, which each pretend to be a national
variant; i.e.:

en-AU.aff en-AU.dic
en-CA.aff en-CA.dic
...
en-US.aff en-US.dic

This is necessary because Mozilla identifies hunspell dictionary
languages via a xx-YY naming convention where YY is the ISO-3166
country code.

Obviously this is inefficient and misleading (because we're checking
"en", not "en-AU" or "en-CA"). This same issue would affect "global"
(multi-variety) spellcheckers for other languages such as Spanish.

At the moment, I think the best way to fix this would be for mozilla
to allow BCP 47-style codes for hunspell dictionaries. It would be an
opportunity to support language varieties such as these:

en
nan
sr-Latn
es-419
sl-IT-nedis
az-Arab-x-AZE-derbend

Then "global" spellcheckers would work via BCP 47's fallback rules
(en-US -> en etc). At the same time, other varieties such as Serbian
written in Latin would be supported.

Does everyone agree that this would be a good approach?

Thanks,
--
David Chan

Jesper Kristensen

unread,
Jun 16, 2013, 4:14:24 AM6/16/13
to
Doesn't this work today already?

For the Danish spell check dictionary I use da.aff and da.dic, and it works.

Gordon P. Hemsley

unread,
Jun 16, 2013, 1:49:38 PM6/16/13
to
I agree that this would be a good approach, and it's one which probably already works for locales like Danish that don't use region tags.

The problem is likely that the fallback is not implemented for locales which use more complex language tags, as David has suggested.

There is probably already a long-standing bug on file for this (or something similar), but I don't know it offhand.

HTH,
Gordon

Axel Hecht

unread,
Jun 16, 2013, 5:28:26 PM6/16/13
to
I don't think that fallback matters here. spellchecker.dictionary is
just set to the locale code of your dictionary, no matter how it's
formatted.
http://mxr.mozilla.org/mozilla-central/search?string=spellchecker.dictionary
looks like that stuff should be pretty straight forward.

There's the long-standing problem of using language codes for which we
don't have a name in our setup, but that's different.

Axel

Mark Tyndall

unread,
Jun 18, 2013, 2:35:38 AM6/18/13
to
David Chan wrote:
> Hello,
>
> I am in the process of open-sourcing a cross-variety English
> spellchecker project at http://github.com/divec/en-global .
>
> This "global" spellchecker allows all national variant English
> spellings (so "color" and "colour" are both accepted) and has better
> global placename and personal name support than region-specific
> spellcheckers. It is helpful whenever cross-regional forms should be
> supported; e.g. for English learners or international collaboration.

Hi,

Is this an actual requirement anywhere? Any "international" setting
I've encountered either implicitly or explicitly defines whether English
or American English is the preferred language.

And I don't believe that English learners are helped by not being
corrected if they are spelling words wrongly* in the chosen regional
version of English they are learning.

* NB: exactly one version of "color" or "colour" is correct, not both,
for -AU, -CA, -GB and -US. (Contrast with organise/organize, which is
ruled in en-GB by preference alone.)

regards,

Mark.
en-GB dictionary packager

John Wilcock

unread,
Jun 18, 2013, 4:33:31 AM6/18/13
to
Le 18/06/2013 08:35, Mark Tyndall a écrit :
> Is this an actual requirement anywhere? Any "international" setting
> I've encountered either implicitly or explicitly defines whether English
> or American English is the preferred language.
>
> And I don't believe that English learners are helped by not being
> corrected if they are spelling words wrongly* in the chosen regional
> version of English they are learning.
>
> * NB: exactly one version of "color" or "colour" is correct, not both,
> for -AU, -CA, -GB and -US. (Contrast with organise/organize, which is
> ruled in en-GB by preference alone.)

Indeed - while people in some "international" settings don't really care
which variety of English spelling is used, consistency *is* (or should
be) important.

For that matter, consistency of -ize/-ise spellings is also important; a
far more useful project IMO would be to create en-GB-Oxford and
en-GB-standard dictionaries that only accept one or the other but not both.

--
John

Mark Tyndall

unread,
Jun 18, 2013, 6:07:41 AM6/18/13
to
On 18/06/2013 09:33, John Wilcock wrote:
>[...]
> For that matter, consistency of -ize/-ise spellings is also important; a
> far more useful project IMO would be to create en-GB-Oxford and
> en-GB-standard dictionaries that only accept one or the other but not both.

David Bartlett, the last maintainer of the en-GB dictionary, did produce
an en-GB-OED version.

http://en-gb.pyxidium.co.uk/dictionary/mozilla.php lists it as being
considered "beta".


Mark..

John Wilcock

unread,
Jun 18, 2013, 11:20:55 AM6/18/13
to
Le 18/06/2013 12:07, Mark Tyndall a écrit :
>> For that matter, consistency of -ize/-ise spellings is also important; a
>> far more useful project IMO would be to create en-GB-Oxford and
>> en-GB-standard dictionaries that only accept one or the other but not
>> both.
>
> David Bartlett, the last maintainer of the en-GB dictionary, did produce
> an en-GB-OED version.
>
> http://en-gb.pyxidium.co.uk/dictionary/mozilla.php lists it as being
> considered "beta".

Thanks, I wasn't aware of that - though unfortunately the main en-GB
dictionary still accepts both -ize and -ise variants.
0 new messages