Could you tell me how the charset alias works now?
Let me give you some background why I wanted to know. Previously, I
have written a small utility that runs on Mac OS X which tinkers the
charsetalias.properties file a bit to handle many "big5-HKSCS" encoded
web sites that declared themselves incorrectly as "big5". It simply
replaces the target encoding of big5 with big5-HKSCS instead. As big5-
HKSCS is a superset of big5, it shows big5 encoded site correctly as
well as misconfigured big5-HKSCS sites. I have just downloaded Firefox
4.0 beta 7 and discovered that the charsetalias.properties file is no
longer there. Could you tell me how the charset handling works in
Gecko 2.0?
My little utility can be downloaded from here:
http://www.macupdate.com/info.php/id/19216/i-speak-cantonese
That moved into the compiled code,
https://bugzilla.mozilla.org/show_bug.cgi?id=563536.
No idea if there's anything left that allows you to tweak it.
That said, is there a bug filed on what you're trying to fix? Add-ons
like yours sound like something we shouldn't need.
Axel
It's not a great idea to have hard-coded those identifiers :-(
It's not just a matter of something that mozilla got wrong that's just
need to be fixed. There will be some identifier that about nobody uses,
that's it's a nonsense to include by defaut, and there will be a few
case where things have been done wrong, so that it's useful to be able
to override the default, even if the defaut is correct.
big5 is a case of a situation that can't quite be satisfyingly solved
with a hard coded solution. http://en.wikipedia.org/wiki/Big5 list at
least 10 different extensions to big5 that possibly could have been
truly used when a page is tagged as big5.
Of those HKSCS is probably the only one that's a real standard and has
currently a large usage. But what's if someone wants a warning when some
of the characters are not truly big5 and use the hkscs extension ? (a
taiwanese for exemple for whom hkscs is not a standard) And what if he
is handling some old content, that's truly HKSCS incompatible, where
he'd love to have big5 interpreted as something else than HKSCS ?
Having the identifiers defined in a ressource file that can be
overwritten to change them makes the situation significantly easier.
I've found another case that's a lot more pertinent : Proprietary
extensions to the japanese S-JIS encoding for emoji characters.
There appear to be currently 3 such extensions, DoCoMo, KDDI and
SoftBank. Basically every japanese cell network has it's own proprietary
way of encoding emoji, documented in
http://www.unicode.org/Public/UNIDATA/EmojiSources.txt
As those emoji characters now are encoded inside the unicode standard,
those extension can't simply be ignored.