> On Wed, 28 Mar 2012 12:18:41 +0200, Anne van Kesteren <ann...@opera.com>
> wrote:
>> I'm not sure what to do with big5 and big5-hkscs. After generating all
>> possible byte sequences (lead bytes 0x81 to 0xFE, trail bytes 0x40 to
>> 0x7E and 0xA1 to 0xFE) and getting the code points for those in various
>> browsers there does not seem to be that much interoperability.
>>
>> http://html5.org/temp/big5.json has all the code points for Internet
>> Explorer ("internetexplorer", same for big5 and hkscs), Firefox
>> ("firefox" and "firefox-hk"), Opera ("opera" and "opera-hk"), and
>> Chrome ("chrome" and "chrome-hk"). "internetexplorer" and "chrome" are
>> quite close, the rest is a little further apart.
>>
>> Some help as to how best to proceed would be appreciated.
>
> To give some more context, IE treats big5 and big5-hkscs identical. Out
> of the total 19782 code points, 6217 of them map to the Private Use Area
> (PUA) in IE. Chrome does the same for big5, but has a different mapping
> for big5-hkscs. To deal with HKSCS Microsoft brought out this patch:
> http://www.microsoft.com/hk/hkscs/ Basically people living in the Hong
> Kong area are expected to have that installed and therefore the PUA code
> points map to different glyphs. I'm not sure what the situation is like
> on Mac or Linux, but given the market share statistics I saw the market
> is pretty heavenly dominated by Microsoft.
>
> Gecko seems to use a combination of things as documented in
> https://bugzilla.mozilla.org/show_bug.cgi?id=310299 though it is unclear
> how successful that approach is.
>
> There are also various threads online such as
> http://www.google.com/support/forum/p/Chrome/thread?tid=466c210af3fb6d08
> that seem to indicate "pages in the Hong Kong area" are not using the
> big5-hkscs label and therefore rely on what IE and Chrome do for big5
> and rely on users having the compatible fonts.
Making big5 and big5-hkscs aliases sounds like a good idea, on the
assumption that big5-hkscs is a pure extension of Big5.
To make this more concrete, here are a few fairly common characters that I
think are in big5-hkscs but not in big5, their unicode point and byte
representation in big5-hkscs when converted using Python:
啫 U+556B '\x94\xdc'
嗰 U+55F0 '\x9d\xf5'
嘅 U+5605 '\x9d\xef'
I'm not sure how to use big5.json, so perhaps you can tell me what these
map to in various browsers? If they're all the same, examples of byte
sequences that don't would be interesting.
It seems fairly obvious that the most sane solution would be to just use a
more correct mapping that doesn't involve the PUA, but:
1. What is the compatible subset of all browsers?
2. Does that subset include anything mapping to the PUA?
3. Do Hong Kong or Taiwan sites depend on charCodeAt returning values in
the PUA?
4. Would hacks be needed on the font-loading side if browsers started
using a more correct mapping?
--
Philip Jägenstedt
Core Developer
Opera Software
I believe they are not, but given that a) Windows treats them identical
and b) reportedly has no different default setup for Hong Kong and Taiwan
users (and no longer offers a HKSCS download), they can probably be
considered the same.
For more details on Windows and Internet Explorer, see:
http://lists.w3.org/Archives/Public/www-archive/2012Mar/thread.html#msg46
> To make this more concrete, here are a few fairly common characters that
> I think are in big5-hkscs but not in big5, their unicode point and byte
> representation in big5-hkscs when converted using Python:
>
> 啫 U+556B '\x94\xdc'
> 嗰 U+55F0 '\x9d\xf5'
> 嘅 U+5605 '\x9d\xef'
>
> I'm not sure how to use big5.json, so perhaps you can tell me what these
> map to in various browsers? If they're all the same, examples of byte
> sequences that don't would be interesting.
big5.json is the result of outputting all possible lead/trail byte
combinations and then running charCodeAt over the resulting string, while
accounting for surrogates and working around a minor problem in Opera.
Running the following (Python):
import json
data = json.loads(open("big5.json", "r").read())
lead = 0x9D
trail = 0xF5
row = 0xFE-0xA1 + 0x7E-0x40 + 2
cell = (trail-0xA1 + 0x7E-0x40 +1) if trail > 0x7E else trail - 0x40
index = (lead-0x81) * row + cell
for x in data:
print x, hex(data[x][index])
I get
opera-hk 0x55f0
firefox 0x9c1f
chrome 0xecd7
firefox-hk 0x55f0
opera 0xfffd
chrome-hk 0x55f0
internetexplorer 0xecd7
indicating browsers agree for big5-hkscs and not at all for big5. Similar
results for your other examples.
> It seems fairly obvious that the most sane solution would be to just use
> a more correct mapping that doesn't involve the PUA, but:
>
> 1. What is the compatible subset of all browsers?
> 2. Does that subset include anything mapping to the PUA?
This depends on whether or not you include big5-hkscs results. Opera never
maps to PUA, but whether that is compatible enough is unclear.
> 3. Do Hong Kong or Taiwan sites depend on charCodeAt returning values in
> the PUA?
>
> 4. Would hacks be needed on the font-loading side if browsers started
> using a more correct mapping?
Don't know.
Mozilla has done a number of interesting things here nobody else does, but
that was all big in '05 or earlier.
https://bugzilla.mozilla.org/show_bug.cgi?id=9686
https://bugzilla.mozilla.org/show_bug.cgi?id=310299
How relevant that is today, given that they are not the market leader
there, is unclear.
Given the information from Microsoft indicated at the start of this email
I sort of think maybe just following Internet Explorer here is the best
way forward, combined with strongly discouraging the usage of big5.
--
Anne van Kesteren
http://annevankesteren.nl/
The range IE and Chrome map to PUA in bytes (lead,trail) is 0x8140 to
0xA0FE and 0xC6A2 to 0xC8FE. The big5-hkscs label in all non-Internet
Explorer browsers differs for these code points (not always all of them).
However, it also differs for the lead byte 0xA3 row. However, that row is
also mapped in an incompatible way among big5 implementations.
Ideally someone does detailed content analysis to figure out what the best
path forward is here, though I'm not entirely sure how.