Re: Unicode support for Character.is* methods

81 views

Skip to first unread message

j...@google.com

unread,

Mar 16, 2010, 10:51:06 PM3/16/10

to pmuet...@google.com, google-web-tool...@googlegroups.com

A few issues:

- the way this is divided, all of the code will get pulled into every
app that calls any of these methods.
- this is incomplete and doesn't have the other properties, such as
getDirection, toLower, etc.

I had written a full implementation a while ago (it is still available
in svn at changes/jat/ucd), which encoded each table separately with a
combination of run-length encoding and huffman coding the runs, which
got the size of individual tables down to a few hundred bytes each, and
you only paid for the tables that were used. The decompression code was
of course larger, so maybe there is room for a simpler encoding
mechanism that takes less code even if the data is larger.

That effort was complete but was never merged in because some people
objected to the code size increase. Given the synchronous nature of the
API, it isn't feasible to fetch the tables on-demand from a server, so
they have to be downloaded with the code (they can go into different
runAsync fragments though).

I hope to work on that and other i18n issues next quarter, but I am not
sure how much time I will have to work on it.

http://gwt-code-reviews.appspot.com/226801

Pascal Muetschard

unread,

Mar 31, 2010, 5:43:48 PM3/31/10

to Google Web Toolkit Contributors

I have uploaded another patch set to http://gwt-code-reviews.appspot.com/226801
to address the concerns raised. See inline messages below.

This latest version has an ASCII only option for the is*() methods
witch has an overhead of a couple hundred bytes. See below for the
size penalties for the tables:

is*() methods: 5396 (all of them, not each)
getDirectionality(): 2627
getType(): 4112
getNumericValue: 2163
digit(char,int): 6779 (uses isDigit() and getNumericValue())

TOTAL: 11681 (at savings of 2617 - the shared code between each is
about 700 bytes)

I feel like the size argument is well met - if using all the tables,
the penalty is a mere 11k - comparing with the bootstrap code usually
at 5k and HashMap at 15k, that's quite small.

On Mar 16, 7:51 pm, j...@google.com wrote:
> A few issues:
>
> - the way this is divided, all of the code will get pulled into every
> app that calls any of these methods.

I had thought about separating the tables out for each of the is*()
methods, however, the extra code each of the objects adds quickly
becomes much larger than the data of the tables. This means that we
would sacrifice the runtime size of the common case for the corner
case where only a single is*() method is used. Also, the inherit
relationship (i.e. if isDefined is false, all others are false as
well) and mutual exclusion (isUpperCase vs isLowerCase vs isDigit)
between the attributes cut out a lot of duplication by combing the
tables.

I have also added the deferred property and your "ASCII version."
However, I've made unicode the default, as I feel like that's what
people expect from GWT - to be i18n compatible by default.

> - this is incomplete and doesn't have the other properties, such as
> getDirection, toLower, etc.

I've added getDirectionality(), getType(), getNumericValue() and
digit(char,int). I have excluded the to*Case() methods on purpose, as
their definition is not i18n correct - there are upper case characters
that need more than one character in lower case and vice versa.

>
> I had written a full implementation a while ago (it is still available
> in svn at changes/jat/ucd), which encoded each table separately with a
> combination of run-length encoding and huffman coding the runs, which
> got the size of individual tables down to a few hundred bytes each, and
> you only paid for the tables that were used. The decompression code was
> of course larger, so maybe there is room for a simpler encoding
> mechanism that takes less code even if the data is larger.

I have looked at this and mine is similar. My version also encodes run
lengths and makes sure that the most common "tokens" have the smallest
representation. It also uses LZW to compress the data. This
compression is simple and needs a lot less code to decompress, but
still provides a good compression ratio.

>
> That effort was complete but was never merged in because some people
> objected to the code size increase. Given the synchronous nature of the
> API, it isn't feasible to fetch the tables on-demand from a server, so
> they have to be downloaded with the code (they can go into different
> runAsync fragments though).
>
> I hope to work on that and other i18n issues next quarter, but I am not
> sure how much time I will have to work on it.

Which is why I'm trying to help with this :)

>
> http://gwt-code-reviews.appspot.com/226801

John Tamplin

unread,

Apr 1, 2010, 11:11:49 AM4/1/10

to google-web-tool...@googlegroups.com

On Wed, Mar 31, 2010 at 5:43 PM, Pascal Muetschard <pmuetsc...@google.com> wrote:

I have uploaded another patch set to http://gwt-code-reviews.appspot.com/226801
to address the concerns raised. See inline messages below.

Thanks for your efforts -- it will be next week before I can look closely at it.

--
John A. Tamplin
Software Engineer (GWT), Google

Mark Proctor

unread,

Apr 28, 2020, 7:07:09 AM4/28/20

to GWT Contributors

This was never merged in the end, Are the patch sets still available?

On Thursday, 1 April 2010 16:11:49 UTC+1, John Tamplin wrote:

Goktug Gokdogan

unread,

Apr 28, 2020, 3:33:36 PM4/28/20

to google-web-toolkit-contributors

I don't have access to that patch but if it was correct; I'm sure that it was very costly though.

A recent attempt looked like following:

  private static class RegExps {
    /* Formatted strings for the fallback regular expressions were generated with the following
     * Python 3 script.

    from unicodedata import bidirectional, category
    from itertools import *
    condition = lambda x: category(chr(x)).startswith("L")  # adjust this
    uchr = lambda x: "\\\\u%04X" % x if bidirectional(chr(x)) in ("R", "AN", "AL") else chr(x)
    ranges = []
    codepoints = [x for x in range(0, 0xFFFF) if condition(x)]
    for _, group in groupby(enumerate(codepoints), lambda t: t[0] - t[1]):
        l = list(item for (index, item) in group)
        ranges += [uchr(l[0]) + ("-" if len(l) > 2 else "") + (uchr(l[-1]) if len(l) > 1 else "")]
    formatted = "        \"["
    for r in ranges:
        if len(formatted) + len(r) > 99:
            print(formatted + "\"")
            formatted = "            + \""
        formatted += r
    print(formatted + "]\";")
     */
    private static final String LETTER_FALLBACK =
        "[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶͷͺ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙա-և\\u05D0-\\u05EA\\u05F0-\\u05F2"
            + "\\u0620-\\u064A\\u066E\\u066F\\u0671-\\u06D3\\u06D5\\u06E5\\u06E6\\u06EE\\u06EF"
            + "\\u06FA-\\u06FC\\u06FF\\u0710\\u0712-\\u072F\\u074D-\\u07A5\\u07B1\\u07CA-\\u07EA"
            + "\\u07F4\\u07F5\\u07FA\\u0800-\\u0815\\u081A\\u0824\\u0828\\u0840-\\u0858"
            + "\\u08A0-\\u08B4\\u08B6-\\u08BDऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএঐও-নপ-রলশ-হঽৎড়ঢ়য়-ৡৰৱਅ-ਊਏਐਓ-ਨਪ-ਰਲਲ਼ਵਸ਼ਸਹ"
            + "ਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલળવ-હઽૐૠૡૹଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହଽଡ଼ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந-பம-ஹௐఅ-ఌఎ-ఐ"
            + "ఒ-నప-హఽౘ-ౚౠౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠೡೱೲഅ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะาำเ-ๆກຂຄງຈຊຍ"
            + "ດ-ທນ-ຟມ-ຣລວສຫອ-ະາຳຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍ"
            + "ነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛱ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡷᢀ-ᢄᢇ-ᢨᢪ"
            + "ᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᳩ-ᳬᳮ-ᳱᳵᳶᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώ"
            + "ᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯ"
            + "ⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⸯ々〆〱-〵〻〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄭㄱ-ㆎㆠ-ㆺㇰ-ㇿ㐀-䶵一-鿕ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪꘫꙀ-ꙮ"
            + "ꙿ-ꚝꚠ-ꛥꜗ-ꜟꜢ-ꞈꞋ-ꞮꞰ-ꞷꟷ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵꪶ"
            + "ꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭥꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ﬀ-ﬆﬓ-ﬗ\\uFB1D\\uFB1F-\\uFB28"
            + "\\uFB2A-\\uFB36\\uFB38-\\uFB3C\\uFB3E\\uFB40\\uFB41\\uFB43\\uFB44\\uFB46-\\uFBB1"
            + "\\uFBD3-\\uFD3D\\uFD50-\\uFD8F\\uFD92-\\uFDC7\\uFDF0-\\uFDFB\\uFE70-\\uFE74"
            + "\\uFE76-\\uFEFCＡ-Ｚａ-ｚｦ-ﾾￂ-ￇￊ-ￏￒ-ￗￚ-ￜ]";
    private static final String DIGIT_FALLBACK =
        "[0-9\\u0660-\\u0669۰-۹\\u07C0-\\u07C9०-९০-৯੦-੯૦-૯୦-୯௦-௯౦-౯೦-೯൦-൯෦-෯๐-๙໐-໙༠-༩၀-၉႐-႙០-៩᠐-᠙"
            + "᥆-᥏᧐-᧙᪀-᪉᪐-᪙᭐-᭙᮰-᮹᱀-᱉᱐-᱙꘠-꘩꣐-꣙꤀-꤉꧐-꧙꧰-꧹꩐-꩙꯰-꯹０-９]";
    private static final String LOWER_CASE_FALLBACK =
        "[a-zµß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĳĵķĸĺļľŀłńņňŉŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżž-ƀƃƅƈƌƍƒƕƙ-ƛƞơƣ"
            + "ƥƨƪƫƭưƴƶƹƺƽ-ƿǆǉǌǎǐǒǔǖǘǚǜǝǟǡǣǥǧǩǫǭǯǰǳǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿɀɂɇɉɋɍɏ-ʓʕ-ʯͱ"
            + "ͳͷͻ-ͽΐά-ώϐϑϕ-ϗϙϛϝϟϡϣϥϧϩϫϭϯ-ϳϵϸϻϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊ"
            + "ӌӎӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣԥԧԩԫԭԯա-ևᏸ-ᏽᲀ-ᲈᴀ-ᴫᵫ-ᵷᵹ-ᶚḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝ"
            + "ḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ-ẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉ"
            + "ịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰ-ώᾀ-ᾇᾐ-ᾗᾠ-ᾧᾰ-ᾴᾶᾷιῂ-ῄῆῇῐ-ΐῖῗῠ-ῧῲ-ῴῶῷℊ"
            + "ℎℏℓℯℴℹℼℽⅆ-ⅉⅎↄⰰ-ⱞⱡⱥⱦⱨⱪⱬⱱⱳⱴⱶ-ⱻⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣⳤⳬⳮⳳ"
            + "ⴀ-ⴥⴧⴭꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙡꙣꙥꙧꙩꙫꙭꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꚙꚛꜣꜥꜧꜩꜫꜭꜯ-ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯ"
            + "ꝱ-ꝸꝺꝼꝿꞁꞃꞅꞇꞌꞎꞑꞓ-ꞕꞗꞙꞛꞝꞟꞡꞣꞥꞧꞩꞵꞷꟺꬰ-ꭚꭠ-ꭥꭰ-ꮿﬀ-ﬆﬓ-ﬗａ-ｚ]";
    private static final String UPPER_CASE_FALLBACK =
        "[A-ZÀ-ÖØ-ÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİĲĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽƁƂƄƆƇƉ-ƋƎ-ƑƓƔƖ-Ƙ"
            + "ƜƝƟƠƢƤƦƧƩƬƮƯƱ-ƳƵƷƸƼǄǇǊǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮǱǴǶ-ǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺȻȽȾɁɃ-ɆɈɊɌ"
            + "ɎͰͲͶͿΆΈ-ΊΌΎΏΑ-ΡΣ-ΫϏϒ-ϔϘϚϜϞϠϢϤϦϨϪϬϮϴϷϹϺϽ-ЯѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼ"
            + "ҾӀӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԚԜԞԠԢԤԦԨԪԬԮԱ-ՖႠ-ჅჇჍᎠ-ᏵḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞ"
            + "ḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒṔṖṘṚṜṞṠṢṤṦṨṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔẞẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎ"
            + "ỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸỺỼỾἈ-ἏἘ-ἝἨ-ἯἸ-ἿὈ-ὍὙὛὝὟὨ-ὯᾸ-ΆῈ-ΉῘ-ΊῨ-ῬῸ-Ώℂℇℋ-ℍℐ-ℒℕℙ-ℝℤΩℨK-ℭℰ-ℳℾℿ"
            + "ⅅↃⰀ-ⰮⱠⱢ-ⱤⱧⱩⱫⱭ-ⱰⱲⱵⱾ-ⲀⲂⲄⲆⲈⲊⲌⲎⲐⲒⲔⲖⲘⲚⲜⲞⲠⲢⲤⲦⲨⲪⲬⲮⲰⲲⲴⲶⲸⲺⲼⲾⳀⳂⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢⳫⳭⳲꙀꙂꙄꙆꙈꙊꙌꙎꙐꙒꙔꙖ"
            + "ꙘꙚꙜꙞꙠꙢꙤꙦꙨꙪꙬꚀꚂꚄꚆꚈꚊꚌꚎꚐꚒꚔꚖꚘꚚꜢꜤꜦꜨꜪꜬꜮꜲꜴꜶꜸꜺꜼꜾꝀꝂꝄꝆꝈꝊꝌꝎꝐꝒꝔꝖꝘꝚꝜꝞꝠꝢꝤꝦꝨꝪꝬꝮꝹꝻꝽꝾꞀꞂꞄꞆꞋꞍꞐꞒꞖꞘꞚꞜꞞꞠꞢꞤꞦ"
            + "ꞨꞪ-ꞮꞰ-ꞴꞶＡ-Ｚ]";
    private static final String TITLE_CASE_FALLBACK = "[ǅǈǋǲᾈ-ᾏᾘ-ᾟᾨ-ᾯᾼῌῼ]";

    // The expressions below were generated by looping over all valid Unicode code points using the
    // desktop version of Java.
    private static final NativeRegExp WHITESPACE =
        new NativeRegExp(
            "[\\u0009-\\u000D\\u001C-\\u0020\\u1680\\u180E\\u2000-\\u2006\\u2008-\\u200A\\u2028"
                + "\\u2029\\u205F\\u3000]");
    private static final NativeRegExp SPACE_CHAR =
        new NativeRegExp(
            "[\\u0020\\u00A0\\u1680\\u180E\\u2000-\\u200A\\u2028\\u2029\\u202F\\u205F\\u3000]");

    // Use Unicode category matcher to identify letters, digits, etc.
    // If not available, use an explicit expression that's valid for the Basic Multilingual Plane.
    private static final NativeRegExp DIGIT =
        createRegExpWithFallback("0", "\\p{Nd}", DIGIT_FALLBACK);
    private static final NativeRegExp LETTER =
        createRegExpWithFallback("a", "\\p{L}", LETTER_FALLBACK);
    private static final NativeRegExp LOWER_CASE =
        createRegExpWithFallback("a", "\\p{Ll}", LOWER_CASE_FALLBACK);
    private static final NativeRegExp UPPER_CASE =
        createRegExpWithFallback("A", "\\p{Lu}", UPPER_CASE_FALLBACK);
    private static final NativeRegExp TITLE_CASE =
        createRegExpWithFallback("\u01C5", "\\p{Lt}", TITLE_CASE_FALLBACK);

    private static NativeRegExp createRegExpWithFallback(
        String testMatch, String pattern, String fallbackPattern) {
      // MS Edge v40 is buggy: it will accept regexes that contain Unicode property matchers as
      // valid, but the regex won't match anything. For this reason, we must verify the Unicode
      // version of the regex on a test input that is known to match.
      try {
        NativeRegExp result = new NativeRegExp(pattern, "u");
        if (result.test(testMatch)) {
          return result;
        }
      } catch (JsException expectedUnicodePatternError) {}
      return new NativeRegExp(fallbackPattern);
    }
  }
 public static boolean isISOControl(char ch) {
    // char cannot be negative.
    return ch <= 0x1F || (ch >= 0x7F && ch <= 0x9F);
  }

  public static boolean isISOControl(int codePoint) {
    return (codePoint >= 0 && codePoint <= 0x1F) || (codePoint >= 0x7F && codePoint <= 0x9F);
  }

  public static boolean isLetter(char ch) {
    return RegExps.LETTER.test(String.valueOf(ch));
  }

  public static boolean isLetter(int codePoint) {
    return RegExps.LETTER.test(String.fromCodePoint(codePoint));
  }

  public static boolean isLetterOrDigit(char ch) {
    String s = String.valueOf(ch);
    return RegExps.LETTER.test(s) || RegExps.DIGIT.test(s);
  }

  public static boolean isLetterOrDigit(int codePoint) {
    String s = String.fromCodePoint(codePoint);
    return RegExps.LETTER.test(s) || RegExps.DIGIT.test(s);
  }

  public static boolean isLowerCase(char ch) {
    return RegExps.LOWER_CASE.test(String.valueOf(ch));
  }

--
You received this message because you are subscribed to the Google Groups "GWT Contributors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-web-toolkit-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-web-toolkit-contributors/01deac40-1eff-41c0-bf1a-9bd64e90ab7e%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages