WikiWords with non-Ascii characters

63 views
Skip to first unread message

fuxoft

unread,
Sep 6, 2005, 1:22:34 PM9/6/05
to TiddlyWikiDev
Hello there,

I am looking at TiddlyWiki as a possible replacement for my current
homepage. However, I have one small problem:

I tried TiddlyWiki and found it has no problems with non-Ascii Tiddler
names, e.g. "FrantišekFuka" (the character after "i" is non-Ascii).

However, "FrantišekFuka" is not automatically recognized as a WikiWord
("FrantisekFuka" is) so I have to use double square brackets all the
time.

Can I somehow change the configuration so that WikiWords are
automatically defined as "String of non-whitespace Unicode characters
containing at least one uppercase letter in the middle of the string"
instead of the current definition ("String of non-whitespace ASCII
characters...")?

If this is not configurable, can you at least point me to the relevant
part of Javascript which handles this? It seems to me it shouldn't be
hard to change the relevant code myself.

Thanks.

Alan Watson

unread,
Sep 6, 2005, 1:30:13 PM9/6/05
to Tiddly...@googlegroups.com
You should be able to figure something out by hacking on my
recently-posted "ignore WikiWord" plugin. You need to modify the plugin
to set the regular expresssion text to

wikiNamePattern = "(~?)(...)";

where ... is whatever you want to match. Search for "wikiNamePattern" in
the original JavaScript for a starting point.

Regards,

Alan
--
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México

Jeremy Ruston

unread,
Sep 6, 2005, 1:32:54 PM9/6/05
to Tiddly...@googlegroups.com
Search for these lines:

var upperLetter = "[A-Z\u00c0-\u00de\u0150\u0170]";
var lowerLetter = "[a-z\u00df-\u00ff_0-9\\-\u0151\u0171]";
var anyLetter =
"[A-Za-z\u00c0-\u00de\u00df-\u00ff_0-9\\-\u0150\u0170\u0151\u0171]";

and then a couple of lines further down there's:

var wikiNamePattern = "(~?)((?:" + upperLetter + "+" + lowerLetter +
"+" + upperLetter + anyLetter + "*)|(?:" + upperLetter + "{2,}" +
lowerLetter + "+))";

So, the rules at the moment are that a Wiki word has two possibilities:
- one or more upper case letters, followed by one or more lower case
letters, followed by one or more lower case letters, followed by any
combination of upper and lower case
- two or more upper case letters followed by one or more lower case letters

And the definitions of what constitutes upper and lower case lies in
those definitions for upperLetter and lowerLetter. I'd be delighted to
extend them to be more complete - I just need to know what extra
characters are required.

From earlier discussions, though, I have gathered that there are some
tricky letters that are considered to be of a different case in
different languages. If that's true, it may never be possible to
arrive at a definitive, language independent definition.

Cheers

Jeremy


--
Jeremy Ruston
mailto:jer...@osmosoft.com
http://www.tiddlywiki.com

Frantisek Fuka

unread,
Sep 6, 2005, 1:58:00 PM9/6/05
to TiddlyWikiDev
Of course, I don't know all the existing languages. I am Czech and in
Czech language (and other Central European Languages, i.e. Slovak,
Polish, Hungarian, I am sure), all the letters look like standard Ascii
letters, except that some of them have some special "diacritics"
above/below them. But all of them exist in both upper and lower case.

I looked at the Unicode table here:
http://free.prohosting.com/~vitivas/js/UniCode/CharTab.html

It's rather messy. The uppercase characters C0-DD correspond to
lowercase characters E0-FD (e.g. C5 is uppercase version of E5
lowercase character). Then, from 100 to 233, even characters are
uppercase equivalents of odd lowercase character (e.g. 10E is
uppercase, 10F is the same character lowercase). Using these rules
should take care of all languages I know.

Note that I am not saying everything between C0 and 233 are existing
European letters. If you implement the rule in the paragraph above, it
would mean that some special non-letter characters (e.g. 1c0 to 1c3)
would be incorrectly recognized as uppercase/lowercase letters. But I
think it's a small price to pay if there is not any existing
"isUppercase?()" function for unicode characters.

Frantisek Fuka

unread,
Sep 6, 2005, 2:20:14 PM9/6/05
to TiddlyWikiDev
Maybe the most elegant solution would be to be able to define two
special system Tiddlers ExtraUpperCaseLetters and ExtraLowerCaseLetters
which could be set by the user.

Or one Tiddler called "ExtraLocalCharacters" which would list all the
characters in 2 lines like this:

ešcržýáíé
EŠCRŽÝÁÍÉ

JÁROLI József

unread,
Sep 6, 2005, 6:05:20 PM9/6/05
to TiddlyWikiDev
Ahoj František! And what about Uu , Cc ... ;)

You can find my related thread / post here:
http://groups.google.co.hu/group/TiddlyWiki/tree/browse_frm/thread/4ab10453ff0ef077/b5faeb8102dac85a?rnum=21&hl=hu&q=hungarian&_done=%2Fgroup%2FTiddlyWiki%2Fbrowse_frm%2Fthread%2F4ab10453ff0ef077%3Fpage%3Dend%26q%3Dhungarian%26hl%3Dhu%26&page=end#doc_b5faeb8102dac85a

Jeremy!

As fas as I can see either you can include all special characters from
Latin-1 Supplement codepage at once ( http://www.unicode.org/charts/
PDF/U0080.pdf ) or add them step-by-step as your users start to request
it, or provide a modular way of specifying special upper/lovercase
pairs with config tiddlers :).

Feel free to drop me a mail if you need further help.
Cheers:
József

Frantisek Fuka

unread,
Sep 6, 2005, 6:35:32 PM9/6/05
to TiddlyWikiDev
My little selfish caveat: Latin-1 does NOT contain all Czech
characters! You need Latin-2 (or, better yet, Unicode) for that.

Paul Petterson

unread,
Sep 6, 2005, 6:44:42 PM9/6/05
to Tiddly...@googlegroups.com
you can check out http://www.unicode.org and in particular http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt  which delcares what all the characters are...  Here's the regex for lower case letter - for latin, cyrillic, greek,  scoptic, etc, etc, etc...
 
[\u0061..\u007A\u00AA\u00B5\u00BA\u00DF..\u00F6\u00F8..\u00FF\u0101\u0103\u0105\u0107\u0109\u010B\u010D\u010F\u0111\u0113\u0115\u0117\u0119\u011B\u011D\u011F\u0121\u0123\u0125\u0127\u0129\u012B\u012D\u012F\u0131\u0133\u0135\u0137..\u0138\u013A\u013C\u013E\u0140\u0142\u0144\u0146\u0148..\u0149\u014B\u014D\u014F\u0151\u0153\u0155\u0157\u0159\u015B\u015D\u015F\u0161\u0163\u0165\u0167\u0169\u016B\u016D\u016F\u0171\u0173\u0175\u0177\u017A\u017C\u017E..\u0180\u0183\u0185\u0188\u018C..\u018D\u0192\u0195\u0199..\u019B\u019E\u01A1\u01A3\u01A5\u01A8\u01AA..\u01AB\u01AD\u01B0\u01B4\u01B6\u01B9..\u01BA\u01BD..\u01BF\u01C6\u01C9\u01CC\u01CE\u01D0\u01D2\u01D4\u01D6\u01D8\u01DA\u01DC..\u01DD\u01DF\u01E1\u01E3\u01E5\u01E7\u01E9\u01EB\u01ED\u01EF..\u01F0\u01F3\u01F5\u01F9\u01FB\u01FD\u01FF\u0201\u0203\u0205\u0207\u0209\u020B\u020D\u020F\u0211\u0213\u0215\u0217\u0219\u021B\u021D\u021F\u0221\u0223\u0225\u0227\u0229\u022B\u022D\u022F\u0231\u0233..\u0239\u023C\u023F..\u0240\u0250..\u02AF\u02B0..\u02B8\u02C0..\u02C1\u02E0..\u02E4\u0345\u037A\u0390\u03AC..\u03CE\u03D0..\u03D1\u03D5..\u03D7\u03D9\u03DB\u03DD\u03DF\u03E1\u03E3\u03E5\u03E7\u03E9\u03EB\u03ED\u03EF..\u03F3\u03F5\u03F8\u03FB..\u03FC\u0430..\u045F\u0461\u0463\u0465\u0467\u0469\u046B\u046D\u046F\u0471\u0473\u0475\u0477\u0479\u047B\u047D\u047F\u0481\u048B\u048D\u048F\u0491\u0493\u0495\u0497\u0499\u049B\u049D\u049F\u04A1\u04A3\u04A5\u04A7\u04A9\u04AB\u04AD\u04AF\u04B1\u04B3\u04B5\u04B7\u04B9\u04BB\u04BD\u04BF\u04C2\u04C4\u04C6\u04C8\u04CA\u04CC\u04CE\u04D1\u04D3\u04D5\u04D7\u04D9\u04DB\u04DD\u04DF\u04E1\u04E3\u04E5\u04E7\u04E9\u04EB\u04ED\u04EF\u04F1\u04F3\u04F5\u04F7\u04F9\u0501\u0503\u0505\u0507\u0509\u050B\u050D\u050F\u0561..\u0587\u1D00..\u1D2B\u1D2C..\u1D61\u1D62..\u1D77\u1D78\u1D79..\u1D9A\u1D9B..\u1DBF\u1E01\u1E03\u1E05\u1E07\u1E09\u1E0B\u1E0D\u1E0F\u1E11\u1E13\u1E15\u1E17\u1E19\u1E1B\u1E1D\u1E1F\u1E21\u1E23\u1E25\u1E27\u1E29\u1E2B\u1E2D\u1E2F\u1E31\u1E33\u1E35\u1E37\u1E39\u1E3B\u1E3D\u1E3F\u1E41\u1E43\u1E45\u1E47\u1E49\u1E4B\u1E4D\u1E4F\u1E51\u1E53\u1E55\u1E57\u1E59\u1E5B\u1E5D\u1E5F\u1E61\u1E63\u1E65\u1E67\u1E69\u1E6B\u1E6D\u1E6F\u1E71\u1E73\u1E75\u1E77\u1E79\u1E7B\u1E7D\u1E7F\u1E81\u1E83\u1E85\u1E87\u1E89\u1E8B\u1E8D\u1E8F\u1E91\u1E93\u1E95..\u1E9B\u1EA1\u1EA3\u1EA5\u1EA7\u1EA9\u1EAB\u1EAD\u1EAF\u1EB1\u1EB3\u1EB5\u1EB7\u1EB9\u1EBB\u1EBD\u1EBF\u1EC1\u1EC3\u1EC5\u1EC7\u1EC9\u1ECB\u1ECD\u1ECF\u1ED1\u1ED3\u1ED5\u1ED7\u1ED9\u1EDB\u1EDD\u1EDF\u1EE1\u1EE3\u1EE5\u1EE7\u1EE9\u1EEB\u1EED\u1EEF\u1EF1\u1EF3\u1EF5\u1EF7\u1EF9\u1F00..\u1F07\u1F10..\u1F15\u1F20..\u1F27\u1F30..\u1F37\u1F40..\u1F45\u1F50..\u1F57\u1F60..\u1F67\u1F70..\u1F7D\u1F80..\u1F87\u1F90..\u1F97\u1FA0..\u1FA7\u1FB0..\u1FB4\u1FB6..\u1FB7\u1FBE\u1FC2..\u1FC4\u1FC6..\u1FC7\u1FD0..\u1FD3\u1FD6..\u1FD7\u1FE0..\u1FE7\u1FF2..\u1FF4\u1FF6..\u1FF7\u2071\u207F\u2090..\u2094\u210A\u210E..\u210F\u2113\u212F\u2134\u2139\u213C..\u213D\u2146..\u2149\u2170..\u217F\u24D0..\u24E9\u2C30..\u2C5E\u2C81\u2C83\u2C85\u2C87\u2C89\u2C8B\u2C8D\u2C8F\u2C91\u2C93\u2C95\u2C97\u2C99\u2C9B\u2C9D\u2C9F\u2CA1\u2CA3\u2CA5\u2CA7\u2CA9\u2CAB\u2CAD\u2CAF\u2CB1\u2CB3\u2CB5\u2CB7\u2CB9\u2CBB\u2CBD\u2CBF\u2CC1\u2CC3\u2CC5\u2CC7\u2CC9\u2CCB\u2CCD\u2CCF\u2CD1\u2CD3\u2CD5\u2CD7\u2CD9\u2CDB\u2CDD\u2CDF\u2CE1\u2CE3..\u2CE4\u2D00..\u2D25\uFB00..\uFB06\uFB13..\uFB17\uFF41..\uFF5A\u10428..\u1044F\u1D41A..\u1D433\u1D44E..\u1D454\u1D456..\u1D467\u1D482..\u1D49B\u1D4B6..\u1D4B9\u1D4BB\u1D4BD..\u1D4C3\u1D4C5..\u1D4CF\u1D4EA..\u1D503\u1D51E..\u1D537\u1D552..\u1D56B\u1D586..\u1D59F\u1D5BA..\u1D5D3\u1D5EE..\u1D607\u1D622..\u1D63B\u1D656..\u1D66F\u1D68A..\u1D6A5\u1D6C2..\u1D6DA\u1D6DC..\u1D6E1\u1D6FC..\u1D714\u1D716..\u1D71B\u1D736..\u1D74E\u1D750..\u1D755\u1D770..\u1D788\u1D78A..\u1D78F\u1D7AA..\u1D7C2\u1D7C4..\u1D7C9]

I haven't tested this - I just cut/pase and did a quick regex subsititution on the doc above... you milage may vary.
 
Paul

JÁROLI József

unread,
Sep 7, 2005, 3:40:16 PM9/7/05
to TiddlyWikiDev
I am sorry I was way too tired, I wanted to write the Latin Extended A
indeed: http://www.unicode.org/charts/PDF/U0100.pdf

Reply all
Reply to author
Forward
0 new messages