Unicode has things called "properties" which are basically attributes on
the characters. For example "A" has a "general category" property with
the value "Lu" which means "Letter, uppercase". Conversely, it also has
the binary property "L" with the value "true".
One way to look at this is that the Unicode properties are an extension
of character classes of the 8-bit world, or the isXXX() of C.
In regular expressions properties are written with \p (negated with \P):
for example \p{Ll} for lowercase.
Many of these properties have aliases: it is faster to write "Sc" than
"Currency_Symbol".
The task I have in mind is that we are missing quite a few of these
aliases since back in the day the property alias definitions weren't
in a machine-readable format. But since we wanted to have some of them,
some of them were hardwired to the lib/unicore/mktables script.
The script is run as explained in the lib/unicore/README.perl to
generate the various lib/unicore/{*.pl,*/*.pl} files that are loaded
on-demand in runtime to do the various \p bits.
Currently the script has the property aliases hardwired, look for
"assumedly". But now we do have machine-readable tables for the
aliases: PropertyAliases.txt and PropValueAliases.txt (the original
Unicode consortium name of the latter was PropertyValueAliases.txt,
but we are 8.3-nice).
The exact rules of how to use these alias names and values is in the
latest edition of the Unicode TR#18:
http://www.unicode.org/reports/tr18/#Categories
Your task, should you choose to accept it, is to tweak the mktables
script so that the *.pl files are built correctly from the aliases *.txt
files. Note that there will be more *.pl files than there are now, for
example we don't do \p{LC} now, but I assume that once the script is
fixed there will be a lib/unicore/lib/LC.pl.
There is a second part but that would be I think rather more involved
since it would mean implementing the \p{A=B} (or \p{A:B}) syntax as
proposed in the UTR#18. In other words the regex runtime would need
changes. I am not stopping anyone from entering the Unicode dragon's
lair, I'm just saying that asbestos longjohns while itchy may be a good
idea.
Therefore I think only the general category (gc) and the script (sc)
property aliases could be trivially done by simply fixing the mktables
to generate the right *.pl files based on the aliases *.txt files.
I do not expect *that* much change in the *.pl files since we have the
most aliases covered by our hardwiring, but it still would be good to
have this thing automated.
--
Jarkko Hietaniemi <j...@iki.fi> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen