-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Pygments users,
There's been a bug in Pygments for a while now, with the way it
handled non-BMP[1] characters in the XQuery lexer. I don't think many
people used this lexer, and there were no testcases for this
particular issue, so I don't think anybody noticed.
The behavior, before March when I "fixed" it, was that non-BMP
characters would never match.
The behavior, after I "fixed" it[2], was that non-BMP characters would
match on wide Python builds. But on Narrow builds, like ship on OS-X
apparently, the regex would be uncompilable.
I'd like to fix it properly before the next release goes out.
Attached is a patch I'd like to propose, which adds a new function
'unirange' which will construct the appropriate regex to match a
non-BMP range against the internal representation of a string.
I've tested against
- - narrow python 2.67/2.7.1 (OS X Lion)
- - wide python 2.7.3 (Arch Linux)
- - wide python 3.2.3 (Arch Linux)
- - jython 2.5 [3]
Please test on other platforms/versions, give it a look, and comment,
especially with regard to 2to3 compatibility. I hope this is an
acceptable level of ugliness in order to have portable support for all
of Unicode. And sorry for the breakage. I'm updating regexlint[4] to
detect cases like this explicitly.
Tim
[1] in a nutshell, BMP = \u0000-\uffff, non-BMP is > \uffff
see http://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane
[2] http://code.timhatch.com/hg/pygments-tim/rev/7464c9b8c04e
[3] there's an unrelated change in vimbuiltins that is necessary to
work around a code size limitation in jython, as well as one test that
still fails due to what appears to be a limit on alternation
repetitions for the way we typically match strings, for which I intend
to report a bug
[4] http://timhatch.com/projects/regexlint/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAlAQp88ACgkQgwVGtvGz4EcSuwCfRwg2HcEiW1xxTRGcIbtfxWyU
w94AoIYg7IyJwUOI7a3DKsfQfdO6uMNl
=l50l
-----END PGP SIGNATURE-----