Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Pygments support for wide unicode characters
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  2 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Tim Hatch  
View profile  
 More options Jul 25 2012, 10:13 pm
From: Tim Hatch <t...@timhatch.com>
Date: Wed, 25 Jul 2012 19:13:36 -0700
Local: Wed, Jul 25 2012 10:13 pm
Subject: Pygments support for wide unicode characters

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Pygments users,

There's been a bug in Pygments for a while now, with the way it
handled non-BMP[1] characters in the XQuery lexer.  I don't think many
people used this lexer, and there were no testcases for this
particular issue, so I don't think anybody noticed.

The behavior, before March when I "fixed" it, was that non-BMP
characters would never match.

The behavior, after I "fixed" it[2], was that non-BMP characters would
match on wide Python builds.  But on Narrow builds, like ship on OS-X
apparently, the regex would be uncompilable.

I'd like to fix it properly before the next release goes out.
Attached is a patch I'd like to propose, which adds a new function
'unirange' which will construct the appropriate regex to match a
non-BMP range against the internal representation of a string.

I've tested against
- - narrow python 2.67/2.7.1 (OS X Lion)
- - wide python 2.7.3 (Arch Linux)
- - wide python 3.2.3 (Arch Linux)
- - jython 2.5 [3]

Please test on other platforms/versions, give it a look, and comment,
especially with regard to 2to3 compatibility.  I hope this is an
acceptable level of ugliness in order to have portable support for all
of Unicode.  And sorry for the breakage.  I'm updating regexlint[4] to
detect cases like this explicitly.

Tim

[1] in a nutshell, BMP = \u0000-\uffff, non-BMP is > \uffff
see http://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane
[2] http://code.timhatch.com/hg/pygments-tim/rev/7464c9b8c04e
[3] there's an unrelated change in vimbuiltins that is necessary to
work around a code size limitation in jython, as well as one test that
still fails due to what appears to be a limit on alternation
repetitions for the way we typically match strings, for which I intend
to report a bug
[4] http://timhatch.com/projects/regexlint/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlAQp88ACgkQgwVGtvGz4EcSuwCfRwg2HcEiW1xxTRGcIbtfxWyU
w94AoIYg7IyJwUOI7a3DKsfQfdO6uMNl
=l50l
-----END PGP SIGNATURE-----

  unirange.diff
86K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Hatch  
View profile  
 More options Aug 27 2012, 3:58 am
From: Tim Hatch <t...@timhatch.com>
Date: Mon, 27 Aug 2012 00:58:53 -0700
Local: Mon, Aug 27 2012 3:58 am
Subject: Re: Pygments support for wide unicode characters
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 7/25/12 7:13 PM, Tim Hatch wrote:

> Hi Pygments users,

> There's been a bug in Pygments for a while now, with the way it
> handled non-BMP[1] characters in the XQuery lexer.  I don't think
> many people used this lexer, and there were no testcases for this
> particular issue, so I don't think anybody noticed.
...
> I'd like to fix it properly before the next release goes out.
> Attached is a patch I'd like to propose, which adds a new function
> 'unirange' which will construct the appropriate regex to match a
> non-BMP range against the internal representation of a string.

This code, slightly modified, is now in
http://code.timhatch.com/hg/pygments-tim

The tests still pass for me; please verify for you, and object loudly
if you see any problems.  I also pushed the escaping for unistring
into the generation, since that seemed more natural now that it has to
be there for the surrogate changes.

Tim
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlA7KL0ACgkQgwVGtvGz4EeUBACeOA0M8lWxSH+YuRdiP6bnNCsB
9mkAoI+nafa513wkDNNJQp8l9vKj1Ry+
=VENX
-----END PGP SIGNATURE-----


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »