Matching certain unicode characters with REGEXP

Kai Schaetzl

unread,

Feb 12, 2011, 2:41:47 PM2/12/11

to

I'm trying to match certain Unicode code points in a select query but all
my attempts and lots of searching on the net failed. (Actually, I want to
replace something and I want to check with this select if the replace
operation worked correctly.)

I tried for instance

WHERE col REGEXP '\u004c'

which should find any occurences of 'L', but it doesn't.

'\x4c' fails as well.

What's the correct syntax for MySQL?

The field is of type text collation utf8_roman_ci (or similar).

(Of course, I do not want to find 'L' this way. It's just a simplified
query to get a working syntax.)

Kai

Peter H. Coffin

unread,

Feb 14, 2011, 5:31:54 PM2/14/11

to

Where on

http://dev.mysql.com/doc/refman/5.1/en/regexp.html

or

http://dev.mysql.com/doc/refman/5.1/en/pattern-matching.html

or

http://dev.mysql.com/doc/refman/5.1/en/string-syntax.html

is \u or \x discussed? Outside of the case on the last one where it says
explicitly 'For example, "\x" is just "x".'

--
Revenge is an integral part of forgiving and forgetting.
-- The BOFH

Kai Schaetzl

unread,

Feb 15, 2011, 5:58:01 AM2/15/11

to

Peter H. Coffin schrieb am Mon, 14 Feb 2011 16:31:54 -0600:

> Where on
>
..

>
> is \u or \x discussed?

\u and \x is standard syntax in various regular expression
implementations. I gave it as an example of what obvious things I tried.
That this is not supported is *exactly* the problem! How do I specify a
certain Unicode code point either for matching or for insertion if not
this way?

Kai
--
Conactive Internet Services, Berlin, Germany

Peter H. Coffin

unread,

Feb 15, 2011, 8:14:25 PM2/15/11

to

On Tue, 15 Feb 2011 11:58:01 +0100, Kai Schaetzl wrote:
> Peter H. Coffin schrieb am Mon, 14 Feb 2011 16:31:54 -0600:
>
>> Where on
>>
> ..
>>
>> is \u or \x discussed?
>
> \u and \x is standard syntax in various regular expression
> implementations. I gave it as an example of what obvious things I tried.
> That this is not supported is *exactly* the problem!

It's an inconvenience. But at least the manual spells out exactly which
method of parsing is behind REGEXP, and details what is supported, so
there's not much point in being unhappy that something that wasn't
suggested would work does not, in fact, work.

> How do I specify a
> certain Unicode code point either for matching or for insertion if not
> this way?

There's a couple of ways that would likely work. The most
straightforward way is to simply set the connection character set
correctly, and construct the pattern using the explicit characters that
you're looking for. If you're looking for 'L', send an L. If you're
looking for a 'þ', send a þ.

Another method would be to REGEXP on a HEX(my_col), which might be
entirely reasonable for small sets of data. It's a little ... suboptimal
on large datasets because it'll force a tablescan.

And while you're playing with those, remember that MySQL uses 0x
notation to express hex constants, and you might find the
CHAR(N,... [USING charset_name]) function pretty handy....

Kai Schaetzl

unread,

Feb 21, 2011, 8:31:18 AM2/21/11

to

Peter H. Coffin schrieb am Tue, 15 Feb 2011 19:14:25 -0600:

> And while you're playing with those, remember that MySQL uses 0x
> notation to express hex constants, and you might find the
> CHAR(N,... [USING charset_name]) function pretty handy....

Thanks for this. The solution was to use something like

UPDATE table SET fr=REPLACE(fr, ' ?',CONCAT(CHAR(0xc2a0),'?')) WHERE fr
LIKE '%?%'

and

SELECT fr FROM table WHERE fr LIKE CONCAT('%',CHAR(0xc2a0),'?','%')

I didn't test REGEXP.