Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: regex matches Chinese characters

61 views
Skip to first unread message

Shlomi Fish

unread,
Jul 27, 2018, 2:00:04 AM7/27/18
to Lauren C., Perl Beginners
Hi Lauren,

On Fri, 27 Jul 2018 11:28:42 +0800
"Lauren C." <lau...@miscnote.net> wrote:

> greetings,
>
> I was doing the log statistics stuff using perl.
> There are chinese characters in log items.
> I tried with regex to match them, but got no luck.
>
> $ perl -mstrict -le 'my $char="汉语"; print "it is chinese" if $char =~
> /\p{Han}+/'
>
> $ perl -mstrict -mutf8 -le 'my $char="汉语"; print "it is chinese" if
> $char =~ /\p{Han}+/'
>
> both output nothing.
>
> My terminal is UTF-8:
>

According to http://perldoc.perl.org/perlrun.html , you probably need -Mstrict
and -Mutf8 instead of the lowercase -m, so "sub import" will get called:

shlomif@telaviv1:~$ perl -Mstrict -Mutf8 -le 'my $char="汉语"; print "it is
chinese" if $char =~ /\p{Han}+/'
it is chinese
shlomif@telaviv1:~$

HTH,

Shlomi

> $ locale
> LANG=en_US.UTF-8
> LANGUAGE=
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=
>
>
> Can you help? thanks in advance.
>



--
-----------------------------------------------------------------
Shlomi Fish http://www.shlomifish.org/
https://github.com/sindresorhus/awesome - curated list of lists

Cats are smarter than dogs. You can’t get eight cats to pull a sled through
snow. — Source unknown, via Nadav Har’El.

Please reply to list if it's a mailing list post - http://shlom.in/reply .
0 new messages