How to determine if a word has an extended character?

ambaris...@gmail.com

unread,

May 20, 2008, 8:54:17 AM5/20/08

to

I have a file which contains just one word. My task is just to find
out if the word has any extended character. Thats all.

I can use regex, but am not able to find out a regex pattern for
extended character. Any hints?

For example, if the file content is: sample, then the Perl code prints
false; and if the file content is samplé, then the Perl code prints
true.

Thanks.

Jürgen Exner

unread,

May 20, 2008, 9:50:56 AM5/20/08

to

ambaris...@gmail.com wrote:
>I have a file which contains just one word. My task is just to find
>out if the word has any extended character. Thats all.
>
>I can use regex, but am not able to find out a regex pattern for
>extended character. Any hints?

[Interpreting 'extended' as non-ASCII]

You could simply use the POSIX character class [:ASCII:]

Another way would be to check for each character, if its ord() is less
than 128. That should work at least for the most common encodings like
ISO-Latin-1, Windows-1252, ...

Or: [untested]
if (/^[A-Za-z]*$/) {
print 'false';
} else {
print 'true';
}

You could probably also set your locale to EN-US and use
if (/\W/) {
print 'true';
} else {
print 'false';
}

All of these do somewhat different things, so you have some options to
choose the one that most closely matches your needs.

jue

Hartmut Camphausen

unread,

May 20, 2008, 9:55:11 AM5/20/08

to

In <<405f2950-fa4a-4a3e...@k1g2000prb.googlegroups.com>>
schrieb ...

$string =~ m/[^\w]/ ? print "\nhas extended." : print "\nOK.";

should do the trick.

This prints "has extended" if $string contains any characters other
([^...]) then 'a' to 'z', 'A' to 'Z', '0' to '9' plus '_' (the \w
character class).

If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]
If you want to include more "valid" characters, expand the [^...]
accordingly (note: if you want to inlcude '-' as valid character, put it
at the very end of the characters list).

See
perldoc perlre
perldoc perlrequick
perldoc perlreref
perldoc perlretut

hth, Hartmut

--
------------------------------------------------
Hartmut Camphausen h.camp[bei]textix[punkt]de

John W. Krahn

unread,

May 20, 2008, 7:29:08 PM5/20/08

to

Hartmut Camphausen wrote:
> In <<405f2950-fa4a-4a3e...@k1g2000prb.googlegroups.com>>
> schrieb ...
>> I have a file which contains just one word. My task is just to find
>> out if the word has any extended character. Thats all.
>>
>> I can use regex, but am not able to find out a regex pattern for
>> extended character. Any hints?
>>
>>
>> For example, if the file content is: sample, then the Perl code prints
>> false; and if the file content is samplé, then the Perl code prints
>> true.
>
>
> $string =~ m/[^\w]/ ? print "\nhas extended." : print "\nOK.";

[^\w] is usually written as \W.

> should do the trick.
>
> This prints "has extended" if $string contains any characters other
> ([^...]) then 'a' to 'z', 'A' to 'Z', '0' to '9' plus '_' (the \w
> character class).

From perlre.pod:

<QUOTE>
If "use locale" is in effect, the list of alphabetic characters
generated by "\w" is taken from the current locale. See perllocale.
</QUOTE>

In other words, if your locale supports it then 'é' will be included in \w.

> If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]

[^a-zA-Z0-9] means any character that is *not* alphanumeric. You
probably meant [a-zA-Z0-9].

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

Ben Bullock

unread,

May 21, 2008, 12:28:32 AM5/21/08

to

On Tue, 20 May 2008 23:29:08 +0000, John W. Krahn wrote:

> Hartmut Camphausen wrote:

>>
>> $string =~ m/[^\w]/ ? print "\nhas extended." : print "\nOK.";
>
> [^\w] is usually written as \W.

Helmut mentioned that one could add more characters to the ^\w in the
following part of his post, which may explain why he chose this method rather
than using \W.

>> This prints "has extended" if $string contains any characters other
>> ([^...]) then 'a' to 'z', 'A' to 'Z', '0' to '9' plus '_' (the \w
>> character class).
>
> From perlre.pod:
>
> <QUOTE>
> If "use locale" is in effect, the list of alphabetic characters
> generated by "\w" is taken from the current locale. See perllocale.
> </QUOTE>
>
> In other words, if your locale supports it then 'é' will be included in
> \w.

Or if you use Unicode:

#!/usr/bin/perl
use warnings;
use strict;
use Unicode::UCD 'charinfo';
sub count_match
{
my ($re)=@_;
my $c;
for my $n (0x00 .. 0xD7FF, 0xE000 .. 0xFDCF, 0xFDF0.. 0xFFFD) {
if (chr($n) =~ /$re/) {
my $ci = charinfo($n);
# print sprintf ('%02X', $n), " which is ", $$ci{name}, " matches\n";
$c++;
}
}
print "There are $c characters matching \"$re\".\n";
}
count_match('\w');

Uncommenting the "print" statement will produce a lot of output.

>> If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]
>
> [^a-zA-Z0-9] means any character that is *not* alphanumeric. You
> probably meant [a-zA-Z0-9].

I think he meant what he said, [^\w] matches _ but [^a-zA-Z0-9] doesn't.

Ben Bullock

unread,

May 21, 2008, 1:24:51 AM5/21/08

to

On Wed, 21 May 2008 04:28:32 +0000, Ben Bullock wrote:

>>> If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]
>>
>> [^a-zA-Z0-9] means any character that is *not* alphanumeric. You
>> probably meant [a-zA-Z0-9].
>
> I think he meant what he said, [^\w] matches _ but [^a-zA-Z0-9] doesn't.

Sorry, I meant to say "[^\w] doesn't match _, but [^a-zA-Z0-9] does."

Hartmut Camphausen

unread,

May 21, 2008, 6:51:31 AM5/21/08

to

In <<89JYj.3989$KB3.3516@edtnps91>> schrieb John W. Krahn...

> > $string =~ m/[^\w]/ ? print "\nhas extended." : print "\nOK.";
>
> [^\w] is usually written as \W.

You are right. But I chose this notation to make it easy to expand the
list of characters not to match on (Ben B., your crystal ball worked
well :-)

> From perlre.pod:
>
> <QUOTE>
> If "use locale" is in effect, the list of alphabetic characters
> generated by "\w" is taken from the current locale. See perllocale.
> </QUOTE>
>
> In other words, if your locale supports it then 'é' will be included in \w.

Arrgh. Right again. Well, I never 'use locale'... and so didn't even
think of it.
I'll consider this in my future posts (hopefully).

> > If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]
>
> [^a-zA-Z0-9] means any character that is *not* alphanumeric. You
> probably meant [a-zA-Z0-9].

Not quite.
Maybe the term 'exclude' is a bit misleading; I wanted to say 'exlude
from list of characters NOT to match on'. That is, [^\w] won't match on
'_', while [^a-zA-Z0-9] will, making '_' an 'extended' charakter.

Thanks for your hints anyway.

mfg, Hartmut

Dr.Ruud

unread,

May 21, 2008, 3:05:51 PM5/21/08

to

ambaris...@gmail.com schreef:

> I have a file which contains just one word. My task is just to find
> out if the word has any extended character. Thats all.

Define "extended character". Any character sorted after "\x7F" maybe?

--
Affijn, Ruud

"Gewoon is een tijger."