I can use regex, but am not able to find out a regex pattern for
extended character. Any hints?
For example, if the file content is: sample, then the Perl code prints
false; and if the file content is samplé, then the Perl code prints
true.
Thanks.
[Interpreting 'extended' as non-ASCII]
You could simply use the POSIX character class [:ASCII:]
Another way would be to check for each character, if its ord() is less
than 128. That should work at least for the most common encodings like
ISO-Latin-1, Windows-1252, ...
Or: [untested]
if (/^[A-Za-z]*$/) {
print 'false';
} else {
print 'true';
}
You could probably also set your locale to EN-US and use
if (/\W/) {
print 'true';
} else {
print 'false';
}
All of these do somewhat different things, so you have some options to
choose the one that most closely matches your needs.
jue
$string =~ m/[^\w]/ ? print "\nhas extended." : print "\nOK.";
should do the trick.
This prints "has extended" if $string contains any characters other
([^...]) then 'a' to 'z', 'A' to 'Z', '0' to '9' plus '_' (the \w
character class).
If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]
If you want to include more "valid" characters, expand the [^...]
accordingly (note: if you want to inlcude '-' as valid character, put it
at the very end of the characters list).
See
perldoc perlre
perldoc perlrequick
perldoc perlreref
perldoc perlretut
hth, Hartmut
--
------------------------------------------------
Hartmut Camphausen h.camp[bei]textix[punkt]de
[^\w] is usually written as \W.
> should do the trick.
>
> This prints "has extended" if $string contains any characters other
> ([^...]) then 'a' to 'z', 'A' to 'Z', '0' to '9' plus '_' (the \w
> character class).
From perlre.pod:
<QUOTE>
If "use locale" is in effect, the list of alphabetic characters
generated by "\w" is taken from the current locale. See perllocale.
</QUOTE>
In other words, if your locale supports it then 'é' will be included in \w.
> If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]
[^a-zA-Z0-9] means any character that is *not* alphanumeric. You
probably meant [a-zA-Z0-9].
John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall
> Hartmut Camphausen wrote:
>>
>> $string =~ m/[^\w]/ ? print "\nhas extended." : print "\nOK.";
>
> [^\w] is usually written as \W.
Helmut mentioned that one could add more characters to the ^\w in the
following part of his post, which may explain why he chose this method rather
than using \W.
>> This prints "has extended" if $string contains any characters other
>> ([^...]) then 'a' to 'z', 'A' to 'Z', '0' to '9' plus '_' (the \w
>> character class).
>
> From perlre.pod:
>
> <QUOTE>
> If "use locale" is in effect, the list of alphabetic characters
> generated by "\w" is taken from the current locale. See perllocale.
> </QUOTE>
>
> In other words, if your locale supports it then 'é' will be included in
> \w.
Or if you use Unicode:
#!/usr/bin/perl
use warnings;
use strict;
use Unicode::UCD 'charinfo';
sub count_match
{
my ($re)=@_;
my $c;
for my $n (0x00 .. 0xD7FF, 0xE000 .. 0xFDCF, 0xFDF0.. 0xFFFD) {
if (chr($n) =~ /$re/) {
my $ci = charinfo($n);
# print sprintf ('%02X', $n), " which is ", $$ci{name}, " matches\n";
$c++;
}
}
print "There are $c characters matching \"$re\".\n";
}
count_match('\w');
Uncommenting the "print" statement will produce a lot of output.
>> If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]
>
> [^a-zA-Z0-9] means any character that is *not* alphanumeric. You
> probably meant [a-zA-Z0-9].
I think he meant what he said, [^\w] matches _ but [^a-zA-Z0-9] doesn't.
>>> If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]
>>
>> [^a-zA-Z0-9] means any character that is *not* alphanumeric. You
>> probably meant [a-zA-Z0-9].
>
> I think he meant what he said, [^\w] matches _ but [^a-zA-Z0-9] doesn't.
Sorry, I meant to say "[^\w] doesn't match _, but [^a-zA-Z0-9] does."
> > $string =~ m/[^\w]/ ? print "\nhas extended." : print "\nOK.";
>
> [^\w] is usually written as \W.
You are right. But I chose this notation to make it easy to expand the
list of characters not to match on (Ben B., your crystal ball worked
well :-)
> From perlre.pod:
>
> <QUOTE>
> If "use locale" is in effect, the list of alphabetic characters
> generated by "\w" is taken from the current locale. See perllocale.
> </QUOTE>
>
> In other words, if your locale supports it then 'é' will be included in \w.
Arrgh. Right again. Well, I never 'use locale'... and so didn't even
think of it.
I'll consider this in my future posts (hopefully).
> > If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]
>
> [^a-zA-Z0-9] means any character that is *not* alphanumeric. You
> probably meant [a-zA-Z0-9].
Not quite.
Maybe the term 'exclude' is a bit misleading; I wanted to say 'exlude
from list of characters NOT to match on'. That is, [^\w] won't match on
'_', while [^a-zA-Z0-9] will, making '_' an 'extended' charakter.
Thanks for your hints anyway.
mfg, Hartmut
> I have a file which contains just one word. My task is just to find
> out if the word has any extended character. Thats all.
Define "extended character". Any character sorted after "\x7F" maybe?
--
Affijn, Ruud
"Gewoon is een tijger."