finding & saving accented (unicode) chars with perl 5.6.1

jcm...@gmail.com

unread,

May 21, 2008, 7:57:15 AM5/21/08

to

Hello fellow perl addicts,

I have following issue :

I want to find words containing accented chars,
based on a simple regex having a placeholder for the accented chars

I read everywhere that perl internally is fully unicode, so this
should be no problem
however my query doesn't work

setup :
given a list of words with '?' symbols (the chars to look for in a ref
list) "@AllQstLines" &
a reference list of correct accented words "@AllRefLines"

#### Begin code
foreach $wrd (@AllQstLines)
{
# Bir?ebbuga <----- example of a $qst
next if ($wrd !~ /\?/); # skip if no unknow/illegal chars
next if ($WrdSeen{$wrd});# skip if allready in replace list
$WrdSeen{$wrd}=1;
$wrd =~ s/\?/\./g; chomp($wrd);
$found=0;

@Matches=grep /$wrd/, @AllRefLines;
foreach $match (@Matches)
{
$match =~ /$wrd/; $correct= $&;
print "DBG: $wrd, $correct\n";
$Correction{$wrd}=$correct;
$found=1;
last if ($found); ## only consider first match
}
if (!$found) {print "DBG: $wrd, NOTHING FOUND\n";}
}
#### End code

The problem : in this way the ?ejtun never matches the reference
"Żejtun"
neither does Lu?ija seem to match "Luċija"

what am I doing wrong ?

thx in advance
kind regards
jc

Jim Gibson

unread,

May 21, 2008, 3:36:08 PM5/21/08

to

In article
<e94b4949-5249-4bfe...@z72g2000hsb.googlegroups.com>,
<"jcm...@gmail.com"> wrote:

> "?ejtun"
> neither does Lu?ija seem to match "Lu?ija"

>
> what am I doing wrong ?

Several things:

1. You are using Perl 5.6. Unicode support exists in 5.6, but is not
recommended. From perluniintro:

"Perl's Unicode Support
Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
natively. Perl 5.8.0, however, is the first recommended release for
serious Unicode work. The maintenance release 5.6.1 fixed many of the
problems of the initial Unicode implementation, but for example regular
expressions still do not work with Unicode in 5.6.1."

2. You have not provided a complete, working program that someone can
run. We can only guess at what data you are using to test your program.
Unicode support by newsreaders is variable, so I am not sure if you are
using actual '?' characters in your post or they are something else and
my newsreader is rendering them as '?'.

3. You are using Google Groups to post. Many of the most knowledgeable
people reading this newsgroup filter out posts from Google. If you are
serious about using Usenet for help, you should get yourself a real
news reader and news provided. (By posting your entire article, I am
providing those who don't read posts from Google the opportunity to
read yours.)

If you wish to read more about Perl's unicode support, read perlunitut,
perlunicode, and perluniintro, available with your Perl distribution or
at <http://www.cpan.org>.

Good luck!

--
Jim Gibson

jcm...@gmail.com

unread,

May 21, 2008, 4:41:29 PM5/21/08

to

On May 21, 9:36 pm, Jim Gibson <jimsgib...@gmail.com> wrote:
> In article
> <e94b4949-5249-4bfe-842f-e426d7f24...@z72g2000hsb.googlegroups.com>,

Hello Jim & thx for your reaction

Google & newsreaders exactly show my code (but indeed changed the
lookds of the examples of reference text) :

i am using the Questionmark symbol as placeholder in my QstnLines &
thus the var $wrd
iow $wrd = Lu?ija (a qstn-mark as 3rd char)

the referenceLines & thus the var $match contains the real life &
correctly written word containing accented chars
iow $match = Luċija (3rd char = c-with-dot-above ; total word
looks like Lucija but with a different c)

so rephrased, the PERL problem seems to be that the statement
$match =~ /$wrd/
Luċija =~ /Lu?ija/
doesn't seem to be TRUE

what the total progrm wants to achieve, is find the real/correct
written word in a reference file,
given a list of 'simplified' keywords that do not contain any accented
chars, those are all replaced by a Questionmark

rgds
jc

Jim Gibson

unread,

May 21, 2008, 9:09:38 PM5/21/08

to

In article
<320d2308-cfe5-4bcc...@d1g2000hsg.googlegroups.com>,
<"jcm...@gmail.com"> wrote:

> Google & newsreaders exactly show my code (but indeed changed the
> lookds of the examples of reference text) :
>
> i am using the Questionmark symbol as placeholder in my QstnLines &
> thus the var $wrd
> iow $wrd = Lu?ija (a qstn-mark as 3rd char)
>
> the referenceLines & thus the var $match contains the real life &
> correctly written word containing accented chars
> iow $match = Luċija (3rd char = c-with-dot-above ; total word
> looks like Lucija but with a different c)
>
> so rephrased, the PERL problem seems to be that the statement
> $match =~ /$wrd/
> Luċija =~ /Lu?ija/
> doesn't seem to be TRUE
>
> what the total progrm wants to achieve, is find the real/correct
> written word in a reference file,
> given a list of 'simplified' keywords that do not contain any accented
> chars, those are all replaced by a Questionmark

The question mark in Perl regular expressions is a quantifier that
means "0 or 1 of the previous character". Thus the RE /Lu?ija? will
match any string containing the substring 'Lija' or 'Liuja'. It will
not match any string having any character between the 'u' and the 'i'.

The Perl meta-character for matching a single character is period (.).

You need an expression that will match a single accented character. You
can use character classes enclosed in brackets ([...]), control
character codes (\cX), numerical character codes (\NNN), POSIX style
character classes ([[:ascii:]), or one of the fancier Unicode
constructs (see pp 167-174 in Programmin Perl, 3rd Ed.). I do not have
experience with those and can't help you. Be aware that the Unicode
constructs probably do not exist in Perl 5.6.

jcm...@gmail.com

unread,

May 22, 2008, 3:35:27 AM5/22/08

to

On May 22, 3:09 am, Jim Gibson <jimsgib...@gmail.com> wrote:
> In article

> <320d2308-cfe5-4bcc-b6a4-8e21666fe...@d1g2000hsg.googlegroups.com>,

Hi Jim,
thx again

I do swap the ?s with dots (see $wrd =~ s/\?/\./g;)

but I got it working,
your first guess was correct :
i upgraded my perl environment last night to 5.8.8 and bingo
everything now works fine !

great suggestion, thank you

rgd
jc

jcm...@gmail.com

unread,

May 22, 2008, 4:54:44 AM5/22/08

to

On May 22, 3:09 am, Jim Gibson <jimsgib...@gmail.com> wrote:
> In article

> <320d2308-cfe5-4bcc-b6a4-8e21666fe...@d1g2000hsg.googlegroups.com>,

Hi Jim,

Hartmut Camphausen

unread,

May 22, 2008, 5:02:48 AM5/22/08

to

Hello jcmmat,

If I got it correctly, you
1. replaced any accented character in the given words with '?'
2. put the result into @AllQstLines
3. use the inserted '?' in $wrd as a placeholder for the
desired '&#nnn;', contained in the respective word in @AllRefLines:

@Matches=grep /$wrd/, @AllRefLines;

4. use the first $wrd-matching @Matches as the desired result
If so then:

In <<320d2308-cfe5-4bcc...@d1g2000hsg.googlegroups.com>>
schrieb jcm...@gmail.com...

> so rephrased, the PERL problem seems to be that the statement
> $match =~ /$wrd/
> Luċija =~ /Lu?ija/
> doesn't seem to be TRUE

How could it supposed to be ;-)
Denoted like this, Jim's hint on '?' beeing a quantifier is correct.

BUT in your code example, you previously said:

$wrd =~ s/\?/\./g;

so the 'rephrased' expression should look like

Luċija =~ /Lu.ija/;

As you see, the to-be-matched 'ċ' is quit a bit longer then just
one single character (that's what the '.' in 'Lu.ija' represents).

So you should say

$wrd =~ s/\?/.+?/g;

to make $wrd a regex that can match the longer character representation
in $AllRefLines[n].

NOTE that this RE will find any words in @AllRefLines that contain the
constant parts of $wrd plus ANY chars before/between/after them,
depending of the occurence(s) of '?':

/Lu.+?ija/

will match Luċija
as well as Lu_cyinthesky_ija

You thus may want to be more specific about what to match.
If you know how your '?' is denoted, you could e.g. say

/Lu&.+?;ija/

that is

$wrd =~ s/\?/&.+?;/g

Hint: You are using $wrd as a RE several times. For the sake of
efficiency you should compile it as a RE

$wrd = qr/\?/&.+?;/

before using it.

Hint: To get the first match of $wrd in @AllRefLines you can simply say

foreach (@allreflines){
m/($wrd)/ ? ($correct = $1, last) : ($correct = undef)
} ;

No extra @Matches-array needed, use of $& avoided... (see perldoc
perlre, line 387 ff.)

hth + mfg, Hartmut

--
------------------------------------------------
Hartmut Camphausen h.camp[bei]textix[punkt]de

John W. Krahn

unread,

May 22, 2008, 6:22:43 AM5/22/08

to

Hartmut Camphausen wrote:
>
> Hint: You are using $wrd as a RE several times. For the sake of
> efficiency you should compile it as a RE
>
> $wrd = qr/\?/&.+?;/

That will produce a syntax error because of the '&.+?;/' after the
compiled expression qr/\?/

$ perl -le'$wrd = qr/\?/&.+?;/'
syntax error at -e line 1, near "&."
Search pattern not terminated or ternary operator parsed as search
pattern at -e line 1.

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

Hartmut Camphausen

unread,

May 22, 2008, 6:47:59 AM5/22/08

to

Hello John,

In <<TPbZj.3530$Yp.2679@edtnps92>> schrieb John W. Krahn...

> Hartmut Camphausen wrote:
> >
> > Hint: You are using $wrd as a RE several times. For the sake of
> > efficiency you should compile it as a RE
> >
> > $wrd = qr/\?/&.+?;/
>
> That will produce a syntax error because of the '&.+?;/' after the
> compiled expression qr/\?/

Arrgh. Right. Wrong logic.
The above is quite senseless anyway :-p
(I didn't want to make $wrd a RE for just '?')

In my test script, I had

$wrd =~ s/\?/&.+?;/g; # prepare $wrd and THEN...
$wrd = qr/$wrd/; # ...compile it as a RE

which worked as expected, giving $wrd as e.g.

(?-xism:Lu&.+?;ija)

Thanks for the hint!

Hartmut Camphausen

unread,

May 22, 2008, 6:56:38 AM5/22/08

to

In <<MPG.229f542a1...@news.t-online.de>> schrieb Hartmut
Camphausen...
> ...you should compile it as a RE

>
> $wrd = qr/\?/&.+?;/
>
> before using it.

WRONGWRONGWRONG!
Thanks to John's hint, I paste(!) the correct lines here:

$wrd =~ s/\?/&.+?;/g; # FIRST prepare $wrd...
$wrd = qr/$wrd/; # ...and THEN compile it as a RE.

This will compile ok, and

m/$wrd/

works as expected.
Uff.