Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

regular expression for special html characters

106 views
Skip to first unread message

Shlomit Afgin

unread,
Feb 2, 2011, 4:25:29 AM2/2/11
to begi...@perl.org


Hello,

I tried to convert html special characters to their real character.
For example, converting ” to " .

I had the string
$str = "“ test ” ניסיון ";
The string contain also Hebrew letters.

1. first I did:
$str = decode_entities($str);
It convert the special characters okay.
The problem is that the Hebrew came not okay.
So when I print the value of the $str I get the hebrew as יסיון

2. Then I decided to write a regular expression that change only the
html special characters.
I wrote:
$str = "“ test ” ניסיון ";
$str =~ s/(&#(?=[0-9])*.{2,5};)/decode_entities($1)/ge;
Even that it should work only on the matches sub string, it's seem that
it happen also on the Hebrew letters.
The Hebrew letters came again as יסיון
Part 1 and 2 give the same output.

3. I decide to check the regular expression, I remove the 'e' in the
end of the regular expression so I can see the conversion.
I wrote:
$str = "“ test ” ניסיון ";
$str =~ s/(&#(?=[0-9])*.{2,5};)/decode_entities($1)/g;
The output was:
decode_entities(“) test decode_entities(”) ניסיון
The Hebrew came out okay, of course.

4. I can do :
$str =~ s/“|”/"/g;
Which don't effect the Hebrew, and convert the html characters.
The problem that there are other html special characters that exist in
the data.
I would like to do something more generic that will work also for the
future.
Any ideas are welcome!!
Shlomit.

Jeff Pang

unread,
Feb 3, 2011, 5:52:07 AM2/3/11
to Shlomit Afgin, begi...@perl.org
2011/2/2 Shlomit Afgin <Shlomi...@weizmann.ac.il>:

>
>
> Hello,
>
> I tried to convert html special characters to their real character.
> For example, converting    &#8221;      to     "   .
>
> I had the string
> $str = "&#8220; test &#8221; ניסיון ";
> The string contain also Hebrew letters.
>

Could Encode work on it?

use Encode;
$new = encode("iso-8859-1",decode("iso-8859-8",$str));

Regards.

Shawn H Corey

unread,
Feb 3, 2011, 8:45:47 AM2/3/11
to begi...@perl.org
On 11-02-02 04:25 AM, Shlomit Afgin wrote:
> I tried to convert html special characters to their real character.
> For example, converting&#8221; to " .
>
> I had the string
> $str = "&#8220; test&#8221; ניסיון ";

> The string contain also Hebrew letters.

This seems to work:

#!/usr/bin/perl

use strict;
use warnings;

use encoding( 'utf8' );
use HTML::Entities;

my $str = "&#8220; test &#8221; ניסיון ";
$str = decode_entities( $str );
print "$str\n";

__END__


--
Just my 0.00000002 million dollars worth,
Shawn

Confusion is the first step of understanding.

Programming is as much about organization and communication
as it is about coding.

The secret to great software: Fail early & often.

Eliminate software piracy: use only FLOSS.

John Delacour

unread,
Feb 3, 2011, 9:56:17 AM2/3/11
to begi...@perl.org
At 18:52 +0800 03/02/2011, Jeff Pang wrote:

>2011/2/2 Shlomit Afgin <Shlomi...@weizmann.ac.il>:


>
>
> > I tried to convert html special characters to their real character.
> > For example, converting    &#8221;      to     "   .
> >
> > I had the string

> > $str = "&#8220; test &#8221; ÈÒÈÂÔ†¢ª


> > The string contain also Hebrew letters.
>
>Could Encode work on it?

use Encode;
$new = encode("iso-8859-1",decode("iso-8859-8",$str));

Heaven forbid!

The html entities are Unicode decimal, so all you need to do in this
case is get the number n and then execute chr n in a substitution:


#!/usr/local/bin/perl
use strict;
binmode STDOUT, 'utf8';
$_ = "&#8220;&#1488;&#8221;";
s~&#([\d]+);~chr $1~eg;
print; # -=> “א”

JD

0 new messages