arabic presentation form -> normal UTF-8

42 views
Skip to first unread message

Mohammad Ali Safari

unread,
Jan 22, 2012, 5:33:18 PM1/22/12
to Persian Computing
Hi
I have a text that contains characters from the Arabic presentation form (U+FE70 - U+FEFF) like ﺵﺝﺭ
Does any body have a PHP code that converts this to a normal UTF-8 text?

cheers,
-Mohammad

Behdad Esfahbod

unread,
Jan 22, 2012, 6:39:13 PM1/22/12
to Mohammad Ali Safari, Persian Computing

Do you actually need ZWJ / ZWNJ inserted to keep the forms too, or just want
three letters out for three letters in? If the latter, I can generate the
mapping table for you in 2minutes. The former needs more work if you want to
remove excess ZWJ / ZWNJ.

Anyway, you can generate the table yourself. Here's all you need:

1. Get the UnicodeData.txt from:

http://www.unicode.org/Public/6.1.0/ucd/

2. grep 'initial\|medial\|final\|isolated' UnicodeData.txt

There's the data, massage it in the format you need.

behdad

> cheers,
> -Mohammad

Mohammad Ali Safari

unread,
Jan 22, 2012, 11:34:20 PM1/22/12
to Behdad Esfahbod, Persian Computing
Thanks Behdad. I ended up writing a simple mapping function:

function purify_value($v){
    if ($v < 0xFE70) return $v;
    if ($v < 0xFE8F) return $v;
    $cv_table = array(0xFE92 => 0x628, 0xFE94 => 0x629 , 0xFE98 => 0x62A, 0xFE9C => 0x62B, 0xFEA0 => 0x62C, 0xFEA4 => 0x62D, 0xFEA8=> 0x62E, 0xFEAA=>0x62F, 0xFEAC=>0x630,
    0xFEAE=>0x631, 0xFEB0=>0x632, 0xFEB4=>0x633, 0xFEB8=>0x634, 0xFEBC=>0x635, 0xFEC0=>0x636, 0xFEC4=>0x637, 0xFEC8=>0x638, 0xFECC=>0x639, 0xFED0=>0x63A, 0xFED4=>0x641, 0xFED8=>0x642,
    0xFEDC=>0x6a9, 0xFEE0=>0x644, 0xFEE4=>0x645, 0xFEE8=>0x646, 0xFEEC=>0x647, 0xFEEE=>0x648, 0xFEF4=>0x649);
    foreach ($cv_table as $fr=>$t)
        if ($v <= $fr) return $t;
}

Reply all
Reply to author
Forward
0 new messages