Text Conversion in Perl

Mario Thomas

unread,

Dec 26, 2000, 4:03:30 AM12/26/00

to

Hi All,

I'm receiving a text file from another department in my company which
contains an article which has previously been published in a newspaper. The
purpose of the file is to upload it to the net - thereby replicating our
newspaper content online. The problem i have is that the file contains all
sorts of strange characters. I can only
assume they have been put there by Quark on the Mac. Is there anyway i can
convert these characters to PC format using Perl? I have pasted in a sample
below:
<START TEXT>

The Phillips Report into the BSE crisis found there had been Ňa clear policy
restricting the disclosure of information about BSEÓ that robbed those with
an interest of any power to react. MP Tony Benn reckons backbench MPs need a
Freedom of Information Act as much as anybody, such is the ethos of secrecy
around the higher Zchelons of government. But, say disappointed campaigners,
this Bill is fairly toothless.

Any help or suggestions would be very much appreciated.

Thanks in advance

Mario

Abigail

unread,

Dec 26, 2000, 6:16:48 AM12/26/00

to

Mario Thomas (ma...@alamar.net) wrote on MMDCLXXIV September MCMXCIII in
<URL:news:J6Z16.2530$I5.28530@stones>:
==
== The problem i have is that the file contains all
== sorts of strange characters. I can only
== assume they have been put there by Quark on the Mac. Is there anyway i can
== convert these characters to PC format using Perl?

tr/// will do. You have to make your own mapping of course.

Abigail
--
use lib sub {($\) = split /\./ => pop; print $"};
eval "use Just" || eval "use another" || eval "use Perl" || eval "use Hacker";

Mario Thomas

unread,

Dec 26, 2000, 6:27:42 AM12/26/00

to

I thought about using tr/// the problem with that is that i will have to
create the mappings as they happen. I would prefer to find out what
character sets i am translating between. Is there a way to do this or is
there a list of character sets somewhere?

"Mario Thomas" <ma...@alamar.net> wrote in message
news:J6Z16.2530$I5.28530@stones...

egw...@netcom.com

unread,

Dec 26, 2000, 1:44:14 PM12/26/00

to

Mario Thomas <ma...@alamar.net> wrote:
> I thought about using tr/// the problem with that is that i will have to
> create the mappings as they happen. I would prefer to find out what
> character sets i am translating between. Is there a way to do this or is
> there a list of character sets somewhere?

Yes, you call up the person from the other department and say, "Hey,
what character mapping are you using?" Then you can go to somewhere
like
http://www.unicode.org/Public/MAPPINGS/

and create your tr/// (or maybe s/// if you're going to convert to
HTML character entities). Then take the rest of the day off.

Bart Lateur

unread,

Dec 26, 2000, 5:45:52 PM12/26/00

to

Mario Thomas wrote:

>The problem i have is that the file contains all
>sorts of strange characters. I can only
>assume they have been put there by Quark on the Mac. Is there anyway i can
>convert these characters to PC format using Perl?

Yup, you're right. The reason is that the Mac's character set and the
PC's (Windows) character set are different. Now since you want to place
the texts on Internet, converting the characters to ISO-Latin-1 will be
good enough. And: ISO-Latin-1 is a subset of Unicode. Sopmebody
suggested getting the character set tables from Unicode.org's FTP site,
and that's a good suggestion. All you need is this file:
<ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT>.

Now, about the file format:these files contain plain (ASCII) text files,
with per line either just a comment, starting with "#", or a line with
the encoding for one character. Let's take an example:

0xCA 0x00A0 # NO-BREAK SPACE

There are three columns in each line, separated by tabs. The first
column is the character code in the proprietary set, the second columns
the character code in Unicode (ISO-LAtin-1, if it's less than 256), both
in hexadecimal; and the third column is a comment, a description ofthe
character in plain text. This one character is the non-breaking space,
0x00A0 = 160 in ISO-Latin-1, and 0xCA = 202, on the Mac. So what you
need to do, is replace characters with code 202 by characters with code
160. Amongst others.

So, here's some more complete code to do this conversion for you.

open IN, "Apple/Roman.txt" or die "Cannot open file: $!";
# subsitute the correct path for the file you downloaded
while(<IN>) {
/^\s*(0x[0-9a-fA-F]+)\t(0x[0-9a-fA-F]+)\t/ or next;
$replace{chr hex $1} = chr hex $2;
}

# convert the data file:
while(<>) {
s/([\200-\377])/$replace{$1}/g;
}

Some final remarks:

* the Mac's characters with code below 128, is plain ASCII, i.e. the
same as ISO-Latin-1. No need to convert anything there, apart for the
line-ends: the Mac uses CR only, the PC takes CR+LF, Unix takes only LF.
Note that in Perl on PC, "\n" is LF only, but this gets converted into
CR+LF when printing to a text file (the normal file mode).

* I do not check for characters that are not in ISO-Latin-1 (codes 256
and above), or even in Unicde (e.g. the Apple symbol). You won't
encounter too many of those, I hope.

--
Bart.

Martijn Lievaart

unread,

Dec 26, 2000, 7:17:01 PM12/26/00

to

"Mario Thomas" <ma...@alamar.net> wrote in <Xd%16.2760$I5.29687@stones>:

[ Requoted, please answer below the quotes. Thanks ]

[ Please reply to the article you are replying to, makes it easier for
others to see what exactly you are replying to. Thanks. ]

There are many character sets out there, and you should get aquainted with
them. However, the example you give doesn't seem to be in any particular
character set, there are just funny characters inbetween the text. Ask the
people who give you these data sets what the meaning of those characters is
(but first get it clear what characterset they are using!).

Point is, you are the user of their input, and they should be able to
describe what input they are delivering to you. Alternatively, you could
state what you want to receive, but that is a luxury we don't often have.
:-(

OK, short primer on character sets. There are a lot of them, but only ASCII
and EBCDIC survived as the basic A-Za-z0-9 and funny characters sets. Of
those, only ASCII is worth talking about as it is the basis for all other
current charactersets[1][2].

Now ASCII only defines codes 0-127. So others have filled in the gap from
128-255. One character set that was used very often is IBM-extended, which
was burned into the ROMS of all IBM compatibles. That is not used a lot
anymore. ANSI/ISO defined some character sets, which are used a lot.
Particularly ISO 8859-1 is the now current standard on the internet, but
others do exist. I'm talking about charactersets that have ASCII as their
lower 128 character codes, they redefine the upper 128 codes.

This doesn't work to well in practice (talk to any european that had to
convert between character sets) so they thought up something else. Unicode.
That is a 16 bit code[3] that encodes most commonly used characters. The
idea is good and it pretty much works as advertised. Some problems do
surface though. Most Unices don't have much support for it yet, nor have
any other OSses but WindowsNT. Most people don't understand all the issues
surrounding Unicode (I certainly don't), but then most people stay
blissfully ignorant of charactersets even if they hit them in the face!

Other problems with Unicode are that no font will show all charcters
possible. So there will be some trouble displaying random Unicode strings.
Look to your OS for solutions.

However, unicode is probably the way to go, and Perl support for unicode is
probably one of the most exiting things that happened in computing over the
past decade. Oh, by the way, don't get hung up by UTF-8 (or UTF-x in
general), it is just a funny way of encoding unicode. It's an encoding
only, down below it's just unicode. You'll probably encounter it more, just
be aware of it.

You asked for a list of charcter sets. Just search the web for the terms I
gave above, it should turn up plenty. Note that the ISO standards do cost
money, so if you really need them you probably have to buy them. But they
(or the gist) should probably be on the web somewhere.

OK, Back to your problem. All those character sets inherited from ASCII.
Although there are some funny Unicode end-of-line, end-of-whatever, etc
characters, basicaly nowadays you get either 8-bit or 16-bit input, with 16
bit being very rare. If it is 8 bit, it is either UTF-8 encoded Unicode, or
it is an extended ASCII, probably 8859-1. But the prople who give you the
input should be able to tell you what character set the input is in, and if
applicable, what the funny characters mean. But in your case, it seems that
there is just ASCII with some funny characters.

I may be a sucker for these cases, but I always /demand/ to know what input
I get. That normally means knowing the character set, and the end-of-line
character(s). Without that knowledge (possibly learned from sample files),
I cannot go to work. (There are many more issues, like what is the end of
string seperator, if any, what does an "empty field" mean, etc).

HTH,
M4
[1] This does not mean that EBCDIC (and even others!) are not used anymore,
just that their usage is rare and conversions to ASCII and derivatives
always exist.

[2] I once was project leader for a big project involving exchange of data
files. After inquiring what character set they used, I learned that the
input would be translated from IBM-extended to some form of EBCDIC
extended. Just to be translated back at the other end to IBM-extended.
Needless to say that 1) I shortcircuited things and 2) the other end said
they worked with IBM-extended while in reality they worked with 8859-1.
Duh!

[3] There also seems to be a 32 bit Unicode, but I don't know anything
about that.

John Delacour

unread,

Dec 26, 2000, 7:34:04 PM12/26/00

to

At 9:03 am +0000 26/12/00, Mario Thomas wrote:

>I'm receiving a text file from another department in my company which

>contains an article ...Quark on the Mac.... a sample below:

><START TEXT>
>The Phillips Report into the BSE crisis found there had been “a clear
>policy restricting the disclosure of information about BSE” that...
><ENDTEXT>

Some people have suggested using tr/// to convert these curly quotes etc.
from Mac to Latin-1. The problem is that there are no curly quotes in
Latin-1. The curly quotes you now have in Windows belong to the charset
windows-1252.

The normal default for browsers is Latin-1 and unless you write your <HEAD>
correctly with the proper META tag for the charset, many people will not be
able to see the curly quotes properly without manually adjusting the
encoding, which is beyond the wit of most people. Older browsers won't
interpret them properly anyway. If you want curly quotes, you need to use
‘ ’ “ ” for ‘’“’ respectively. Otherwise you can
use Unicode, UTF-7 or UTF8 but older browsers will produce garbage from
that.

tr/// is no good for anything but transliteration from one table to another
of the same size.

##1
#!perl -w
$mac = '“He’s here again.”, she said.';
(my $windows1252 = $mac) =~ tr/‘’“”/ëíìî/;
print "\n$windows1252\n";
##2
$_ = $mac;
s /‘/‘/g;
s /’/’/g;
s /“/“/g;
s /”/”/g;
print "\n$mac\n$windows1252\n$_\n"

Method 2 is what you need. I've asked before for a perl routine to map
from array to array, rather than charlist to charlist, but I've never got
an answer. There MUST be a quicker and easier way.

JD

John Delacour

unread,

Dec 27, 2000, 7:55:39 AM12/27/00

to Mario Thomas

This routine seems to work well. I've not yet tried it on huge files but I
guess its performance would not be bad. The script print a sample string
and then reads a mac file and writes the equivalent Unicode html character
entities to another file. I have written the script in such a way that if
the user is working on a Mac the file "temp.in" will first be written with
a list of all the Mac characters 0x80 and greater. On other systems, just
paste the Mac stuff into your fin file and save it.

The text stream is read one character at a time. If the character is in
the extended ascii range, it is converted to html code.

As I'm only a beginner in perl, I am sure there is a neater way of doing
this, but I haven't seen it. I hope someone more experienced will supply
it.

JD

##################
@html = split /\s/,
'Ä Å Ç É Ñ Ö Ü á
à â ä ã å ç é è
ê ë í ì î ï ñ ó
ò ô ö õ ú ù û ü
† ° ¢ £ § • ¶ ß
® © ™ ´ ¨ ≠ Æ Ø
∞ ± ≤ ≥ ¥ µ ∂
∑ ∏ π ∫ ª º Ω
æ ø ¿ ¡ ¬ √ ƒ
≈ ∆ « » …   À
Ã Õ Œ œ – — “
” ‘ ’ ÷ ◊ ÿ Ÿ
⁄ € ‹ › ﬁ ﬂ
‡ · ‚ „ ‰ Â Ê
Á Ë È Í Î Ï Ì Ó
Ô  Ò Ú Û Ù ı
ˆ ˜ ¯ ˘ ˙ ˚ ¸
˝ ˛ ˇ';

############# Convert a string Mac to Unicode;
$macstring = 'été übel “hello” Stuffit™';
@list = split //, $macstring;
foreach (@list) {
$ord = ord();
if ($ord > 127) {
$n = $ord - 128;
print $html[$n];
} else {
print
}
}
print $/

############# Convert a Mac file to Unicode;
$dir = 'd:documents'; # folder
$fin = "$dir:temp.in"; # file to read
$fout = "$dir:temp.out"; #file to write
$_ = $ENV{'TMPDIR'};
if (m/:/) { # if OS is Mac
$OS = 'mac';
open FIN, ">$fin"; # Write to fin only if running mac...;
#...otherwise work with existing data;
for ($i = 128; $i < 256; $i++) {
print FIN chr($i).' ';
}
close FIN;
}
open FIN, "<$fin";
open FOUT, ">$fout";
print FOUT "<html>\n";
while (read FIN, $_, 1) {
$ord = ord();
if ($ord > 127) {
$n = $ord - 128;
print FOUT $html[$n];
} else {
print FOUT
}
}
print "\a\n", time - $^T, ' seconds.';
close FIN; close FOUT;

############# View results in editor -- Mac only
if ($OS eq 'mac') {
MacPerl::DoAppleScript(<<END_SCRIPT);
tell app "Finder" to ¬
open {item "$fout",item "$fin"} using¬
application file id "R*ch"
END_SCRIPT
}

george...@my-deja.com

unread,

Dec 27, 2000, 8:15:35 PM12/27/00

to

> At 9:03 am +0000 26/12/00, Mario Thomas wrote:
>
> I'm receiving a text file from another department in my company which
> contains an article ...Quark on the Mac...

I'm sorry to say this, but why on earth are you trying to fix this
problem with Perl? Of course it's possible, but it's probably the most
roundabout solution you could possibly use.

These aren't "weird characters put into the file by Quark", they're
perfectly logical characters for a Mac/Print version of the file. To
have left them *out* would have marked the DTP person as an amateur...

Get in contact with the person who sent you the file and ask them for a
plain text version without the smart quotes and em-dashes etc. I bet
you they've come across this problem before!

They can do that from Quark, but if they can't, they can run it through
their version of Word or just about *anything* and have those
characters replaced with non-Mac-specific characters. Hell, send it to
*me* and I'll open the file, make one mouseclick, save the file and
send it back.

It just seems counterproductive to me to have a fantastically
complicated program like Quark produce a very specific file, then to
attack the arcane specificities of that file with Perl. You're using a
sledgehammer to crack a nut, and moreover it's a nut sent to you by
someone who owns a $1000 nutcracker...

Sent via Deja.com
http://www.deja.com/