Umlaute (ascii > 127?) within external files

Peter Schupp

unread,

Oct 11, 2000, 3:00:00 AM10/11/00

to

Hi,

unfortunately I've got a real german problem: does somebody know how to
treat Umlaute
ÄÖÜ äöü etc?
I like to read strings from an external file (using "with-open-file" and
"read-line") and move
all words (separated by blanks) to a list. Afterwards i like to use the
lists entries to compare
them to components.

When I read the string from an external file and move it one by one
character to a list I can
see, that the Umlaute (ÄÖÜ) are read as double byte chars.

This may be the content of my external file
-----------------------------------------------------
Arme Hände Füße Beine
Ohren Nase Test Kühe
123 abc def ghi

This is my test code
---------------------------

(defun test-read-char-separated-file (v_in-file v_col-sep)
(let (
v_list
)

; ----- testing and debugging purposes -----
(log-message (format () "\n--- function test-read-char-separated-file
started ---\n"))

; open file and read lines
(with-open-file (v_in-stream v_in-file :direction :input)
(while (not (eql (setq v_line (read-line v_in-stream)) 'eof))
(setq v_pos 0)
(log-message (format () ": ~A" (string v_line)))
; loop as long as pointer reaches the end of input string
(while (< v_pos (string-length v_line))
(log-message (format () "char ~D : ~A" v_pos (char v_line
v_pos)))
(push (char v_line v_pos) v_list)
; move on to next position
(inc v_pos)
)
)
)

; reverse the result list v_list
(setq v_list (reverse v_list))
; ----- test and debug
(log-message (format () "\n--- function test-read-ret-file finished
---\n"))
; return the list
v_list
))

--
_______________________________________________________________________

mailto:Peter....@object-it.de Privat:

STZ object-IT Tel. 0711 18 39 78 6 | Kirchheimer Str. 18
Postfach 10 43 62 Fax 0711 18 39 68 7 | 73760 Ostfildern-Ruit
D-70038 Stuttgart D2 0172 9 06 71 62 | Tel 0711 44 16 06 5

PGP Key available at: http://wwwkeys.de.pgp.net
_______________________________________________________________________

Rainer Joswig

unread,

Oct 11, 2000, 3:00:00 AM10/11/00

to

In article <39E48166...@object-it.de>, Peter....@object-it.de
wrote:

> Hi,
>
> unfortunately I've got a real german problem: does somebody know how to
> treat Umlaute
> ÄÖÜ äöü etc?

This "problem" exists for a lot of languages.

You didn't tell us which OS and which Lisp you are using.

> I like to read strings from an external file (using "with-open-file" and
> "read-line") and move
> all words (separated by blanks) to a list. Afterwards i like to use the
> lists entries to compare
> them to components.
>
> When I read the string from an external file and move it one by one
> character to a list I can
> see, that the Umlaute (ÄÖÜ) are read as double byte chars.

So what is the problem?

Try stuff like this (use one of the SPLIT-STRING functions
that have been posted to comp.lang.lisp recently).

(defun read-file-as-delimited-lines (stream column-character)
(loop for line = (read-line stream nil nil)
while line
collect (ccl::split-string line :item column-character)))

(defun test (string)
(with-input-from-string (stream string)
(read-file-as-delimited-lines stream #\space)))

(test "Arme Hände Füße Beine
Ohren Nase Test Kühe
123 abc def ghi")

-> (("Arme" "Hände" "Füße" "Beine") ("Ohren" "Nase" "Test" "Kühe") ("123" "abc" "def" "ghi"))

--
Rainer Joswig, Hamburg, Germany
Email: mailto:jos...@corporate-world.lisp.de
Web: http://corporate-world.lisp.de/

Lieven Marchand

unread,

Oct 11, 2000, 3:00:00 AM10/11/00

to

Peter Schupp <Peter....@object-it.de> writes:

> Hi,
>
> unfortunately I've got a real german problem: does somebody know how to
> treat Umlaute
> ÄÖÜ äöü etc?

What implementation are you using? This stuff is still fairly
implementation specific. One possibility is to look in your vendor
documentation for possible values for the :EXTERNAL-FORMAT keyword
argument for OPEN.

--
Lieven Marchand <m...@bewoner.dma.be>
Lambda calculus - Call us a mad club

Peter Schupp

unread,

Oct 12, 2000, 2:50:46 AM10/12/00

to

thanks for your support so far and further inquiry...

This is my environment:

I'm using Interleaf Lisp within Quicksilver 7 - a DTP System which uses -
for sure - a very special implementation of Lisp. I run it on an Intel
WindowsNT 4.x/Windows 98 system.

Peter

Pierre R. Mai

unread,

Oct 12, 2000, 3:00:00 AM10/12/00

to

Peter Schupp <Peter....@object-it.de> writes:

> When I read the string from an external file and move it one by one
> character to a list I can
> see, that the Umlaute (ÄÖÜ) are read as double byte chars.

This would seem to indicate that the external file is encoded in
something like UTF-8 (i.e. Unicode in a varying-byte representation).
This is most likely not what you want. What you probably really want
is ISO Latin-1 encoding, or the Windows mangling of said encoding.

You should probably first check whether the external file is in UTF-8
encoding. Furthermore you need to check which implementation of Lisp
you are using, and what kinds of external formats it supports (see
documentation of OPEN). If you are in luck, then it might support
reading UTF-8 directly.

If you are not in luck, then you might want to convert the external
file into Latin-1 beforehand, or you'll have to do the conversion in
Lisp, like e.g. so:

(defun convert-utf8-to-latin1 (string)
(declare (string string) (optimize (speed 3)))
(with-output-to-string (stream)
(let ((length (length string))
(index 0))
(declare (fixnum length index))
(loop
(unless (< index length) (return nil))
(let* ((char (char string index))
(code (char-code char)))
(cond
((< code #x80) ; ASCII
(write-char char stream)
(incf index 1))
((< code #xC0)
;; We are in the middle of a multi-byte sequence!
;; This should never happen, so we raise an error.
(error "Encountered illegal multi-byte sequence."))
((< code #xC4)
;; Two byte sequence in Latin-1 range
(unless (< (1+ index) length)
(error "Encountered incomplete two-byte sequence."))
(let* ((char2 (char string (1+ index)))
(code2 (char-code char2)))
(unless (and (logbitp 7 code2) (not (logbitp 6 code2)))
(error "Second byte in sequence is not a continuation."))
(let* ((upper-bits (ldb (byte 2 0) code))
(lower-bits (ldb (byte 6 0) code2))
(new-code (dpb upper-bits (byte 2 6) lower-bits)))
(write-char (code-char new-code) stream)))
(incf index 2))
((>= code #xFE)
;; Ignore stray byte-order markers
(incf index 1))
(t
(error "Multi-byte sequence outside Latin-1 range."))))))))

Note that this is not in any way the most efficient way to do the
conversion (a table driven approach would probably work best). It
also relies on the charset of the implementation being Latin-1 and
char-code/code-char being implemented accordingly.

But it should suffice to test if you are indeed dealing with UTF-8 or
not...

Regs, Pierre.

--
Pierre R. Mai <pm...@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein