Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

decoding to ascii

0 views
Skip to first unread message

John Burke

unread,
Mar 31, 2005, 6:24:27 PM3/31/05
to

Sometimes I'll move some text from antiword into an iso8 xemacs buffer
to rework, and I end up with character references for long dashes, and
strange, almost superscripted, single quotes (etc.). I'd like to be
able to simply decode into single byte ascii. How do I do that?

BTW, I came across "decode-hz-buffer" (for chinese encodings?), and it
*almost* does the trick, but not quite. I'd kinda like a
decode-m$-buffer command...

jb

giacomo boffi

unread,
Apr 1, 2005, 4:00:48 PM4/1/05
to
John Burke <j...@museumca.org> writes:

> Sometimes I'll move some text from antiword into an iso8 xemacs
> buffer to rework, and I end up with character references for long
> dashes, and strange, almost superscripted, single quotes (etc.).
> I'd like to be able to simply decode into single byte ascii.
> How do I do that?

single byte ascii i don't know, but _maybe_ you could ask antiword to
do the conversion

,---- from antiword(1)
| -m mapping file
| This file is used to map Unicode characters to your
| local character set. The default is UTF-8.txt in
| locales that support UTF-8 and 8859-1.txt in other
`----

--
Israele l'abbiamo fatto noi, e teoricamente starebbe a noi salvaguardarlo.
Ma Israele si comporta come uno stato indipendente, e di quello che
diciamo noi se ne sbatte i coglioni. -- Termy, in IFQ

John Burke

unread,
Apr 1, 2005, 6:40:23 PM4/1/05
to
>>>>> "g" == giacomo boffi <giacom...@polimi.it> writes:
g>

g> John Burke <j...@museumca.org> writes:
>> Sometimes I'll move some text from antiword into an iso8 xemacs
>> buffer to rework, and I end up with character references for long
>> dashes, and strange, almost superscripted, single quotes (etc.).
>> I'd like to be able to simply decode into single byte ascii. How
>> do I do that?
g>
g> single byte ascii i don't know, but _maybe_ you could ask
g> antiword to do the conversion

Good idea! I looked into map-files, but came across no-word.el which
uses antiword to bring M$.docs into a buffer (I was using a separate
terminal app). It does such a great job, now I'm cooking with gas!

jb

Aidan Kehoe

unread,
Apr 1, 2005, 11:18:33 AM4/1/05
to

Ar an t-aonú lá is triochad de mí Márta, scríobh John Burke:

You haven’t said what encoding antiword uses, Windows-1252 or UTF-8. To
decode the latter, you’ll need Mule-UCS and a call to (decode-coding-string
... 'utf-8) .

If you’ve decoded the UTF-8, here’s some code that may be useful. It
requires Mule-UCS on 21.4--that is, you’ll need a line (require 'un-define)
before the code in your ~/.xemacs/init.el. On 21.5, (fset 'ucs-to-char
'unicode-to-char) should be sufficient. The code transforms some of the
fancy typography to vanilla ASCII equivalents.

(defconst sundry-chars-to-latin-1-map
(let ((ct (make-char-table 'char))
(chars-to-map
#s(hash-table data
(#x20AC ?e ;; EURO SIGN
#x201A ?\' ;; SINGLE LOW-9 QUOTATION MARK
#x201E ?\" ;; DOUBLE LOW-9 QUOTATION MARK
#x2018 ?\' ;; LEFT SINGLE QUOTATION MARK
#x2019 ?\' ;; RIGHT SINGLE QUOTATION MARK
#x201C ?\" ;; LEFT DOUBLE QUOTATION MARK
#x201D ?\" ;; RIGHT DOUBLE QUOTATION MARK
#x2022 ?· ;; BULLET
#x2013 ?- ;; EN DASH
#x2014 ?- ;; EM DASH
#x02DC ?~ ;; SMALL TILDE
))))
(maphash '(lambda (key value)
(if (setq key (ucs-to-char key))
(put-char-table key value ct)))
chars-to-map)
ct)
"Mapping from some random Unicode code points to Latin 1.
To be used when sending mail to non-techie people whose mail clients choke
on UTF-8. ")


(defun trim-buffer-to-latin-1 ()
"If I'm corresponding with someone who's using a mail client that chokes
on UTF-8, and they're not vaguely techie, there's no reason to give them
hassle with the broken UTF-8. Call this function after writing a mail, in
that case. "
(interactive)
(save-excursion
(save-restriction
(let (begin end)
(message "Trimming buffer to Latin 1 ...")
(goto-char (point-min))
(while (not (zerop (setq begin (skip-chars-forward "\001-\377")
end (skip-chars-forward "^\001-\377"))))
(translate-region (point) (- (point) end)
sundry-chars-to-latin-1-map))
(goto-char (point-min))
(while (search-forward "\0" nil t)
(replace-match "." nil t))
(message "Trimming buffer to Latin 1 ... done.")))))

For the windows-1252 case, here’s some more code, which you should be able
to combine with the preceding;

;; begin non-standard-1252.el
;; Make sure we have a unicode transformation function available.
(if (fboundp 'unicode-to-char)
(fset 'ucs-to-char 'unicode-to-char)
(require 'un-define))

(defconst non-standard-1252-char-map
(let ((ct (make-char-table 'char))
(ucs-code nil) (mule-char nil)
(windows-1252-extra-chars
[ #x20AC ;; EURO SIGN
nil ;; UNDEFINED
#x201A ;; SINGLE LOW-9 QUOTATION MARK
#x0192 ;; LATIN SMALL LETTER F WITH HOOK
#x201E ;; DOUBLE LOW-9 QUOTATION MARK
#x2026 ;; HORIZONTAL ELLIPSIS
#x2020 ;; DAGGER
#x2021 ;; DOUBLE DAGGER
#x02C6 ;; MODIFIER LETTER CIRCUMFLEX ACCENT
#x2030 ;; PER MILLE SIGN
#x0160 ;; LATIN CAPITAL LETTER S WITH CARON
#x2039 ;; SINGLE LEFT-POINTING ANGLE QUOTATION MARK
#x0152 ;; LATIN CAPITAL LIGATURE OE
nil ;; UNDEFINED
#x017D ;; LATIN CAPITAL LETTER Z WITH CARON
nil ;; UNDEFINED
nil ;; UNDEFINED
#x2018 ;; LEFT SINGLE QUOTATION MARK
#x2019 ;; RIGHT SINGLE QUOTATION MARK
#x201C ;; LEFT DOUBLE QUOTATION MARK
#x201D ;; RIGHT DOUBLE QUOTATION MARK
#x2022 ;; BULLET
#x2013 ;; EN DASH
#x2014 ;; EM DASH
#x02DC ;; SMALL TILDE
#x2122 ;; TRADE MARK SIGN
#x0161 ;; LATIN SMALL LETTER S WITH CARON
#x203A ;; SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
#x0153 ;; LATIN SMALL LIGATURE OE
nil ;; UNDEFINED
#x017E ;; LATIN SMALL LETTER Z WITH CARON
#x0178 ;; LATIN CAPITAL LETTER Y WITH DIAERESIS
]))
(dotimes (i (length windows-1252-extra-chars))
(setq ucs-code (aref windows-1252-extra-chars i)
mule-char (if ucs-code (ucs-to-char ucs-code) nil))
(if mule-char
(put-char-table (make-char 'control-1 i) mule-char ct)
(put-char-table (make-char 'control-1 i) (make-char 'control-1 i) ct)))
ct)
"Mapping from the characters in the `control-1' character set to
the corresponding characters in Windows 1252. ")

(defun non-standard-1252-transform (begin end)
"Translate the control characters to their Windows-1252 equivalents."
(save-excursion
(save-restriction
(narrow-to-region begin end)
(goto-char begin)
(while (not (zerop (setq begin (skip-chars-forward "^\200-\237")
end (skip-chars-forward "\200-\237"))))
(translate-region (point) (- (point) end)
non-standard-1252-char-map)))))

(provide 'non-standard-1252)
;; end non-standard-1252.el

--
“I, for instance, am gung-ho about open source because my family is being
held hostage in Rob Malda’s basement. But who fact-checks me, or Enderle,
when we say something in public? No-one!” -- Danny O’Brien

0 new messages