BTW, I came across "decode-hz-buffer" (for chinese encodings?), and it
*almost* does the trick, but not quite. I'd kinda like a
decode-m$-buffer command...
jb
> Sometimes I'll move some text from antiword into an iso8 xemacs
> buffer to rework, and I end up with character references for long
> dashes, and strange, almost superscripted, single quotes (etc.).
> I'd like to be able to simply decode into single byte ascii.
> How do I do that?
single byte ascii i don't know, but _maybe_ you could ask antiword to
do the conversion
,---- from antiword(1)
| -m mapping file
| This file is used to map Unicode characters to your
| local character set. The default is UTF-8.txt in
| locales that support UTF-8 and 8859-1.txt in other
`----
--
Israele l'abbiamo fatto noi, e teoricamente starebbe a noi salvaguardarlo.
Ma Israele si comporta come uno stato indipendente, e di quello che
diciamo noi se ne sbatte i coglioni. -- Termy, in IFQ
Good idea! I looked into map-files, but came across no-word.el which
uses antiword to bring M$.docs into a buffer (I was using a separate
terminal app). It does such a great job, now I'm cooking with gas!
jb
You haven’t said what encoding antiword uses, Windows-1252 or UTF-8. To
decode the latter, you’ll need Mule-UCS and a call to (decode-coding-string
... 'utf-8) .
If you’ve decoded the UTF-8, here’s some code that may be useful. It
requires Mule-UCS on 21.4--that is, you’ll need a line (require 'un-define)
before the code in your ~/.xemacs/init.el. On 21.5, (fset 'ucs-to-char
'unicode-to-char) should be sufficient. The code transforms some of the
fancy typography to vanilla ASCII equivalents.
(defconst sundry-chars-to-latin-1-map
(let ((ct (make-char-table 'char))
(chars-to-map
#s(hash-table data
(#x20AC ?e ;; EURO SIGN
#x201A ?\' ;; SINGLE LOW-9 QUOTATION MARK
#x201E ?\" ;; DOUBLE LOW-9 QUOTATION MARK
#x2018 ?\' ;; LEFT SINGLE QUOTATION MARK
#x2019 ?\' ;; RIGHT SINGLE QUOTATION MARK
#x201C ?\" ;; LEFT DOUBLE QUOTATION MARK
#x201D ?\" ;; RIGHT DOUBLE QUOTATION MARK
#x2022 ?· ;; BULLET
#x2013 ?- ;; EN DASH
#x2014 ?- ;; EM DASH
#x02DC ?~ ;; SMALL TILDE
))))
(maphash '(lambda (key value)
(if (setq key (ucs-to-char key))
(put-char-table key value ct)))
chars-to-map)
ct)
"Mapping from some random Unicode code points to Latin 1.
To be used when sending mail to non-techie people whose mail clients choke
on UTF-8. ")
(defun trim-buffer-to-latin-1 ()
"If I'm corresponding with someone who's using a mail client that chokes
on UTF-8, and they're not vaguely techie, there's no reason to give them
hassle with the broken UTF-8. Call this function after writing a mail, in
that case. "
(interactive)
(save-excursion
(save-restriction
(let (begin end)
(message "Trimming buffer to Latin 1 ...")
(goto-char (point-min))
(while (not (zerop (setq begin (skip-chars-forward "\001-\377")
end (skip-chars-forward "^\001-\377"))))
(translate-region (point) (- (point) end)
sundry-chars-to-latin-1-map))
(goto-char (point-min))
(while (search-forward "\0" nil t)
(replace-match "." nil t))
(message "Trimming buffer to Latin 1 ... done.")))))
For the windows-1252 case, here’s some more code, which you should be able
to combine with the preceding;
;; begin non-standard-1252.el
;; Make sure we have a unicode transformation function available.
(if (fboundp 'unicode-to-char)
(fset 'ucs-to-char 'unicode-to-char)
(require 'un-define))
(defconst non-standard-1252-char-map
(let ((ct (make-char-table 'char))
(ucs-code nil) (mule-char nil)
(windows-1252-extra-chars
[ #x20AC ;; EURO SIGN
nil ;; UNDEFINED
#x201A ;; SINGLE LOW-9 QUOTATION MARK
#x0192 ;; LATIN SMALL LETTER F WITH HOOK
#x201E ;; DOUBLE LOW-9 QUOTATION MARK
#x2026 ;; HORIZONTAL ELLIPSIS
#x2020 ;; DAGGER
#x2021 ;; DOUBLE DAGGER
#x02C6 ;; MODIFIER LETTER CIRCUMFLEX ACCENT
#x2030 ;; PER MILLE SIGN
#x0160 ;; LATIN CAPITAL LETTER S WITH CARON
#x2039 ;; SINGLE LEFT-POINTING ANGLE QUOTATION MARK
#x0152 ;; LATIN CAPITAL LIGATURE OE
nil ;; UNDEFINED
#x017D ;; LATIN CAPITAL LETTER Z WITH CARON
nil ;; UNDEFINED
nil ;; UNDEFINED
#x2018 ;; LEFT SINGLE QUOTATION MARK
#x2019 ;; RIGHT SINGLE QUOTATION MARK
#x201C ;; LEFT DOUBLE QUOTATION MARK
#x201D ;; RIGHT DOUBLE QUOTATION MARK
#x2022 ;; BULLET
#x2013 ;; EN DASH
#x2014 ;; EM DASH
#x02DC ;; SMALL TILDE
#x2122 ;; TRADE MARK SIGN
#x0161 ;; LATIN SMALL LETTER S WITH CARON
#x203A ;; SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
#x0153 ;; LATIN SMALL LIGATURE OE
nil ;; UNDEFINED
#x017E ;; LATIN SMALL LETTER Z WITH CARON
#x0178 ;; LATIN CAPITAL LETTER Y WITH DIAERESIS
]))
(dotimes (i (length windows-1252-extra-chars))
(setq ucs-code (aref windows-1252-extra-chars i)
mule-char (if ucs-code (ucs-to-char ucs-code) nil))
(if mule-char
(put-char-table (make-char 'control-1 i) mule-char ct)
(put-char-table (make-char 'control-1 i) (make-char 'control-1 i) ct)))
ct)
"Mapping from the characters in the `control-1' character set to
the corresponding characters in Windows 1252. ")
(defun non-standard-1252-transform (begin end)
"Translate the control characters to their Windows-1252 equivalents."
(save-excursion
(save-restriction
(narrow-to-region begin end)
(goto-char begin)
(while (not (zerop (setq begin (skip-chars-forward "^\200-\237")
end (skip-chars-forward "\200-\237"))))
(translate-region (point) (- (point) end)
non-standard-1252-char-map)))))
(provide 'non-standard-1252)
;; end non-standard-1252.el
--
“I, for instance, am gung-ho about open source because my family is being
held hostage in Rob Malda’s basement. But who fact-checks me, or Enderle,
when we say something in public? No-one!” -- Danny O’Brien