[racket] windows-1252 charset decoding

9 views
Skip to first unread message

John Clements

unread,
Mar 3, 2015, 7:24:13 PM3/3/15
to us...@racket-lang.org
I'm trying to process a bunch of e-mail, and I've discovered that lots of it is encoded using the "windows-1252" charset.  It looks pretty straightforward to map this to unicode, but I thought I'd check: has anyone written this code already?

John Clements

Matthew Flatt

unread,
Mar 3, 2015, 7:32:30 PM3/3/15
to John Clements, us...@racket-lang.org
You can use "windows-1252" as an encoding name with, for example,
`reencode-input-port`:

> (read-line (reencode-input-port (open-input-bytes #"\xA3")
"windows-1252"))
"£"

For handling e-mail, see also `generalize-encoding` from `net/unihead`.

> ____________________
> Racket Users list:
> http://lists.racket-lang.org/users

____________________
Racket Users list:
http://lists.racket-lang.org/users

John Clements

unread,
Mar 4, 2015, 3:06:10 PM3/4/15
to John Clements, Matthew Flatt, us...@racket-lang.org
I see that the documentation suggests that (entity-charset) is supposed to return a symbol. However, it nearly always returns a string. In particular, it appears to me that it returns a symbol only when it returns its default, 'us-ascii.

I feel compelled to repair this, but there are two ways to fix it:
1) make it match the docs and always return a symbol, or
2) change the docs and the default to return a string.

It looks to me like #2 will break (less) code, though it's certainly possible that people depend on the default value's being a string.

Opinions? In my tree, I've added contract checks on the structure exports and changed the documentation and default to always return a string. If people like this, I can just submit it as a pull request.

John


On Tue, Mar 3, 2015 at 10:11 PM, John Clements <clem...@brinckerhoff.org> wrote:

On Mar 3, 2015, at 4:31 PM, Matthew Flatt <mfl...@cs.utah.edu> wrote:

> You can use "windows-1252" as an encoding name with, for example,
> `reencode-input-port`:
>
>> (read-line (reencode-input-port (open-input-bytes #"\xA3")
>                                   "windows-1252"))
> “£"

Perfect!

I went looking for a place where I might add a “windows-1252” search term, but it looks like it might be hard, since the list of supported encodings is apparently platform dependent. Would it make sense simply to attach a free-floating search tag of “windows-1252” to this part of the documentation?


>
> For handling e-mail, see also `generalize-encoding` from `net/unihead`.

That probably saved me another half-hour of searching and head-scratching.

Thanks!

John

(p.s.: no one whose mailer checks DMARC records will get this e-mail, sadly. Can’t wait to change to google groups.)

Sam Tobin-Hochstadt

unread,
Mar 4, 2015, 3:15:54 PM3/4/15
to John Clements, John Clements, Matthew Flatt, us...@racket-lang.org
On Wed, Mar 4, 2015 at 3:06 PM John Clements <johnbc...@gmail.com> wrote:
I see that the documentation suggests that (entity-charset) is supposed to return a symbol. However, it nearly always returns a string. In particular, it appears to me that it returns a symbol only when it returns its default, 'us-ascii.

I feel compelled to repair this, but there are two ways to fix it:
1) make it match the docs and always return a symbol, or
2) change the docs and the default to return a string.

It looks to me like #2 will break (less) code, though it's certainly possible that people depend on the default value's being a string.

It seems like option #3, document the current behavior, will break the least code, and that we should do that.

Sam 

John Clements

unread,
Mar 5, 2015, 1:40:23 PM3/5/15
to Sam Tobin-Hochstadt, sol...@acm.org, Matthew Flatt, us...@racket-lang.org, John Clements
Urghh.... really? The existing behavior is clearly broken, and this library is--to the best of my knowledge--used by a relatively small number of people. Francisco, as the original author of this code, do you have an opinion?
Reply all
Reply to author
Forward
0 new messages