HTTP Headers use byte->string/utf-8

98 views
Skip to first unread message

Tim Brown

unread,
May 5, 2016, 11:28:49 AM5/5/16
to Racket Users
Folks,

I have a web server, which in production is receiving a user agent
in its header which includes the phrase:

#"FBCR/M\351ditel"

It seems to be a mobile from Morocco. I’m forcing the
request-headers/raw promise on my request; and I get:

“bytes->string/utf-8: string is not a well-formed UTF-8 encoding”

Which it isn’t. I then investigated (searched on StackOverflow[1], more
like) what character set should be used for HTTP headers. In turn I was
pointed to [2]:

> Historically, HTTP has allowed field content with text in the
> ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
> through use of [RFC2047] encoding. In practice, most HTTP header
> field values use only a subset of the US-ASCII charset [USASCII].
> Newly defined header fields SHOULD limit their field values to
> US-ASCII octets. A recipient SHOULD treat other octets in field
> content (obs-text) as opaque data.

Do you agree that the headers’ string conversion for HTTP (at least,
I don’t know about other network protocols) should be done using:
bytes->string/latin-1 -- which should support LATIN-1 as well as its
subset US-ASCII, as we’d hope for in the future?


Tim

[1]
http://stackoverflow.com/questions/4400678/http-header-should-use-what-character-encoding
[2] https://tools.ietf.org/html/rfc7230#section-3.2.4

--
Tim Brown CEng MBCS <tim....@cityc.co.uk>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
City Computing Limited · www.cityc.co.uk
City House · Sutton Park Rd · Sutton · Surrey · SM1 2AE · GB
T:+44 20 8770 2110 · F:+44 20 8770 2130
────────────────────────────────────────────────────────────────────────
City Computing Limited registered in London No:1767817.
Registered Office: City House, Sutton Park Road, Sutton, Surrey, SM1 2AE
VAT No: GB 918 4680 96

Jay McCarthy

unread,
May 5, 2016, 1:46:47 PM5/5/16
to Tim Brown, Racket Users
Hi Tim,

I consider this an error. The Web server tries to avoid interpreting anything as UTF-8 unless asked by the servlet. Header comparison incorrectly converted to UTF-8 and I just pushed a fix. Can you verify that it works now with your workload?

Jay

--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Jay McCarthy
Associate Professor
PLT @ CS @ UMass Lowell
http://jeapostrophe.github.io

           "Wherefore, be not weary in well-doing,
      for ye are laying the foundation of a great work.
And out of small things proceedeth that which is great."
                          - D&C 64:33

Tim Brown

unread,
May 5, 2016, 2:00:34 PM5/5/16
to Racket Users
Thanks Jay,

Will check on it during work tomorrow.

Tim

Tim Brown

unread,
May 6, 2016, 11:49:40 AM5/6/16
to Racket Users, Jay McCarthy, Eli Barzilay
Sorry, Jay; I’ve just tested this and I hit:

Servlet (@ /...) exception:
bytes->string/utf-8: string is not a well-formed UTF-8 encoding
string: #"timmeh \351"

context...:

/usr/local/racket/extra-pkgs/web-server/web-server-lib/web-server/http/bindings.rkt:9:7
loop

/usr/local/racket-6.5/share/racket/collects/racket/contract/private/arrow-val-first.rkt:357:18


This is in web-server/http/bindings.rkt (where I count no less than five
`bytes->string/utf-8`s); and I really do think that that should be
bytes->string/latin-1 both because it covers all 256 code points AND
it is what HTTP asks for.

That would fix my issue (I hope).



Also, looking at byte-upcase / bytes-ci=? in
web-server-lib/web-server/private/util ; can I make a couple of
suggestions:

1. I think Eli points out in issue where \277 and \276 are not ci=?
to each other. I’m not sure of his specific example; because in
Latin-1, they are "3/4" and an upside down "?" -- which I wouldn’t
personally consider ci=? But further up the character set; I would
say that \311 E' and \350 e' ARE ci=? : but only in Latin-1.

So should there not be a byte-upcase/latin-1 and byte-upcase/ascii-7
and a bytes-ci=?/latin-1 and bytes-ci=?/latin-1

2. Since this is implemented in a web-server / HTTP context (and for the
reasons I set out above w.r.t. the bindings); should util.rkt not use
bytes-ci=?/latin-1 ?


Since I have an ISO-8859-1 table in front of me:

(define (byte-upcase/latin-1 b)
(if ((or (<= 97 b 12) ; ascii-7: a-z range
(<= 224 b 246) ; latin-1: a` to o"
; latin-1: -:- is not the lower case of x
(<= 248 254)) ; latin-1: o/ to |p
(- b 32)) ; 97 - 65 = 32
b))


On 05/05/16 18:46, Jay McCarthy wrote:
> Hi Tim,
>
> I consider this an error. The Web server tries to avoid interpreting
> anything as UTF-8 unless asked by the servlet. Header comparison
> incorrectly converted to UTF-8 and I just pushed a fix. Can you verify
> that it works now with your workload?
>
> Jay
- D&C 64:33

Jay McCarthy

unread,
May 6, 2016, 12:31:06 PM5/6/16
to Tim Brown, Racket Users, Eli Barzilay
You should not be using request-headers or request-bindings if you
don't want them to be interpreted as UTF-8. The documentation for
web-server/http/bindings explicitly says, "We recommend against their
use, but they are provided for compatibility with old code."

Jay

Eli Barzilay

unread,
May 6, 2016, 12:32:17 PM5/6/16
to Tim Brown, Racket Users, Jay McCarthy
On Fri, May 6, 2016 at 11:49 AM, Tim Brown <tim....@cityc.co.uk> wrote:
>
> 1. I think Eli points out in issue where \277 and \276 are not ci=?
> to each other.

No -- my comment there is about \277 and \277 (itself), which are
neither `bytes-ci=?` nor not because the implementation assumes that the
two bytes to compare are both in utf-8 and therefore we get an exception
instead of an answer. The source of the comment is that this was (I
think) at some point in unstable, as a candidate to move to racket/bytes
(hence my comment about the memory requirement, which is relevant in
that context).

[BTW, looking at that SO answer and the RFC it seems to me that latin-1
is wrong too for the values, which should remain opaque...]

--
((x=>x(x))(x=>x(x))) Eli Barzilay:
http://barzilay.org/ Maze is Life!

Tim Brown

unread,
May 7, 2016, 4:07:38 AM5/7/16
to Racket Users
Thanks, Jay, for highlighting the warning which pretty well says that if I do what I did then I deserve what happened to me :-)

More generally, is there any means of "annotating" that a function/library is deprecated (I assume you'd consider that library to be deprecated). So that I would have to face the fact every time I ran my software?

Is Racket getting mature enough that more mand more things are going to be left in "just for backward compatibility"?

Without too much consideration, I ask is there a place in contracts for that?

Tim

Jay McCarthy

unread,
May 7, 2016, 8:45:30 AM5/7/16
to Tim Brown, Racket Users
I think that there are lots of things that are just for compatibility.
I think that it would be a bad idea to do something like printing to
stderr for using it and I'm not sure that contract per se are the
right mechanism. I do think that contracts are good model as they are
an annotation "on the side". I think the ideal would be a way to know
during building that you use something for compat and a possible
mechanism would be a special logging level to pass the information
out-of-band during compilation.

Jay
Reply all
Reply to author
Forward
0 new messages