url-encode does not encode the '+' character

745 views
Skip to first unread message

Jan Rychter

unread,
Dec 18, 2013, 5:56:06 AM12/18/13
to ring-c...@googlegroups.com
I've just discovered that url-encode passes the '+' character unencoded. This caused me problems when using url-encode for generating Amazon AWS signatures.

Here's what ring does:

(defn url-encode
  "Returns the url-encoded version of the given string, using either a specified
  encoding or UTF-8 by default."
  [unencoded & [encoding]]
  (str/replace
    unencoded
    #"[^A-Za-z0-9_~.+-]+"
    #(double-escape (percent-encode % encoding))))

A quick search in ring's history shows this change: "Fixed url-encode and url-decode to handle "+" correctly". Which means '+' was introduced for a reason. But what is that reason?

In general, the whole area of percent- and url- encoding is murky at best. But the list of "unreserved characters" in RFC3986 (http://tools.ietf.org/html/rfc3986#section-2.3) does not include '+'. Also, '+' is on the list of "reserved characters" in section 2.2 (http://tools.ietf.org/html/rfc3986#section-2.2) as a sub-delimiter, although I fail to understand the three paragraphs that follow.

This means that ring added '+' to the list of "unreserved characters" from RFC3986. Why?

The reason I hit this problem is because Amazon insists that anything that isn't "unreserved" should be encoded. I don't know if they are right, I'm not even sure if one *can* be right here — but I wanted to understand why ring passes '+' through. Perhaps this is a bug after all?

thanks,
--J.

David Powell

unread,
Dec 18, 2013, 6:18:00 AM12/18/13
to ring-c...@googlegroups.com
Are you encoding a query string (or url-encoded POST body?)

In a query-string, space should get encoded to +, and + to %2B.
So I guess ring's url-encode isn't for encoding query-strings, only for encoding url path components.

-- 
Dave



--
You received this message because you are subscribed to the Google Groups "Ring" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ring-clojure...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

James Reeves

unread,
Dec 18, 2013, 6:47:10 AM12/18/13
to ring-c...@googlegroups.com
Ring has two functions:
  • ring.util.codec/url-encode
  • ring.util.codec/form-encode
The form-encode function does what you want it to do, in that it translates "+" into "%2B". It will also translate maps of parameters.

The url-encode function, on the other hand, doesn't translate "+" because it's not an invalid character, and technically shouldn't be encoded in most circumstances. In particular, it can be used without encoding within the path of the URL (see section 3.3 of RFC 3986). For instance, you could have a URL like "http://example.com/buy+sell".

- James



Jan Rychter

unread,
Dec 18, 2013, 7:04:17 AM12/18/13
to ring-c...@googlegroups.com, ja...@booleanknot.com
On Wednesday, December 18, 2013 12:47:10 PM UTC+1, James Reeves wrote:
Ring has two functions:
  • ring.util.codec/url-encode
  • ring.util.codec/form-encode
The form-encode function does what you want it to do, in that it translates "+" into "%2B". It will also translate maps of parameters.

The url-encode function, on the other hand, doesn't translate "+" because it's not an invalid character, and technically shouldn't be encoded in most circumstances. In particular, it can be used without encoding within the path of the URL (see section 3.3 of RFC 3986). For instance, you could have a URL like "http://example.com/buy+sell".

Sigh. I think there is a lot of confusion around this and the RFC leaves much to be desired.

What I really expect is a strict reading of RFC3986 section 2.3:

2.3. Unreserved Characters

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde. unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

As for the question of what I am encoding, the answer isn't that obvious. I'm signing AWS requests, which requires merging the query string *and* request POST params. See for example http://docs.aws.amazon.com/AmazonSimpleDB/latest/DeveloperGuide/HMACAuth.html#REST_RESTAuth:

URL encode the parameter name and values according to the following rules:

  • Do not URL encode any of the unreserved characters that RFC 3986 defines.

    These unreserved characters are A-Z, a-z, 0-9, hyphen ( - ), underscore ( _ ), period ( . ), and tilde ( ~ ).

  • Percent encode all other characters with %XY, where X and Y are hex characters 0-9 and uppercase A-F.

  • Percent encode extended UTF-8 characters in the form %XY%ZA....

  • Percent encode the space character as %20 (and not +, as common encoding schemes do).

And unfortunately, ring's form-encode isn't doing what I need either:

(ring.util.codec/form-encode "a b+c~-")
"a+b%2Bc%7E-"
(ring.util.codec/url-encode "a b+c~-")
"a%20b+c~-"

What I really expected is:
(ring.util.codec/url-encode-strict "a b+c~-")
"a%20b%2Bc~-"

In other words, any character which is not an "unreserved character" gets encoded. Note that url-encode is nearly there, but for some reason '+' was added to the list.

--J.

James Reeves

unread,
Dec 18, 2013, 7:26:55 AM12/18/13
to ring-c...@googlegroups.com
On 18 December 2013 12:04, Jan Rychter <jryc...@gmail.com> wrote:
On Wednesday, December 18, 2013 12:47:10 PM UTC+1, James Reeves wrote:

The url-encode function, on the other hand, doesn't translate "+" because it's not an invalid character, and technically shouldn't be encoded in most circumstances. In particular, it can be used without encoding within the path of the URL (see section 3.3 of RFC 3986). For instance, you could have a URL like "http://example.com/buy+sell".

Sigh. I think there is a lot of confusion around this and the RFC leaves much to be desired.

What I really expect is a strict reading of RFC3986 section 2.3:

The BNFs in section 3.3 and section 2.2 make it pretty clear that "+" should not be encoded for paths:

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"


sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

And from a practical perspective, the url-encode function is designed for the purpose of encoding data into URLs. The "+" character was added specifically to support a use-case where someone wanted to have a path like "/buy+sell", which is a perfectly valid URI.

 
As for the question of what I am encoding, the answer isn't that obvious. I'm signing AWS requests, which requires merging the query string *and* request POST params. See for example http://docs.aws.amazon.com/AmazonSimpleDB/latest/DeveloperGuide/HMACAuth.html#REST_RESTAuth:

URL encode the parameter name and values according to the following rules:

  • Do not URL encode any of the unreserved characters that RFC 3986 defines.

    These unreserved characters are A-Z, a-z, 0-9, hyphen ( - ), underscore ( _ ), period ( . ), and tilde ( ~ ).

  • Percent encode all other characters with %XY, where X and Y are hex characters 0-9 and uppercase A-F.

  • Percent encode extended UTF-8 characters in the form %XY%ZA....

  • Percent encode the space character as %20 (and not +, as common encoding schemes do).


These rules are, to the best of my knowledge, unique to AWS, and Ring is a general-purpose HTTP library.

Might I suggest that you create a function that caters to these specific requirements? e.g.

(defn aws-url-encode [unencoded & [encoding]]
  (str/replace
    unencoded
    #"[^A-Za-z0-9_~.-]+"
    #(double-escape (percent-encode % encoding))))

- James

Jan Rychter

unread,
Dec 18, 2013, 11:01:47 AM12/18/13
to ring-c...@googlegroups.com, ja...@booleanknot.com
[...]

That's exactly what I did :-) — but I wanted to double-check with you if the '+' was really intentionally left unencoded. If those rules are unique to AWS, then indeed a special function is required.

thanks!
--J.
 

Marc

unread,
Dec 27, 2014, 6:05:11 AM12/27/14
to ring-c...@googlegroups.com, ja...@booleanknot.com
Hello,

was pointed to this thread in an issue created for my uritemplate-clj library (https://github.com/mwkuster/uritemplate-clj/issues/2).

It seems that the interpretation of "+" as the only encoding for space in query parameters conflicts with the official test cases of RFC 6570 "URI Templates" that are published under https://github.com/uri-templates/uritemplate-test/blob/master/extended-tests.json. These test cases cover a number of examples where query terms such as "URI Templates" must be encoded with %20, e.g.:

/base/12345/John/pages/5/en?format=json&q=URI%20Templates
It seems that the possible encoding of a space as %20 instead of + is supported by at least one major RFC in addition to the specific practice of AWS. Would it be possible to add an optional parameter to form-encode to request this behaviour during encoding? 

Best regards

Marc

James Reeves

unread,
Dec 27, 2014, 3:19:57 PM12/27/14
to Marc, ring-c...@googlegroups.com
The form-encode function is designed to encode data in application/x-www-form-urlencoded format, which is pretty explicit about replacing spaces with +.

See:

If the URL template RFC prefers %20, then I'd suggest creating an encoding function specifically for that purpose.

- James

Brian Marick

unread,
Jan 1, 2015, 9:13:54 PM1/1/15
to ring-c...@googlegroups.com, Marc


James Reeves wrote:
> The form-encode function is designed to encode data in
> application/x-www-form-urlencoded format, which is pretty explicit about
> replacing spaces with +.

Just a side note: Madmimi's REST interface is unhappy with spaces
encoded as "+". Took me a while to understand that.
Reply all
Reply to author
Forward
0 new messages