[Proposal] Add `URI.encode_query/2` to allow RFC3986 space encoding

95 views
Skip to first unread message

Floris Huetink

unread,
Jan 29, 2021, 4:59:37 AM1/29/21
to elixir-lang-core
Hi,

I've been using `URI.encode_query/1` to convert a key/value map to a query string to be appended to a URL (as GET parameters).

As it turns out, `URI.encode_query/1` encodes differently than `URI.encode/2`, particularly in the case of spaces ("+" instead of "%20"). For more context, please see the comment thread here:

https://github.com/elixir-lang/elixir/pull/2392

For me, it was non-obvious that I better not use `encode_query` to encode a URL-appending query string. I tried using `URI.encode/2` instead, as this provides RFC3986 encoding, but this simply behaves differently, in the sense that it takes string input, not an Enumerable.

Given the above, I'd like to propose the addition of `URI.encode_query/2`, where the second argument should allow the user to specify which type of encoding should be used.
Default encoding should be the existing `:www_form` encoding (I'm not sure if there is an official RFC for www-form-urlencoded? If so, we should use that)
One should be able to replace this with RFC3986, e.g. like so:

URI.encode_query(%{foo: "bar"}, :rfc3986)

This should produce a query string with spaces encoded as "%20" instead of "+" (and possibly other differences).

An alternative approach could be to use an `opts` Keyword list:

URI.encode_query(%{foo: "bar"}, encoding: :rfc3986)

The latter approach could have the benefit of being able to add other options later (which I currently cannot think of, but others might).

The general benefit for Elixir users would be to be able to encode Elixir data structures (typically maps) into RFC3986 compliant query strings without having to manually iterate over every item and/or patching the `URI.encode_query/1` result with quick-and-dirty solutions like `String.replace(query_string, "+", "%20")`.

That's it. I'd love to hear feedback!

– Floris

José Valim

unread,
Jan 29, 2021, 6:42:09 AM1/29/21
to elixir-l...@googlegroups.com
Hi Floris, thanks for the proposal!

Unfortunately this is a bit more complicated than expected.

First of all, the current implementation is not in violation of RFC3986. %20 is a valid escaping of spaces in query strings. As far as I know, RFC3986 does not explicitly mention that + is equivalent to a space. Furthermore, earlier specifications, such as RFC2396, did not allow + in query strings at all.

Only later on W3C specified that + is reserved to mean spaces to be compatible with the general usage of URLs - which browsers eventually standardized on. However, at this point, the damage was done. For example, for mailto links, your mail client may not rewrite + to spaces, while it will certainly handle percent encoded spaces.

In other words, if you want to guarantee the space will be treated as space, %20 is the best choice. So I would say String.replace/3 is the way to go unless there is a quote in RFC3986 which suggests or advocates for using + as the escaping of spaces in query strings.

Finally, it is important to remember that escaping of URI segments for paths and query strings use distinct algorithms (which is why the function is called encode_query).


--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/538f2618-dd14-4084-b6cc-dabb7a8f58e7n%40googlegroups.com.

Floris Huetink

unread,
Jan 29, 2021, 9:32:02 AM1/29/21
to elixir-lang-core
Hi José,

Thanks for the quick reply! To be sure, generally speaking, I certainly see what you say about "a bit more complicated than expected".

A couple of sentences are a bit confusing to me, and make me wonder whether we've understood each other correctly. I'll explain in context of your reply:

On Friday, January 29, 2021 at 12:42:09 PM UTC+1 José Valim wrote:
Hi Floris, thanks for the proposal!

Unfortunately this is a bit more complicated than expected.

First of all, the current implementation is not in violation of RFC3986. %20 is a valid escaping of spaces in query strings. As far as I know, RFC3986 does not explicitly mention that + is equivalent to a space. Furthermore, earlier specifications, such as RFC2396, did not allow + in query strings at all.

If I read this correctly, than given what you write, the current `URI.encode_query/1` implementation _is_ in violation of RFC3986. Example:

iex(1)> URI.encode_query(%{"key" => "two words"})
"key=two+words"
iex(2)> URI.encode("key=two words")
"key=two%20words"

 
As you can see, `encode_query` converts " " to "+", while `encode` converts " " to "%20".
This violation of RFC3986 is the main reason for my proposal, as I'd like Elixir to be able to at least comply with this spec (albeit as opt-in for backward compatibility).


Only later on W3C specified that + is reserved to mean spaces to be compatible with the general usage of URLs - which browsers eventually standardized on. However, at this point, the damage was done. For example, for mailto links, your mail client may not rewrite + to spaces, while it will certainly handle percent encoded spaces.

Yes, thanks! This is yet another reason to at least _allow_ `URI.encode_query` to encode spaces to "%20".

Site note: on further research I've now found there to be a difference between a "normal" URL and a URL generated by a GET form submit. Although not authorative, this section explains very well and links to some original specifications:

Bottom line is that browser are instructed to use "+" to encode spaces when coverting GET form fields into a URL. In practice, this leads to the quite wonky situation that the generated URL (containing those encoded form field values as query string parameters!) will by definition _not_ comply to RFC3986, which contains the specification for... URL encoding.


In other words, if you want to guarantee the space will be treated as space, %20 is the best choice. So I would say String.replace/3 is the way to go unless there is a quote in RFC3986 which suggests or advocates for using + as the escaping of spaces in query strings.

Yes! Again, given what you say in the first sentence here, that was the main reasoning for proposing RFC3986-compliant encode_query behaviour.


Finally, it is important to remember that escaping of URI segments for paths and query strings use distinct algorithms (which is why the function is called encode_query).

Good to know!

Thanks again for the quick and detailed reply. I hope this message will bring us a bit closer to clarity, instead of inadvertently adding even more confusion to the in itself already quite confusing matter of URI encoding :)

– Floris

José Valim

unread,
Jan 29, 2021, 10:13:20 AM1/29/21
to elixir-l...@googlegroups.com
If I read this correctly, than given what you write, the current `URI.encode_query/1` implementation _is_ in violation of RFC3986. Example:

You can’t compare the result of URI.encode with URI.encode_query because they are meant to escape different parts of an URI and different parts use different rules. They are both in accordance to the RFC though.

The assumption that all of a URL needs to be escaped with URI.encode is incorrect. if we encode a query parameter with URI.encode, that will be the wrong result.

Plus, the Wikipedia article explicitly mentions that escaping space as + is a difference to RFC3986, which confirms my assumption that RFC3986 does not mention whitespace *in query params* should be encoded as +:

The encoding of SPACE as '+' and the selection of "as-is" characters distinguishes this encoding from RFC 3986.

José Valim

unread,
Jan 29, 2021, 10:27:20 AM1/29/21
to elixir-l...@googlegroups.com
Gah, I am so sorry. I have been working on the wrong assumption that URI.encode_query was escaping space to %20 but it is encoding it to +, which was your point all along. Yes, escaping it to + is not in accordance to RFC3986.

I will re-read your original e-mail and address it accordingly now. Once again, apologies.

Floris Huetink

unread,
Jan 29, 2021, 10:32:28 AM1/29/21
to elixir-lang-core
Ah, thanks! I was already getting quite lost here, seriously doubting my own understanding of the situation :)

Looking forward to your next reply!

José Valim

unread,
Jan 29, 2021, 10:58:30 AM1/29/21
to elixir-l...@googlegroups.com
Ok, sorry about the initial confusion. I thought we emitted %20 and you were proposing to emit +. So I was basically arguing in favor of your change except I didn't know it. :)

Here is my hopefully correct e-reply: encode_www_form (which is what is used by encode_query) is not specified by RFC3986. So at best, we need to clarify that these functions are not part of RFC3986.

RFC3986 does allow + in query strings but it does not say anything about encoding/decoding it. The query string is ultimately up to the interpretation of the underlying application. Even the common key=value mechanism is hinted but not asserted. For example, someone could use a query string where the meaning of & and = are replaced and that's fine.

So while encode_query may not violate RFC3986, it definitely doesn't follow RFC3986. Encoding spaces to %20 would definitely be a better take. To make matters worse, we cannot simply replace URI.encode_www_form by URI.encode because URI.encode does not handle #, which MUST be escaped in query strings.

I think the best option at the moment is to deprecate encode_query and introduce encode_www_query. We should also introduce a new function that encodes it according to the RFC interpretation of query strings but I am not sure what to call it. Suggestions?

Christopher Keele

unread,
Jan 29, 2021, 1:57:13 PM1/29/21
to elixir-lang-core
> For example, someone could use a query string where the meaning of & and = are replaced and that's fine.

I've had to work with an API that used subtly different querystring encoding semantics, and had to pretty much throw out all of Tesla to get it to work, since it made these assumptions at the time. Delighted that we are smartly agnostic about this in the stdlib.

> I think the best option at the moment is to deprecate encode_query and introduce encode_www_query. We should also introduce a new function that encodes it according to the RFC interpretation of query strings but I am not sure what to call it. Suggestions?

I agree we should deprecate encode_query/1 and support the rfc. Though, I'd hate to clutter the namespace, and the function name we already have is quite solid.

I think Floris's original proposal for encode_query/2 might make a good deprecation path:

- Implement encode_query/2: say via encode_query(_, :www_form) and encode_query(_, :rfc3986); reimplement encode_query/1 to use :www_form
- Start issuing deprecation warnings for encode_query/1 usage
- Eventually remove, so we are left with just encode_query/2

José Valim

unread,
Jan 29, 2021, 2:06:28 PM1/29/21
to elixir-l...@googlegroups.com
> - Implement encode_query/2: say via encode_query(_, :www_form) and encode_query(_, :rfc3986); reimplement encode_query/1 to use :www_form

I like this. A PR is welcome. :)

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.

David Bernheisel

unread,
Jan 29, 2021, 2:08:08 PM1/29/21
to elixir-lang-core
Heads up, the ExAws library also had a problem with this, and there was a PR to adjust the behavior to match S3. I'm not suggesting S3 is doing it correctly, but we've had to deal with this in the community already.

Floris Huetink

unread,
Jan 29, 2021, 2:15:00 PM1/29/21
to elixir-lang-core
On Friday, January 29, 2021 at 8:06:28 PM UTC+1 José Valim wrote:
> - Implement encode_query/2: say via encode_query(_, :www_form) and encode_query(_, :rfc3986); reimplement encode_query/1 to use :www_form

I like this. A PR is welcome. :)


Great! I'd like to take this up myself it that's OK. (This would be my first Elixir core contribution, so feedback is very much appreciated.)

I'll draft a PR and mention this thread as reference.

Floris Huetink

unread,
Jan 29, 2021, 2:17:39 PM1/29/21
to elixir-lang-core
Oh wait, now that I read this:


Should I first wait for an issue to be added to the issue tracker before opening up a PR?

Eager to get started, but let's not rush things :)

José Valim

unread,
Jan 29, 2021, 2:23:04 PM1/29/21
to elixir-l...@googlegroups.com
> Great! I'd like to take this up myself it that's OK. (This would be my first Elixir core contribution, so feedback is very much appreciated.)

Yes, please go ahead with the PR, it will be very appreciated!


--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.

Floris Huetink

unread,
Jan 29, 2021, 3:40:54 PM1/29/21
to elixir-lang-core
On Friday, January 29, 2021 at 8:23:04 PM UTC+1 José Valim wrote:
Yes, please go ahead with the PR, it will be very appreciated!

Great! I've just submitted a PR here:
Reply all
Reply to author
Forward
0 new messages