URL encoding, name token and parameters

Daniel Colchete

unread,

Oct 10, 2011, 5:01:17 PM10/10/11

to gwt-pl...@googlegroups.com

Good day everyone!

I started to show off my app for a few friends here, beta testing it, and I ran into a problem here today that have to do with how GWTP generates its URLs.

Commom GWTP URL are like:

http://mysite.com/project.html#nametoken;param1=value1;param2=value2

Today a mail client used by a possible future user, recoded that URL as

http://mysite.com/project.html#nametoken%3bparam1%3dvalue1%3bparam2%3dvalue2

when I customer tried to click the email verification link. That is a valid encoded of the URL. RFC 2396 allows any char in a URL to be escaped and says that ; and = must be escaped, since they are reserved chars. The problem is that GWTP doesn't decode it. So, an error page was shown.

My quick solution was to set up Apache's mod_rewrite and a PHP script, so that I can send a different URL to the customers that don't use ';'s and '='s. For example:

http://mysite.com/email/nametoken/param1/value1/param2/value2

will redirect the user to the right URL.

My suggestion is that GWTP decodes the URLs before using the # fragment.

What do you guys say?

Best,

--
Dani
Cloud3 Tech - http://cloud3.tc/
Twitter: @DaniCloud3 @Cloud3Tech

Philippe Beaudoin

unread,

Oct 10, 2011, 5:11:33 PM10/10/11

to gwt-pl...@googlegroups.com

I think it makes sense, however I believe we are already encoding parameters in `param1` and `value1` so we have to be careful not to decode the entire thing too early or we risk gettings thing like param1==value1 with no ability to know if it's "param1=" or "=value1". Tricky.

Philippe

Daniel Colchete

unread,

Oct 11, 2011, 8:06:49 AM10/11/11

to gwt-pl...@googlegroups.com

Hi Philippe! How are things at the now job coming?

Yeah, I agree, and I solving the "param1=" or "=value1" part would be easy if we could use a different escape char to encode either the (param, value) pairs or the =s and ;s themselves. But this does have a level of breaking backwards compatibility.

Another solution would be to create a second # format, like #/e/NameToken/param1/value1/param2/value2, like I did. Them the NameTokenFormatter class could have a email compability flag, or asEmailString method, or whatever. Sure you can build a case where you break compatibility here but you will have to you it your best to achieve this.

I don't know if we should leave it this way, so that we don't introduce backwards compatibility problems. Because broken emailed links is a very big usability problem.

What do you think?

Best,

Dani

Daniel Colchete

unread,

Oct 11, 2011, 8:14:16 AM10/11/11

to gwt-pl...@googlegroups.com

Or maybe another solution would be to move to a new format like

#/NameToken/param1/value1/param2/value2

make the old one depreciated but still recognized. The key would be the first '/' char in the fragment.

Best,

Dani

Philippe Beaudoin

unread,

Nov 3, 2011, 12:19:03 PM11/3/11

to gwt-pl...@googlegroups.com

I like the idea. We could probably build an alternate TokenFormatter implementation and encourage people to use it to build email-resistant tokens. Are you up for the task? :) In all cases, if you could create an issue for this it would be nice.

Philippe

Daniel Colchete

unread,

Nov 21, 2011, 7:46:12 AM11/21/11

to gwt-pl...@googlegroups.com

Hi Philippe!

I just got back from two trips (a business one and a vacation one) and I'm getting back now to my projects here. I'm up to the task! I opened the issue here: http://code.google.com/p/gwt-platform/issues/detail?id=381 .

As I was reading the RFC 2396 a little bit more I discovered the my mod_rewrite solution works because I used the '/' char in the path part of the URL, not in the fragment part. Unfortunately the RFC also says that '/' also must be escaped inside fragments, so that solution I proposed would also break the RFC and this is not a good starting point.

The only unreserved chars I could use are "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" (RFC 2396, section 2.3), and since "!" is already used semantically for search engine crawling, this is one out. In any case we will have to manually escape the chosen char inside params and values anyway, so it would be a question of witch one produces better looking URLs. What do you think?

Or we could still look for a another way.

What are your thoughts on this?

Best,

Dani

Philippe Beaudoin

unread,

Nov 24, 2011, 8:54:46 AM11/24/11

to gwt-pl...@googlegroups.com

I've thought a bit more about it, here is my idea...

Right now the decoding process is:

- Split the hierarchy along ;

- Split the assignations along =

- Unescape all the components obtained

I suggest the following:

- Unescape everything

- Split the hierarchy along ;

- Split the assignations along =

- Within the components, unescape "\0" to ";", "\1" to "=" and "\2" to "\"

(In both case, encoding is simply inverting the steps and playing them in reverse.)

I believe this works because the first thing we do is unescape. So no matter what the sender decided to escape, we'll get it. We will work on legacy URLs like:

http://mysite.com/project.html#nametoken;param1=value1;param2=value2

And with fully escaped URLs:

http://mysite.com/project.html#nametoken%3bparam1%3dvalue1%3bparam2%3dvalue2

For situations like where we want to encode:

{ ('data_A', 'junk;=\'), ('data_B, 'more_junk;=\') }

then we will work with the following URL:

http://mysite.com/project.html#nametoken;data_A=junk\0\1\2;data_B=more_junk\0\1\2

This encoded URL:

http://mysite.com/project.html#nametoken;data_A=junk%5C0%5C1%5C2;data_B=more_junk%5C0%5C1%5C2

And even this one:

http://mysite.com/project.html#nametoken%3bdata_A%3djunk%5C0%5C1%5C2%3bdata_B%3dmore_junk%5C0%5C1%5C2

The only drawback I can see is that we will break legacy URLs for which data or value did contain "=" or ";". I think it's a small enough subset that we can just upgrade the token formatter to this new logic and document the breaking change.

What do you say?

Philippe Beaudoin

unread,

Nov 24, 2011, 8:57:31 AM11/24/11

to gwt-pl...@googlegroups.com

Oops. Small mistake. The ";" is the parameter separator. We need to add a step to split along "/" for the hierarchy, and then we need to unescape "\3" to "/".

Christian Goudreau

unread,

Nov 24, 2011, 9:23:09 AM11/24/11

to gwt-pl...@googlegroups.com

I was thinking, why don't we use the Gwt Place default tokenizer? (I'm just talking about the : instead of the ;)

On Thu, Nov 24, 2011 at 8:57 AM, Philippe Beaudoin <philippe...@gmail.com> wrote:

Oops. Small mistake. The ";" is the parameter separator. We need to add a step to split along "/" for the hierarchy, and then we need to unescape "\3" to "/".

--
Christian Goudreau

www.arcbees.com

Philippe Beaudoin

unread,

Nov 24, 2011, 9:25:23 AM11/24/11

to gwt-pl...@googlegroups.com

I'll need more info as to why this wont run into the problems mentioned above.

Christian Goudreau

unread,

Nov 24, 2011, 9:26:45 AM11/24/11

to gwt-pl...@googlegroups.com

It'll probably run into the same problem... it's not an unreserved char.

--
Christian Goudreau

www.arcbees.com

Daniel Colchete

unread,

Nov 25, 2011, 6:10:53 AM11/25/11

to gwt-pl...@googlegroups.com

Hi Phillipe!

Great! I think your idea solves everything. I'll work on it next week.

Best,

Dani

Daniel Colchete

unread,

Nov 25, 2011, 7:28:10 AM11/25/11

to gwt-pl...@googlegroups.com

Hi Christian!

About using : instead of ; it is up to you guys. What ever you say. I just think the keep using the current ; will make the new format almost backwards compatible, Phillipe pointed out the almost part.

the big thing here is that we will first decode the URL using the standard URL decoder and then do our parsing and decoding. \ is an unreserved char too, the RFC says it must be encoded too, but the big change here is that we will decode it first. So, even this URL:

http://mysite.com/project.html#nametoken%3bdata_A%3djunk%5C0%5C1%5C2%3bdata_B%3dmore_junk%5C0%5C1%5C2

will work too. Here is the process for the above URL:

1) Start with:

"nametoken%3bdata_A%3djunk%5C0%5C1%5C2%3bdata_B%3dmore_junk%5C0%5C1%5C2"

2) Unescape everything:

"nametoken;data_A=junk\0\1\2;data_B=more_junk\0\1\2"

3) Split the hierarchy along ;

"nametoken"

"data_A=junk\0\1\2"

"data_B=more_junk\0\1\2"

4) Split the assignations along =

"nametoken"

"data_A" = "junk\0\1\2"

"data_B" = "more_junk\0\1\2"

5) Within the components, unescape "\0" to ";", "\1" to "=" and "\2" to "\"

"nametoken"

"data_A" = "junk;=\"

"data_B" = "more_junk;=\"

Best,

Daniel

Philippe Beaudoin

unread,

Nov 25, 2011, 8:01:20 AM11/25/11

to gwt-pl...@googlegroups.com

Yes, I say stick with ";" for compatibility.

Philippe

mayumi

unread,

Nov 28, 2011, 12:12:40 PM11/28/11

to GWTP

Hi Daniel and Philippe,

We ran into exact same problem over this weekend.
Daniel if you are not going to get to it would you like me to take a
look at it?

I am working on it today and tomorrow.

Mayumi

On Nov 25, 5:10 am, Daniel Colchete <d...@cloud3.tc> wrote:
> Hi Phillipe!
>
> Great! I think your idea solves everything. I'll work on it next week.
>
> Best,
> Dani
>
> On Thu, Nov 24, 2011 at 11:54 AM, Philippe Beaudoin <
>
>
>
>
>
>
>
>
>
> philippe.beaud...@gmail.com> wrote:
> > I've thought a bit more about it, here is my idea...
>
> > Right now the decoding process is:
> > - Split the hierarchy along ;
> > - Split the assignations along =
> > - Unescape all the components obtained
>
> > I suggest the following:
> > - Unescape everything
> > - Split the hierarchy along ;
> > - Split the assignations along =
> > - Within the components, unescape "\0" to ";", "\1" to "=" and "\2" to "\"
>
> > (In both case, encoding is simply inverting the steps and playing them in
> > reverse.)
>
> > I believe this works because the first thing we do is unescape. So no
> > matter what the sender decided to escape, we'll get it. We will work on
> > legacy URLs like:
> > http://mysite.com/project.html#nametoken;param1=value1;param2=value2
> > And with fully escaped URLs:
>

> >http://mysite.com/project.html#nametoken%3bparam1%3dvalue1%3bparam2%3...

>
> > For situations like where we want to encode:
> > { ('data_A', 'junk;=\'), ('data_B, 'more_junk;=\') }
> > then we will work with the following URL:
>
> >http://mysite.com/project.html#nametoken;data_A=junk\0\1\2;data_B=more_junk\0\1\2
> > This encoded URL:
>

> >http://mysite.com/project.html#nametoken;data_A=junk%5C0%5C1%5C2;data...
> > And even this one:
>
> >http://mysite.com/project.html#nametoken%3bdata_A%3djunk%5C0%5C1%5C2%...

Daniel Colchete

unread,

Nov 30, 2011, 8:48:11 AM11/30/11

to gwt-pl...@googlegroups.com

Good day everyone!

I just posted the first version of the new token formatter on the issue page [1]. I changed one part the of encoding process. Instead of URL.encode()ing everything in the end, I only encode params and values so that the new URLs are similar to the old ones. This means that the \0 is escaped to %5C0, \1 is escaped to %5C1, and so on. The decoding process URL.decode()s everything twice, before the parser jumps in, and also when decoding params and values.

In the end, old-URLs works, but for the corner cases where [=;/\] is used inside parameter names or values, emailed URL works, bookmarking works, copying and pasting URL works. I tested with PlaceManager.revealPlace, PlaceManager.buildHistoryToken (for the email part), Ctrl+C Ctrl+V and the URL bar and encoding the fragment part of the URL again manually (like a mail user agent would). It seems to be working fine. If anyone finds any problem please let-me know and I'll fix it.

The code is based on ParamTokenFormatter. So I kept and updated the documentation accordingly. Two new functions are introduced to do our own \N encoding. I tried to use String.replaceAll() first but the current implementation works twice as fast. And it is really fast indeed. On my laptop, a Core 2 Duo, it can do about half a million encodoings of a 40-char sentence per second, so I didn't tried to get it any faster, although JSNI would be my next try.

Of course the Class name and Package name will have to be changed. This is the version that works inside my test project.

[1] http://code.google.com/p/gwt-platform/issues/detail?id=381

Please let me know what you all think.

Best,
Dani

On Mon, Nov 28, 2011 at 3:12 PM, mayumi <mayumi....@gmail.com> wrote:

Hi Daniel and Philippe,

We ran into exact same problem over this weekend.
Daniel if you are not going to get to it would you like me to take a
look at it?

I am working on it today and tomorrow.

Mayumi

On Nov 25, 5:10 am, Daniel Colchete <d...@cloud3.tc> wrote:
> Hi Phillipe!
>
> Great! I think your idea solves everything. I'll work on it next week.
>
> Best,
> Dani
>
> On Thu, Nov 24, 2011 at 11:54 AM, Philippe Beaudoin <
>

> philippe.beaud...@gmail.com> wrote:
> > I've thought a bit more about it, here is my idea...
>
> > Right now the decoding process is:
> > - Split the hierarchy along ;
> > - Split the assignations along =
> > - Unescape all the components obtained
>
> > I suggest the following:
> > - Unescape everything
> > - Split the hierarchy along ;
> > - Split the assignations along =
> > - Within the components, unescape "\0" to ";", "\1" to "=" and "\2" to "\"
>

Philippe Beaudoin

unread,

Nov 30, 2011, 10:24:19 AM11/30/11

to gwt-pl...@googlegroups.com

On Wed, Nov 30, 2011 at 8:48 AM, Daniel Colchete <da...@cloud3.tc> wrote:

Good day everyone!

I just posted the first version of the new token formatter on the issue page [1]. I changed one part the of encoding process. Instead of URL.encode()ing everything in the end, I only encode params and values so that the new URLs are similar to the old ones. This means that the \0 is escaped to %5C0, \1 is escaped to %5C1, and so on. The decoding process URL.decode()s everything twice, before the parser jumps in, and also when decoding params and values.

Excellent! That's precisely what I wanted, but I figured it would be easier to explain in the code review. Glad to see you thought about it. :)

In the end, old-URLs works, but for the corner cases where [=;/\] is used inside parameter names or values, emailed URL works, bookmarking works, copying and pasting URL works. I tested with PlaceManager.revealPlace, PlaceManager.buildHistoryToken (for the email part), Ctrl+C Ctrl+V and the URL bar and encoding the fragment part of the URL again manually (like a mail user agent would). It seems to be working fine. If anyone finds any problem please let-me know and I'll fix it.

The code is based on ParamTokenFormatter. So I kept and updated the documentation accordingly. Two new functions are introduced to do our own \N encoding. I tried to use String.replaceAll() first but the current implementation works twice as fast. And it is really fast indeed. On my laptop, a Core 2 Duo, it can do about half a million encodoings of a 40-char sentence per second, so I didn't tried to get it any faster, although JSNI would be my next try.

Of course the Class name and Package name will have to be changed. This is the version that works inside my test project.

[1] http://code.google.com/p/gwt-platform/issues/detail?id=381

Please let me know what you all think.

You put si much work went into this! (Performance stats? Wow!) Thanks a lot. I'll spend some time commenting/merging next week.

Cheers,

Philippe

Daniel Colchete

unread,

Dec 1, 2011, 6:57:02 AM12/1/11

to gwt-pl...@googlegroups.com

Thank you Philippe! A lot! I'm really glad I can help with the project. The work you guys did on GWTP is awesome! I wouldn't be using GWT anymore if it wasn't for GWTP.

If you need anything with the code or with anything else please let me know.

Best,

Dani

Christian Goudreau

unread,

Dec 1, 2011, 8:35:04 AM12/1/11

to gwt-pl...@googlegroups.com

Thanks Daniel! You have no idea what it means to us!

Cheers,

--
Christian Goudreau

www.arcbees.com

Philippe Beaudoin

unread,

Jan 31, 2012, 8:57:11 PM1/31/12

to gwt-pl...@googlegroups.com

Hi Daniel!

It's been a loooong time, but I've finally merged your patch. I've made it the new ParameterTokenFormatter and have added some tests (and updated the few failing ones). I also added some doc in the wiki in the Porting to V0.7.

Thanks again. A lot. It's a great update!

Daniel Colchete

unread,

Feb 2, 2012, 2:31:56 PM2/2/12

to gwt-pl...@googlegroups.com

Phillippe!

that is great to hear! Thank you! I'm really glad I can help. Please count me in for anything else!

Congratulations on version 0.7 for you and the team!

Best,

Dani

Christian Goudreau

unread,

Feb 2, 2012, 2:33:13 PM2/2/12

to gwt-pl...@googlegroups.com

Are you available?

--
Christian Goudreau

www.arcbees.com

Daniel Colchete

unread,

Feb 3, 2012, 5:38:39 PM2/3/12

to gwt-pl...@googlegroups.com

Sure! What is it Christian?

Best,
Dani

Christian Goudreau

unread,

Feb 4, 2012, 12:02:35 PM2/4/12

to gwt-pl...@googlegroups.com

I mean, to work with us, you said that you would do anything :D

--
Christian Goudreau

www.arcbees.com

Reply all

Reply to author

Forward