URL Canonicalization and host ports, and some other edge cases

35 views
Skip to first unread message

nickg

unread,
May 14, 2011, 4:51:36 PM5/14/11
to Google Safe Browsing API
Hello,

The GSB V2 API has a good test suite for URL canonicalization.
However it only has one example of a host-port combo:

Canonicalize("http://www.gotaport.com:1234/") = "http://
www.gotaport.com:1234/";

Granted the port isn't used in computing hashes and prefixes, it still
might be good to have more tests, especially for the cases of port 80
and port 443:

Canonicalize("http://www.google.com:/) = "http://
www.google.com/" // missing port
Canonicalize("http://www.google.com:80/) = "http://
www.google.com/" // explicit standard port
Canonicalize("http://www.google.com:81/) = "http://www.google.com:
81/" // non standard port
Canonicalize("https://www.google.com:443/) = "https://
www.google.com/" // explicit standard port
Canonicalize("https://www.google.com:444/) = "https://www.google.com:
444/" // non standard port

Also the following edge case would be good too:
Canonicalize("http://www.google.com/..) = "http://www.google.com/.."
Canonicalize("http://www.google.com/.) = "http://www.google.com/."

as the normalization rules only applies to "/../" and "/./", not
"/..", "/."

thoughts?

thanks!

nickg


Sam C

unread,
Jun 14, 2011, 11:05:06 AM6/14/11
to Google Safe Browsing API
Hi Nickg,

They're good examples of canonicalization and would be useful on a web
browser but for the purposes of GSB I think they're unnecessary
overhead (would have to store a list of standard ports etc) as
obviously when we do a lookup we don't use the ports as such. (I think
the reason the port example is even in there is just to ensure that
the client doesn't go crazy when someone adds a port to the end of
their URL!)

But yeah in my opinion, great for a web browsers, probably unnecessary
for GSB clients.

--Sam

On May 14, 9:51 pm, nickg <ni...@client9.com> wrote:
> Hello,
>
> The GSB V2 API has a good test suite for URL canonicalization.
> However it only has one example of a host-port combo:
>
> Canonicalize("http://www.gotaport.com:1234/") = "http://www.gotaport.com:1234/";
>
> Granted the port isn't used in computing hashes and prefixes, it still
> might be good to have more tests, especially for the cases of port 80
> and port 443:
>
> Canonicalize("http://www.google.com:/) = "http://www.google.com/"    // missing port
> Canonicalize("http://www.google.com:80/) = "http://www.google.com/"   // explicit standard port
> Canonicalize("http://www.google.com:81/) = "http://www.google.com:
> 81/"  // non standard port
> Canonicalize("https://www.google.com:443/) = "https://www.google.com/" // explicit standard port

dcs dcs

unread,
Jun 17, 2011, 11:07:24 AM6/17/11
to google-safe-...@googlegroups.com
Hi All:

I am interested how should I implement the puny code. But the google
document for GSB has no example for a foreign language, such as
Chinese url, and it has a few errors in it and is confusing to say the
best.


Regards,


DCS DCS

> --
> You received this message because you are subscribed to the Google Groups "Google Safe Browsing API" group.
> To post to this group, send email to google-safe-...@googlegroups.com.
> To unsubscribe from this group, send email to google-safe-browsi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/google-safe-browsing-api?hl=en.
>
>

Reply all
Reply to author
Forward
0 new messages