return url.spec().substr(0, kMaxURLLength);I don't think we want to truncate the URL. I think what we want is to report TLD instead. You can probably get the eTLD+1 directly from URL.
```
net::registry_controlled_domains::GetDomainAndRegistry(
url,
net::registry_controlled_domains::INCLUDE_PRIVATE_REGISTRIES);
```
One thing that Stephan mention yesterday is that we probably still want to check if the eTLD+1 itself is not longer then our limit. However this shouldn't be the case as per https://datatracker.ietf.org/doc/html/rfc1035#section-2.3.4.
Maybe worth commenting on that or even DCHECKing?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
return url.spec().substr(0, kMaxURLLength);I don't think we want to truncate the URL. I think what we want is to report TLD instead. You can probably get the eTLD+1 directly from URL.
```
net::registry_controlled_domains::GetDomainAndRegistry(
url,
net::registry_controlled_domains::INCLUDE_PRIVATE_REGISTRIES);
```One thing that Stephan mention yesterday is that we probably still want to check if the eTLD+1 itself is not longer then our limit. However this shouldn't be the case as per https://datatracker.ietf.org/doc/html/rfc1035#section-2.3.4.
Maybe worth commenting on that or even DCHECKing?
I actually think that truncation is a good approach. The part that makes some URLs excessively long is the query params, so truncating only affects this part, and we will be able to derive all the other forms on the server-side, i.e. eTLD+1, origin, url_without_query, this is a net gain from only reporting eTLD+1.
In this case, we probably want to make it explicit and unambiguous that a truncation happened, e.g. set a `is_truncated` bit or even just append `[...]` to the end of URL.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
return url.spec().substr(0, kMaxURLLength);Sun YueruI don't think we want to truncate the URL. I think what we want is to report TLD instead. You can probably get the eTLD+1 directly from URL.
```
net::registry_controlled_domains::GetDomainAndRegistry(
url,
net::registry_controlled_domains::INCLUDE_PRIVATE_REGISTRIES);
```One thing that Stephan mention yesterday is that we probably still want to check if the eTLD+1 itself is not longer then our limit. However this shouldn't be the case as per https://datatracker.ietf.org/doc/html/rfc1035#section-2.3.4.
Maybe worth commenting on that or even DCHECKing?
I actually think that truncation is a good approach. The part that makes some URLs excessively long is the query params, so truncating only affects this part, and we will be able to derive all the other forms on the server-side, i.e. eTLD+1, origin, url_without_query, this is a net gain from only reporting eTLD+1.
In this case, we probably want to make it explicit and unambiguous that a truncation happened, e.g. set a `is_truncated` bit or even just append `[...]` to the end of URL.
Is it good now, with the [...] at the end? Or is it best to have a way to set is_truncated?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
return url.spec().substr(0, kMaxURLLength);Sun YueruI don't think we want to truncate the URL. I think what we want is to report TLD instead. You can probably get the eTLD+1 directly from URL.
```
net::registry_controlled_domains::GetDomainAndRegistry(
url,
net::registry_controlled_domains::INCLUDE_PRIVATE_REGISTRIES);
```One thing that Stephan mention yesterday is that we probably still want to check if the eTLD+1 itself is not longer then our limit. However this shouldn't be the case as per https://datatracker.ietf.org/doc/html/rfc1035#section-2.3.4.
Maybe worth commenting on that or even DCHECKing?
Alicja OpalinskaI actually think that truncation is a good approach. The part that makes some URLs excessively long is the query params, so truncating only affects this part, and we will be able to derive all the other forms on the server-side, i.e. eTLD+1, origin, url_without_query, this is a net gain from only reporting eTLD+1.
In this case, we probably want to make it explicit and unambiguous that a truncation happened, e.g. set a `is_truncated` bit or even just append `[...]` to the end of URL.
Is it good now, with the [...] at the end? Or is it best to have a way to set is_truncated?
On the server-side, we need to look for the same truncation indicator to distinguish URLs that were truncated and the URLs that were exactly 2048 characters long.
Only for the former case, since they don't represent the original "full_url" as seen by the users, these can be replaced by "URLTooLong" as before, but their other variants like `origin` can then be derived and recorded in the logs.
Server-side handling should land first before this client-side change does.
I think the alternative of setting a `is_truncated` bit may be a cleaner signal, but from the top of my head, I can't think of any real use cases for it. To count what proportion of full URLs were truncated, counting `URLTooLong` would yield the same answer as `is_truncated`.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
return url.spec().substr(0, kMaxURLLength);Sun YueruI don't think we want to truncate the URL. I think what we want is to report TLD instead. You can probably get the eTLD+1 directly from URL.
```
net::registry_controlled_domains::GetDomainAndRegistry(
url,
net::registry_controlled_domains::INCLUDE_PRIVATE_REGISTRIES);
```One thing that Stephan mention yesterday is that we probably still want to check if the eTLD+1 itself is not longer then our limit. However this shouldn't be the case as per https://datatracker.ietf.org/doc/html/rfc1035#section-2.3.4.
Maybe worth commenting on that or even DCHECKing?
Alicja OpalinskaI actually think that truncation is a good approach. The part that makes some URLs excessively long is the query params, so truncating only affects this part, and we will be able to derive all the other forms on the server-side, i.e. eTLD+1, origin, url_without_query, this is a net gain from only reporting eTLD+1.
In this case, we probably want to make it explicit and unambiguous that a truncation happened, e.g. set a `is_truncated` bit or even just append `[...]` to the end of URL.
Sun YueruIs it good now, with the [...] at the end? Or is it best to have a way to set is_truncated?
On the server-side, we need to look for the same truncation indicator to distinguish URLs that were truncated and the URLs that were exactly 2048 characters long.
Only for the former case, since they don't represent the original "full_url" as seen by the users, these can be replaced by "URLTooLong" as before, but their other variants like `origin` can then be derived and recorded in the logs.
Server-side handling should land first before this client-side change does.
I think the alternative of setting a `is_truncated` bit may be a cleaner signal, but from the top of my head, I can't think of any real use cases for it. To count what proportion of full URLs were truncated, counting `URLTooLong` would yield the same answer as `is_truncated`.
Ohh, I see now, thank you for explaining that Sun!
Only for the former case, since they don't represent the original "full_url" as seen by the users, these can be replaced by "URLTooLong" as before, but their other variants like origin can then be derived and recorded in the logs.
How will we determine if the URL was cutoff on query and not on path itself (i.e. `url_without_query` length is in itself bigger then `2048`)?
I also am not huge fan of `[...]` suffix in URL, this seems very special casey and we will always wonder what if it accidentally is at the end of the string, it will also mean the lookup of 2kB every time we want to check if string was truncated (assuming we use zero terminated string). I think it would be significantly better to have `is_truncated` flag.
Additionally if we go with extra flag, we can even solve the problem of where the url was truncated by calling it `truncated_at` and adding values to it `{ not_truncated, at_query, at_path, at_origin }`. The logic will be slightly more complex but also slightly more complete.
WDYT?
const std::string kTruncatedIndicator = "[...]";probably could be constexpr?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
How will we determine if the URL was cutoff on query and not on path itself (i.e. url_without_query length is in itself bigger then 2048)?
It seems extremely unlikely that the `url_without_query` part would exceed this length. If truncation does happen here (we would see the URL missing the `?` symbol in this part), the truncated `url_without_query` likely does not represent a real URL of the site so it wouldn't be logged by the server either (due to server-side check).
Checking the logs... This does happen! https://paste.googleplex.com/5202623947997184
Additionally if we go with extra flag, we can even solve the problem of where the url was truncated by calling it truncated_at
While I think `[...]` is sufficiently unambiguous (since if a real URL contains this substring, it would be escaped, e.g. `[` becomes `%5B` in the `full_url` field), I agree with you on using `truncated_at` for better clarity. We can keep it as simple as a `uint32` field for now, currently set to 2048. If we ever want to bump up `kMaxURLLength`, this field tracks that status. WDYT?
Hm, but `truncated_at` that represents the number of character will not be able to answer the question of weather the truncation happened within query or path?
If we went with enum `{ not_truncated, at_query, at_path, at_origin }` we would be able to tell where it was truncated and we can still derive the number of characters it was truncated at from the data.
Can you take a look again, Sun? I applied afie's idea basically with is_truncated being an enum, which will be an information for the server which urls to replace with "URLTooLong".
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
EXPECT_EQ(path_string.substr(0, 2048 - 5) + "[...]",Hm, where are those `[...]` coming from in this version of the code?
if (kMaxURLLength > query_start) {nitty nit: I would probably spell it out in reverse `query_start < kMaxURLLength` (here and below). The reason is that at least from my perspective we are thinking from left to right and we actually want to check if `query_start` is before the limit.
I guess this is also off by one, if `query_start == kMaxURLLength`, then that means that we cut exactly whole QUERY but we still didn't touch a PATH. No? So it should be `query_start <= kMaxURLLength` and `path_start <= kMaxURLLength`
// Verify whether the kMaxURLLength-character cutoff occurred after the QUERY,
// after the PATH, or within the ORIGIN
const url::Parsed& parsed = url.parsed_for_possibly_invalid_spec();
int query_start = parsed.CountCharactersBefore(url::Parsed::QUERY, true);
int path_start = parsed.CountCharactersBefore(url::Parsed::PATH, true);
UrlInfo::TruncationStatus status;
if (kMaxURLLength > query_start) {
status = UrlInfo::AT_QUERY;
} else if (kMaxURLLength > path_start) {
status = UrlInfo::AT_PATH;
} else {
status = UrlInfo::AT_ORIGIN;
}nit: This block look like a great candidate to extract to a separate function like `std::pair<std::string, TruncationStatus> truncatedUrlSpec(const GURL& url, int32_t limit)`. WDYT?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
void SetShortenedURL(const GURL& url, UrlInfo* url_info) {nit: this could use a new name, "truncated" would be more descriptive than "shortened" now. Maybe `SetUrlInfoWithTruncationStatus`?
return url.spec().substr(0, kMaxURLLength);I see, fair enough, let's go with this then.
// This file was exported from google3, do not edit manually.Proto change should be made in google3 first then sync'ed to chromium.
AT_QUERY = 1;Can we align these names to those in UKM instead of the GURL library?
E.g. TRUNCATED_AT_FULL_PATH = 1, TRUNCATED_AT_URL_WITHOUT_QUERY = 2, TRUNCATED_AT_ORIGIN = 3
TruncationStatus is_truncated = 3;Let's call this `truncation_status` too. `is_*` by convention is used for boolean types.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |