ICU4X has APIs for this but I guess you don't want to have to reparse
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
static_assert(c >= 'a' && c <= 'z', "Invalid BCP47 extension identifier");Gemini says: BCP47 allows singletons to be digits (`0-9`) in addition to letters.
This `static_assert` should be updated to allow digit singletons:
```cpp
static_assert((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'), "Invalid BCP47 extension identifier");
```
Greg says: this appears to be correct as per https://www.rfc-editor.org/info/rfc5646/#section-2.1
std::string_view AsStringView() const { return value_; }how about just `value()`?
explicit Extension(base::PassKey<base::LanguageCode>, std::string value)nit: omit `explicit` for multi-arg ctors
// LanguageCodeBuilder::GetInstance().FromString("en-US-u-ca-gregory");that's me! :-)
std::string_view FindExtension(std::string_view code, char ext_id) {nit: blank line after namespace open
std::string needle = StrCat({"-", std::string_view(&ext_id, 1), "-"});Gemini says: Using the address of a local `char` to create a `std::string_view` is fragile and generally considered an anti-pattern, even if it's technically memory-safe here because it's consumed immediately.
Consider using `std::string` directly:
```cpp
std::string needle = base::StrCat({"-", std::string(1, ext_id), "-"});
```
Or simply:
```cpp
std::string needle = std::string("-") + ext_id + "-";
```
Greg says: I had a similar thought when i read this, so i think it's a good idea to change it.
return "";optional nit: it amounts to the same thing, but how about `{}` here to mean "empty string view"
for (size_t i = pos + 1; i + 2 < code.size(); i++) {nit: i think it's slightly more readable to do `code = code.substr(pos);` or `std::string_view extension = code.substr(pos);` just before this and adjust the code below accordingly.
for (size_t i = pos + 1; i + 2 < code.size(); i++) {start at index `3`? we don't need to look at any of the characters matched by `needle`.
if (code[i] == '-' && base::IsAsciiAlpha(code[i + 1]) &&Gemini says: BCP47 extension singletons can be digits (`0-9`) as per RFC 5646 section 2.1.
Using `base::IsAsciiAlpha` will fail to recognize digit singletons (e.g., `-1-`), causing it to incorrectly include the next extension as part of the current one.
You should use `base::IsAsciiAlphaNumeric` (or `absl::ascii_isalnum`) instead.
if (code[i] == '-' && base::IsAsciiAlpha(code[i + 1]) &&
code[i + 2] == '-') {nit: swap the order of these last two terms so that the most expensive check is last.
return code.substr(pos, i - pos);Gemini says (and Greg has not validated, so please verify its claim): Please fix this WARNING reported by autoreview issue finding: There's a critical issue here regarding the private use extension (`x`).
Unlike regular extensions, `x` allows subtags of length 1 (e.g., `en-US-x-a-b`).
If `FindExtension` is looking for `x`, this loop will incorrectly stop at `-a-` and truncate the private use value.
Furthermore, if you are searching for `a`, `code.find(needle)` might accidentally find `-a-` *inside* the private use extension!
To fix both issues, you can take advantage of the fact that `x` is always the last extension:
```cpp
size_t x_pos = code.find("-x-");
std::string_view search_space = code;
if (x_pos != std::string_view::npos && ext_id != 'x') {
// Ignore everything inside the private use extension.
search_space = code.substr(0, x_pos);
}
std::string needle = std::string("-") + ext_id + "-";
size_t pos = search_space.find(needle); if (pos == std::string_view::npos) {
return "";
}// 'x' is always the last extension, so it spans to the end of the string.
if (ext_id == 'x') {
return search_space.substr(pos);
}
for (size_t i = pos + 1; i + 2 < search_space.size(); i++) {
if (search_space[i] == '-' && base::IsAsciiAlphaNumeric(search_space[i + 1]) &&
search_space[i + 2] == '-') {
return search_space.substr(pos, i - pos);
}
}return search_space.substr(pos);
```
if (extension.size() < 3) {nit: "3" happens a lot. consider something like `kExtensionPrefixLength` or something in an unnamed namespace at the top of the file.
return R(base::PassKey<LanguageCode>(), std::move(*value));`*std::move(value)` as per go/totw/181#moving-from-the-value-of-an-abslstatusor
Could you add some tests for malformed inputs?
"en-US-u-ca-gregory-x-private-a-test");Gemini says (and, again, Greg has not validated): This test implies a misunderstanding of BCP47.
Because the private use extension (`x`) is always the last subtag sequence and it allows 1-character subtags, `-a-test` is actually a continuation of the `x` extension, NOT a new extension `a`.
The value of `x` here should be `private-a-test`, and `ext<'a'>()` should return `std::nullopt`.
(This test currently passes only because of the buggy parsing logic in `FindExtension` which incorrectly stops at 1-character subtags inside the `x` extension).
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
static_assert(c >= 'a' && c <= 'z', "Invalid BCP47 extension identifier");Gemini says: BCP47 allows singletons to be digits (`0-9`) in addition to letters.
This `static_assert` should be updated to allow digit singletons:
```cpp
static_assert((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'), "Invalid BCP47 extension identifier");
```Greg says: this appears to be correct as per https://www.rfc-editor.org/info/rfc5646/#section-2.1
Done
std::string_view AsStringView() const { return value_; }how about just `value()`?
dunno, for it .value() is too much associated with optionality, don't you think?
explicit Extension(base::PassKey<base::LanguageCode>, std::string value)nit: omit `explicit` for multi-arg ctors
Done
std::string_view FindExtension(std::string_view code, char ext_id) {nit: blank line after namespace open
Done
std::string needle = StrCat({"-", std::string_view(&ext_id, 1), "-"});Gemini says: Using the address of a local `char` to create a `std::string_view` is fragile and generally considered an anti-pattern, even if it's technically memory-safe here because it's consumed immediately.
Consider using `std::string` directly:
```cpp
std::string needle = base::StrCat({"-", std::string(1, ext_id), "-"});
```
Or simply:
```cpp
std::string needle = std::string("-") + ext_id + "-";
```Greg says: I had a similar thought when i read this, so i think it's a good idea to change it.
Done
optional nit: it amounts to the same thing, but how about `{}` here to mean "empty string view"
Done
nit: i think it's slightly more readable to do `code = code.substr(pos);` or `std::string_view extension = code.substr(pos);` just before this and adjust the code below accordingly.
Done
start at index `3`? we don't need to look at any of the characters matched by `needle`.
Done
if (code[i] == '-' && base::IsAsciiAlpha(code[i + 1]) &&Gemini says: BCP47 extension singletons can be digits (`0-9`) as per RFC 5646 section 2.1.
Using `base::IsAsciiAlpha` will fail to recognize digit singletons (e.g., `-1-`), causing it to incorrectly include the next extension as part of the current one.You should use `base::IsAsciiAlphaNumeric` (or `absl::ascii_isalnum`) instead.
Done
if (code[i] == '-' && base::IsAsciiAlpha(code[i + 1]) &&
code[i + 2] == '-') {nit: swap the order of these last two terms so that the most expensive check is last.
Done
Gemini says (and Greg has not validated, so please verify its claim): Please fix this WARNING reported by autoreview issue finding: There's a critical issue here regarding the private use extension (`x`).
Unlike regular extensions, `x` allows subtags of length 1 (e.g., `en-US-x-a-b`).
If `FindExtension` is looking for `x`, this loop will incorrectly stop at `-a-` and truncate the private use value.
Furthermore, if you are searching for `a`, `code.find(needle)` might accidentally find `-a-` *inside* the private use extension!To fix both issues, you can take advantage of the fact that `x` is always the last extension:
```cpp
size_t x_pos = code.find("-x-");
std::string_view search_space = code;
if (x_pos != std::string_view::npos && ext_id != 'x') {
// Ignore everything inside the private use extension.
search_space = code.substr(0, x_pos);
}std::string needle = std::string("-") + ext_id + "-";
size_t pos = search_space.find(needle);if (pos == std::string_view::npos) {
return "";
}// 'x' is always the last extension, so it spans to the end of the string.
if (ext_id == 'x') {
return search_space.substr(pos);
}for (size_t i = pos + 1; i + 2 < search_space.size(); i++) {
if (search_space[i] == '-' && base::IsAsciiAlphaNumeric(search_space[i + 1]) &&
search_space[i + 2] == '-') {
return search_space.substr(pos, i - pos);
}
}return search_space.substr(pos);
```
very nice catch. I rewrote it a little differently, let me know what you think.
nit: "3" happens a lot. consider something like `kExtensionPrefixLength` or something in an unnamed namespace at the top of the file.
Done
return R(base::PassKey<LanguageCode>(), std::move(*value));`*std::move(value)` as per go/totw/181#moving-from-the-value-of-an-abslstatusor
Done
Could you add some tests for malformed inputs?
Done
Gemini says (and, again, Greg has not validated): This test implies a misunderstanding of BCP47.
Because the private use extension (`x`) is always the last subtag sequence and it allows 1-character subtags, `-a-test` is actually a continuation of the `x` extension, NOT a new extension `a`.
The value of `x` here should be `private-a-test`, and `ext<'a'>()` should return `std::nullopt`.
(This test currently passes only because of the buggy parsing logic in `FindExtension` which incorrectly stops at 1-character subtags inside the `x` extension).
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
- New API: Adds LanguageCode::GetExtension<T>(traits) to retrieve specific extension types.nit: please wrap cl descriptions at 72 cols as per https://chromium.googlesource.com/chromium/src/+/main/docs/contributing.md#uploading-a-change-for-review
// Represents a Unicode BCP47 extension ('u-')....and this holds the subtags for the Unicode extension in a language tag.
// Represents a generic BCP47 extension (e.g., 'x-' for private use).to be precise: an instance of this holds an extension's subtags in a language tag, right?
std::string_view AsStringView() const { return value_; }Danilo Françoso Tedeschihow about just `value()`?
dunno, for it .value() is too much associated with optionality, don't you think?
let's look at bcp47 to see what it calls this... https://www.rfc-editor.org/info/rfc5646/#section-2.2.6 calls the stuff that follows the single-character extension identifier the "extension subtags" (see items 5 and 6 -- 5 says "Each extension subtag MUST be from two to eight characters long..." and 6 says "Each singleton MUST be followed by at least one extension subtag."). so how about renaming `value_` to `subtags_` and call this getter `subtags()`? this seems most consistent with the language of the spec.
// Retrieves a BCP47 extension subtag from the language code.nit: the blob after a singleton is one or more subtags, so maybe say "Retrieves the subtag(s) for an extension to a BCP47 language tag."
class BASE_I18N_EXPORT LanguageCode {now that i've been scrutinizing BCP47, i think this class is more aligned with what the spec calls a "Language Tag". the spec uses the term "code" to refer to the actual value of any subtag in a language tag (e.g., "en" is the language code for English; "US" is the region code for the US, etc). maybe too late to rename the class now, but perhaps it's worth clarifying in the doc comments.
#include "base/strings/strcat.h"This `#include` doesn't seem to be used in this file and can be safely removed.
// Finds the position of start of the first extension.nit: remove extra space
// Returns the extension keyed by the singleton <ext_id>. It only returns themaybe rephrase to "Returns the subtags for the extension identified by the singleton `ext_id`." (using backticks rather than angle brackets around the argument name as per https://google.github.io/styleguide/cppguide.html#Function_Comments.)
std::string_view GetExtensionValue(std::string_view code, char ext_id) {i wonder if a good way to structure this is to have a primitive similar to `FindFirstExtension` that operates on a string view and returns either the offset of the first singleton found (i.e., the middle character in the discovered '-S-' sequence) or npos (no singleton found). maybe call this `FindNextSingleton`. then this function can use it repeatedly until it finds the singleton that equals `ext_id`; aborting if it finds "x" or an empty view. when it finds a match, it calls `FindNextSingleton` one more time to find the start of the next singleton. if this isn't npos and is > this_ext + 2, we return the substr from this_ext + 2 to next_ext. or something like that. i don't think we need `needle` in this case -- we just walk one by one through each singleton until we find the one we want, and return everything up to the next singleton (with special handling for "x"). wdyt? would this simplify things at all?
(i don't think we'd need/want the "Skip the first two characters..." part of the fn above.)
std::string needle = ext_id + std::string("-");You can avoid the string heap allocation here by using a `std::string_view` over a local char array:
```cpp
const char needle_chars[] = {ext_id, '-'};
std::string_view needle(needle_chars, 2u);
```
CHECK_GT(next_extension_pos, 0u)crashing on invalid inputs seems problematic. should we not simply reject/ignore them?
it might be nice to reference the spec (e.g., "BCP47 2.2.6. says that each extension subtag MUST be from two to eight characters long")
std::string_view extension_value = GetExtensionValue(code_.AsString(), key);maybe `GetExtensionSubtags` is more aligned with the language of the spec.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
- New API: Adds LanguageCode::GetExtension<T>(traits) to retrieve specific extension types.nit: please wrap cl descriptions at 72 cols as per https://chromium.googlesource.com/chromium/src/+/main/docs/contributing.md#uploading-a-change-for-review
formated it.
...and this holds the subtags for the Unicode extension in a language tag.
Done
// Represents a generic BCP47 extension (e.g., 'x-' for private use).to be precise: an instance of this holds an extension's subtags in a language tag, right?
Added a method for returning the singleton identifier for the extension. This way we get a proper extension representation from the `Extension` class.
std::string_view AsStringView() const { return value_; }Danilo Françoso Tedeschihow about just `value()`?
Greg Thompsondunno, for it .value() is too much associated with optionality, don't you think?
let's look at bcp47 to see what it calls this... https://www.rfc-editor.org/info/rfc5646/#section-2.2.6 calls the stuff that follows the single-character extension identifier the "extension subtags" (see items 5 and 6 -- 5 says "Each extension subtag MUST be from two to eight characters long..." and 6 says "Each singleton MUST be followed by at least one extension subtag."). so how about renaming `value_` to `subtags_` and call this getter `subtags()`? this seems most consistent with the language of the spec.
Done
std::string_view AsStringView() const { return value_; }Danilo Françoso Tedeschihow about just `value()`?
Greg Thompsondunno, for it .value() is too much associated with optionality, don't you think?
let's look at bcp47 to see what it calls this... https://www.rfc-editor.org/info/rfc5646/#section-2.2.6 calls the stuff that follows the single-character extension identifier the "extension subtags" (see items 5 and 6 -- 5 says "Each extension subtag MUST be from two to eight characters long..." and 6 says "Each singleton MUST be followed by at least one extension subtag."). so how about renaming `value_` to `subtags_` and call this getter `subtags()`? this seems most consistent with the language of the spec.
Done
// Singletons can be digits (except 'x', 'i').
EXPECT_THAT(LanguageCodeBuilder::GetInstance().FromString("en-1-myext"),
Optional(Property(&LanguageCode::ToString, Eq("en-1-myext"))));REmoving the possibility of digits as singletons as they are not supported from ICU4x now and are not even considered by the standards.
... EXPECT_THAT(Danilo Françoso Tedeschiremove "... "
Done
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
std::string_view AsStringView() const { return value_; }Danilo Françoso Tedeschihow about just `value()`?
dunno, for it .value() is too much associated with optionality, don't you think?
Done
// Retrieves a BCP47 extension subtag from the language code.Danilo Françoso Tedeschinit: the blob after a singleton is one or more subtags, so maybe say "Retrieves the subtag(s) for an extension to a BCP47 language tag."
Done
class BASE_I18N_EXPORT LanguageCode {now that i've been scrutinizing BCP47, i think this class is more aligned with what the spec calls a "Language Tag". the spec uses the term "code" to refer to the actual value of any subtag in a language tag (e.g., "en" is the language code for English; "US" is the region code for the US, etc). maybe too late to rename the class now, but perhaps it's worth clarifying in the doc comments.
I do not think it is too late to rename it. I pick the name just because in google3 that is what they ended up choosing. I am ok with changing it.
the other option would be to call it `Locale` as well, but I also agree `LanguageTag` also sounds more aligned with BCP47.
This `#include` doesn't seem to be used in this file and can be safely removed.
Done
// Finds the position of start of the first extension.Danilo Françoso Tedeschinit: remove extra space
Done
// Returns the extension keyed by the singleton <ext_id>. It only returns themaybe rephrase to "Returns the subtags for the extension identified by the singleton `ext_id`." (using backticks rather than angle brackets around the argument name as per https://google.github.io/styleguide/cppguide.html#Function_Comments.)
Done
std::string_view GetExtensionValue(std::string_view code, char ext_id) {i wonder if a good way to structure this is to have a primitive similar to `FindFirstExtension` that operates on a string view and returns either the offset of the first singleton found (i.e., the middle character in the discovered '-S-' sequence) or npos (no singleton found). maybe call this `FindNextSingleton`. then this function can use it repeatedly until it finds the singleton that equals `ext_id`; aborting if it finds "x" or an empty view. when it finds a match, it calls `FindNextSingleton` one more time to find the start of the next singleton. if this isn't npos and is > this_ext + 2, we return the substr from this_ext + 2 to next_ext. or something like that. i don't think we need `needle` in this case -- we just walk one by one through each singleton until we find the one we want, and return everything up to the next singleton (with special handling for "x"). wdyt? would this simplify things at all?
(i don't think we'd need/want the "Skip the first two characters..." part of the fn above.)
Let me know what you think of the approach I implemented.
You can avoid the string heap allocation here by using a `std::string_view` over a local char array:
```cpp
const char needle_chars[] = {ext_id, '-'};
std::string_view needle(needle_chars, 2u);
```
Done
crashing on invalid inputs seems problematic. should we not simply reject/ignore them?
Danilo Françoso Tedeschiit might be nice to reference the spec (e.g., "BCP47 2.2.6. says that each extension subtag MUST be from two to eight characters long")
Done
std::string_view extension_value = GetExtensionValue(code_.AsString(), key);Danilo Françoso Tedeschimaybe `GetExtensionSubtags` is more aligned with the language of the spec.
I called it GetExtensionString as I changed the code to include the singleton in the output of it.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Created a bug for ICU4x not accepting digits as extension singletons: https://github.com/unicode-org/icu4x/issues/8018
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |