void ForEachCountryBucket(std::vector<AutofillProfile>& profiles,
base::FunctionRef<void(size_t, size_t)> callback) {
using Iter = std::vector<AutofillProfile>::iterator;
Iter it_start = profiles.begin();
while (it_start != profiles.end()) {
const Iter it_end = std::ranges::upper_bound(
it_start, profiles.end(), it_start->GetAddressCountryCode(), {},
&AutofillProfile::GetAddressCountryCode);
const size_t start_bucket_index = std::distance(profiles.begin(), it_start);
const size_t end_bucket_index = std::distance(profiles.begin(), it_end);
base::UmaHistogramCounts100000(
"Autofill.NumberOfProfilesConsideredForDedupe.ByCountry",
end_bucket_index - start_bucket_index);
callback(start_bucket_index, end_bucket_index);
it_start = it_end;
}
}Is this safe? IIUC a caller may modify `profiles` in `callback` and then stuff will break here.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
void ForEachCountryBucket(std::vector<AutofillProfile>& profiles,
base::FunctionRef<void(size_t, size_t)> callback) {
using Iter = std::vector<AutofillProfile>::iterator;
Iter it_start = profiles.begin();
while (it_start != profiles.end()) {
const Iter it_end = std::ranges::upper_bound(
it_start, profiles.end(), it_start->GetAddressCountryCode(), {},
&AutofillProfile::GetAddressCountryCode);
const size_t start_bucket_index = std::distance(profiles.begin(), it_start);
const size_t end_bucket_index = std::distance(profiles.begin(), it_end);
base::UmaHistogramCounts100000(
"Autofill.NumberOfProfilesConsideredForDedupe.ByCountry",
end_bucket_index - start_bucket_index);
callback(start_bucket_index, end_bucket_index);
it_start = it_end;
}
}Is this safe? IIUC a caller may modify `profiles` in `callback` and then stuff will break here.
Yes, thanks for pointing that out! In practice, I would say profiles are not expected to be removed in the callback but I see your point.
I could pre-compute all bucket ranges first and then apply the callback for each one of them in reverse order (so that in case a profile from a bucket is removed, all the subsequent ranges remain valid). The problem here is that the caller could still (potentially) remove any other profile and break the logic anyway.
Do you have any idea?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
* Country bucketing: Profiles are initially grouped by country code
and deduplicated as independent subsets.Do we have any data showing that users with a lot of profiles usually have profiles from different countries?
* Deferred batching: Profile updates and removals are batched and
deferred to the very end of deduplication, avoiding useless
intermediate updates for profiles that are ultimately removed.Is this worth doing? The data manager queues pending updates, since only one of them is executed at the same time. I'd be surprised if this makes a measurable difference during deduplication.
using Iter = std::vector<AutofillProfile>::iterator;
Iter it_start = profiles.begin();
while (it_start != profiles.end()) {
const Iter it_end = std::ranges::upper_bound(
it_start, profiles.end(), it_start->GetAddressCountryCode(), {},
&AutofillProfile::GetAddressCountryCode);
const size_t start_bucket_index = std::distance(profiles.begin(), it_start);
const size_t end_bucket_index = std::distance(profiles.begin(), it_end);
base::UmaHistogramCounts100000(
"Autofill.NumberOfProfilesConsideredForDedupe.ByCountry",
end_bucket_index - start_bucket_index);
callback(start_bucket_index, end_bucket_index);
it_start = it_end;
}I think the more canonical approach is using `std::find_if()`. Does something like this work?
```
auto bgn = profiles.begin();
while (bgn != profiles.end()) {
auto end = std::find_if(bgn+1, profiles.end(), [&](const AutofillProfile& p) {
return bgn->GetCountryCode() != p.GetCountryCode();
});
// All profiles in [bgn, end[ share the same country code.
callback(...);
bgn = end;
}
```
This implementation is only O(n) rather than O(nlogn) due to the repeated upper_bound usage.
// and preserves the `initial_creator_id`. The algorithm proceeds in two steps,
// such that the amount of retained information is maximized without sending the
// data to the account if it was stored locally.nit: There is an entire paragraph between "the algorithm proceeds in two steps" and the steps now. Can we rephrase this somehow?
profiles, [&](size_t start_bucket_index, size_t end_bucket_index) {Do we really have to use indices everywhere? The code becomes much harder to read this way.
DeleteDisusedAddresses();Moving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.
// Other cleanups are performed on every browser start.This comment was meant to apply to DeleteDisusedAddresses() as well.
* Country bucketing: Profiles are initially grouped by country code
and deduplicated as independent subsets.Do we have any data showing that users with a lot of profiles usually have profiles from different countries?
As far as I know, we don't. This is also the reason why we thought about introducing a new metric to count each bucket size.
* Deferred batching: Profile updates and removals are batched and
deferred to the very end of deduplication, avoiding useless
intermediate updates for profiles that are ultimately removed.Is this worth doing? The data manager queues pending updates, since only one of them is executed at the same time. I'd be surprised if this makes a measurable difference during deduplication.
I see your point! My idea was to minimize the number of redundant updates. For example, if profiles A and B are subsets of D, D would normally queue two updates. But if D is also a subset of E, those intermediate updates for D become unnecessary.
Do you think this change is not worth it?
Looks better! I will update it :)
// and preserves the `initial_creator_id`. The algorithm proceeds in two steps,
// such that the amount of retained information is maximized without sending the
// data to the account if it was stored locally.nit: There is an entire paragraph between "the algorithm proceeds in two steps" and the steps now. Can we rephrase this somehow?
Sure!
profiles, [&](size_t start_bucket_index, size_t end_bucket_index) {Do we really have to use indices everywhere? The code becomes much harder to read this way.
I could pass a base::span<AutofillProfile> but then I would have to use std::distance() to derive the index, do you see it as a better option?
DeleteDisusedAddresses();Moving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.
Ah, good to know! I see that the unit test DeleteDisusedAddresses is more or less doing the same thing, expecting profiles are removed immediately after calling DeleteDisusedAddresses(). Does that mean the test result is unrealiable?
DeleteDisusedAddresses();Gianmarco PicarellaMoving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.
Ah, good to know! I see that the unit test DeleteDisusedAddresses is more or less doing the same thing, expecting profiles are removed immediately after calling DeleteDisusedAddresses(). Does that mean the test result is unrealiable?
Maybe I can reuse IsAutofillEntryWithUseDateDeletable() and delete the profiles manually at the very beginning of DeduplicateProfiles().
* Deferred batching: Profile updates and removals are batched and
deferred to the very end of deduplication, avoiding useless
intermediate updates for profiles that are ultimately removed.Gianmarco PicarellaIs this worth doing? The data manager queues pending updates, since only one of them is executed at the same time. I'd be surprised if this makes a measurable difference during deduplication.
I see your point! My idea was to minimize the number of redundant updates. For example, if profiles A and B are subsets of D, D would normally queue two updates. But if D is also a subset of E, those intermediate updates for D become unnecessary.
Do you think this change is not worth it?
For example, if profiles A and B are subsets of D, D would normally queue two updates. But if D is also a subset of E, those intermediate updates for D become unnecessary.
That's true, but the point is that the updates are queued. So they don't impact browser startup metrics - which I think is what we are concerned about?
profiles, [&](size_t start_bucket_index, size_t end_bucket_index) {Gianmarco PicarellaDo we really have to use indices everywhere? The code becomes much harder to read this way.
I could pass a base::span<AutofillProfile> but then I would have to use std::distance() to derive the index, do you see it as a better option?
No, `std::distance()` seems even worse :)
Maybe one option is to define a custom struct containing the status and the profile, and operating on those structs instead.
That said, I'm not convinced that these optimisations are worth it to begin with.
DeleteDisusedAddresses();Gianmarco PicarellaMoving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.
Ah, good to know! I see that the unit test DeleteDisusedAddresses is more or less doing the same thing, expecting profiles are removed immediately after calling DeleteDisusedAddresses(). Does that mean the test result is unrealiable?
The test uses a TestAddressDataManager, which [is synchronous](https://source.chromium.org/chromium/chromium/src/+/main:components/autofill/core/browser/data_manager/addresses/test_address_data_manager.h;l=16-17?q=TestAddressDataManager).