Optimize profile deduplication performance [chromium/src : main]

0 views
Skip to first unread message

Karol Sygiet (Gerrit)

unread,
8:03 AM (11 hours ago) 8:03 AM
to Gianmarco Picarella, Chromium LUCI CQ, Florian Leimgruber, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org
Attention needed from Florian Leimgruber and Gianmarco Picarella

Karol Sygiet added 1 comment

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc
Line 79, Patchset 2 (Latest):void ForEachCountryBucket(std::vector<AutofillProfile>& profiles,
base::FunctionRef<void(size_t, size_t)> callback) {
using Iter = std::vector<AutofillProfile>::iterator;
Iter it_start = profiles.begin();
while (it_start != profiles.end()) {
const Iter it_end = std::ranges::upper_bound(
it_start, profiles.end(), it_start->GetAddressCountryCode(), {},
&AutofillProfile::GetAddressCountryCode);
const size_t start_bucket_index = std::distance(profiles.begin(), it_start);
const size_t end_bucket_index = std::distance(profiles.begin(), it_end);
base::UmaHistogramCounts100000(
"Autofill.NumberOfProfilesConsideredForDedupe.ByCountry",
end_bucket_index - start_bucket_index);

callback(start_bucket_index, end_bucket_index);
it_start = it_end;
}
}
Karol Sygiet . unresolved

Is this safe? IIUC a caller may modify `profiles` in `callback` and then stuff will break here.

Open in Gerrit

Related details

Attention is currently required from:
  • Florian Leimgruber
  • Gianmarco Picarella
Submit Requirements:
  • requirement satisfiedCode-Coverage
  • requirement is not satisfiedCode-Owners
  • requirement is not satisfiedCode-Review
  • requirement is not satisfiedNo-Unresolved-Comments
  • requirement is not satisfiedReview-Enforcement
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. DiffyGerrit
Gerrit-MessageType: comment
Gerrit-Project: chromium/src
Gerrit-Branch: main
Gerrit-Change-Id: Icb69c97b214bb50888a36ec32cc110e59b09e934
Gerrit-Change-Number: 7689517
Gerrit-PatchSet: 2
Gerrit-Owner: Gianmarco Picarella <pica...@google.com>
Gerrit-Reviewer: Florian Leimgruber <fleim...@google.com>
Gerrit-Reviewer: Gianmarco Picarella <pica...@google.com>
Gerrit-Reviewer: Karol Sygiet <syg...@google.com>
Gerrit-CC: Chromium Metrics Reviews <chromium-met...@google.com>
Gerrit-CC: Jihad Hanna <jihad...@google.com>
Gerrit-CC: Julia Sobiech <jsob...@google.com>
Gerrit-CC: Maryia Mankevich <mma...@google.com>
Gerrit-Attention: Gianmarco Picarella <pica...@google.com>
Gerrit-Attention: Florian Leimgruber <fleim...@google.com>
Gerrit-Comment-Date: Mon, 23 Mar 2026 12:03:34 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No
satisfied_requirement
unsatisfied_requirement
open
diffy

Gianmarco Picarella (Gerrit)

unread,
8:25 AM (11 hours ago) 8:25 AM
to Chromium LUCI CQ, Florian Leimgruber, Karol Sygiet, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org
Attention needed from Florian Leimgruber and Karol Sygiet

Gianmarco Picarella added 1 comment

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc
Line 79, Patchset 2 (Latest):void ForEachCountryBucket(std::vector<AutofillProfile>& profiles,
base::FunctionRef<void(size_t, size_t)> callback) {
using Iter = std::vector<AutofillProfile>::iterator;
Iter it_start = profiles.begin();
while (it_start != profiles.end()) {
const Iter it_end = std::ranges::upper_bound(
it_start, profiles.end(), it_start->GetAddressCountryCode(), {},
&AutofillProfile::GetAddressCountryCode);
const size_t start_bucket_index = std::distance(profiles.begin(), it_start);
const size_t end_bucket_index = std::distance(profiles.begin(), it_end);
base::UmaHistogramCounts100000(
"Autofill.NumberOfProfilesConsideredForDedupe.ByCountry",
end_bucket_index - start_bucket_index);

callback(start_bucket_index, end_bucket_index);
it_start = it_end;
}
}
Karol Sygiet . unresolved

Is this safe? IIUC a caller may modify `profiles` in `callback` and then stuff will break here.

Gianmarco Picarella

Yes, thanks for pointing that out! In practice, I would say profiles are not expected to be removed in the callback but I see your point.

I could pre-compute all bucket ranges first and then apply the callback for each one of them in reverse order (so that in case a profile from a bucket is removed, all the subsequent ranges remain valid). The problem here is that the caller could still (potentially) remove any other profile and break the logic anyway.

Do you have any idea?

Open in Gerrit

Related details

Attention is currently required from:
  • Florian Leimgruber
  • Karol Sygiet
Submit Requirements:
  • requirement satisfiedCode-Coverage
  • requirement is not satisfiedCode-Owners
  • requirement is not satisfiedCode-Review
  • requirement is not satisfiedNo-Unresolved-Comments
  • requirement is not satisfiedReview-Enforcement
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. DiffyGerrit
Gerrit-MessageType: comment
Gerrit-Project: chromium/src
Gerrit-Branch: main
Gerrit-Change-Id: Icb69c97b214bb50888a36ec32cc110e59b09e934
Gerrit-Change-Number: 7689517
Gerrit-PatchSet: 2
Gerrit-Owner: Gianmarco Picarella <pica...@google.com>
Gerrit-Reviewer: Florian Leimgruber <fleim...@google.com>
Gerrit-Reviewer: Gianmarco Picarella <pica...@google.com>
Gerrit-Reviewer: Karol Sygiet <syg...@google.com>
Gerrit-CC: Chromium Metrics Reviews <chromium-met...@google.com>
Gerrit-CC: Jihad Hanna <jihad...@google.com>
Gerrit-CC: Julia Sobiech <jsob...@google.com>
Gerrit-CC: Maryia Mankevich <mma...@google.com>
Gerrit-Attention: Karol Sygiet <syg...@google.com>
Gerrit-Attention: Florian Leimgruber <fleim...@google.com>
Gerrit-Comment-Date: Mon, 23 Mar 2026 12:25:13 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No
Comment-In-Reply-To: Karol Sygiet <syg...@google.com>
satisfied_requirement
unsatisfied_requirement
open
diffy

Florian Leimgruber (Gerrit)

unread,
9:03 AM (10 hours ago) 9:03 AM
to Gianmarco Picarella, Chromium LUCI CQ, Karol Sygiet, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org
Attention needed from Gianmarco Picarella and Karol Sygiet

Florian Leimgruber added 7 comments

Commit Message
Line 17, Patchset 2 (Latest):* Country bucketing: Profiles are initially grouped by country code
and deduplicated as independent subsets.
Florian Leimgruber . unresolved

Do we have any data showing that users with a lot of profiles usually have profiles from different countries?

Line 19, Patchset 2 (Latest):* Deferred batching: Profile updates and removals are batched and
deferred to the very end of deduplication, avoiding useless
intermediate updates for profiles that are ultimately removed.
Florian Leimgruber . unresolved

Is this worth doing? The data manager queues pending updates, since only one of them is executed at the same time. I'd be surprised if this makes a measurable difference during deduplication.

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc
Line 81, Patchset 2 (Latest): using Iter = std::vector<AutofillProfile>::iterator;

Iter it_start = profiles.begin();
while (it_start != profiles.end()) {
const Iter it_end = std::ranges::upper_bound(
it_start, profiles.end(), it_start->GetAddressCountryCode(), {},
&AutofillProfile::GetAddressCountryCode);
const size_t start_bucket_index = std::distance(profiles.begin(), it_start);
const size_t end_bucket_index = std::distance(profiles.begin(), it_end);
base::UmaHistogramCounts100000(
"Autofill.NumberOfProfilesConsideredForDedupe.ByCountry",
end_bucket_index - start_bucket_index);

callback(start_bucket_index, end_bucket_index);
it_start = it_end;
}
Florian Leimgruber . unresolved

I think the more canonical approach is using `std::find_if()`. Does something like this work?

```
auto bgn = profiles.begin();
while (bgn != profiles.end()) {
auto end = std::find_if(bgn+1, profiles.end(), [&](const AutofillProfile& p) {
return bgn->GetCountryCode() != p.GetCountryCode();
});
// All profiles in [bgn, end[ share the same country code.
callback(...);
bgn = end;
}
```

This implementation is only O(n) rather than O(nlogn) due to the repeated upper_bound usage.

Line 126, Patchset 2 (Latest):// and preserves the `initial_creator_id`. The algorithm proceeds in two steps,
// such that the amount of retained information is maximized without sending the
// data to the account if it was stored locally.
Florian Leimgruber . unresolved

nit: There is an entire paragraph between "the algorithm proceeds in two steps" and the steps now. Can we rephrase this somehow?

Line 168, Patchset 2 (Latest): profiles, [&](size_t start_bucket_index, size_t end_bucket_index) {
Florian Leimgruber . unresolved

Do we really have to use indices everywhere? The code becomes much harder to read this way.

Line 298, Patchset 2 (Latest): DeleteDisusedAddresses();
Florian Leimgruber . unresolved

Moving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.

Line 310, Patchset 2 (Latest): // Other cleanups are performed on every browser start.
Florian Leimgruber . unresolved

This comment was meant to apply to DeleteDisusedAddresses() as well.

Open in Gerrit

Related details

Attention is currently required from:
  • Gianmarco Picarella
  • Karol Sygiet
Gerrit-Attention: Gianmarco Picarella <pica...@google.com>
Gerrit-Attention: Karol Sygiet <syg...@google.com>
Gerrit-Comment-Date: Mon, 23 Mar 2026 13:03:24 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No
satisfied_requirement
unsatisfied_requirement
open
diffy

Gianmarco Picarella (Gerrit)

unread,
10:07 AM (9 hours ago) 10:07 AM
to Chromium LUCI CQ, Florian Leimgruber, Karol Sygiet, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org
Attention needed from Florian Leimgruber and Karol Sygiet

Gianmarco Picarella added 6 comments

Commit Message
Line 17, Patchset 2 (Latest):* Country bucketing: Profiles are initially grouped by country code
and deduplicated as independent subsets.
Florian Leimgruber . unresolved

Do we have any data showing that users with a lot of profiles usually have profiles from different countries?

Gianmarco Picarella

As far as I know, we don't. This is also the reason why we thought about introducing a new metric to count each bucket size.

Line 19, Patchset 2 (Latest):* Deferred batching: Profile updates and removals are batched and
deferred to the very end of deduplication, avoiding useless
intermediate updates for profiles that are ultimately removed.
Florian Leimgruber . unresolved

Is this worth doing? The data manager queues pending updates, since only one of them is executed at the same time. I'd be surprised if this makes a measurable difference during deduplication.

Gianmarco Picarella

I see your point! My idea was to minimize the number of redundant updates. For example, if profiles A and B are subsets of D, D would normally queue two updates. But if D is also a subset of E, those intermediate updates for D become unnecessary.

Do you think this change is not worth it?

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc
Gianmarco Picarella

Looks better! I will update it :)

Line 126, Patchset 2 (Latest):// and preserves the `initial_creator_id`. The algorithm proceeds in two steps,
// such that the amount of retained information is maximized without sending the
// data to the account if it was stored locally.
Florian Leimgruber . unresolved

nit: There is an entire paragraph between "the algorithm proceeds in two steps" and the steps now. Can we rephrase this somehow?

Gianmarco Picarella

Sure!

Line 168, Patchset 2 (Latest): profiles, [&](size_t start_bucket_index, size_t end_bucket_index) {
Florian Leimgruber . unresolved

Do we really have to use indices everywhere? The code becomes much harder to read this way.

Gianmarco Picarella

I could pass a base::span<AutofillProfile> but then I would have to use std::distance() to derive the index, do you see it as a better option?

Line 298, Patchset 2 (Latest): DeleteDisusedAddresses();
Florian Leimgruber . unresolved

Moving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.

Gianmarco Picarella

Ah, good to know! I see that the unit test DeleteDisusedAddresses is more or less doing the same thing, expecting profiles are removed immediately after calling DeleteDisusedAddresses(). Does that mean the test result is unrealiable?

Open in Gerrit

Related details

Attention is currently required from:
  • Florian Leimgruber
  • Karol Sygiet
Gerrit-Attention: Karol Sygiet <syg...@google.com>
Gerrit-Attention: Florian Leimgruber <fleim...@google.com>
Gerrit-Comment-Date: Mon, 23 Mar 2026 14:07:04 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No
Comment-In-Reply-To: Florian Leimgruber <fleim...@google.com>
satisfied_requirement
unsatisfied_requirement
open
diffy

Gianmarco Picarella (Gerrit)

unread,
10:10 AM (9 hours ago) 10:10 AM
to Chromium LUCI CQ, Florian Leimgruber, Karol Sygiet, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org
Attention needed from Florian Leimgruber and Karol Sygiet

Gianmarco Picarella added 1 comment

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc
Line 298, Patchset 2 (Latest): DeleteDisusedAddresses();
Florian Leimgruber . unresolved

Moving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.

Gianmarco Picarella

Ah, good to know! I see that the unit test DeleteDisusedAddresses is more or less doing the same thing, expecting profiles are removed immediately after calling DeleteDisusedAddresses(). Does that mean the test result is unrealiable?

Gianmarco Picarella

Maybe I can reuse IsAutofillEntryWithUseDateDeletable() and delete the profiles manually at the very beginning of DeduplicateProfiles().

Gerrit-Comment-Date: Mon, 23 Mar 2026 14:10:02 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No
Comment-In-Reply-To: Gianmarco Picarella <pica...@google.com>
Comment-In-Reply-To: Florian Leimgruber <fleim...@google.com>
satisfied_requirement
unsatisfied_requirement
open
diffy

Florian Leimgruber (Gerrit)

unread,
10:10 AM (9 hours ago) 10:10 AM
to Gianmarco Picarella, Chromium LUCI CQ, Karol Sygiet, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org
Attention needed from Gianmarco Picarella and Karol Sygiet

Florian Leimgruber added 3 comments

Commit Message
Line 19, Patchset 2 (Latest):* Deferred batching: Profile updates and removals are batched and
deferred to the very end of deduplication, avoiding useless
intermediate updates for profiles that are ultimately removed.
Florian Leimgruber . unresolved

Is this worth doing? The data manager queues pending updates, since only one of them is executed at the same time. I'd be surprised if this makes a measurable difference during deduplication.

Gianmarco Picarella

I see your point! My idea was to minimize the number of redundant updates. For example, if profiles A and B are subsets of D, D would normally queue two updates. But if D is also a subset of E, those intermediate updates for D become unnecessary.

Do you think this change is not worth it?

Florian Leimgruber

For example, if profiles A and B are subsets of D, D would normally queue two updates. But if D is also a subset of E, those intermediate updates for D become unnecessary.

That's true, but the point is that the updates are queued. So they don't impact browser startup metrics - which I think is what we are concerned about?

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc
Line 168, Patchset 2 (Latest): profiles, [&](size_t start_bucket_index, size_t end_bucket_index) {
Florian Leimgruber . unresolved

Do we really have to use indices everywhere? The code becomes much harder to read this way.

Gianmarco Picarella

I could pass a base::span<AutofillProfile> but then I would have to use std::distance() to derive the index, do you see it as a better option?

Florian Leimgruber

No, `std::distance()` seems even worse :)

Maybe one option is to define a custom struct containing the status and the profile, and operating on those structs instead.

That said, I'm not convinced that these optimisations are worth it to begin with.

Line 298, Patchset 2 (Latest): DeleteDisusedAddresses();
Florian Leimgruber . unresolved

Moving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.

Gianmarco Picarella

Ah, good to know! I see that the unit test DeleteDisusedAddresses is more or less doing the same thing, expecting profiles are removed immediately after calling DeleteDisusedAddresses(). Does that mean the test result is unrealiable?

Attention is currently required from:
  • Gianmarco Picarella
  • Karol Sygiet
Gerrit-Attention: Gianmarco Picarella <pica...@google.com>
Gerrit-Attention: Karol Sygiet <syg...@google.com>
Gerrit-Comment-Date: Mon, 23 Mar 2026 14:10:35 +0000
satisfied_requirement
unsatisfied_requirement
open
diffy
Reply all
Reply to author
Forward
0 new messages