Karol Sygiet (Gerrit)

unread,

8:03 AM (11 hours ago) 8:03 AM

to Gianmarco Picarella, Chromium LUCI CQ, Florian Leimgruber, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org

Attention needed from Florian Leimgruber and Gianmarco Picarella

Karol Sygiet added 1 comment

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc

Line 79, Patchset 2 (Latest):

void ForEachCountryBucket(std::vector<AutofillProfile>& profiles,
                          base::FunctionRef<void(size_t, size_t)> callback) {
  using Iter = std::vector<AutofillProfile>::iterator;
  Iter it_start = profiles.begin();
  while (it_start != profiles.end()) {
    const Iter it_end = std::ranges::upper_bound(
        it_start, profiles.end(), it_start->GetAddressCountryCode(), {},
        &AutofillProfile::GetAddressCountryCode);
    const size_t start_bucket_index = std::distance(profiles.begin(), it_start);
    const size_t end_bucket_index = std::distance(profiles.begin(), it_end);
    base::UmaHistogramCounts100000(
        "Autofill.NumberOfProfilesConsideredForDedupe.ByCountry",
        end_bucket_index - start_bucket_index);

    callback(start_bucket_index, end_bucket_index);
    it_start = it_end;
  }
}

Karol Sygiet . unresolved

Is this safe? IIUC a caller may modify `profiles` in `callback` and then stuff will break here.

Open in Gerrit

Related details

Attention is currently required from:

Florian Leimgruber
Gianmarco Picarella

Submit Requirements:

Code-Coverage
Code-Owners
Code-Review
No-Unresolved-Comments
Review-Enforcement

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Gianmarco Picarella (Gerrit)

unread,

8:25 AM (11 hours ago) 8:25 AM

to Chromium LUCI CQ, Florian Leimgruber, Karol Sygiet, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org

Attention needed from Florian Leimgruber and Karol Sygiet

Gianmarco Picarella added 1 comment

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc

Line 79, Patchset 2 (Latest):

void ForEachCountryBucket(std::vector<AutofillProfile>& profiles,
                          base::FunctionRef<void(size_t, size_t)> callback) {
  using Iter = std::vector<AutofillProfile>::iterator;
  Iter it_start = profiles.begin();
  while (it_start != profiles.end()) {
    const Iter it_end = std::ranges::upper_bound(
        it_start, profiles.end(), it_start->GetAddressCountryCode(), {},
        &AutofillProfile::GetAddressCountryCode);
    const size_t start_bucket_index = std::distance(profiles.begin(), it_start);
    const size_t end_bucket_index = std::distance(profiles.begin(), it_end);
    base::UmaHistogramCounts100000(
        "Autofill.NumberOfProfilesConsideredForDedupe.ByCountry",
        end_bucket_index - start_bucket_index);

    callback(start_bucket_index, end_bucket_index);
    it_start = it_end;
  }
}

Karol Sygiet . unresolved

Is this safe? IIUC a caller may modify `profiles` in `callback` and then stuff will break here.

Gianmarco Picarella

Yes, thanks for pointing that out! In practice, I would say profiles are not expected to be removed in the callback but I see your point.

I could pre-compute all bucket ranges first and then apply the callback for each one of them in reverse order (so that in case a profile from a bucket is removed, all the subsequent ranges remain valid). The problem here is that the caller could still (potentially) remove any other profile and break the logic anyway.

Do you have any idea?

Open in Gerrit

Related details

Attention is currently required from:

Florian Leimgruber
Karol Sygiet

Submit Requirements:

Code-Coverage
Code-Owners
Code-Review
No-Unresolved-Comments
Review-Enforcement

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Florian Leimgruber (Gerrit)

unread,

9:03 AM (10 hours ago) 9:03 AM

to Gianmarco Picarella, Chromium LUCI CQ, Karol Sygiet, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org

Attention needed from Gianmarco Picarella and Karol Sygiet

Florian Leimgruber added 7 comments

Commit Message

Line 17, Patchset 2 (Latest):

* Country bucketing: Profiles are initially grouped by country code
  and deduplicated as independent subsets.

Florian Leimgruber . unresolved

Do we have any data showing that users with a lot of profiles usually have profiles from different countries?

Line 19, Patchset 2 (Latest):

* Deferred batching: Profile updates and removals are batched and
  deferred to the very end of deduplication, avoiding useless
  intermediate updates for profiles that are ultimately removed.

Florian Leimgruber . unresolved

Is this worth doing? The data manager queues pending updates, since only one of them is executed at the same time. I'd be surprised if this makes a measurable difference during deduplication.

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc

Line 81, Patchset 2 (Latest): using Iter = std::vector<AutofillProfile>::iterator;


  Iter it_start = profiles.begin();
  while (it_start != profiles.end()) {
    const Iter it_end = std::ranges::upper_bound(
        it_start, profiles.end(), it_start->GetAddressCountryCode(), {},
        &AutofillProfile::GetAddressCountryCode);
    const size_t start_bucket_index = std::distance(profiles.begin(), it_start);
    const size_t end_bucket_index = std::distance(profiles.begin(), it_end);
    base::UmaHistogramCounts100000(
        "Autofill.NumberOfProfilesConsideredForDedupe.ByCountry",
        end_bucket_index - start_bucket_index);

    callback(start_bucket_index, end_bucket_index);
    it_start = it_end;
  }

Florian Leimgruber . unresolved

I think the more canonical approach is using `std::find_if()`. Does something like this work?

```
auto bgn = profiles.begin();
while (bgn != profiles.end()) {
  auto end = std::find_if(bgn+1, profiles.end(), [&](const AutofillProfile& p) {
    return bgn->GetCountryCode() != p.GetCountryCode();
  });
  // All profiles in [bgn, end[ share the same country code.
  callback(...);
  bgn = end;
}
```

This implementation is only O(n) rather than O(nlogn) due to the repeated upper_bound usage.

Line 126, Patchset 2 (Latest):

// and preserves the `initial_creator_id`. The algorithm proceeds in two steps,
// such that the amount of retained information is maximized without sending the
// data to the account if it was stored locally.

Florian Leimgruber . unresolved

nit: There is an entire paragraph between "the algorithm proceeds in two steps" and the steps now. Can we rephrase this somehow?

Line 168, Patchset 2 (Latest): profiles, [&](size_t start_bucket_index, size_t end_bucket_index) {

Florian Leimgruber . unresolved

Do we really have to use indices everywhere? The code becomes much harder to read this way.

Line 298, Patchset 2 (Latest): DeleteDisusedAddresses();

Florian Leimgruber . unresolved

Moving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.

Line 310, Patchset 2 (Latest): // Other cleanups are performed on every browser start.

Florian Leimgruber . unresolved

This comment was meant to apply to DeleteDisusedAddresses() as well.

Open in Gerrit

Related details

Attention is currently required from:

Gianmarco Picarella
Karol Sygiet

satisfied_requirement

unsatisfied_requirement

open

diffy

Gianmarco Picarella (Gerrit)

unread,

10:07 AM (9 hours ago) 10:07 AM

to Chromium LUCI CQ, Florian Leimgruber, Karol Sygiet, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org

Attention needed from Florian Leimgruber and Karol Sygiet

Gianmarco Picarella added 6 comments

Commit Message

Line 17, Patchset 2 (Latest):

* Country bucketing: Profiles are initially grouped by country code
  and deduplicated as independent subsets.

Florian Leimgruber . unresolved

Do we have any data showing that users with a lot of profiles usually have profiles from different countries?

Gianmarco Picarella

As far as I know, we don't. This is also the reason why we thought about introducing a new metric to count each bucket size.

Line 19, Patchset 2 (Latest):

* Deferred batching: Profile updates and removals are batched and
  deferred to the very end of deduplication, avoiding useless
  intermediate updates for profiles that are ultimately removed.

Florian Leimgruber . unresolved

Is this worth doing? The data manager queues pending updates, since only one of them is executed at the same time. I'd be surprised if this makes a measurable difference during deduplication.

Gianmarco Picarella

I see your point! My idea was to minimize the number of redundant updates. For example, if profiles A and B are subsets of D, D would normally queue two updates. But if D is also a subset of E, those intermediate updates for D become unnecessary.

Do you think this change is not worth it?

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc

Gianmarco Picarella

Looks better! I will update it :)

Line 126, Patchset 2 (Latest):

// and preserves the `initial_creator_id`. The algorithm proceeds in two steps,
// such that the amount of retained information is maximized without sending the
// data to the account if it was stored locally.

Florian Leimgruber . unresolved

nit: There is an entire paragraph between "the algorithm proceeds in two steps" and the steps now. Can we rephrase this somehow?

Gianmarco Picarella

Sure!

Line 168, Patchset 2 (Latest): profiles, [&](size_t start_bucket_index, size_t end_bucket_index) {

Florian Leimgruber . unresolved

Do we really have to use indices everywhere? The code becomes much harder to read this way.

Gianmarco Picarella

I could pass a base::span<AutofillProfile> but then I would have to use std::distance() to derive the index, do you see it as a better option?

Line 298, Patchset 2 (Latest): DeleteDisusedAddresses();

Florian Leimgruber . unresolved

Moving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.

Gianmarco Picarella

Ah, good to know! I see that the unit test DeleteDisusedAddresses is more or less doing the same thing, expecting profiles are removed immediately after calling DeleteDisusedAddresses(). Does that mean the test result is unrealiable?

Open in Gerrit

Related details

Attention is currently required from:

Florian Leimgruber
Karol Sygiet

satisfied_requirement

unsatisfied_requirement

open

diffy

Gianmarco Picarella (Gerrit)

unread,

10:10 AM (9 hours ago) 10:10 AM

to Chromium LUCI CQ, Florian Leimgruber, Karol Sygiet, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org

Attention needed from Florian Leimgruber and Karol Sygiet

Gianmarco Picarella added 1 comment

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc

Line 298, Patchset 2 (Latest): DeleteDisusedAddresses();

Florian Leimgruber . unresolved

Moving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.

Gianmarco Picarella

Ah, good to know! I see that the unit test DeleteDisusedAddresses is more or less doing the same thing, expecting profiles are removed immediately after calling DeleteDisusedAddresses(). Does that mean the test result is unrealiable?

Gianmarco Picarella

Maybe I can reuse IsAutofillEntryWithUseDateDeletable() and delete the profiles manually at the very beginning of DeduplicateProfiles().

satisfied_requirement

unsatisfied_requirement

open

diffy

Florian Leimgruber (Gerrit)

unread,

10:10 AM (9 hours ago) 10:10 AM

to Gianmarco Picarella, Chromium LUCI CQ, Karol Sygiet, Julia Sobiech, Maryia Mankevich, Jihad Hanna, Chromium Metrics Reviews, chromium...@chromium.org, asvitkine...@chromium.org, browser-comp...@chromium.org

Attention needed from Gianmarco Picarella and Karol Sygiet

Florian Leimgruber added 3 comments

Commit Message

Line 19, Patchset 2 (Latest):

* Deferred batching: Profile updates and removals are batched and
  deferred to the very end of deduplication, avoiding useless
  intermediate updates for profiles that are ultimately removed.

Florian Leimgruber . unresolved

Is this worth doing? The data manager queues pending updates, since only one of them is executed at the same time. I'd be surprised if this makes a measurable difference during deduplication.

Gianmarco Picarella

I see your point! My idea was to minimize the number of redundant updates. For example, if profiles A and B are subsets of D, D would normally queue two updates. But if D is also a subset of E, those intermediate updates for D become unnecessary.
Do you think this change is not worth it?

Florian Leimgruber

For example, if profiles A and B are subsets of D, D would normally queue two updates. But if D is also a subset of E, those intermediate updates for D become unnecessary.

That's true, but the point is that the updates are queued. So they don't impact browser startup metrics - which I think is what we are concerned about?

File components/autofill/core/browser/data_manager/addresses/address_data_cleaner.cc

Line 168, Patchset 2 (Latest): profiles, [&](size_t start_bucket_index, size_t end_bucket_index) {

Florian Leimgruber . unresolved

Do we really have to use indices everywhere? The code becomes much harder to read this way.

Gianmarco Picarella

I could pass a base::span<AutofillProfile> but then I would have to use std::distance() to derive the index, do you see it as a better option?

Florian Leimgruber

No, `std::distance()` seems even worse :)

Maybe one option is to define a custom struct containing the status and the profile, and operating on those structs instead.

That said, I'm not convinced that these optimisations are worth it to begin with.

Line 298, Patchset 2 (Latest): DeleteDisusedAddresses();

Florian Leimgruber . unresolved

Moving this up here will do very little. Removing a profile from the data manager is asynchronous. So by the time you call `ApplyDeduplicationRoutine()`, disused profiles will still be contained in the return value of `GetProfiles()`.

Gianmarco Picarella

Ah, good to know! I see that the unit test DeleteDisusedAddresses is more or less doing the same thing, expecting profiles are removed immediately after calling DeleteDisusedAddresses(). Does that mean the test result is unrealiable?

Florian Leimgruber

The test uses a TestAddressDataManager, which [is synchronous](https://source.chromium.org/chromium/chromium/src/+/main:components/autofill/core/browser/data_manager/addresses/test_address_data_manager.h;l=16-17?q=TestAddressDataManager).

Open in Gerrit

Related details

Attention is currently required from:

Gianmarco Picarella
Karol Sygiet

satisfied_requirement

unsatisfied_requirement

open

diffy

Reply all

Reply to author

Forward

Optimize profile deduplication performance [chromium/src : main]

Karol Sygiet (Gerrit)

Karol Sygiet added 1 comment

Related details

Gianmarco Picarella (Gerrit)

Gianmarco Picarella added 1 comment

Related details

Florian Leimgruber (Gerrit)

Florian Leimgruber added 7 comments

Related details

Gianmarco Picarella (Gerrit)

Gianmarco Picarella added 6 comments

Related details

Gianmarco Picarella (Gerrit)

Gianmarco Picarella added 1 comment

Florian Leimgruber (Gerrit)

Florian Leimgruber added 3 comments

Related details