Samuel Huang (Gerrit)

unread,

Dec 12, 2025, 5:08:41 PM (3 days ago) Dec 12

to Brandon Wylie, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org

Attention needed from Brandon Wylie

Samuel Huang voted and added 1 comment

Votes added by Samuel Huang

Commit-Queue

+1

1 comment

Patchset-level comments

File-level comment, Patchset 8 (Latest):

Samuel Huang . resolved

This needed more tweaking than I thought... Might need to iterate still, and add more tests next week. PTAL. Thanks!

Open in Gerrit

Related details

Attention is currently required from:

Brandon Wylie

Submit Requirements:

Code-Coverage
Code-Owners
Code-Review
Review-Enforcement

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Brandon Wylie (Gerrit)

unread,

Dec 12, 2025, 5:29:41 PM (3 days ago) Dec 12

to Samuel Huang, Chromium LUCI CQ, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org

Attention needed from Samuel Huang

Brandon Wylie added 1 comment

File third_party/readability/modded_src/Readability.js

File-level comment, Patchset 9 (Latest):

Brandon Wylie . unresolved

Why emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).

Open in Gerrit

Related details

Attention is currently required from:

Samuel Huang

Submit Requirements:

Code-Coverage
Code-Owners
Code-Review

No-Unresolved-Comments

Review-Enforcement

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Samuel Huang (Gerrit)

unread,

2:37 PM (9 hours ago) 2:37 PM

to Chromium LUCI CQ, Brandon Wylie, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org

Attention needed from Brandon Wylie

Samuel Huang added 1 comment

File third_party/readability/modded_src/Readability.js

File-level comment, Patchset 9:

Brandon Wylie . unresolved

Why emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).

Samuel Huang

Per verbose commit comment:

* Why extract the image + caption to a separate field, instead of simply
  prepending to content? The latter is much simpler, and limit the
  change Readability.js only. However, the former was chosen because it:
   * Avoids massive changes to Readability tests once we attempt to
     upstream the test (though we might need to add new tests), and
     reduces churn if we need to iterate.
   * Gives the caller (Chrome) more freedom to post-process and show
     extracted data.
     
However, today I just realized that the Readability's sibling inclusion may also include Hero image / caption, leading to duplication! I alleviated this by removing elements if they have lead image / caption (to de-dup). But this also affects tests in Readability! So if conflict is inevitable, may as well as go back to the simpler approach?

Open in Gerrit

Related details

Attention is currently required from:

Brandon Wylie

Submit Requirements:

Code-Coverage
Code-Owners
Code-Review
No-Unresolved-Comments
Review-Enforcement

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Brandon Wylie (Gerrit)

unread,

2:58 PM (9 hours ago) 2:58 PM

to Samuel Huang, Chromium LUCI CQ, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org

Attention needed from Samuel Huang

Brandon Wylie added 1 comment

File third_party/readability/modded_src/Readability.js

File-level comment, Patchset 9:

Brandon Wylie . unresolved

Why emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).

Samuel Huang

Per verbose commit comment:

* Why extract the image + caption to a separate field, instead of simply
  prepending to content? The latter is much simpler, and limit the
  change Readability.js only. However, the former was chosen because it:
   * Avoids massive changes to Readability tests once we attempt to
     upstream the test (though we might need to add new tests), and
     reduces churn if we need to iterate.
   * Gives the caller (Chrome) more freedom to post-process and show
     extracted data.
     
However, today I just realized that the Readability's sibling inclusion may also include Hero image / caption, leading to duplication! I alleviated this by removing elements if they have lead image / caption (to de-dup). But this also affects tests in Readability! So if conflict is inevitable, may as well as go back to the simpler approach?

Brandon Wylie

Yes I think relying on the source document rather than emitting new forms of raw data will (1) keep us closer to readability implementation-wise, and (2) reduce complexity and shim layers between the distillation result + display in chrome.

Open in Gerrit

Related details

Attention is currently required from:

Samuel Huang

Submit Requirements:

Code-Coverage
Code-Owners
Code-Review
No-Unresolved-Comments
Review-Enforcement

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Samuel Huang (Gerrit)

unread,

4:19 PM (8 hours ago) 4:19 PM

to Chromium LUCI CQ, Brandon Wylie, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org

Attention needed from Brandon Wylie

Samuel Huang voted and added 2 comments

Votes added by Samuel Huang

Commit-Queue

+1

2 comments

Patchset-level comments

File-level comment, Patchset 14 (Latest):

Samuel Huang . resolved

Updated, PTAL. Thanks!

File third_party/readability/modded_src/Readability.js

File-level comment, Patchset 9:

Brandon Wylie . resolved

Why emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).

Samuel Huang

Per verbose commit comment:

* Why extract the image + caption to a separate field, instead of simply
  prepending to content? The latter is much simpler, and limit the
  change Readability.js only. However, the former was chosen because it:
   * Avoids massive changes to Readability tests once we attempt to
     upstream the test (though we might need to add new tests), and
     reduces churn if we need to iterate.
   * Gives the caller (Chrome) more freedom to post-process and show
     extracted data.
     
However, today I just realized that the Readability's sibling inclusion may also include Hero image / caption, leading to duplication! I alleviated this by removing elements if they have lead image / caption (to de-dup). But this also affects tests in Readability! So if conflict is inevitable, may as well as go back to the simpler approach?

Brandon Wylie

Yes I think relying on the source document rather than emitting new forms of raw data will (1) keep us closer to readability implementation-wise, and (2) reduce complexity and shim layers between the distillation result + display in chrome.

Samuel Huang

Done

Open in Gerrit

Related details

Attention is currently required from:

Brandon Wylie

Submit Requirements:

Code-Coverage
Code-Owners

Code-Review

Review-Enforcement

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Brandon Wylie (Gerrit)

unread,

4:37 PM (7 hours ago) 4:37 PM

to Samuel Huang, Chromium LUCI CQ, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org

Attention needed from Samuel Huang

Brandon Wylie voted and added 1 comment

Votes added by Brandon Wylie

Code-Review

+1

1 comment

Patchset-level comments

File-level comment, Patchset 14 (Latest):

Brandon Wylie . resolved

lgtm

Open in Gerrit

Related details

Attention is currently required from:

Samuel Huang

Submit Requirements:

Code-Coverage
Code-Owners

Code-Review
Review-Enforcement

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

open

diffy

Samuel Huang (Gerrit)

unread,

5:27 PM (6 hours ago) 5:27 PM

to Brandon Wylie, Chromium LUCI CQ, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org

Samuel Huang voted and added 1 comment

Votes added by Samuel Huang

Commit-Queue

+2

1 comment

Patchset-level comments

File-level comment, Patchset 14 (Latest):

Samuel Huang . resolved

Thanks, committing!

There will be follow-ups on this. Specifically:

The Readability test on https://en.wikipedia.org/wiki/New_Zealand creates a lot of deltas. This should be investigated.
The "Visualize" feature on the test extension does not recognize the extracted lead image, since we've created a new `<figure>` element. We can either leave it alone, make the test extension smarter, or change our approach (i.e., don't synthesize `<figure>` element, and instead, include the source elements directly).

Open in Gerrit

Related details

Attention set is empty

Submit Requirements:

Code-Coverage
Code-Owners
Code-Review
Review-Enforcement

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

open

diffy

Chromium LUCI CQ (Gerrit)

unread,

5:35 PM (6 hours ago) 5:35 PM

to Samuel Huang, Brandon Wylie, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org

Chromium LUCI CQ submitted the change

Change information

Commit message:

[Reader Mode] Extract and use lead images and captions.

Readability.js sometimes failed to extract hero images. Debugging shows
that on some websites this happens because the hero image is:
* In a "top candidate" sibling that scores too low for inclusion.
* In a sibling of "top candidate"'s parent.

This CL address the above cases by updating Readability.js (local mode):
* After finding top candidate, scan earlier elements for "lead images"
  and caption (optional).
* If found, remove the elements containing the data (to prevent
  redundant inclusion as sibling), and synthesize a <figure> element
  that's added to the returned content.

Design considerations:
* "Lead image": The term is more precise re. location of extracted data
  (appearing before the top candidate), and is purposefully different
  from "hero image", since hero image may be inside top candidate
  already -- in this case we can skip extraction.
* To find lead image we scan 4 "previous elements" before the top
  candidate, i.e., scan "previous siblings", and on exhustion, visit
  parental previous siblings.
* An alternative is to lead image and caption into separate fields, and
  let callers process them. This approach sounded attractive at first,
  but is rejected due to complexity (we'd need to update Chromium C++
  pipeline and test extension).
* Image scoring: Ideally we'd prefer large images. However, in
  Readability.js we might not be able to robustly get images dimensions.
  Therefore we estimated the importance of an image from other signals
  (e.g., presence of srcset or alt text).
* Extracted image: Take image URL.
* Extracted caption: Take as HTML string, expected to be injected via
  innerHTML. This preserves entities and some amount of formatting.

Readability.js details:
* Add _argmax(), _getPreviousElements() helpers.
* Scoring: Using scoring helpers that return "ratings", which are
  objects with numerical score (to be fed to _argmax()) and "payload"
  that can be used to get final result.
* Add _rateLeadImageIn() to score an elements re. likelihood to contain
  a hero image. Criteria: An image must exist.
* Add _rateLeadCaptionIn() to score an element re. likelihood to contain
  caption for the hero image.
* Add _getLeadImageData(): Scans 4 previous elements and takes one with
  best lead image score beyond threshold. Scans elements between this
  element and the top candidate (exclusive) to extract caption.
  Returns result (possibly null) including `affectedElements` (the DOM
  nodes from which data was extracted).
* _grabArticle(): Calls _getLeadImageData() to get lead image data. If
  found, removes `affectedElements` to avoid duplicate adds from
  sibling addition, then makes a <figure> element using extracted data
  and adds it to beginning of content.

Bug: 424854317, 450069061

Change-Id: I0cd80502e68a72563f771e1b0c1a1fa4f6be47d5

Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7248368

Commit-Queue: Samuel Huang <hua...@chromium.org>

Reviewed-by: Brandon Wylie <wyl...@google.com>

Cr-Commit-Position: refs/heads/main@{#1559003}

Files:

M third_party/readability/README.chromium
M third_party/readability/modded_src/Readability.js

Change size: M

Delta: 2 files changed, 200 insertions(+), 0 deletions(-)

Branch: refs/heads/main

Submit Requirements:

Code-Review: +1 by Brandon Wylie

Open in Gerrit

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

open

diffy

satisfied_requirement

Reply all

Reply to author

Forward

[Reader Mode] Extract and use lead images and captions. [chromium/src : main]

Samuel Huang (Gerrit)

Samuel Huang voted and added 1 comment

Votes added by Samuel Huang

1 comment

Related details

Brandon Wylie (Gerrit)

Brandon Wylie added 1 comment

Related details

Samuel Huang (Gerrit)

Samuel Huang added 1 comment

Related details

Brandon Wylie (Gerrit)

Brandon Wylie added 1 comment

Related details

Samuel Huang (Gerrit)

Samuel Huang voted and added 2 comments

Votes added by Samuel Huang

2 comments

Related details

Brandon Wylie (Gerrit)

Brandon Wylie voted and added 1 comment

Votes added by Brandon Wylie

1 comment

Related details

Samuel Huang (Gerrit)

Samuel Huang voted and added 1 comment

Votes added by Samuel Huang

1 comment

Related details

Chromium LUCI CQ (Gerrit)

Chromium LUCI CQ submitted the change

Change information