[Reader Mode] Extract and use lead images and captions. [chromium/src : main]

0 views
Skip to first unread message

Samuel Huang (Gerrit)

unread,
Dec 12, 2025, 5:08:41 PM (3 days ago) Dec 12
to Brandon Wylie, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org
Attention needed from Brandon Wylie

Samuel Huang voted and added 1 comment

Votes added by Samuel Huang

Commit-Queue+1

1 comment

Patchset-level comments
File-level comment, Patchset 8 (Latest):
Samuel Huang . resolved

This needed more tweaking than I thought... Might need to iterate still, and add more tests next week. PTAL. Thanks!

Open in Gerrit

Related details

Attention is currently required from:
  • Brandon Wylie
Submit Requirements:
  • requirement satisfiedCode-Coverage
  • requirement is not satisfiedCode-Owners
  • requirement is not satisfiedCode-Review
  • requirement is not satisfiedReview-Enforcement
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. DiffyGerrit
Gerrit-MessageType: comment
Gerrit-Project: chromium/src
Gerrit-Branch: main
Gerrit-Change-Id: I0cd80502e68a72563f771e1b0c1a1fa4f6be47d5
Gerrit-Change-Number: 7248368
Gerrit-PatchSet: 8
Gerrit-Owner: Samuel Huang <hua...@chromium.org>
Gerrit-Reviewer: Brandon Wylie <wyl...@google.com>
Gerrit-Reviewer: Samuel Huang <hua...@chromium.org>
Gerrit-Attention: Brandon Wylie <wyl...@google.com>
Gerrit-Comment-Date: Fri, 12 Dec 2025 22:08:37 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: Yes
satisfied_requirement
unsatisfied_requirement
open
diffy

Brandon Wylie (Gerrit)

unread,
Dec 12, 2025, 5:29:41 PM (3 days ago) Dec 12
to Samuel Huang, Chromium LUCI CQ, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org
Attention needed from Samuel Huang

Brandon Wylie added 1 comment

File third_party/readability/modded_src/Readability.js
File-level comment, Patchset 9 (Latest):
Brandon Wylie . unresolved

Why emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).

Open in Gerrit

Related details

Attention is currently required from:
  • Samuel Huang
Submit Requirements:
    • requirement satisfiedCode-Coverage
    • requirement is not satisfiedCode-Owners
    • requirement is not satisfiedCode-Review
    • requirement is not satisfiedNo-Unresolved-Comments
    • requirement is not satisfiedReview-Enforcement
    Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. DiffyGerrit
    Gerrit-MessageType: comment
    Gerrit-Project: chromium/src
    Gerrit-Branch: main
    Gerrit-Change-Id: I0cd80502e68a72563f771e1b0c1a1fa4f6be47d5
    Gerrit-Change-Number: 7248368
    Gerrit-PatchSet: 9
    Gerrit-Owner: Samuel Huang <hua...@chromium.org>
    Gerrit-Reviewer: Brandon Wylie <wyl...@google.com>
    Gerrit-Reviewer: Samuel Huang <hua...@chromium.org>
    Gerrit-Attention: Samuel Huang <hua...@chromium.org>
    Gerrit-Comment-Date: Fri, 12 Dec 2025 22:29:32 +0000
    Gerrit-HasComments: Yes
    Gerrit-Has-Labels: No
    satisfied_requirement
    unsatisfied_requirement
    open
    diffy

    Samuel Huang (Gerrit)

    unread,
    2:37 PM (9 hours ago) 2:37 PM
    to Chromium LUCI CQ, Brandon Wylie, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org
    Attention needed from Brandon Wylie

    Samuel Huang added 1 comment

    File third_party/readability/modded_src/Readability.js
    Brandon Wylie . unresolved

    Why emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).

    Samuel Huang

    Per verbose commit comment:

    * Why extract the image + caption to a separate field, instead of simply
    prepending to content? The latter is much simpler, and limit the
    change Readability.js only. However, the former was chosen because it:
    * Avoids massive changes to Readability tests once we attempt to
    upstream the test (though we might need to add new tests), and
    reduces churn if we need to iterate.
    * Gives the caller (Chrome) more freedom to post-process and show
    extracted data.

    However, today I just realized that the Readability's sibling inclusion may also include Hero image / caption, leading to duplication! I alleviated this by removing elements if they have lead image / caption (to de-dup). But this also affects tests in Readability! So if conflict is inevitable, may as well as go back to the simpler approach?
    Open in Gerrit

    Related details

    Attention is currently required from:
    • Brandon Wylie
    Submit Requirements:
    • requirement satisfiedCode-Coverage
    • requirement is not satisfiedCode-Owners
    • requirement is not satisfiedCode-Review
    • requirement is not satisfiedNo-Unresolved-Comments
    • requirement is not satisfiedReview-Enforcement
    Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. DiffyGerrit
    Gerrit-MessageType: comment
    Gerrit-Project: chromium/src
    Gerrit-Branch: main
    Gerrit-Change-Id: I0cd80502e68a72563f771e1b0c1a1fa4f6be47d5
    Gerrit-Change-Number: 7248368
    Gerrit-PatchSet: 12
    Gerrit-Owner: Samuel Huang <hua...@chromium.org>
    Gerrit-Reviewer: Brandon Wylie <wyl...@google.com>
    Gerrit-Reviewer: Samuel Huang <hua...@chromium.org>
    Gerrit-Attention: Brandon Wylie <wyl...@google.com>
    Gerrit-Comment-Date: Mon, 15 Dec 2025 19:37:47 +0000
    Gerrit-HasComments: Yes
    Gerrit-Has-Labels: No
    Comment-In-Reply-To: Brandon Wylie <wyl...@google.com>
    satisfied_requirement
    unsatisfied_requirement
    open
    diffy

    Brandon Wylie (Gerrit)

    unread,
    2:58 PM (9 hours ago) 2:58 PM
    to Samuel Huang, Chromium LUCI CQ, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org
    Attention needed from Samuel Huang

    Brandon Wylie added 1 comment

    File third_party/readability/modded_src/Readability.js
    Brandon Wylie . unresolved

    Why emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).

    Samuel Huang

    Per verbose commit comment:

    * Why extract the image + caption to a separate field, instead of simply
    prepending to content? The latter is much simpler, and limit the
    change Readability.js only. However, the former was chosen because it:
    * Avoids massive changes to Readability tests once we attempt to
    upstream the test (though we might need to add new tests), and
    reduces churn if we need to iterate.
    * Gives the caller (Chrome) more freedom to post-process and show
    extracted data.

    However, today I just realized that the Readability's sibling inclusion may also include Hero image / caption, leading to duplication! I alleviated this by removing elements if they have lead image / caption (to de-dup). But this also affects tests in Readability! So if conflict is inevitable, may as well as go back to the simpler approach?
    Brandon Wylie

    Yes I think relying on the source document rather than emitting new forms of raw data will (1) keep us closer to readability implementation-wise, and (2) reduce complexity and shim layers between the distillation result + display in chrome.

    Open in Gerrit

    Related details

    Attention is currently required from:
    • Samuel Huang
    Submit Requirements:
    • requirement satisfiedCode-Coverage
    • requirement is not satisfiedCode-Owners
    • requirement is not satisfiedCode-Review
    • requirement is not satisfiedNo-Unresolved-Comments
    • requirement is not satisfiedReview-Enforcement
    Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. DiffyGerrit
    Gerrit-MessageType: comment
    Gerrit-Project: chromium/src
    Gerrit-Branch: main
    Gerrit-Change-Id: I0cd80502e68a72563f771e1b0c1a1fa4f6be47d5
    Gerrit-Change-Number: 7248368
    Gerrit-PatchSet: 12
    Gerrit-Owner: Samuel Huang <hua...@chromium.org>
    Gerrit-Reviewer: Brandon Wylie <wyl...@google.com>
    Gerrit-Reviewer: Samuel Huang <hua...@chromium.org>
    Gerrit-Attention: Samuel Huang <hua...@chromium.org>
    Gerrit-Comment-Date: Mon, 15 Dec 2025 19:58:46 +0000
    Gerrit-HasComments: Yes
    Gerrit-Has-Labels: No
    Comment-In-Reply-To: Samuel Huang <hua...@chromium.org>
    Comment-In-Reply-To: Brandon Wylie <wyl...@google.com>
    satisfied_requirement
    unsatisfied_requirement
    open
    diffy

    Samuel Huang (Gerrit)

    unread,
    4:19 PM (8 hours ago) 4:19 PM
    to Chromium LUCI CQ, Brandon Wylie, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org
    Attention needed from Brandon Wylie

    Samuel Huang voted and added 2 comments

    Votes added by Samuel Huang

    Commit-Queue+1

    2 comments

    Patchset-level comments
    File-level comment, Patchset 14 (Latest):
    Samuel Huang . resolved

    Updated, PTAL. Thanks!

    File third_party/readability/modded_src/Readability.js
    File-level comment, Patchset 9:
    Brandon Wylie . resolved

    Why emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).

    Samuel Huang

    Per verbose commit comment:

    * Why extract the image + caption to a separate field, instead of simply
    prepending to content? The latter is much simpler, and limit the
    change Readability.js only. However, the former was chosen because it:
    * Avoids massive changes to Readability tests once we attempt to
    upstream the test (though we might need to add new tests), and
    reduces churn if we need to iterate.
    * Gives the caller (Chrome) more freedom to post-process and show
    extracted data.

    However, today I just realized that the Readability's sibling inclusion may also include Hero image / caption, leading to duplication! I alleviated this by removing elements if they have lead image / caption (to de-dup). But this also affects tests in Readability! So if conflict is inevitable, may as well as go back to the simpler approach?
    Brandon Wylie

    Yes I think relying on the source document rather than emitting new forms of raw data will (1) keep us closer to readability implementation-wise, and (2) reduce complexity and shim layers between the distillation result + display in chrome.

    Samuel Huang

    Done

    Open in Gerrit

    Related details

    Attention is currently required from:
    • Brandon Wylie
    Submit Requirements:
    • requirement satisfiedCode-Coverage
    • requirement satisfiedCode-Owners
    • requirement is not satisfiedCode-Review
    • requirement is not satisfiedReview-Enforcement
    Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. DiffyGerrit
    Gerrit-MessageType: comment
    Gerrit-Project: chromium/src
    Gerrit-Branch: main
    Gerrit-Change-Id: I0cd80502e68a72563f771e1b0c1a1fa4f6be47d5
    Gerrit-Change-Number: 7248368
    Gerrit-PatchSet: 14
    Gerrit-Owner: Samuel Huang <hua...@chromium.org>
    Gerrit-Reviewer: Brandon Wylie <wyl...@google.com>
    Gerrit-Reviewer: Samuel Huang <hua...@chromium.org>
    Gerrit-Attention: Brandon Wylie <wyl...@google.com>
    Gerrit-Comment-Date: Mon, 15 Dec 2025 21:19:46 +0000
    Gerrit-HasComments: Yes
    Gerrit-Has-Labels: Yes
    satisfied_requirement
    unsatisfied_requirement
    open
    diffy

    Brandon Wylie (Gerrit)

    unread,
    4:37 PM (7 hours ago) 4:37 PM
    to Samuel Huang, Chromium LUCI CQ, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org
    Attention needed from Samuel Huang

    Brandon Wylie voted and added 1 comment

    Votes added by Brandon Wylie

    Code-Review+1

    1 comment

    Patchset-level comments
    Brandon Wylie . resolved

    lgtm

    Open in Gerrit

    Related details

    Attention is currently required from:
    • Samuel Huang
    Submit Requirements:
      • requirement satisfiedCode-Coverage
      • requirement satisfiedCode-Owners
      • requirement satisfiedCode-Review
      • requirement satisfiedReview-Enforcement
      Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. DiffyGerrit
      Gerrit-MessageType: comment
      Gerrit-Project: chromium/src
      Gerrit-Branch: main
      Gerrit-Change-Id: I0cd80502e68a72563f771e1b0c1a1fa4f6be47d5
      Gerrit-Change-Number: 7248368
      Gerrit-PatchSet: 14
      Gerrit-Owner: Samuel Huang <hua...@chromium.org>
      Gerrit-Reviewer: Brandon Wylie <wyl...@google.com>
      Gerrit-Reviewer: Samuel Huang <hua...@chromium.org>
      Gerrit-Attention: Samuel Huang <hua...@chromium.org>
      Gerrit-Comment-Date: Mon, 15 Dec 2025 21:37:07 +0000
      Gerrit-HasComments: Yes
      Gerrit-Has-Labels: Yes
      satisfied_requirement
      open
      diffy

      Samuel Huang (Gerrit)

      unread,
      5:27 PM (6 hours ago) 5:27 PM
      to Brandon Wylie, Chromium LUCI CQ, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org

      Samuel Huang voted and added 1 comment

      Votes added by Samuel Huang

      Commit-Queue+2

      1 comment

      Patchset-level comments
      Samuel Huang . resolved

      Thanks, committing!

      There will be follow-ups on this. Specifically:

      • The Readability test on https://en.wikipedia.org/wiki/New_Zealand creates a lot of deltas. This should be investigated.
      • The "Visualize" feature on the test extension does not recognize the extracted lead image, since we've created a new `<figure>` element. We can either leave it alone, make the test extension smarter, or change our approach (i.e., don't synthesize `<figure>` element, and instead, include the source elements directly).
      Open in Gerrit

      Related details

      Attention set is empty
      Submit Requirements:
      • requirement satisfiedCode-Coverage
      • requirement satisfiedCode-Owners
      • requirement satisfiedCode-Review
      • requirement satisfiedReview-Enforcement
      Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. DiffyGerrit
      Gerrit-MessageType: comment
      Gerrit-Project: chromium/src
      Gerrit-Branch: main
      Gerrit-Change-Id: I0cd80502e68a72563f771e1b0c1a1fa4f6be47d5
      Gerrit-Change-Number: 7248368
      Gerrit-PatchSet: 14
      Gerrit-Owner: Samuel Huang <hua...@chromium.org>
      Gerrit-Reviewer: Brandon Wylie <wyl...@google.com>
      Gerrit-Reviewer: Samuel Huang <hua...@chromium.org>
      Gerrit-Comment-Date: Mon, 15 Dec 2025 22:27:05 +0000
      Gerrit-HasComments: Yes
      Gerrit-Has-Labels: Yes
      satisfied_requirement
      open
      diffy

      Chromium LUCI CQ (Gerrit)

      unread,
      5:35 PM (6 hours ago) 5:35 PM
      to Samuel Huang, Brandon Wylie, AyeAye, chromium...@chromium.org, chromium-a...@chromium.org, extension...@chromium.org

      Chromium LUCI CQ submitted the change

      Change information

      Commit message:
      [Reader Mode] Extract and use lead images and captions.

      Readability.js sometimes failed to extract hero images. Debugging shows
      that on some websites this happens because the hero image is:
      * In a "top candidate" sibling that scores too low for inclusion.
      * In a sibling of "top candidate"'s parent.

      This CL address the above cases by updating Readability.js (local mode):
      * After finding top candidate, scan earlier elements for "lead images"
      and caption (optional).
      * If found, remove the elements containing the data (to prevent
      redundant inclusion as sibling), and synthesize a <figure> element
      that's added to the returned content.

      Design considerations:
      * "Lead image": The term is more precise re. location of extracted data
      (appearing before the top candidate), and is purposefully different
      from "hero image", since hero image may be inside top candidate
      already -- in this case we can skip extraction.
      * To find lead image we scan 4 "previous elements" before the top
      candidate, i.e., scan "previous siblings", and on exhustion, visit
      parental previous siblings.
      * An alternative is to lead image and caption into separate fields, and
      let callers process them. This approach sounded attractive at first,
      but is rejected due to complexity (we'd need to update Chromium C++
      pipeline and test extension).
      * Image scoring: Ideally we'd prefer large images. However, in
      Readability.js we might not be able to robustly get images dimensions.
      Therefore we estimated the importance of an image from other signals
      (e.g., presence of srcset or alt text).
      * Extracted image: Take image URL.
      * Extracted caption: Take as HTML string, expected to be injected via
      innerHTML. This preserves entities and some amount of formatting.

      Readability.js details:
      * Add _argmax(), _getPreviousElements() helpers.
      * Scoring: Using scoring helpers that return "ratings", which are
      objects with numerical score (to be fed to _argmax()) and "payload"
      that can be used to get final result.
      * Add _rateLeadImageIn() to score an elements re. likelihood to contain
      a hero image. Criteria: An image must exist.
      * Add _rateLeadCaptionIn() to score an element re. likelihood to contain
      caption for the hero image.
      * Add _getLeadImageData(): Scans 4 previous elements and takes one with
      best lead image score beyond threshold. Scans elements between this
      element and the top candidate (exclusive) to extract caption.
      Returns result (possibly null) including `affectedElements` (the DOM
      nodes from which data was extracted).
      * _grabArticle(): Calls _getLeadImageData() to get lead image data. If
      found, removes `affectedElements` to avoid duplicate adds from
      sibling addition, then makes a <figure> element using extracted data
      and adds it to beginning of content.
      Bug: 424854317, 450069061
      Change-Id: I0cd80502e68a72563f771e1b0c1a1fa4f6be47d5
      Commit-Queue: Samuel Huang <hua...@chromium.org>
      Reviewed-by: Brandon Wylie <wyl...@google.com>
      Cr-Commit-Position: refs/heads/main@{#1559003}
      Files:
      • M third_party/readability/README.chromium
      • M third_party/readability/modded_src/Readability.js
      Change size: M
      Delta: 2 files changed, 200 insertions(+), 0 deletions(-)
      Branch: refs/heads/main
      Submit Requirements:
      • requirement satisfiedCode-Review: +1 by Brandon Wylie
      Open in Gerrit
      Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. DiffyGerrit
      Gerrit-MessageType: merged
      Gerrit-Project: chromium/src
      Gerrit-Branch: main
      Gerrit-Change-Id: I0cd80502e68a72563f771e1b0c1a1fa4f6be47d5
      Gerrit-Change-Number: 7248368
      Gerrit-PatchSet: 15
      Gerrit-Owner: Samuel Huang <hua...@chromium.org>
      Gerrit-Reviewer: Brandon Wylie <wyl...@google.com>
      Gerrit-Reviewer: Chromium LUCI CQ <chromiu...@luci-project-accounts.iam.gserviceaccount.com>
      Gerrit-Reviewer: Samuel Huang <hua...@chromium.org>
      open
      diffy
      satisfied_requirement
      Reply all
      Reply to author
      Forward
      0 new messages