| Commit-Queue | +1 |
This needed more tweaking than I thought... Might need to iterate still, and add more tests next week. PTAL. Thanks!
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Why emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Why emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).
Per verbose commit comment:
* Why extract the image + caption to a separate field, instead of simply
prepending to content? The latter is much simpler, and limit the
change Readability.js only. However, the former was chosen because it:
* Avoids massive changes to Readability tests once we attempt to
upstream the test (though we might need to add new tests), and
reduces churn if we need to iterate.
* Gives the caller (Chrome) more freedom to post-process and show
extracted data.
However, today I just realized that the Readability's sibling inclusion may also include Hero image / caption, leading to duplication! I alleviated this by removing elements if they have lead image / caption (to de-dup). But this also affects tests in Readability! So if conflict is inevitable, may as well as go back to the simpler approach?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Samuel HuangWhy emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).
Per verbose commit comment:
* Why extract the image + caption to a separate field, instead of simply
prepending to content? The latter is much simpler, and limit the
change Readability.js only. However, the former was chosen because it:
* Avoids massive changes to Readability tests once we attempt to
upstream the test (though we might need to add new tests), and
reduces churn if we need to iterate.
* Gives the caller (Chrome) more freedom to post-process and show
extracted data.
However, today I just realized that the Readability's sibling inclusion may also include Hero image / caption, leading to duplication! I alleviated this by removing elements if they have lead image / caption (to de-dup). But this also affects tests in Readability! So if conflict is inevitable, may as well as go back to the simpler approach?
Yes I think relying on the source document rather than emitting new forms of raw data will (1) keep us closer to readability implementation-wise, and (2) reduce complexity and shim layers between the distillation result + display in chrome.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Commit-Queue | +1 |
Samuel HuangWhy emit the lead image as data? Seems like leaving it in the resulting HTML is much simpler and aligns with the current distillation infra (both readability & chrome).
Brandon WyliePer verbose commit comment:
* Why extract the image + caption to a separate field, instead of simply
prepending to content? The latter is much simpler, and limit the
change Readability.js only. However, the former was chosen because it:
* Avoids massive changes to Readability tests once we attempt to
upstream the test (though we might need to add new tests), and
reduces churn if we need to iterate.
* Gives the caller (Chrome) more freedom to post-process and show
extracted data.
However, today I just realized that the Readability's sibling inclusion may also include Hero image / caption, leading to duplication! I alleviated this by removing elements if they have lead image / caption (to de-dup). But this also affects tests in Readability! So if conflict is inevitable, may as well as go back to the simpler approach?
Yes I think relying on the source document rather than emitting new forms of raw data will (1) keep us closer to readability implementation-wise, and (2) reduce complexity and shim layers between the distillation result + display in chrome.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Commit-Queue | +2 |
Thanks, committing!
There will be follow-ups on this. Specifically:
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
[Reader Mode] Extract and use lead images and captions.
Readability.js sometimes failed to extract hero images. Debugging shows
that on some websites this happens because the hero image is:
* In a "top candidate" sibling that scores too low for inclusion.
* In a sibling of "top candidate"'s parent.
This CL address the above cases by updating Readability.js (local mode):
* After finding top candidate, scan earlier elements for "lead images"
and caption (optional).
* If found, remove the elements containing the data (to prevent
redundant inclusion as sibling), and synthesize a <figure> element
that's added to the returned content.
Design considerations:
* "Lead image": The term is more precise re. location of extracted data
(appearing before the top candidate), and is purposefully different
from "hero image", since hero image may be inside top candidate
already -- in this case we can skip extraction.
* To find lead image we scan 4 "previous elements" before the top
candidate, i.e., scan "previous siblings", and on exhustion, visit
parental previous siblings.
* An alternative is to lead image and caption into separate fields, and
let callers process them. This approach sounded attractive at first,
but is rejected due to complexity (we'd need to update Chromium C++
pipeline and test extension).
* Image scoring: Ideally we'd prefer large images. However, in
Readability.js we might not be able to robustly get images dimensions.
Therefore we estimated the importance of an image from other signals
(e.g., presence of srcset or alt text).
* Extracted image: Take image URL.
* Extracted caption: Take as HTML string, expected to be injected via
innerHTML. This preserves entities and some amount of formatting.
Readability.js details:
* Add _argmax(), _getPreviousElements() helpers.
* Scoring: Using scoring helpers that return "ratings", which are
objects with numerical score (to be fed to _argmax()) and "payload"
that can be used to get final result.
* Add _rateLeadImageIn() to score an elements re. likelihood to contain
a hero image. Criteria: An image must exist.
* Add _rateLeadCaptionIn() to score an element re. likelihood to contain
caption for the hero image.
* Add _getLeadImageData(): Scans 4 previous elements and takes one with
best lead image score beyond threshold. Scans elements between this
element and the top candidate (exclusive) to extract caption.
Returns result (possibly null) including `affectedElements` (the DOM
nodes from which data was extracted).
* _grabArticle(): Calls _getLeadImageData() to get lead image data. If
found, removes `affectedElements` to avoid duplicate adds from
sibling addition, then makes a <figure> element using extracted data
and adds it to beginning of content.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |