Extracting the main text of a page using Fathom?

27 views
Skip to first unread message

Don

unread,
May 1, 2017, 8:27:09 PM5/1/17
to FilterBubbler User
Just saw this today -- could be useful for avoiding training on navigation, copyright notices, and extra stuff on a page: https://github.com/mozilla/fathom/blob/master/examples/readability.js

More info: https://mozilla.github.io/fathom/intro.html

"Fathom is a JavaScript framework for extracting meaning from web pages, identifying parts like Previous/Next buttons, address forms, and the main textual content—or classifying a page as a whole. Essentially, it scores DOM nodes and extracts them based on conditions you specify"

Ean Schuessler

unread,
May 1, 2017, 8:31:59 PM5/1/17
to Don, FilterBubbler User
I just gave it a skim and it looks really interesting. It needs access to the DOM so it would have to run as a content script in a WebExtension. That raises some interesting questions about how FilterBubbler recipes should work. I was thinking that text analysis should always run as a background script so that its state isn't getting blown away and retrieved constantly but you do not have access to the DOM from there. We may need to have recipes inject things like Fathom into the content script environment but maybe I'm missing something. Let me give this a read and I'll respond in more detail.

--
You received this message because you are subscribed to the Google Groups "FilterBubbler User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to filterbubbler-user+unsub...@filterbubbler.org.
To post to this group, send email to filterbubbler-user@filterbubbler.org.
To view this discussion on the web visit https://groups.google.com/a/filterbubbler.org/d/msgid/filterbubbler-user/cb19832e-ada9-46a6-904d-857950f9bb38%40filterbubbler.org.



--
Ean Schuessler, Brainfood Co-Founder
Reply all
Reply to author
Forward
0 new messages