Re: [chromium-dev] headless and dom-distiller mode

112 views
Skip to first unread message

Eric Seckler

unread,
Oct 19, 2016, 2:46:34 AM10/19/16
to pra...@laserlike.com, headless-dev, Chromium-dev
+cc headless-dev

You could use the DevTools commands under the Dom domain (https://chromedevtools.github.io/debugger-protocol-viewer/tot/DOM/) to extract information about a page's DOM. Not sure if this would return the distilled DOM (if the distiller is enabled) or the original one.

I'm also not sure how to actually go about enabling the Dom distiller in headless_shell (I'm assuming this is what you're using?) Maybe it's already possible to use the --enable-dom-distiller flag, otherwise you could add support for it.

On Wed, Oct 19, 2016, 01:31 Praveen <pra...@laserlike.com> wrote:
Hi all,

For one my projects, we need to extract the distilled content of a crawled webpage. We want to be able to this as an internal service on the server side without the browser. I have been working on using the headless mode of chromium with the distill functionality to be able to do this. Has anyone tried doing this in this forum? Even if you have not done, could someone share some insights on how to potentially do this ?

Thanks,
Praveen.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

Praveen Patnala

unread,
Oct 19, 2016, 1:28:34 PM10/19/16
to Eric Seckler, headless-dev, Chromium-dev
Hi Eric,

Thanks for the email. I am effectively trying to add the "--enable-dom-distiller" functionality to headless_shell. This is the first time I am working on a chromium development project. The code base, build management, architecture is ofcourse all new. I was attempting to just invoke dom_distiller::DomDistillerServiceFactory:GetInstance() in headless_shell.cc and then invoking DistillPage(). Its not clear if I am on the right track. I couldn't get the build ninja working till now. I realized the headless_shell includes a small number of files and libraries vs the much larger chrome browser.

At a high level, this is what I am looking to accomplish: build a API such that given a URL and/or crawled content, return the distilled content.

I also noticed that the dom-distiller project in github effectively manages the domdistiller.js which is then packaged as a third party in chromium project. I tried using the underlying java library but unfortunately couldn't get that working either. 

I am wondering if someone could suggest at a high level which approach might be better.
1) Chrome without UI mode, invoke distilled functionality and return.
2) headless_shell with "--enable-dom-distiller"
3) Use dom-distiller Java/JS files directly.

Any pointers here would be super useful.

Thanks,
Praveen.




Alex Clarke

unread,
Oct 19, 2016, 1:32:59 PM10/19/16
to Praveen Patnala, Eric Seckler, headless-dev, Chromium-dev
On 19 October 2016 at 18:28, Praveen Patnala <pra...@laserlike.com> wrote:
Hi Eric,

Thanks for the email. I am effectively trying to add the "--enable-dom-distiller" functionality to headless_shell. This is the first time I am working on a chromium development project. The code base, build management, architecture is ofcourse all new. I was attempting to just invoke dom_distiller::DomDistillerServiceFactory:GetInstance() in headless_shell.cc

Looks like that's a chrome/ level feature and currently headless_shell does not link against anything in chrome/.   IIRC it's pretty difficult for us to link in that stuff.
 
and then invoking DistillPage(). Its not clear if I am on the right track. I couldn't get the build ninja working till now. I realized the headless_shell includes a small number of files and libraries vs the much larger chrome browser.

At a high level, this is what I am looking to accomplish: build a API such that given a URL and/or crawled content, return the distilled content.

I also noticed that the dom-distiller project in github effectively manages the domdistiller.js which is then packaged as a third party in chromium project. I tried using the underlying java library but unfortunately couldn't get that working either. 

I am wondering if someone could suggest at a high level which approach might be better.
1) Chrome without UI mode, invoke distilled functionality and return.
2) headless_shell with "--enable-dom-distiller"
3) Use dom-distiller Java/JS files directly.

Any pointers here would be super useful.

Thanks,
Praveen.





On Tue, Oct 18, 2016 at 11:46 PM, Eric Seckler <esec...@chromium.org> wrote:
+cc headless-dev

You could use the DevTools commands under the Dom domain (https://chromedevtools.github.io/debugger-protocol-viewer/tot/DOM/) to extract information about a page's DOM. Not sure if this would return the distilled DOM (if the distiller is enabled) or the original one.

I'm also not sure how to actually go about enabling the Dom distiller in headless_shell (I'm assuming this is what you're using?) Maybe it's already possible to use the --enable-dom-distiller flag, otherwise you could add support for it.

On Wed, Oct 19, 2016, 01:31 Praveen <pra...@laserlike.com> wrote:
Hi all,

For one my projects, we need to extract the distilled content of a crawled webpage. We want to be able to this as an internal service on the server side without the browser. I have been working on using the headless mode of chromium with the distill functionality to be able to do this. Has anyone tried doing this in this forum? Even if you have not done, could someone share some insights on how to potentially do this ?

Thanks,
Praveen.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

--
You received this message because you are subscribed to the Google Groups "headless-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to headless-dev+unsubscribe@chromium.org.
To post to this group, send email to headle...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/headless-dev/CALYAs1w%2BeF%3D4N%3D-GX5KCDEYf%2BAM7G_eE17feQJwTBMwiBp-6FQ%40mail.gmail.com.

Reply all
Reply to author
Forward
0 new messages