headless and dom-distiller mode

632 views
Skip to first unread message

Praveen

unread,
Oct 18, 2016, 8:30:26 PM10/18/16
to Chromium-dev
Hi all,

For one my projects, we need to extract the distilled content of a crawled webpage. We want to be able to this as an internal service on the server side without the browser. I have been working on using the headless mode of chromium with the distill functionality to be able to do this. Has anyone tried doing this in this forum? Even if you have not done, could someone share some insights on how to potentially do this ?

Thanks,
Praveen.

Eric Seckler

unread,
Oct 19, 2016, 2:47:23 AM10/19/16
to pra...@laserlike.com, headless-dev, Chromium-dev
+cc headless-dev

You could use the DevTools commands under the Dom domain (https://chromedevtools.github.io/debugger-protocol-viewer/tot/DOM/) to extract information about a page's DOM. Not sure if this would return the distilled DOM (if the distiller is enabled) or the original one.

I'm also not sure how to actually go about enabling the Dom distiller in headless_shell (I'm assuming this is what you're using?) Maybe it's already possible to use the --enable-dom-distiller flag, otherwise you could add support for it.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

Alex Clarke

unread,
Oct 19, 2016, 5:10:50 AM10/19/16
to pra...@laserlike.com, Chromium-dev
I'm hoping to land two patches today which might be of interest:

https://codereview.chromium.org/2413693002/ and https://codereview.chromium.org/2385653003/  (Patchset 3 will get it to capture computed CSS as well as the dom tree).

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev+unsubscribe@chromium.org.

Praveen Patnala

unread,
Oct 19, 2016, 1:29:15 PM10/19/16
to Eric Seckler, headless-dev, Chromium-dev
Hi Eric,

Thanks for the email. I am effectively trying to add the "--enable-dom-distiller" functionality to headless_shell. This is the first time I am working on a chromium development project. The code base, build management, architecture is ofcourse all new. I was attempting to just invoke dom_distiller::DomDistillerServiceFactory:GetInstance() in headless_shell.cc and then invoking DistillPage(). Its not clear if I am on the right track. I couldn't get the build ninja working till now. I realized the headless_shell includes a small number of files and libraries vs the much larger chrome browser.

At a high level, this is what I am looking to accomplish: build a API such that given a URL and/or crawled content, return the distilled content.

I also noticed that the dom-distiller project in github effectively manages the domdistiller.js which is then packaged as a third party in chromium project. I tried using the underlying java library but unfortunately couldn't get that working either. 

I am wondering if someone could suggest at a high level which approach might be better.
1) Chrome without UI mode, invoke distilled functionality and return.
2) headless_shell with "--enable-dom-distiller"
3) Use dom-distiller Java/JS files directly.

Any pointers here would be super useful.

Thanks,
Praveen.




Alex Clarke

unread,
Oct 19, 2016, 1:33:39 PM10/19/16
to Praveen Patnala, Eric Seckler, headless-dev, Chromium-dev
On 19 October 2016 at 18:28, Praveen Patnala <pra...@laserlike.com> wrote:
Hi Eric,

Thanks for the email. I am effectively trying to add the "--enable-dom-distiller" functionality to headless_shell. This is the first time I am working on a chromium development project. The code base, build management, architecture is ofcourse all new. I was attempting to just invoke dom_distiller::DomDistillerServiceFactory:GetInstance() in headless_shell.cc

Looks like that's a chrome/ level feature and currently headless_shell does not link against anything in chrome/.   IIRC it's pretty difficult for us to link in that stuff.
 
and then invoking DistillPage(). Its not clear if I am on the right track. I couldn't get the build ninja working till now. I realized the headless_shell includes a small number of files and libraries vs the much larger chrome browser.

At a high level, this is what I am looking to accomplish: build a API such that given a URL and/or crawled content, return the distilled content.

I also noticed that the dom-distiller project in github effectively manages the domdistiller.js which is then packaged as a third party in chromium project. I tried using the underlying java library but unfortunately couldn't get that working either. 

I am wondering if someone could suggest at a high level which approach might be better.
1) Chrome without UI mode, invoke distilled functionality and return.
2) headless_shell with "--enable-dom-distiller"
3) Use dom-distiller Java/JS files directly.

Any pointers here would be super useful.

Thanks,
Praveen.





On Tue, Oct 18, 2016 at 11:46 PM, Eric Seckler <esec...@chromium.org> wrote:
+cc headless-dev

You could use the DevTools commands under the Dom domain (https://chromedevtools.github.io/debugger-protocol-viewer/tot/DOM/) to extract information about a page's DOM. Not sure if this would return the distilled DOM (if the distiller is enabled) or the original one.

I'm also not sure how to actually go about enabling the Dom distiller in headless_shell (I'm assuming this is what you're using?) Maybe it's already possible to use the --enable-dom-distiller flag, otherwise you could add support for it.

On Wed, Oct 19, 2016, 01:31 Praveen <pra...@laserlike.com> wrote:
Hi all,

For one my projects, we need to extract the distilled content of a crawled webpage. We want to be able to this as an internal service on the server side without the browser. I have been working on using the headless mode of chromium with the distill functionality to be able to do this. Has anyone tried doing this in this forum? Even if you have not done, could someone share some insights on how to potentially do this ?

Thanks,
Praveen.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

--
You received this message because you are subscribed to the Google Groups "headless-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to headless-dev+unsubscribe@chromium.org.
To post to this group, send email to headle...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/headless-dev/CALYAs1w%2BeF%3D4N%3D-GX5KCDEYf%2BAM7G_eE17feQJwTBMwiBp-6FQ%40mail.gmail.com.

Wei-Yin Chen

unread,
Oct 19, 2016, 7:30:50 PM10/19/16
to Chromium-dev, esec...@chromium.org, dom-dist...@google.com
Hi,

-cc headless-dev
+cc dom-distiller-eng

At a very high level view, server-side distillation can actually do better than DOM distiller, which is a client-side implementation. On the server side, you could leverage similar pages, like pages from the same site, and detect boilerplate HTML more easily and accurately, or even use site-specific rules or templates. DOM distiller only gets a single page, and its algorithm needs to be general enough, so the accuracy would be limited.

Nevertheless, it could still be a good starting point for your API if you don't have better solutions.

One possibility is to run Chrome with WebDriver inside xvfb.
You could either use the native distiller feature with "--enable-dom-distiller", or inject the distiller JS code through WebDriver and get the results inside the returned JSON.
Some references below:

If you want to customize DOM distiller, injecting the distiller JS code would be more convenient, since you don't have to recompile Chrome.

For distilling scrawled content, you might need this patch to use the native distiller feature:

Let me know if this works for you.

- Wei-Yin

Praveen Patnala

unread,
Oct 20, 2016, 2:19:21 PM10/20/16
to wyc...@google.com, Vishnu Natchu, Chromium-dev, Eric Seckler, dom-dist...@google.com
Hi Wei-Yin,

Thanks for the response, really appreciate it. We will try the Chrome driver option as you suggested.  Looks thats a selenium integration with chrome.

Also, I wanted to ask you, do you suggest or is it even possible to just use the domdistiller.js output from you dom-distiller github project directly in our application, without any other chrome component? When I tried using the dom-distiller Java classes, I couldn't get it working. Its possible its written (with GWT) only to be used within a browser framework. 

Thanks,
Praveen.

---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev+unsubscribe@chromium.org.

Praveen Patnala

unread,
Oct 21, 2016, 3:56:30 PM10/21/16
to Wei-Yin Chen, Vishnu Natchu, Chromium-dev, Eric Seckler, dom-dist...@google.com
Hi all,

Thanks to Wei-Yin's suggestion, we got the distiller working programmatically. We used WebDriver (ChromeDriver) and after the page is loaded, we are loading the JS file and invoking the distiller method. 

Setup: locally saved html files of webpages. Local mac with regular chrome browser. No development chrome environment, etc. 

Injectdom.js code: This code goes to the head section in the DOM. 

var getHeadTag = document.getElementsByTagName('head')[0]; 
var newScriptTag = document.createElement('script'); 
newScriptTag.type='text/javascript'; 
newScriptTag.language='javascript'; 
newScriptTag.src='/Users/ppraveen/dev/selenium/vishnu/domdistiller/domdistillerjstest/domdistillerjstest.nocache.js';
getHeadTag.appendChild(newScriptTag);

Chromedriver java code

ChromeOptions options = new ChromeOptions();

System.setProperty("webdriver.chrome.driver", "/Users/ppraveen/dev/selenium/chromedriver");

try {
    WebDriver driver = new ChromeDriver();
    driver.navigate().to("file:///Users/ppraveen/dev/selenium/vishnu/domdistiller/wpost.html");
    JavascriptExecutor jse = (JavascriptExecutor) driver;

    Scanner sc = new Scanner(new FileInputStream(new File("/Users/ppraveen/dev/selenium/injecttodom.js")));
    String inject = ""; 
    while (sc.hasNext()) {          
String[] s = sc.next().split("\r\n");   
for (int i = 0; i < s.length; i++) {
    inject += s[i];
    inject += " ";
}           
    }       
    System.out.println("Debug: Inject dom string: " + inject);
    jse.executeScript(inject);

    Thread.sleep(2000); 

    String title = (String)jse.executeScript("return document.title");
            System.out.println(" Title Of site : "+title);
    Object res = jse.executeScript("return org.chromium.distiller.DomDistiller.apply();");
    System.out.println("\n\nGot distilled content ----\n\n: " + res);
    System.out.println("\n\nEnd distilled content ----\n: ");

    driver.quit();
}
catch(Exception e) {
    System.out.println("Exception: " + e);
}


Output is under res[1] which is the distilled body.

Starting ChromeDriver 2.24.417412 (ac882d3ce7c0d99292439bf3405780058fcca0a6) on port 47735
Only local connections are allowed.
Oct 21, 2016 12:12:40 PM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Attempting bi-dialect session, assuming Postel's Law holds true on the remote end
Oct 21, 2016 12:12:41 PM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Detected dialect: OSS
Debug: Inject dom string: var getHeadTag = document.getElementsByTagName('head')[0]; var newScriptTag = document.createElement('script'); newScriptTag.type='text/javascript'; newScriptTag.language='javascript'; newScriptTag.src='/Users/ppraveen/dev/selenium/vishnu/domdistiller/domdistillerjstest/domdistillerjstest.nocache.js'; getHeadTag.appendChild(newScriptTag);
 Title Of site : GOP braces for Trump loss, roiled by refusal to accept election results - The Washington Post


Got distilled content ----

: {1=GOP braces for Trump loss, roiled by refusal to accept election results, 2={1=<p dir="ltr"> <span class="dateline" dir="ltr">LAS VEGAS —</span> A wave of apprehension and anguish swept the Republican Party on Thursday, with many GOP leaders alarmed by Donald Trump’s refusal to accept the outcome of the election and concluding that it is probably too late to salvage his flailing presidential campaign.</p><p>As the Republican nominee reeled from a turbulent performance in the final debate here in Las Vegas, his party’s embattled senators and House members scrambled to protect their seats and preserve the GOP’s congressional majorities against what Republicans privately acknowledge could be a landslide victory for Democratic nominee Hillary Clinton.</p><p>With less than three weeks until the election, the Republican Party is in a state of historic turmoil, encapsulated by Trump’s extraordinary debate declaration that he would leave the nation in “suspense” about whether he would recognize the results from an election he has claimed will be “rigged” or even “stolen.”</p><p>The immediate responses from GOP officials were divergent and vague, with no clear strategy on how to handle Trump’s threat. The candidate was defiant and would not back away from his position, telling a roaring crowd Thursday in Ohio that he would accept the results “if I win” — and reserving his right to legally challenge the results should he fall short.</p><p>For seasoned Republicans who have watched Trump warily as a general-election candidate, the aftermath of Wednesday’s debate brought a feeling of finality.</p><div class="wp-volt-gal-nav-number" dir="ltr"><span class="wp-volt-gal-nav-number-current" dir="ltr">1</span> of 28</div><img alt="" src="https://img.washingtonpost.com/rf/image_606w/2010-2019/WashingtonPost/2016/10/08/National-Politics/Images/Congress_Saudi_Arabia-34900-3460.jpg" width="606" height="397"><div class="wp-volt-gal-interstitial-text">Wait 1 second to continue.</div><p>“The campaign is over,” said Steve Schmidt, a Trump critic and former senior strategist on George W. Bush’s and John McCain’s presidential campaigns.</p><p>Calling a refusal to accept the election results “disqualifying,” Schmidt added: “The question is, how close will Clinton get to 400 electoral votes? She’ll be north of 350, and she’s trending towards 400 — and the trend line is taking place in very red states like Georgia, Texas and Arizona.”</p><p>Clinton and Trump appeared together and traded jabs in delivering mostly lighthearted roasts Thursday night at the Alfred E. Smith Memorial Foundation dinner, a white-tie gala benefiting Catholic charities. The candidates used searing humor to taunt each other, reflecting the personal animus on display in the debates. Trump’s routine was at times unsettling, drawing some rare boos from the audience. Held at the Waldorf Astoria hotel in New York, the dinner is a traditional event on the calendar for presidential nominees.</p><p>Meanwhile, top Democrats fanned out to battleground states on Thursday to hammer Trump for what they described as an unprecedented attack on the country’s political system and to attempt to yoke Trump to Republican candidates down the ballot.</p><p>Campaigning in Miami, President Obama said Trump’s doubts about the election outcome are “not a joking matter. That is dangerous.”</p><p>The president eviscerated Republicans who have stood by Trump, singling out Sen. Marco Rubio (R-Fla.), who called Trump “a dangerous con artist” and condemned his more controversial comments during the GOP primaries but now plans to vote for him.</p><div class="inline-video-caption" dir="ltr"> <span class="pb-caption" dir="ltr">On Oct. 20 in Miami Gardens, Fla., President Obama said the fact that Republican presidential nominee Donald Trump refused to say during the final debate whether he would accept the result of the election “is not a joking matter.” (The Washington Post)</span> </div><p>“Marco just seems to care about hanging on to his job,” Obama said, calling the senator’s positioning “the height of cynicism.”</p><p>And in Arizona, where polls show an unexpectedly tight presidential race, first lady Michelle Obama said Trump “is threatening the very idea of America itself” by suggesting he would not honor the election results.</p><p>“You do not keep American democracy ‘in suspense,’ ” Obama said in Phoenix.</p><p>Sen. Tim Kaine of Virginia, Clinton’s vice-presidential running mate, held a rally at a downtown Charlotte brewery, where he said Trump’s claims of a “rigged” election reminded him of the Third World politicking he had seen as a young missionary in Honduras.</p><p dir="ltr">“The bigger we can win by, the harder it is for him to whine and have anyone believe him,” <a href="https://www.washingtonpost.com/news/post-politics/wp/2016/10/20/kaine-on-trump-the-bigger-we-can-win-by-the-harder-it-is-for-him-to-whine/?tid=sm_tw" shape="rect" dir="ltr">Kaine said</a>, trying to galvanize supporters on the first day of early voting in North Carolina.</p><p>On the debate stage, Trump amplified what he had been saying for weeks at his rallies: that the election is “rigged.” Questioned directly as to whether he would accept the results should Clinton prevail, Trump said, “I’ll keep you in suspense.”</p><p>Clinton called Trump’s answer “horrifying,” both in the debate and to reporters overnight on her flight home to New York.</p><p>Trump’s advisers and surrogates struggled to explain the candidate’s position. Campaign manager Kellyanne Conway said it was too early to determine whether voting irregularities could make the difference between winning and losing. She and other Trump backers drew a parallel to then-Vice President Al Gore’s concession call to then-Texas Gov. George W. Bush, which he later withdrew as he awaited a recount in Florida.</p><p>“I’m going to keep reminding everybody about the 2000 election when Al Gore said he would accept the results of the election and then did not,” Conway said. “He retracted his concession.”</p><p>Reince Priebus, chairman of the Republican National Committee, contended that Trump and the party would stand by the results unless the margin is small enough to warrant a recount or legal challenges. Priebus said Trump is merely preserving flexibility in the event of a contested result.</p><p>“All he’s saying is, ‘Look, I’m not going to forgo my right to a recount in a close election,’ ” Priebus said. “We accept the results as long as we’re not talking about a few votes where it actually matters. I know him. I know where his head’s at. . . . I promise you, that’s all this is.”</p><p>Other Trump surrogates took a different interpretation. Keith Kellogg, a retired Army lieutenant general, accused the media of “splitting hairs” and insisted that Trump was “not threatening democratic norms,” while former New York mayor Rudolph W. Giuliani argued that any Republican would be “stupid” to accept the integrity of results before they are known.</p><p>“Suppose she wins Pennsylvania by 50 votes,” Giuliani said. He speculated, without evidence, that Democrats would “steal a lot more than 50 votes in Philadelphia. I guarantee you of that. And I’ll tell you how they will do it — they’ll bus people in who will vote dead people’s names four, five, six times . . . or have people in Philadelphia paid to vote three, four and five times.”</p><p>Democrats expressed dismay that the Republican nominee and his backers were advancing the idea of widespread voter fraud.</p><p>“He is just trying to find an excuse for the fact that he’s going to lose, and perhaps the fact that he’s going to lose to the first woman president stings a little sharper than it might otherwise,” said Jennifer Palmieri, the Clinton campaign’s communications director.</p><p>Prominent Republican senators in tough reelection bids distanced themselves from Trump’s posture. “Donald Trump needs to accept the outcome,” Sen. Kelly Ayotte (R-N.H.) said in a statement.</p><p>McCain (Ariz.), who lost to Obama eight years ago, said in a statement: “I didn’t like the outcome of the 2008 election. But I had a duty to concede. A concession isn’t just an act of graciousness. It is an act of respect for the will of the American people.”</p><p>As of Thursday afternoon, neither House Speaker Paul D. Ryan (R-Wis.) nor Senate Majority Leader Mitch McConnell (R-Ky.) had offered any comment, underscoring the party’s unease with its own nominee and the political dangers of tangling with him.</p><p>Benjamin L. Ginsberg, a lawyer at Jones Day who has been national counsel for several Republican presidential campaigns, said Trump’s stance puts the party in “quite a difficult position.”</p><p>“There will be Republican candidates who are winning by narrow margins and losing by narrow margins,” Ginsberg said. “The party as a whole has a collective interest in having the results upheld.”</p><p>Republican pollster Whit ­Ayres said that at the Las Vegas debate, Trump “blew his last chance to turn it around.” But, he said, “I am not convinced that the rest of the party will have as bad a night [on Election Day] as Donald Trump is going to have, because the Trump brand is so distinct from the Republican brand.”</p><p>Escalating the Republican angst was Trump’s rally Thursday in Delaware, Ohio, where he advanced conspiracies swirling around far-right websites about Clinton. He referred to reports that Democratic operatives with no direct connection to the Clinton campaign hired people to violently disrupt Trump events.</p><p>“This criminal behavior that violates centuries of tradition of peaceful democratic elections, a campaign like Clinton’s that will incite violence is truly a campaign that will do anything to win,” Trump said, going on to call Clinton “a candidate who is truly capable of anything, including voter fraud.” </p><p channel="!Daily">Trump also mentioned an email, which surfaced on WikiLeaks through an illegal hack that U.S. authorities blame on the Russian government, in which interim Democratic National Committee chairwoman Donna Brazile seemed to suggest to the Clinton team that she had knowledge of a question that would come up in a primary forum earlier this year. While Brazile has denied that CNN provided any questions in advance, Trump called her actions “cheating at the highest level.”</p><p>Even as his party loses faith, Trump proclaimed that he was poised for victory.</p><p>“Bottom line, we’re going to win,” he told the boisterous Ohio crowd. “We’re going to win. We’re going to win so big. We’re going to win so big.”</p><p class="trailleft">Jenna Johnson in Delaware, Ohio; David Weigel in Charlotte; Krissah Thompson in Phoenix; and Juliet Eilperin, Jose A. DelReal and Karoun Demirjian in Washington contributed to this report.</p>}, 3={}, 5={1=GOP braces for Trump loss, roiled by refusal to accept election results, 2=Article, 3=https://www.washingtonpost.com/politics/gop-braces-for-trump-loss-roiled-by-refusal-to-accept-election-results/2016/10/20/6e1de6aa-96dc-11e6-9b7c-57290af48a49_story.html, 4=Republican leaders conclude it is probably too late for Trump to save his flailing campaign., 5=Washington Post, 6=, 7=Philip Rucker, 8={1=, 2=, 3=, 4=, 5=[https://www.facebook.com/PhilipRuckerWPhttps://www.facebook.com/costareports]}, 9=[{1=https://img.washingtonpost.com/rf/image_1484w/2010-2019/WashingtonPost/2016/10/20/National-Politics/Images/615866756.jpg, 2=, 3=, 4=, 5=0, 6=0}]}, 6={1=3.57999999999993, 2=83.5150000000003, 3=75.915, 4=24.4799999999996, 5=282.3, 6=[{1=OpenGraphProtocolParser, 2=1.18499999999995}, {1=SchemaOrgParserAccessor, 2=0.385000000000218}, {1=IEReadingViewParser, 2=0.305000000000291}, {1=OpenGraphProtocolParser.findPrefixes, 2=4.15500000000066}, {1=OpenGraphProtocolParser.parseMetaTags, 2=10.1250000000005}, {1=OpenGraphProtocolParser.imageParser.verify, 2=0.320000000000164}, {1=OpenGraphProtocolParser.checkRequired, 2=1.11000000000013}, {1=OpenGraphProtocolParser.parse, 2=23.5799999999999}, {1=Pagination, 2=10.5650000000001}, {1=SchemaOrgParser.parse, 2=11.8699999999999}]}, 7={1=DomDistiller debug level: 0


End distilled content ----

Thanks,
Praveen.
Reply all
Reply to author
Forward
0 new messages