Can this be used to build a crawler?

590 views
Skip to first unread message

vvowns

unread,
Apr 26, 2017, 9:56:15 AM4/26/17
to headless-dev
Hi, I am looking for ways to use chrome headless for crawling a website, in an efficient way.

From my first experiences with chrome-remote-interface, it appears that spinning up 10 tabs, then navigating inside them in parallel (to a local website) leads to not so good performance.

What I have been doing:
- create ten tabs
- disallow jpg, css, js ... (setBlockedUrls)
- take a start url and load it in first tab
- wait for domContentEventFired
- get all links from that page
- push all links as new tasks in a queue handled by async
- then the process is the same and will try to use all tabs

When I do this, I reach a maximum number of tasks being done every second. Mostly because
loading the page and waiting for domContentLoaded can take from 80ms to 2s as soon as I try to use
multiple tabs in parallel. Same for getting all the links, it can take up to 50ms.

Any comment, feedback or remarks on what I am doing is welcomed. Thanks.

Isaac Dawson

unread,
Apr 26, 2017, 10:09:20 AM4/26/17
to vvowns, headless-dev
I'm not sure this is the correct list for this, but I had similar challenges and spoke about them at a conference last year: https://youtu.be/aqeBM9Q3aY8 

--
You received this message because you are subscribed to the Google Groups "headless-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to headless-dev...@chromium.org.
To post to this group, send email to headle...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/headless-dev/fc32b28f-a77a-4ebf-8d1d-e780d2a25602%40chromium.org.

Sami Kyostila

unread,
Apr 26, 2017, 1:08:21 PM4/26/17
to Isaac Dawson, vvowns, headless-dev
This is interesting. Could you using the --trace-startup and, say, --trace-startup-duration=30 command line flags to record some performance traces and attach them to a bug? I suspect since all tabs are trying to render animations in parallel, they'll end up using a lot of CPU time in total.

- Sami

Anton Bacaj

unread,
Apr 26, 2017, 3:48:27 PM4/26/17
to Sami Kyostila, headless-dev, vvowns, Isaac Dawson
You said you disable CSS, and JS... Isn't that the purpose of even using Chrome Headless? 

To me, it sounds like you can crawl a webpage quiet easily without a browser, just use a http library etc

On Apr 26, 2017 1:08 PM, "Sami Kyostila" <skyo...@chromium.org> wrote:
This is interesting. Could you using the --trace-startup and, say, --trace-startup-duration=30 command line flags to record some performance traces and attach them to a bug? I suspect since all tabs are trying to render animations in parallel, they'll end up using a lot of CPU time in total.

- Sami

ke 26. huhtik. 2017 klo 15.09 Isaac Dawson <isaac....@gmail.com> kirjoitti:
I'm not sure this is the correct list for this, but I had similar challenges and spoke about them at a conference last year: https://youtu.be/aqeBM9Q3aY8 

On Wed, Apr 26, 2017 at 10:56 PM vvowns <wou...@gmail.com> wrote:
Hi, I am looking for ways to use chrome headless for crawling a website, in an efficient way.

From my first experiences with chrome-remote-interface, it appears that spinning up 10 tabs, then navigating inside them in parallel (to a local website) leads to not so good performance.

What I have been doing:
- create ten tabs
- disallow jpg, css, js ... (setBlockedUrls)
- take a start url and load it in first tab
- wait for domContentEventFired
- get all links from that page
- push all links as new tasks in a queue handled by async
- then the process is the same and will try to use all tabs

When I do this, I reach a maximum number of tasks being done every second. Mostly because
loading the page and waiting for domContentLoaded can take from 80ms to 2s as soon as I try to use
multiple tabs in parallel. Same for getting all the links, it can take up to 50ms.

Any comment, feedback or remarks on what I am doing is welcomed. Thanks.

--
You received this message because you are subscribed to the Google Groups "headless-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to headless-dev+unsubscribe@chromium.org.

--
You received this message because you are subscribed to the Google Groups "headless-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to headless-dev+unsubscribe@chromium.org.

--
You received this message because you are subscribed to the Google Groups "headless-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to headless-dev+unsubscribe@chromium.org.

To post to this group, send email to headle...@chromium.org.

Alex Clarke

unread,
Apr 26, 2017, 4:16:38 PM4/26/17
to Anton Bacaj, Sami Kyostila, headless-dev, vvowns, Isaac Dawson
Chrome is currently trying to render at 60fps even in headless mode, that can burn a lot of CPU, although a trace would be useful to confirm theres not some other problem.  We are working on adding controls for rendering which we hope will help reduce CPU usage.

Message has been deleted

Alex Clarke

unread,
Apr 27, 2017, 3:14:18 AM4/27/17
to Vincent Voyer, Anton Bacaj, Sami Kyostila, headless-dev, Isaac Dawson
On 26 April 2017 at 21:29, Vincent Voyer <wou...@gmail.com> wrote:

On 26 April 2017 at 22:16, Alex Clarke <alexc...@google.com> wrote:
Chrome is currently trying to render at 60fps even in headless mode, that can burn a lot of CPU, although a trace would be useful to confirm theres not some other problem.  We are working on adding controls for rendering which we hope will help reduce CPU usage.

I will do that trace, did not even know it was feasible, nice! Ultimately I do not need any rendering, I just want to use this DOM features and being able to query the DOM.

A note of caution, depending on which pages you're crawling you'll have to do at least some rendering.  Quite a few pages won't function without animation triggers or a few requestAnimationFrames firing.

 
 

On 26 April 2017 at 20:48, Anton Bacaj <aba...@gmail.com> wrote:
You said you disable CSS, and JS... Isn't that the purpose of even using Chrome Headless? 

I disabled the loading of CSS, JS, images. Those are still loaded by chrome headless while for my current case I want to use chrome headless in two cases:
- no external dependencies, just the html of the page
- all external dependencies, with CSS, JS ... loaded

What I will be using from chrome headless: the DOM loading and DOM/Network methods.
 

To me, it sounds like you can crawl a webpage quiet easily without a browser, just use a http library etc

Yes you can do that but to understand the DOM you will have to use some libraries that will try to mimic a browser without reaching it. And then when you will want to have JS, CSS, images loaded, you will have to use another system (like a real browser, chrome headless).

I want a single solution, if feasible.
 

On Apr 26, 2017 1:08 PM, "Sami Kyostila" <skyo...@chromium.org> wrote:
This is interesting. Could you using the --trace-startup and, say, --trace-startup-duration=30 command line flags to record some performance traces and attach them to a bug? I suspect since all tabs are trying to render animations in parallel, they'll end up using a lot of CPU time in total.

- Sami

I'll do that as soon as tomorrow, thanks for suggestion.
 

ke 26. huhtik. 2017 klo 15.09 Isaac Dawson <isaac....@gmail.com> kirjoitti:
I'm not sure this is the correct list for this, but I had similar challenges and spoke about them at a conference last year: https://youtu.be/aqeBM9Q3aY8 

Thanks for this, was very interesting and yes some of the challenges you faced I will face them too.



--
Vincent Voyer
06 13 92 69 96

--
You received this message because you are subscribed to the Google Groups "headless-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to headless-dev+unsubscribe@chromium.org.
To post to this group, send email to headle...@chromium.org.
Message has been deleted

vvowns

unread,
Apr 28, 2017, 6:06:52 AM4/28/17
to headless-dev, isaac....@gmail.com, wou...@gmail.com
Hi Sami,


I did not open a bug yet because I am not sure there's a bug, I don't want to create useless work for maintainers.

Still I would love your feedback on this trace as I am currently unable to understand it.
Especially if you know that soon I will be able to get an even better performance.

After fiddling a bit more I was able to get a better performance (something like 70 pages parsed by second). Nice.

Thanks a lot.

Sami Kyostila

unread,
May 3, 2017, 6:38:28 AM5/3/17
to Vincent Voyer, headless-dev, isaac....@gmail.com, wou...@gmail.com
Thanks for the trace Vincent! Looks like you're running 16 renderers in parallel and each one is getting a little preempted. How may cores does your machine have? Also, there is a GPU process which is rendering at about 50 fps, which is causing some extra load on the browser IO thread (which also handles network connections). I would suggest trying to disable GPU emulation (use the --disable-gpu command line flag if this is the headless shell). Also, soon it will be possible to control rendering frame rate more explicitly which should help further reduce the CPU load. However each navigation seems to result in about two renders which isn't too bad -- maybe it would be better to tell each tab that it is backgrounded to avoid any rendering at all (see Browser.setWindowBounds).

- Sami

pe 28. huhtik. 2017 klo 11.06 'Vincent Voyer' via headless-dev <headle...@chromium.org> kirjoitti:
Hi Sami,


I did not open a bug yet because I am not sure there's a bug, I don't want to create useless work for maintainers.

Still I would love your feedback on this trace as I am currently unable to understand it.
Especially if you know that soon I will be able to get an even better performance.

After fiddling a bit more I was able to get a better performance (something like 70 pages parsed by second). Nice.

Thanks a lot.

daniel....@zwoop.biz

unread,
May 16, 2017, 2:41:22 AM5/16/17
to headless-dev
Hi,

Are you willing to share the code you're using for crawling? 
Which programming language are you using with Chromium and the chromedriver?

Best regards
Reply all
Reply to author
Forward
0 new messages