How can I filter and cancel requests?

M

unread,

May 4, 2017, 12:05:31 PM5/4/17

to headless-dev

I'm trying to port my PhantomJS script over to Headless Chrome to test improvements in terms of cpu, ram, network and speed.

My script loads a page and blocks every request (html, js, css, gif, mp4 etc) but the allowed ones (atm regex urls like "var match4 = ((/tags.tiqcdn.com\/utag\/electriccompany/g).test(requestUrl));") that are defined in the script and depend on the domain I want to load (example.com has different rules than houseworkstuff.es).
If the final request url is found (this is always the same url), I get the parameters of that url, parse it to JSON and log them with the status code and url name. Then comes the next page in the list (up to 1,5 million).

I haven't found a way to do that filtering/blocking with headless chrome, are the APIs not there yet or did I just overlooked them?
And is there a parameter to disable the loading of images/graphics? There was a thread from 10.03.2017 discussing that and it seemed the APIs weren't ready then? I'm sick of the huge memory leak bug from PhantomJS if you set page.settings.loadImages = false; crashing my script.

Aaaaand ... how do I get the request urls from javascript that loads javascript? I can load the request urls like the example here (https://github.com/cyrus-and/chrome-remote-interface) but that only shows me the request urls from the source code, not the request (external) javascript loads on that site?

Best regards
M

Simon Luetzelschwab

unread,

May 5, 2017, 8:43:26 AM5/5/17

to M, headless-dev

My understanding is that for headless mode, it's currently only possible to use Network.setBlockedURLs [0] to block certain requests. While it does accept wildcards, it does not seem to support regular expressions. There's also no corresponding whitelist.

Depending how predictable the URLs are, you may have to load the page twice - first time to collect all the URLs, apply your regex to populate your blacklist and then reload the page with the corresponding blacklist in place.

Also see somewhat related previous thread [1] (method has been renamed to setBlockedURLs in the latest release).

If anyone knows of an alternative way, please share.

[0] https://chromedevtools.github.io/devtools-protocol/tot/Network/#method-setBlockedURLs

[1] https://groups.google.com/a/chromium.org/d/msg/headless-dev/D3tUxpzmqw8/sV4gNeebDAAJ

--
You received this message because you are subscribed to the Google Groups "headless-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to headless-dev+unsubscribe@chromium.org.
To post to this group, send email to headle...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/headless-dev/bc26b218-6683-46ac-a749-b795a5fb4259%40chromium.org.

--

Alpeware LLC - 548 Market St #35286, San Francisco, CA 94104 - +1 415 200 3094

Raffaele Sena

unread,

May 11, 2017, 1:45:29 AM5/11/17

to Simon Luetzelschwab, M, headless-dev

For real "navigation" requests (i.e. loading of HTML pages, not sub-resources) you can also use https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-setControlNavigations

Listen to the event https://chromedevtools.github.io/devtools-protocol/tot/Page/#event-navigationRequested and accept/reject the request with https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-processNavigation

-- Raffaele

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/headless-dev/CAHKjGjpMwfgR0a1%2Bsvr27OECJMb%2B9cEA_kj3bYPYcbAkaU9U8g%40mail.gmail.com.

Reply all

Reply to author

Forward

Message has been deleted