News-Crawler: Is there an option to use HTTP/1.0 or 1.1 for the Warc File

24 views
Skip to first unread message

Yuxin Zhu

unread,
Sep 15, 2020, 9:08:01 PM9/15/20
to DigitalPebble
Hi there,

I tried to use the news-crawler, and the produced Warc files have HTTP/2 in the status line. I'm wondering if there's an option in the config files to use HTTP/1.0 or 1.1? Thanks in advance.
Screen Shot 2020-09-15 at 10.03.20 AM.png

Sebastian Nagel

unread,
Sep 16, 2020, 6:12:18 AM9/16/20
to DigitalPebble
The easiest way (without code modifications) is to use Java 8 because there is no support for HTTP/2 and okhttp will fall back to HTTP/1.1

Otherwise you need to modify the code in the method
  com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol#configure(Config conf)
and define which protocols [1] are allowed. See Nutch's protocol-okhttp as example [2].

Please open an issue on [3] to make the selection of the protocol configurable via the crawler-conf.yaml.

Of course, you could also fix the HTTP status line written to WARC files. This would happen at [4].

Best,
Sebastian

Yuxin Zhu

unread,
Sep 16, 2020, 11:32:46 AM9/16/20
to DigitalPebble
Hi Sebastian,

Thanks so much for the detailed reply. I have another question(might be a little dumb)...I am using Newscrawler instead of StormCrawler. In the News-crawler's github page, in the prerequisites section, you mentioned "Clone and compile StormCrawler". Where should I clone the StormCrawler repo to? Should it be placed as a subdirectory under the NewsCrawler repo? Thanks.

Yuxin

DigitalPebble

unread,
Sep 17, 2020, 2:07:19 AM9/17/20
to DigitalPebble
Hi Yuxin

I don't think this is necessary anymore as the news crawler relies on a released version of SC and pulls it like any other dependency.

BTW you'd get a wider audience if you ask questions on StackOverflow

Julien

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/digitalpebble/5be5b503-b018-4b11-8284-3df38736201en%40googlegroups.com.


--

Sebastian Nagel

unread,
Sep 17, 2020, 3:31:14 AM9/17/20
to DigitalPebble
Yes, the master branch now - since there have been complaints about unstable versions -  always relies on a released SC package which is pulled from a Maven repository. If you want to use a development version of SC you'd need to compile SC and "install" it to your local Maven repository (~/.m2/repository/ on Linux). I'll update the news-crawl README and mark this option as "expert".

Sebastian Nagel

unread,
Oct 1, 2020, 7:13:50 AM10/1/20
to DigitalPebble
Update and correction:
- recent Java 8 JDK packages may support HTTP/2 if they include the ALPN backport, see
- I'll update the WARC bolt soon to allow to restrict the protocol version used
- but also to make the WARC record writer "fake" the recorded HTTP header

Sebastian
Reply all
Reply to author
Forward
0 new messages