Running news-crawl in a docker container

322 views
Skip to first unread message

Bogdan Metea

unread,
Mar 1, 2018, 3:43:16 PM3/1/18
to Common Crawl
Hi guys,

I'm following the instruction from here https://github.com/commoncrawl/news-crawl to run an instance of news-crawl in a Docker container.

Everything works fine until the Dockerfile has this instruction:

ADD target/crawler-1.8-SNAPSHOT.jar news-crawler/lib/crawler.jar

but I'm pretty sure it doesn't exist at that point. There's a note in the GitHub repo that says: Note: the uberjar is included in the Docker image and needs to be built first.

Does anyone know what it means? Do I just add another instruction to the Dockerfile ? Something like mvn clean install?

Thank you!

Sebastian Nagel

unread,
Mar 1, 2018, 4:16:12 PM3/1/18
to common...@googlegroups.com
Hi Bogdan,

> Something like* */*mvn clean install*?/

Exactly. Please run
mvn clean package
as described a few lines above at
https://github.com/commoncrawl/news-crawl#run-the-crawl

Best,
Sebastian


On 03/01/2018 09:43 PM, Bogdan Metea wrote:
> Hi guys,
>
> I'm following the instruction from here https://github.com/commoncrawl/news-crawl to run an instance
> of news-crawl in a Docker container.
>
> Everything works fine until the Dockerfile has this instruction:
>
> *ADD target/crawler-1.8-SNAPSHOT.jar news-crawler/lib/crawler.jar*
>
> but I'm pretty sure it doesn't exist at that point. There's a note in the GitHub repo that
> says: /*Note: the uberjar is included in the Docker image and needs to be built first.*/
>
> Does anyone know what it means? Do I just add another instruction to the Dockerfile ? Something
> like* */*mvn clean install*?/
> /
> /
> Thank you!
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Sebastian Nagel

unread,
Mar 1, 2018, 4:27:31 PM3/1/18
to common...@googlegroups.com
> Do I just add another instruction to the Dockerfile ?

Just run mvn locally in the news-crawl/ directory.
Maven needs some local setup (cache, etc.), that's
why it's not run inside a container during the
docker build run. There is also no need to include
all the build tools in the container.

Bogdan Metea

unread,
Mar 2, 2018, 9:09:34 AM3/2/18
to Common Crawl
Hi Sebastian,

Thanks for the quick reply, I had missed a step I fixed my problem by cloning storm-crawler from the link provided and then building it locally with Maven.

I now have built the image successfully and I can run it interactively, I've created two volumes and attached them for elastic data and the warc data.

But when I run it with : 
/home/ubuntu/news-crawler/bin/run-crawler.sh

I get a bunch of errors which I tried to figure out but I just don't know storm well enough to debug. I think it's because the ports aren't exposed correctly. I am running this on mac osx. So this is the stack trace when I try to run it inside the docker container:

Successfully tagged newscrawler:1.8
$ ~/TTCode/news-crawl 🖖  ---> docker run --net=host \
>     -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 \
>     -p 5601:5601 -p 8080:8080 \
>     -v news-elastic:/data/elasticsearch \
>     -v news-warc:/data/warc \
>     --rm -i -t newscrawler:1.8 /bin/bash
root@linuxkit-025000000001:/home/ubuntu# /home/ubuntu/news-crawler/bin/run-crawler.sh
/usr/lib/python2.7/dist-packages/supervisor/options.py:297: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
  'Supervisord is running as root and it is searching '
sudo: unable to resolve host linuxkit-025000000001
Deleted status index
Creating status index with mapping
{"acknowledged":true,"shards_acknowledged":true,"index":"status"}
Deleted metrics index
Creating metrics index with mapping
{"acknowledged":true}
Deleted docs index
Creating docs index with mapping
{"acknowledged":true,"shards_acknowledged":true,"index":"index"}Running: java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/opt/apache-storm-1.1.1 -Dstorm.log.dir=/opt/apache-storm-1.1.1/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /opt/apache-storm-1.1.1/lib/kryo-3.0.3.jar:/opt/apache-storm-1.1.1/lib/slf4j-api-1.7.21.jar:/opt/apache-storm-1.1.1/lib/clojure-1.7.0.jar:/opt/apache-storm-1.1.1/lib/log4j-slf4j-impl-2.8.2.jar:/opt/apache-storm-1.1.1/lib/asm-5.0.3.jar:/opt/apache-storm-1.1.1/lib/objenesis-2.1.jar:/opt/apache-storm-1.1.1/lib/log4j-core-2.8.2.jar:/opt/apache-storm-1.1.1/lib/storm-rename-hack-1.1.1.jar:/opt/apache-storm-1.1.1/lib/disruptor-3.3.2.jar:/opt/apache-storm-1.1.1/lib/ring-cors-0.1.5.jar:/opt/apache-storm-1.1.1/lib/log4j-over-slf4j-1.6.6.jar:/opt/apache-storm-1.1.1/lib/reflectasm-1.10.1.jar:/opt/apache-storm-1.1.1/lib/minlog-1.3.0.jar:/opt/apache-storm-1.1.1/lib/log4j-api-2.8.2.jar:/opt/apache-storm-1.1.1/lib/servlet-api-2.5.jar:/opt/apache-storm-1.1.1/lib/storm-core-1.1.1.jar:/home/ubuntu/news-crawler/lib/crawler.jar:/opt/apache-storm-1.1.1/conf:/opt/apache-storm-1.1.1/bin -Dstorm.jar=/home/ubuntu/news-crawler/lib/crawler.jar -Dstorm.dependency.jars= -Dstorm.dependency.artifacts={} com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector /home/ubuntu/news-crawler/seeds * -conf /home/ubuntu/news-crawler/conf/es-conf.yaml -conf /home/ubuntu/news-crawler/conf/crawler-conf.yaml
5826 [main] INFO  c.d.s.s.FileSpout - Input : /home/ubuntu/news-crawler/seeds/feeds.txt
5879 [main] WARN  o.a.s.u.Utils - STORM-VERSION new 1.1.1 old null
5900 [main] INFO  o.a.s.StormSubmitter - Generated ZooKeeper secret payload for MD5-digest: -8884617673223088873:-6627026628912068279
5990 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2001ms (NOT MAX)
7992 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2003ms (NOT MAX)
9998 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2005ms (NOT MAX)
12004 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2015ms (NOT MAX)
14020 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2023ms (NOT MAX)
16045 [main] WARN  o.a.s.u.NimbusClient - Ignoring exception while trying to get leader nimbus info from localhost. will retry with a different seed host.
java.lang.RuntimeException: java.lang.RuntimeException: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:108) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.<init>(ThriftClient.java:69) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.utils.NimbusClient.<init>(NimbusClient.java:127) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:83) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:57) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.blobstore.NimbusBlobStore.prepare(NimbusBlobStore.java:268) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.StormSubmitter.getListOfKeysFromBlobStore(StormSubmitter.java:595) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.StormSubmitter.validateConfs(StormSubmitter.java:561) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:207) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159) [storm-core-1.1.1.jar:1.1.1]
at com.digitalpebble.stormcrawler.ConfigurableTopology.submit(ConfigurableTopology.java:85) [crawler.jar:?]
at com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector.run(ESSeedInjector.java:65) [crawler.jar:?]
at com.digitalpebble.stormcrawler.ConfigurableTopology.start(ConfigurableTopology.java:50) [crawler.jar:?]
at com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector.main(ESSeedInjector.java:38) [crawler.jar:?]
Caused by: java.lang.RuntimeException: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.security.auth.TBackoffConnect.retryNext(TBackoffConnect.java:64) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:56) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:100) ~[storm-core-1.1.1.jar:1.1.1]
... 14 more
Caused by: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.thrift.transport.TSocket.open(TSocket.java:226) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:105) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:53) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:100) ~[storm-core-1.1.1.jar:1.1.1]
... 14 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_151]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_151]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_151]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_151]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_151]
at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_151]
at org.apache.storm.thrift.transport.TSocket.open(TSocket.java:221) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:105) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:53) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:100) ~[storm-core-1.1.1.jar:1.1.1]
... 14 more
org.apache.storm.utils.NimbusLeaderNotFoundException: Could not find leader nimbus from seed hosts [localhost]. Did you specify a valid list of nimbus hosts for config nimbus.seeds?
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:111)
at org.apache.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:57)
at org.apache.storm.blobstore.NimbusBlobStore.prepare(NimbusBlobStore.java:268)
at org.apache.storm.StormSubmitter.getListOfKeysFromBlobStore(StormSubmitter.java:595)
at org.apache.storm.StormSubmitter.validateConfs(StormSubmitter.java:561)
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:207)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159)
at com.digitalpebble.stormcrawler.ConfigurableTopology.submit(ConfigurableTopology.java:85)
at com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector.run(ESSeedInjector.java:65)
at com.digitalpebble.stormcrawler.ConfigurableTopology.start(ConfigurableTopology.java:50)
at com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector.main(ESSeedInjector.java:38)
Running: java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/opt/apache-storm-1.1.1 -Dstorm.log.dir=/opt/apache-storm-1.1.1/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /opt/apache-storm-1.1.1/lib/kryo-3.0.3.jar:/opt/apache-storm-1.1.1/lib/slf4j-api-1.7.21.jar:/opt/apache-storm-1.1.1/lib/clojure-1.7.0.jar:/opt/apache-storm-1.1.1/lib/log4j-slf4j-impl-2.8.2.jar:/opt/apache-storm-1.1.1/lib/asm-5.0.3.jar:/opt/apache-storm-1.1.1/lib/objenesis-2.1.jar:/opt/apache-storm-1.1.1/lib/log4j-core-2.8.2.jar:/opt/apache-storm-1.1.1/lib/storm-rename-hack-1.1.1.jar:/opt/apache-storm-1.1.1/lib/disruptor-3.3.2.jar:/opt/apache-storm-1.1.1/lib/ring-cors-0.1.5.jar:/opt/apache-storm-1.1.1/lib/log4j-over-slf4j-1.6.6.jar:/opt/apache-storm-1.1.1/lib/reflectasm-1.10.1.jar:/opt/apache-storm-1.1.1/lib/minlog-1.3.0.jar:/opt/apache-storm-1.1.1/lib/log4j-api-2.8.2.jar:/opt/apache-storm-1.1.1/lib/servlet-api-2.5.jar:/opt/apache-storm-1.1.1/lib/storm-core-1.1.1.jar:/home/ubuntu/news-crawler/lib/crawler.jar:/opt/apache-storm-1.1.1/conf:/opt/apache-storm-1.1.1/bin -Dstorm.jar=/home/ubuntu/news-crawler/lib/crawler.jar -Dstorm.dependency.jars= -Dstorm.dependency.artifacts={} org.commoncrawl.stormcrawler.news.CrawlTopology -conf /home/ubuntu/news-crawler/conf/es-conf.yaml -conf /home/ubuntu/news-crawler/conf/crawler-conf.yaml
5790 [main] INFO  o.a.s.u.TupleUtils - Enabling tick tuple with interval [15]
5917 [main] WARN  o.a.s.u.Utils - STORM-VERSION new 1.1.1 old null
5985 [main] INFO  o.a.s.StormSubmitter - Generated ZooKeeper secret payload for MD5-digest: -5451346309966923328:-6035994271572471267
6132 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2001ms (NOT MAX)
8135 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2002ms (NOT MAX)
10138 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2006ms (NOT MAX)
12145 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2015ms (NOT MAX)
14162 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2020ms (NOT MAX)
16184 [main] WARN  o.a.s.u.NimbusClient - Ignoring exception while trying to get leader nimbus info from localhost. will retry with a different seed host.
java.lang.RuntimeException: java.lang.RuntimeException: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:108) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.<init>(ThriftClient.java:69) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.utils.NimbusClient.<init>(NimbusClient.java:127) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:83) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:57) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.blobstore.NimbusBlobStore.prepare(NimbusBlobStore.java:268) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.StormSubmitter.getListOfKeysFromBlobStore(StormSubmitter.java:595) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.StormSubmitter.validateConfs(StormSubmitter.java:561) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:207) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159) [storm-core-1.1.1.jar:1.1.1]
at com.digitalpebble.stormcrawler.ConfigurableTopology.submit(ConfigurableTopology.java:85) [crawler.jar:?]
at com.digitalpebble.stormcrawler.ConfigurableTopology.submit(ConfigurableTopology.java:65) [crawler.jar:?]
at org.commoncrawl.stormcrawler.news.CrawlTopology.run(CrawlTopology.java:91) [crawler.jar:?]
at com.digitalpebble.stormcrawler.ConfigurableTopology.start(ConfigurableTopology.java:50) [crawler.jar:?]
at org.commoncrawl.stormcrawler.news.CrawlTopology.main(CrawlTopology.java:47) [crawler.jar:?]
Caused by: java.lang.RuntimeException: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.security.auth.TBackoffConnect.retryNext(TBackoffConnect.java:64) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:56) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:100) ~[storm-core-1.1.1.jar:1.1.1]
... 15 more
Caused by: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.thrift.transport.TSocket.open(TSocket.java:226) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:105) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:53) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:100) ~[storm-core-1.1.1.jar:1.1.1]
... 15 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_151]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_151]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_151]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_151]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_151]
at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_151]
at org.apache.storm.thrift.transport.TSocket.open(TSocket.java:221) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:105) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:53) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:100) ~[storm-core-1.1.1.jar:1.1.1]
... 15 more
org.apache.storm.utils.NimbusLeaderNotFoundException: Could not find leader nimbus from seed hosts [localhost]. Did you specify a valid list of nimbus hosts for config nimbus.seeds?
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:111)
at org.apache.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:57)
at org.apache.storm.blobstore.NimbusBlobStore.prepare(NimbusBlobStore.java:268)
at org.apache.storm.StormSubmitter.getListOfKeysFromBlobStore(StormSubmitter.java:595)
at org.apache.storm.StormSubmitter.validateConfs(StormSubmitter.java:561)
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:207)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159)
at com.digitalpebble.stormcrawler.ConfigurableTopology.submit(ConfigurableTopology.java:85)
at com.digitalpebble.stormcrawler.ConfigurableTopology.submit(ConfigurableTopology.java:65)
at org.commoncrawl.stormcrawler.news.CrawlTopology.run(CrawlTopology.java:91)
at com.digitalpebble.stormcrawler.ConfigurableTopology.start(ConfigurableTopology.java:50)
at org.commoncrawl.stormcrawler.news.CrawlTopology.main(CrawlTopology.java:47)
Running: java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/opt/apache-storm-1.1.1 -Dstorm.log.dir=/opt/apache-storm-1.1.1/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /opt/apache-storm-1.1.1/lib/kryo-3.0.3.jar:/opt/apache-storm-1.1.1/lib/slf4j-api-1.7.21.jar:/opt/apache-storm-1.1.1/lib/clojure-1.7.0.jar:/opt/apache-storm-1.1.1/lib/log4j-slf4j-impl-2.8.2.jar:/opt/apache-storm-1.1.1/lib/asm-5.0.3.jar:/opt/apache-storm-1.1.1/lib/objenesis-2.1.jar:/opt/apache-storm-1.1.1/lib/log4j-core-2.8.2.jar:/opt/apache-storm-1.1.1/lib/storm-rename-hack-1.1.1.jar:/opt/apache-storm-1.1.1/lib/disruptor-3.3.2.jar:/opt/apache-storm-1.1.1/lib/ring-cors-0.1.5.jar:/opt/apache-storm-1.1.1/lib/log4j-over-slf4j-1.6.6.jar:/opt/apache-storm-1.1.1/lib/reflectasm-1.10.1.jar:/opt/apache-storm-1.1.1/lib/minlog-1.3.0.jar:/opt/apache-storm-1.1.1/lib/log4j-api-2.8.2.jar:/opt/apache-storm-1.1.1/lib/servlet-api-2.5.jar:/opt/apache-storm-1.1.1/lib/storm-core-1.1.1.jar:/opt/apache-storm-1.1.1/conf:/opt/apache-storm-1.1.1/bin org.apache.storm.command.set_log_level NewsCrawl -l crawlercommons.sitemaps.SiteMapParser=ERROR
6438 [main] INFO  o.a.s.c.set-log-level - Sent log config LogConfig(named_logger_level:{crawlercommons.sitemaps.SiteMapParser=LogLevel(action:UPDATE, target_log_level:ERROR, reset_log_level_timeout_secs:0)}) for topology NewsCrawl
6585 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2001ms (NOT MAX)
8587 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2002ms (NOT MAX)
10590 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2004ms (NOT MAX)
12597 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2015ms (NOT MAX)
14615 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2016ms (NOT MAX)
16632 [main] WARN  o.a.s.u.NimbusClient - Ignoring exception while trying to get leader nimbus info from localhost. will retry with a different seed host.
java.lang.RuntimeException: java.lang.RuntimeException: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:108) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.<init>(ThriftClient.java:69) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.utils.NimbusClient.<init>(NimbusClient.java:127) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:83) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.command.set_log_level$_main.doInvoke(set_log_level.clj:74) [storm-core-1.1.1.jar:1.1.1]
at clojure.lang.RestFn.applyTo(RestFn.java:137) [clojure-1.7.0.jar:?]
at org.apache.storm.command.set_log_level.main(Unknown Source) [storm-core-1.1.1.jar:1.1.1]
Caused by: java.lang.RuntimeException: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.security.auth.TBackoffConnect.retryNext(TBackoffConnect.java:64) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:56) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:100) ~[storm-core-1.1.1.jar:1.1.1]
... 6 more
Caused by: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.thrift.transport.TSocket.open(TSocket.java:226) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:105) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:53) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:100) ~[storm-core-1.1.1.jar:1.1.1]
... 6 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_151]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_151]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_151]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_151]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_151]
at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_151]
at org.apache.storm.thrift.transport.TSocket.open(TSocket.java:221) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:105) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:53) ~[storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:100) ~[storm-core-1.1.1.jar:1.1.1]
... 6 more
Exception in thread "main" org.apache.storm.utils.NimbusLeaderNotFoundException: Could not find leader nimbus from seed hosts ["localhost"]. Did you specify a valid list of nimbus hosts for config nimbus.seeds?
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:111)
at org.apache.storm.command.set_log_level$_main.doInvoke(set_log_level.clj:74)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at org.apache.storm.command.set_log_level.main(Unknown Source)


I've tried to run it locally but I also get errors when I try to inject the urls with storm into the JAR. Stacktrace is:

ESSeedInjector . seeds/feeds.txt -conf conf/es-conf.yaml -conf conf/crawler-conf.yamlbble.stormcrawler.elasticsearch.
Running: java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/usr/local/Cellar/storm/1.2.1/libexec -Dstorm.log.dir=/usr/local/Cellar/storm/1.2.1/libexec/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /usr/local/Cellar/storm/1.2.1/libexec/*:/usr/local/Cellar/storm/1.2.1/libexec/lib/*:/usr/local/Cellar/storm/1.2.1/libexec/extlib/*:target/crawler-1.8-SNAPSHOT.jar:/usr/local/Cellar/storm/1.2.1/libexec/conf:/usr/local/Cellar/storm/1.2.1/libexec/bin -Dstorm.jar=target/crawler-1.8-SNAPSHOT.jar -Dstorm.dependency.jars= -Dstorm.dependency.artifacts={} com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector . seeds/feeds.txt -conf conf/es-conf.yaml -conf conf/crawler-conf.yaml
943  [main] WARN  o.a.s.u.Utils - STORM-VERSION new 1.2.1 old null
979  [main] INFO  o.a.s.StormSubmitter - Generated ZooKeeper secret payload for MD5-digest: -7811604253832361655:-7094390338979575620
1083 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2001ms (NOT MAX)
3090 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2002ms (NOT MAX)
5097 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2006ms (NOT MAX)
7108 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2013ms (NOT MAX)
9126 [main] WARN  o.a.s.u.StormBoundedExponentialBackoffRetry - WILL SLEEP FOR 2020ms (NOT MAX)
11149 [main] WARN  o.a.s.u.NimbusClient - Ignoring exception while trying to get leader nimbus info from localhost. will retry with a different seed host.
java.lang.RuntimeException: java.lang.RuntimeException: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:112) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.security.auth.ThriftClient.<init>(ThriftClient.java:73) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.utils.NimbusClient.<init>(NimbusClient.java:136) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:92) [storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:66) [storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:58) [storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.blobstore.NimbusBlobStore.prepare(NimbusBlobStore.java:268) [storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.StormSubmitter.getListOfKeysFromBlobStore(StormSubmitter.java:595) [storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.StormSubmitter.validateConfs(StormSubmitter.java:561) [storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:207) [storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387) [storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159) [storm-core-1.2.1.jar:1.2.1]
at com.digitalpebble.stormcrawler.ConfigurableTopology.submit(ConfigurableTopology.java:85) [crawler-1.8-SNAPSHOT.jar:?]
at com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector.run(ESSeedInjector.java:65) [crawler-1.8-SNAPSHOT.jar:?]
at com.digitalpebble.stormcrawler.ConfigurableTopology.start(ConfigurableTopology.java:50) [crawler-1.8-SNAPSHOT.jar:?]
at com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector.main(ESSeedInjector.java:38) [crawler-1.8-SNAPSHOT.jar:?]
Caused by: java.lang.RuntimeException: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.security.auth.TBackoffConnect.retryNext(TBackoffConnect.java:64) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:56) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:104) ~[storm-core-1.2.1.jar:1.2.1]
... 15 more
Caused by: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
at org.apache.storm.thrift.transport.TSocket.open(TSocket.java:226) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:105) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:53) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:104) ~[storm-core-1.2.1.jar:1.2.1]
... 15 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:400) ~[?:?]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:243) ~[?:?]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:225) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:402) ~[?:?]
at java.net.Socket.connect(Socket.java:591) ~[?:?]
at org.apache.storm.thrift.transport.TSocket.open(TSocket.java:221) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:105) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:53) ~[storm-core-1.2.1.jar:1.2.1]
at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:104) ~[storm-core-1.2.1.jar:1.2.1]
... 15 more
org.apache.storm.utils.NimbusLeaderNotFoundException: Could not find leader nimbus from seed hosts [localhost]. Did you specify a valid list of nimbus hosts for config nimbus.seeds?
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:120)
at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:66)
at org.apache.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:58)
at org.apache.storm.blobstore.NimbusBlobStore.prepare(NimbusBlobStore.java:268)
at org.apache.storm.StormSubmitter.getListOfKeysFromBlobStore(StormSubmitter.java:595)
at org.apache.storm.StormSubmitter.validateConfs(StormSubmitter.java:561)
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:207)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159)
at com.digitalpebble.stormcrawler.ConfigurableTopology.submit(ConfigurableTopology.java:85)
at com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector.run(ESSeedInjector.java:65)
at com.digitalpebble.stormcrawler.ConfigurableTopology.start(ConfigurableTopology.java:50)
at com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector.main(ESSeedInjector.java:38)

Do these errors tell you anything ? Any kind of help is appreciated😊

Thank you for your time!

Kind regards,
Bogdan

Sebastian Nagel

unread,
Mar 2, 2018, 9:36:18 AM3/2/18
to common...@googlegroups.com
Hi Bogdan,

it could be also caused by an incompatible dependency to Storm:

- from news-crawler 1755df6, Jan 5:
upgrade to Storm-crawler 1.8, Elasticsearch 6.0, Storm 1.1

- Storm-crawler is now based on Storm *1.2.1* and ES 6.1.1

You could try to roll Storm-crawler back to a9e4cb1
and build everything anew. Otherwise, I should be able
to have a look on it early next week. Sorry, but a
project which is based on snapshot dependencies is not
100% stable.

I hope you know that the crawled content is free to download,
see http://commoncrawl.org/2016/10/news-dataset-available/
Every day about 5 GB WARC files are added.

Best,
Sebastian
> > common-crawl...@googlegroups.com <javascript:>
> <mailto:common-crawl...@googlegroups.com <javascript:>>.
> > To post to this group, send email to common...@googlegroups.com <javascript:>
> > <mailto:common...@googlegroups.com <javascript:>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.

Sebastian Nagel

unread,
Mar 6, 2018, 7:45:45 AM3/6/18
to Common Crawl
Hi Bogdan,

the upgrade to Storm 1.2.1 and Elasticsearch 6.1.1 is tracked here:
https://github.com/commoncrawl/news-crawl/issues/20
Please continue the discussion over there. Thanks!

I've prepared a pull request and verified that crawling in a Docker
container works with today's Storm-crawler version (1.8-SNAPSHOT).

Best,
Sebastian

Bogdan Metea

unread,
Mar 8, 2018, 11:58:59 AM3/8/18
to Common Crawl
Hi Sebastian,

I got side-tracked doing some other things. Thank you for all your help.
I'm going to give it a go now see how it goes.

Regards,
Bogdan

>>     <mailto:common-crawl+unsub...@googlegroups.com <javascript:>>.
>>     > To post to this group, send email to common...@googlegroups.com <javascript:>
>>     > <mailto:common...@googlegroups.com <javascript:>>.
>>     > Visit this group at https://groups.google.com/group/common-crawl
>>     <https://groups.google.com/group/common-crawl>.
>>     > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to

Bogdan Metea

unread,
Mar 8, 2018, 2:50:17 PM3/8/18
to Common Crawl
Hi Sebastian,

Docker

So I tried it again and it pretty much fails in the same way. I personally think it's because of this:

root@linuxkit-025000000001:/home/ubuntu# /home/ubuntu/news-crawler/bin/run-crawler.sh
/usr/lib/python2.7/dist-packages/supervisor/options.py:297: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
  'Supervisord is running as root and it is searching '
sudo: unable to resolve host linuxkit-025000000001

I think because the way docker works on mac it can't find leader nimbus host and port. 


Local

I can submit both topologies locally and I can see them both in Storm UI. Do I have to do anything else after that? How do I check that the seeds are being crawled?
I just realised they're all RSS, will something only get crawled when something is pushed over RSS? Is RSS the only thing that's supported? Will it work for sitemap?

I've set the warc.dir to a relative path and I'm going to leave it over night.

I will also try it from a linux machine maybe docker works fine on that. 

Regards,
Bogdan

Sebastian Nagel

unread,
Mar 9, 2018, 5:09:24 AM3/9/18
to common...@googlegroups.com
Hi Bogdan,

> sudo: unable to resolve host linuxkit-025000000001

at a first glance that looks like a network configuration issue.
You could try to map linuxkit-025000000001 to 127.0.0.1 (or another
loop-back address) in the /etc/hosts of your Docker container.
Storm is generically a multi-node framework, in short $HOSTNAME
be resolved to an IP address. But I've never seen this with Docker
on Linux.

Best,
Sebastian

On 03/08/2018 08:50 PM, Bogdan Metea wrote:
> Hi Sebastian,
>
> *Docker*
> *
> *
> So I tried it again and it pretty much fails in the same way. I personally think it's because of this:
>
> root@linuxkit-025000000001:/home/ubuntu# /home/ubuntu/news-crawler/bin/run-crawler.sh
> /usr/lib/python2.7/dist-packages/supervisor/options.py:297: UserWarning: Supervisord is running as
> root and it is searching for its configuration file in default locations (including its current
> working directory); you probably want to specify a "-c" argument specifying an absolute path to a
> configuration file for improved security.
>   'Supervisord is running as root and it is searching '
> sudo: unable to resolve host linuxkit-025000000001
>
> I think because the way docker works on mac it can't find leader nimbus host and port. 
>
>
> *Local*
> *
> *
> >>     <mailto:common-crawl...@googlegroups.com <javascript:>>.
> >>     > To post to this group, send email to common...@googlegroups.com <javascript:>
> >>     > <mailto:common...@googlegroups.com <javascript:>>.
> >>     > Visit this group at https://groups.google.com/group/common-crawl
> <https://groups.google.com/group/common-crawl>
> >>     <https://groups.google.com/group/common-crawl
> <https://groups.google.com/group/common-crawl>>.
> >>     > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout> <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
> >>
> >> --
> >> You received this message because you are subscribed to the Google Groups "Common Crawl"
> group.
> >> To unsubscribe from this group and stop receiving emails from it, send an email to
> >> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> >> To post to this group, send email to common...@googlegroups.com
> >> <mailto:common...@googlegroups.com>.
> >> Visit this group at https://groups.google.com/group/common-crawl
> <https://groups.google.com/group/common-crawl>.
> >> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages