First time StormCrawler no output

45 views
Skip to first unread message

gcr

unread,
Feb 18, 2020, 4:13:13 PM2/18/20
to DigitalPebble
All,

Running StormCrawler for the first time. It seems to run well enough but StdOutIndexer yields nothing, no output.  Why?

I have StormCrawler running in a set of docker containers.  All seems to be going well.  I am trying to crawl a single, local, minimal web site.  My topology is almost exactly the same as the example: CrawlTopology.javaI am still using all the same bolts including StdOutIndexer. I have but one URL that points to a locally running nginx server that serves the default nginx page.

I run the job, it appears to run well. I receive no errors. It tells me that it submits the job, apparently successfully, yet I see no output. Is it not supposed to index the page and output to the console?

Thanks so much

I am including the whole output just in case it helps.

(base) C02Y50P5JGH7:IntraCrawlService geoffryroberts$ ./run.sh

here1=/Users/geoffryroberts/cdcWk/IntraCrawlService/IntraCrawlService

Running: /usr/local/openjdk-8/bin/java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/apache-storm-2.1.0 -Dstorm.log.dir=/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib:/usr/lib64 -Dstorm.conf.file= -cp /apache-storm-2.1.0/*:/apache-storm-2.1.0/lib/*:/apache-storm-2.1.0/extlib/*:/IntraCrawlService.jar:/conf:/apache-storm-2.1.0/bin: -Dstorm.jar=/IntraCrawlService.jar -Dstorm.dependency.jars= -Dstorm.dependency.artifacts={} gov.cdc.dcat.IntraCrawlService IntraCrawlService

21:10:30.763 [main] INFO  o.a.s.StormSubmitter - Generated ZooKeeper secret payload for MD5-digest: -7151646488298444073:-8851430964261411050

21:10:30.871 [main] WARN  o.a.s.v.ConfigValidation - task.heartbeat.frequency.secs is a deprecated config please see class org.apache.storm.Config.TASK_HEARTBEAT_FREQUENCY_SECS for more information.

21:10:31.051 [main] INFO  o.a.s.u.NimbusClient - Found leader nimbus : 3701762abf10:6627

21:10:31.053 [main] INFO  o.a.s.s.a.ClientAuthUtils - Got AutoCreds []

21:10:31.113 [main] INFO  o.a.s.StormSubmitter - Uploading dependencies - jars...

21:10:31.115 [main] INFO  o.a.s.StormSubmitter - Uploading dependencies - artifacts...

21:10:31.116 [main] INFO  o.a.s.StormSubmitter - Dependency Blob keys - jars : [] / artifacts : []

21:10:31.134 [main] INFO  o.a.s.StormSubmitter - Uploading topology jar /IntraCrawlService.jar to assigned location: /data/nimbus/inbox/stormjar-d0fe3da0-0562-4142-be5b-5ebcad18482a.jar

21:10:31.405 [main] INFO  o.a.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /data/nimbus/inbox/stormjar-d0fe3da0-0562-4142-be5b-5ebcad18482a.jar

21:10:31.405 [main] INFO  o.a.s.StormSubmitter - Submitting topology crawl in distributed mode with conf {"http.content.limit":-1,"fetcher.max.crawl.delay.force":false,"sitemap.schedule.delay":-1,"fetcher.timeout.queue":-1,"http.agent.description":"built with StormCrawler 1.16","status.updater.use.cache":true,"fetcher.max.throttle.sleep":-1,"http.agent.version":"1.0","fetcher.max.crawl.delay":30,"fetcher.server.delay.force":false,"metadata.persist":["_redirTo","error.cause","error.source","isSitemap","isFeed"],"http.accept":"text\/html,application\/xhtml+xml,application\/xml;q=0.9,*\/*;q=0.8","track.anchors":true,"fetcher.threads.number":10,"sitemap.discovery":false,"protocols":"http,https,file","indexer.text.maxlength":-1,"detect.mimetype":true,"fetcher.max.queue.size":-1,"selenium.capabilities":{"takesScreenshot":false,"loadImages":false,"javascriptEnabled":true},"metadata.track.depth":true,"indexer.url.fieldname":"url","storm.zookeeper.topology.auth.payload":"-7151646488298444073:-8851430964261411050","http.accept.language":"en-us,en-gb,en;q=0.7,*;q=0.3","fetcher.queue.mode":"byHost","selenium.implicitlyWait":0,"detect.charset.maxlength":10000,"robots.cache.spec":"maximumSize=10000,expireAfterWrite=6h","indexer.canonical.name":"canonical","fetcher.metrics.time.bucket.secs":10,"fetcher.server.min.delay":0.0,"sitemap.filter.hours.since.modified":-1,"max.fetch.errors":3,"fetchInterval.error":-1,"fetcher.server.delay":1.0,"metadata.track.path":true,"http.protocol.implementation":"com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol","selenium.delegated.protocol":"com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol","parser.emitOutlinks":true,"fetchInterval.default":1440,"status.updater.unit.round.date":"SECOND","selenium.instances.num":1,"http.store.headers":false,"jsoup.treat.non.html.as.error":true,"http.timeout":10000,"http.robots.403.allow":true,"https.protocol.implementation":"com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol","indexer.text.fieldname":"content","http.agent.name":"Anonymous Coward","urlbuffer.class":"com.digitalpebble.stormcrawler.persistence.urlbuffer.SimpleURLBuffer","http.content.partial.as.trimmed":false,"indexer.md.mapping":["parse.title=title","parse.keywords=keywords","parse.description=description"],"selenium.setScriptTimeout":0,"robots.noFollow.strict":true,"file.protocol.implementation":"com.digitalpebble.stormcrawler.protocol.file.FileProtocol","http.agent.url":"http:\/\/someorganization.com\/","status.updater.cache.spec":"maximumSize=10000,expireAfterAccess=1h","scheduler.class":"com.digitalpebble.stormcrawler.persistence.DefaultScheduler","parser.emitOutlinks.max.per.page":-1,"fetcher.threads.per.queue":1,"storm.zookeeper.topology.auth.scheme":"digest","robots.error.cache.spec":"maximumSize=10000,expireAfterWrite=1h","selenium.pageLoadTimeout":-1,"fetcher.max.urls.in.queues":-1,"topology.kryo.register":["com.digitalpebble.stormcrawler.Metadata"],"http.agent.email":"som...@someorganization.com","partition.url.mode":"byHost","fetchInterval.fetch.error":120}














21:10:31.755 [main] INFO  o.a.s.StormSubmitter - Finished submitting topology: crawl



 

DigitalPebble

unread,
Feb 19, 2020, 4:30:01 AM2/19/20
to DigitalPebble
Hi,

Looks like you submitted your topology to a Storm cluster in deployed mode, which is great, but means that nothing will get displayed to the console. The output should be in the worker.log.out file in the directory containing the logs, e.g. /var/log/storm/workers-artifacts/crawler-11-1557498619/6700/

StdOutIndexer is used mostly in local mode (i.e not distributed) for debugging and testing.

Please use StackOverflow with the tag StormCrawler - you'll get a wider audience.

Have a good day

Julien

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/digitalpebble/f61a84fd-8ede-4bc4-aa5d-5e5577bf4ed0%40googlegroups.com.


--
Reply all
Reply to author
Forward
0 new messages