# Custom configuration for StormCrawler# This is used to override the default values from crawler-default.xml and provide additional ones# for your custom components.# Use this file with the parameter -config when launching your extension of ConfigurableTopology.# This file does not contain all the key values but only the most frequently used ones. See crawler-default.xml for an extensive list.config:topology.workers: 1topology.message.timeout.secs: 300topology.max.spout.pending: 10topology.debug: false# mandatory when using Fluxtopology.kryo.register:- com.digitalpebble.stormcrawler.Metadata# metadata to transfer to the outlinks# used by Fetcher for redirections, sitemapparser, etc...# these are also persisted for the parent document (see below)# metadata.transfer:# - customMetadataName# lists the metadata to persist to storage# these are not transfered to the outlinksmetadata.persist:- _redirTo- error.cause- error.source- isSitemap- isFeedhttp.agent.name: "StormCrawlerTest4G"http.agent.version: "1.0"http.agent.description: "A Bot 4g"# FetcherBolt queue dump : comment out to activate# if a file exists on the worker machine with the corresponding port number# the FetcherBolt will log the content of its internal queues to the logs# fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}parsefilters.config.file: "parsefilters.json"urlfilters.config.file: "urlfilters.json"# revisit a page daily (value in minutes)# set it to -1 to never refetch a pagefetchInterval.default: 1440# revisit a page with a fetch error after 2 hours (value in minutes)# set it to -1 to never refetch a pagefetchInterval.fetch.error: 120# never revisit a page with an error (or set a value in minutes)fetchInterval.error: -1# custom fetch interval to be used when a document has the key/value in its metadata# and has been fetched succesfully (value in minutes)# fetchInterval.isFeed=true: 10# configuration for the classes extending AbstractIndexerBolt# indexer.md.filter: "someKey=aValue"indexer.url.fieldname: "url"indexer.text.fieldname: "content"indexer.canonical.name: "canonical"indexer.md.mapping:- parse.title=title- parse.keywords=keywords- parse.description=description- domain=domain# Metrics consumers:topology.metrics.consumer.register:- class: "org.apache.storm.metric.LoggingMetricsConsumer"parallelism.hint: 1#for test puprosefetcher.threads.per.queue: 4fetcher.max.crawl.delay: 1fetcher.server.delay: 0.1fetcher.server.min.delay: 0.1http.content.limit: -1
{"com.digitalpebble.stormcrawler.parse.ParseFilters": [{"class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter","name": "XPathFilter","params": {"canonical": "//*[@rel=\"canonical\"]/@href","parse.description": ["//*[@name=\"description\"]/@content","//*[@name=\"Description\"]/@content"],"parse.title": ["//TITLE","//META[@name=\"title\"]/@content"],"parse.keywords": "//META[@name=\"keywords\"]/@content"}},{"class": "com.zwoop.crawler.ContentFilterOut","name": "ContentFilter","params": {"pattern": "*"}},{"class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter","name": "DomainParseFilter","params": {"key": "domain","byHost": false}}]}
{"com.digitalpebble.stormcrawler.filtering.URLFilters": [{"class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter","name": "BasicURLFilter","params": {"maxPathRepetition": 3,"maxLength": 1024}},{"class": "com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter","name": "MaxDepthFilter","params": {"maxDepth": 2}},{"class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer","name": "BasicURLNormalizer","params": {"removeAnchorPart": true,"unmangleQueryString": true,"checkValidURI": true,"removeHashes": false}},{"class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter","name": "HostURLFilter","params": {"ignoreOutsideHost": true,"ignoreOutsideDomain": true}},{"class": "com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer","name": "RegexURLNormalizer","params": {"regexNormalizerFile": "default-regex-normalizers.xml"}},{"class": "com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilter","name": "RegexURLFilter","params": {"regexFilterFile": "default-regex-filters.txt"}},{"class": "com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter","name": "SelfURLFilter"},{"class": "com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter","name": "MetadataFilter","params": {"isSitemap": "false"}}]}
<dependencies><dependency><groupId>org.apache.storm</groupId><artifactId>storm-core</artifactId><version>1.0.2</version><scope>provided</scope></dependency><dependency><groupId>org.apache.storm</groupId><artifactId>storm-hdfs</artifactId><version>1.0.2</version></dependency><dependency><groupId>commons-codec</groupId><artifactId>commons-codec</artifactId><version>1.9</version></dependency><!-- <dependency><groupId>org.apache.storm</groupId><artifactId>flux-core</artifactId><version>1.0.2</version></dependency> --><dependency><groupId>com.digitalpebble.stormcrawler</groupId><artifactId>storm-crawler-core</artifactId><version>1.2</version></dependency></dependencies>
Hello,First, a bit of context, I' quite new to the crawling world, Storm and StormCrawler.
I'm trying to crawl a list of domains in order to get the full html, CSS and JS content of all pages, exported into WARC files.
For that I'm trying both Nutch and StormCrawler.
I locally installed on windows 10 StormCrawler using the maven archetype.-> I ran into this issue .So i had to remove the depency with flux-core from the pom.
Managed to fetch some urls from a domain, using the standard StdOutStatusUpdater from the CrawlTopology.Then I included the external WARC export library from github and modified the CrawlTopology.-> For that i had to add depedency to storm-hdfs and commons-codec.
| <dependency> | |
| <groupId>com.digitalpebble.stormcrawler</groupId> | |
| <artifactId>storm-crawler-warc</artifactId> | |
| <version>1.2</version> | |
| </dependency> |
-> Then got an Hadoop connection refused exception, linked to these action :
String fsURL = "hdfs://localhost:9000";warcbolt.withFsUrl(fsURL);
String warcFilePath = "/warc"; FileNameFormat fileNameFormat = new WARCFileNameFormat() .withPath(warcFilePath);
I locally installed and run hadoop hdfs (which was quite an hassle on windows.. l.) even though I'm not quite sure it was necessary, can you comment on that ?
Anyway, even if it removed some exception i also add to comment the above lines to let the standard confirguration to be able to connect to localhost.
Now, I'm still not able to get any of the content, i can see in the standard output that it's discovering and fetching some urls from the host and it generates an empty warc.gz file.
Do you have any idea what steps I missed or did wrong ?
Any help would be very appreciated :)
--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebble+unsubscribe@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.
Thanks Julien for your fast answer.
String warcFilePath = "/warc"; FileNameFormat fileNameFormat = new WARCFileNameFormat() .withPath(warcFilePath);
/*** Dummy topology to play with the spouts and bolts*/public class CrawlTopology extends ConfigurableTopology {public static void main(String[] args) throws Exception {ConfigurableTopology.start(new CrawlTopology(), args);}@Overrideprotected int run(String[] args) {TopologyBuilder builder = new TopologyBuilder();String[] testURLs = new String[] {"http://www.lequipe.fr/" /*,builder.setSpout("spout", new MemorySpout(testURLs));builder.setBolt("partitioner", new URLPartitionerBolt()).shuffleGrouping("spout");builder.setBolt("fetch", new FetcherBolt()).fieldsGrouping("partitioner", new Fields("key"));builder.setBolt("sitemap", new SiteMapParserBolt()).localOrShuffleGrouping("fetch");builder.setBolt("parse", new JSoupParserBolt()).localOrShuffleGrouping("sitemap");builder.setBolt("index", new StdOutIndexer()).localOrShuffleGrouping("parse");Fields furl = new Fields("url");// can also use MemoryStatusUpdater for simple recursive crawlsbuilder.setBolt("status", new StdOutStatusUpdater()).fieldsGrouping("fetch", Constants.StatusStreamName, furl).fieldsGrouping("sitemap", Constants.StatusStreamName, furl).fieldsGrouping("parse", Constants.StatusStreamName, furl).fieldsGrouping("index", Constants.StatusStreamName, furl);//generating warc files
String warcFilePath = "/warc";FileNameFormat fileNameFormat = new WARCFileNameFormat().withPath(warcFilePath);
Map<String,String> fields = new HashMap<>();fields.put("software:", "StormCrawler 1.0 http://stormcrawler.net/");fields.put("conformsTo:", "http://www.archive.org/documents/WarcFileFormat-1.0.html");WARCHdfsBolt warcbolt = (WARCHdfsBolt) new WARCHdfsBolt().withFileNameFormat(fileNameFormat);warcbolt.withHeader(fields);// can specify the filesystem - will use the local FS by default
String fsURL = "hdfs://localhost:9000";
//warcbolt.withFsUrl(fsURL);// a custom max length can be specified - 1 GB will be used as a defaultFileSizeRotationPolicy rotpol = new FileSizeRotationPolicy(5.0f,Units.MB);warcbolt.withRotationPolicy(rotpol);builder.setBolt("warc", warcbolt).localOrShuffleGrouping("fetch");return submit("crawl", conf, builder);}}
950 [main] INFO o.a.s.u.TupleUtils - Enabling tick tuple with interval [15]1145 [main] INFO o.a.s.StormSubmitter - Generated ZooKeeper secret payload for MD5-digest: -7001763315176827023:-81456101045242752691227 [main] INFO o.a.s.s.a.AuthUtils - Got AutoCreds []17298 [main] WARN o.a.s.u.NimbusClient - Ignoring exception while trying to get leader nimbus info from localhost. will retry with a different seed host.java.lang.RuntimeException: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused: connectat org.apache.storm.security.auth.TBackoffConnect.retryNext(TBackoffConnect.java:64) ~[storm-core-1.0.2.jar:1.0.2]at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:56) ~[storm-core-1.0.2.jar:1.0.2]at org.apache.storm.security.auth.ThriftClient.reconnect(ThriftClient.java:99) ~[storm-core-1.0.2.jar:1.0.2]at org.apache.storm.security.auth.ThriftClient.<init>(ThriftClient.java:69) ~[storm-core-1.0.2.jar:1.0.2]at org.apache.storm.utils.NimbusClient.<init>(NimbusClient.java:106) ~[storm-core-1.0.2.jar:1.0.2]at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:66) [storm-core-1.0.2.jar:1.0.2]at org.apache.storm.StormSubmitter.topologyNameExists(StormSubmitter.java:371) [storm-core-1.0.2.jar:1.0.2]at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:233) [storm-core-1.0.2.jar:1.0.2]at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:311) [storm-core-1.0.2.jar:1.0.2]at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:157) [storm-core-1.0.2.jar:1.0.2]at com.digitalpebble.stormcrawler.ConfigurableTopology.submit(ConfigurableTopology.java:85) [storm-crawler-core-1.2.jar:?]at com.zwoop.crawler.CrawlTopology.run(CrawlTopology.java:107) [classes/:?]at com.digitalpebble.stormcrawler.ConfigurableTopology.start(ConfigurableTopology.java:50) [storm-crawler-core-1.2.jar:?]at com.zwoop.crawler.CrawlTopology.main(CrawlTopology.java:45) [classes/:?]Caused by: org.apache.storm.thrift.transport.TTransportException: java.net.ConnectException: Connection refused: connectat org.apache.storm.thrift.transport.TSocket.open(TSocket.java:226) ~[storm-core-1.0.2.jar:1.0.2]at org.apache.storm.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-1.0.2.jar:1.0.2]at org.apache.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:103) ~[storm-core-1.0.2.jar:1.0.2]at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:53) ~[storm-core-1.0.2.jar:1.0.2]... 12 moreCaused by: java.net.ConnectException: Connection refused: connectat java.net.DualStackPlainSocketImpl.connect0(Native Method) ~[?:1.8.0_111]at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source) ~[?:1.8.0_111]at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source) ~[?:1.8.0_111]at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source) ~[?:1.8.0_111]at java.net.AbstractPlainSocketImpl.connect(Unknown Source) ~[?:1.8.0_111]at java.net.PlainSocketImpl.connect(Unknown Source) ~[?:1.8.0_111]at java.net.SocksSocketImpl.connect(Unknown Source) ~[?:1.8.0_111]at java.net.Socket.connect(Unknown Source) ~[?:1.8.0_111]at org.apache.storm.thrift.transport.TSocket.open(TSocket.java:221) ~[storm-core-1.0.2.jar:1.0.2]at org.apache.storm.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-1.0.2.jar:1.0.2]at org.apache.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:103) ~[storm-core-1.0.2.jar:1.0.2]at org.apache.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:53) ~[storm-core-1.0.2.jar:1.0.2]... 12 moreorg.apache.storm.utils.NimbusLeaderNotFoundException: Could not find leader nimbus from seed hosts [localhost]. Did you specify a valid list of nimbus hosts for config nimbus.seeds?at org.apache.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:90)at org.apache.storm.StormSubmitter.topologyNameExists(StormSubmitter.java:371)at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:233)at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:311)at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:157)at com.digitalpebble.stormcrawler.ConfigurableTopology.submit(ConfigurableTopology.java:85)at com.zwoop.crawler.CrawlTopology.run(CrawlTopology.java:107)at com.digitalpebble.stormcrawler.ConfigurableTopology.start(ConfigurableTopology.java:50)at com.zwoop.crawler.CrawlTopology.main(CrawlTopology.java:45)
Hi,
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.
I launch it with this command : mvn clean compile exec:java -Dexec.mainClass=com.zwoop.crawler.CrawlTopology -Dexec.args="-conf crawler-conf.yaml local"
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebble+unsubscribe@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to a topic in the Google Groups "DigitalPebble" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/digitalpebble/vC9cqBlqmxo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to digitalpebble+unsubscribe@googlegroups.com.
But then i miss the last part of content which is not fetch, so i need to find a way to detect the fetching is finished and flush the data from memory to the file.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebble+unsubscribe@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.
Ok Sorry I'll try to be follow Storm terminology. (I'm still running in local mode so far, will switch to storm very soon)
I would like to be able to monitor the tuples waiting to be processed and see the ones already processed.
May be also know if a toplogy is still processing a tuple or if it's idle, waiting for new tuples to process.
Concretely what i would like to achieve is triggering some function (flushing all data in memory to the WARC file and compressing it for example) when the crawl of a whole webiste is finished (i.e staying on same host and domain, fetched all discovered URL).
I'm thinking on adding some kind on logic on the Spout for that but it's probably not the best design....
Regards,Anthony2016-11-15 16:44 GMT+08:00 DigitalPebble <jul...@digitalpebble.com>:Do u think there is a way to programatically monitor in storm the ongoing and queueing jobs ?
what do you mean by jobs? topologies?On 15 November 2016 at 00:30, Anthony MICHEL <michel....@gmail.com> wrote:Yes i meant flushed.
The first solution is what i tried first and it generates a corrupted compressed file.
Using some kind of timeout is a good idea i may try.
Do u think there is a way to programatically monitor in storm the ongoing and queueing jobs ?
Regards,Anthony
To unsubscribe from this group and all its topics, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.
--
--
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebble+unsubscribe@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.