Configuring storm crawl for outside urls

rolm...@gmail.com

unread,

Aug 30, 2016, 12:04:08 PM8/30/16

to DigitalPebble

Hello,

I have been trying to configure storm crawler to crawl the web using a web directoy as seed but always crawling the same hostname, nothing about outside webs.

Can you guide me please?

Thanks in advance,

Rodrigo

DigitalPebble

unread,

Aug 30, 2016, 12:49:23 PM8/30/16

to digita...@googlegroups.com

Hi Rodrigo

Did you create your project using the archetype? If so the urlfilters.json file in resources should already be configured to stay within the same domain

{
	"class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter",
	"name": "HostURLFilter",
	"params": {
	"ignoreOutsideHost": false,
	"ignoreOutsideDomain": true
	}
	},

change ignoreOutsideHost to true to be more strict and stay with the same hostname.

If you don't have such a file, then create one in src/main/resources using the one from the archetype as an example

Julien

PS: you could also post questions on StormCrawler on stackoverflow using the tag 'stormcrawler'

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebble+unsubscribe@googlegroups.com.
To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.

--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com

http://digitalpebble.blogspot.com

https://twitter.com/digitalpebble

rolm...@gmail.com

unread,

Aug 30, 2016, 2:02:06 PM8/30/16

to DigitalPebble, jul...@digitalpebble.com

Hi Julien,

Thanks for your quick response. I think I haven't explained clear. I would like to do just opposite, so if the crawler can jump from site A to site B to implement a basic web finder.

This is my urlfilter.json:

{

"com.digitalpebble.storm.crawler.filtering.URLFilters": [

{

"class": "com.digitalpebble.storm.crawler.filtering.depth.MaxDepthFilter",

"name": "MaxDepthFilter",

"params": {

"maxDepth": 100

}

},

{

"class": "com.digitalpebble.storm.crawler.filtering.basic.BasicURLNormalizer",

"name": "BasicURLNormalizer",

"params": {

"removeAnchorPart": true,

"unmangleQueryString": true,

"checkValidURI": true

}

},

{

"class": "com.digitalpebble.storm.crawler.filtering.host.HostURLFilter",

"name": "HostURLFilter",

"params": {

"ignoreOutsideHost": false,

"ignoreOutsideDomain": false

}

},

{

"class": "com.digitalpebble.storm.crawler.filtering.regex.RegexURLNormalizer",

"name": "RegexURLNormalizer",

"params": {

"regexNormalizerFile": "default-regex-normalizers.xml"

}

},

{

"class": "com.digitalpebble.storm.crawler.filtering.regex.RegexURLFilter",

"name": "RegexURLFilter",

"params": {

"regexFilterFile": "default-regex-filters.txt"

}

},

{

"class": "com.digitalpebble.storm.crawler.filtering.basic.SelfURLFilter",

"name": "SelfURLFilter"

}

]

}

and the crawl.yaml file:

# Default configuration for StormCrawler

# This is used to make the default values explicit and list the most common configurations.

# Do not modify this file but instead provide a custom one with the parameter -config

# when launching your extension of ConfigurableTopology.

fetcher.server.delay: 1.0

fetcher.server.min.delay: 0.0

fetcher.queue.mode: "byHost"

fetcher.threads.per.queue: 1

fetcher.threads.number: 10

# time bucket to use for the metrics sent by the Fetcher

fetcher.metrics.time.bucket.secs: 10

partition.url.mode: "byHost"

# lists the metadata to transfer to the outlinks

# used by Fetcher for redirections, sitemapparser, etc...

# metadata.transfer:

# - key1

# - key2

# - key3

http.agent.name: "anonymous coward"

http.agent.version: "1.0"

http.agent.description: "a Storm-based crawler"

http.agent.url: "https://github.com/DigitalPebble/storm-crawler"

http.agent.email: "som...@company.com"

http.accept.language: "en-us,en-gb,en;q=0.7,*;q=0.3"

http.accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"

http.content.limit: 65536

http.store.responsetime: true

http.timeout: 10000

http.robots.403.allow: true

# should the URLs be removed when a page is marked as noFollow

robots.noFollow.strict: true

# should the URLs be removed when a page is marked as noFollow

robots.noFollow.strict: true

protocols: "http,https"

http.protocol.implementation: "com.digitalpebble.storm.crawler.protocol.httpclient.HttpProtocol"

https.protocol.implementation: "com.digitalpebble.storm.crawler.protocol.httpclient.HttpProtocol"

parsefilters.config.file: "parsefilters.json"

urlfilters.config.file: "urlfilters.json"

# whether the sitemap parser should try to

# determine whether a page is a sitemap based

# on its content if it is missing the K/V in the metadata

sitemap.sniffContent: false

# filters URLs in sitemaps based on their modified Date (if any)

sitemap.filter.hours.since.modified: -1

# whether to add any sitemaps found in the robots.txt to the status stream

# used by fetcher bolts. sitemap.sniffContent must be set to true if the

# discovery is enabled

sitemap.discovery: false

# Default implementation of Scheduler

scheduler.class: "com.digitalpebble.storm.crawler.persistence.DefaultScheduler"

# revisit a page daily (value in minutes)

fetchInterval.default: 1440

# revisit a page with a fetch error after 2 hours (value in minutes)

fetchInterval.fetch.error: 120

# revisit a page with an error every month (value in minutes)

fetchInterval.error: 44640

# max number of successive fetch errors before changing status to ERROR

max.fetch.errors: 3

# configuration for the classes extending AbstractIndexerBolt

# indexer.md.filter: "someKey=aValue"

indexer.url.fieldname: "url"

indexer.text.fieldname: "content"

indexer.canonical.name: "canonical"

indexer.md.mapping:

- parse.title=page_title

- parse.keywords=keywords

- parse.description=description

- html_title=html_title

- doc_type=doc_type

metadata.track.path: true

metadata.track.depth: false

I have been crawling for 6 hours and all discovered files are under the seed host.

Is there any problem under configuration?

Thanks in advance for your help,

Rodrigo

El martes, 30 de agosto de 2016, 18:49:23 (UTC+2), DigitalPebble escribió:

Hi Rodrigo

Did you create your project using the archetype? If so the urlfilters.json file in resources should already be configured to stay within the same domain

{
"class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter",
"name": "HostURLFilter",
"params": {
"ignoreOutsideHost": false,
"ignoreOutsideDomain": true
}
},

change ignoreOutsideHost to true to be more strict and stay with the same hostname.

If you don't have such a file, then create one in src/main/resources using the one from the archetype as an example

Julien

PS: you could also post questions on StormCrawler on stackoverflow using the tag 'stormcrawler'

On 30 August 2016 at 17:04, <rolm...@gmail.com> wrote:

Hello,

I have been trying to configure storm crawler to crawl the web using a web directoy as seed but always crawling the same hostname, nothing about outside webs.
Can you guide me please?

Thanks in advance,

Rodrigo

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.

To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.

To post to this group, send email to digita...@googlegroups.com.
Visit this group at https://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/d/optout.

DigitalPebble

unread,

Sep 1, 2016, 11:01:07 AM9/1/16

to Rodrigo Olmo Velasco, digita...@googlegroups.com

Hi Rodrigo

Any luck?

BTW you don't need to copy the whole content of the default config in your own config file, just specify the values you want to override. It's also good practice to set a proper value for the agent.name ;-)

Julien

On 30 August 2016 at 21:36, DigitalPebble <jul...@digitalpebble.com> wrote:

Hi Rodrigo

So even after setting ignoreOutsideDomain to false, the crawl doesnt go beyond the original hostname? Did you recompile a new jar after modifying the file? Do you get something different when removing the entire HostURLFilter section from the filter config?

J

rolm...@gmail.com

unread,

Sep 1, 2016, 11:07:30 AM9/1/16

to DigitalPebble, rolm...@gmail.com, jul...@digitalpebble.com

Yes, sure. Just a problem with my configuration. I wasn't giving the right path (yes, more than 15 years working and still doing so stupid things).

Now I'm crawling all the internet (XDDD).

If you want, as soon as we launch something usable we will contact you to grow your use cases.

Thank you very much for your support.

DigitalPebble

unread,

Sep 1, 2016, 11:12:37 AM9/1/16

to Rodrigo Olmo Velasco, DigitalPebble

Glad you got it to work! yes, please! Always good to hear how people use it and show that they do