Hello all,
I cannot seem to get my Frontera crawling cluster to work.
I have it setup in AWS: an EMR Hbase cluster, a box running the kafka Bitnami AMi, and a server with Python 3.5 running frontera.
I am able to connect and get all components running without errors.
The cluster will crawl the seed URLs and output the results I expect. However, it does not crawl anything besides the seed URLs.
In the settings file I specify the crawling strategy:
CRAWLING_STRATEGY = 'frontera.strategy.basic.BasicCrawlingStrategy'
STRATEGY = 'frontera.strategy.basic.BasicCrawlingStrategy'
It is not clear to me if it should be "CRAWLING_STRATEGY" or "STRATEGY" ... so, I am specifying the BasicCrawlingStrategy for both variable names.
I think something may not be working with either Kafka or Hbase -- I assume, but do not know. I looked at Thrift and Hbase logs, and do not see a single [ERROR] or [WARNING] message. And further, the tables in the crawler namespace get created in Hbase when I start frontera -- so it clearly has the ability to access Hbase and run commands.
Nevertheless ... when I try to count rows in Hbase, every single table gives me zero results (see below):
In this case, I put a single domain in the seed file. The spider successfully crawled that single page. However, Hbase didn't show anything (the terminal window shown above was *after* the spider ran successfully on the seed URL). As stated above the spider did not continue crawling after the seed URL.
In addition, something weird might be going on with Kafka.
The topic frontier-todo has the domain name information in it -- but it doesn't have additional URLs from the site that the BasicCrawlingStrategy should have extracted. Somewhere, somehow ... either the crawling strategy is not working (it is not extracting links due to something I am doing wrong) or it is working, but fails to put the additional URLs into frontier-todo (perhaps because of a kafka problem?).
Given that I am not getting errors in any of the consoles for DBW, SW, spider, etc nor in Hbase or Thrift log files ... I am truly not sure where to go from here. Tips or direction (or even tests to run/try) would be much appreciated!
kind regards,
-James