Hello,I'm writing a distributed crawler in Clojure using Storm which will be released under the Apache License. I have a very simple prototype of the Storm-related functionality that you can find at https://gist.github.com/fmw/7ba875e5e9a7e0172561 . I'm throttling requests in the request-page bolt using a Thread/sleep call. This bolt is supposed to have a field grouping based on the "host" value of the output tuple of the seed-uri-sprout and the extract-links bolt. I understand this to mean that different hosts should run in parallel (which is my intention), while the individual request-page thread has a built-in delay to throttle requests to the individual hosts.My problem is that all requests have the delay and seem to run in a single thread, regardless of their "host" value. This despite the fact that I have set the :p value for the request-page bolt to 10, while there are only 2 seed URIs. Here is a sample request log: https://gist.github.com/fmw/7ba875e5e9a7e0172561 (the test data is a collection of dummy HTML documents that have links to a.html until z.html on every page and ten numbered pages for each letter, e.g. 0.html to 9.html for a.html). How do I parallelize requests to the different hosts, while still throttling requests on a per-host basis? I'd also love to hear any general suggestions about implementing a crawler using Storm (I have quite a bit of experience when it comes to crawling and search, but close to zero experience with Storm).I'd love to hear from you!Cheers,Filip de Waard--
You received this message because you are subscribed to the Google Groups "storm-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to storm-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
It would help to see a screenshot of the Storm UI for that topology when it's running.
It looks a lot to me like they're ending up at the same task. In a shuffle grouping, it'll be a modulus over the number of tasks. It may just so happen they're being sent to the same task. Try more seed uris.
-- Sent from mobile
It looks a lot to me like they're ending up at the same task. In a shuffle grouping, it'll be a modulus over the number of tasks. It may just so happen they're being sent to the same task. Try more seed uris.
Hi Enno, is the crawler you wrote, or parts of it open source?would you care to share design decisions and advantanges /disadvantages of the decisions you made (for example async io,fieldgrouping on hostname, )
--
You received this message because you are subscribed to a topic in the Google Groups "storm-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/storm-user/way7TGxi7TY/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to storm-user+...@googlegroups.com.