Re: [storm-user] bolt parallelism while implementing request throttling in a crawler

Nathan Marz

unread,

Mar 23, 2013, 8:21:02 PM3/23/13

to storm...@googlegroups.com

It would help to see a screenshot of the Storm UI for that topology when it's running.

On Thu, Mar 21, 2013 at 10:07 AM, Filip de Waard <f...@vix.io> wrote:

Hello,

I'm writing a distributed crawler in Clojure using Storm which will be released under the Apache License. I have a very simple prototype of the Storm-related functionality that you can find at https://gist.github.com/fmw/7ba875e5e9a7e0172561 . I'm throttling requests in the request-page bolt using a Thread/sleep call. This bolt is supposed to have a field grouping based on the "host" value of the output tuple of the seed-uri-sprout and the extract-links bolt. I understand this to mean that different hosts should run in parallel (which is my intention), while the individual request-page thread has a built-in delay to throttle requests to the individual hosts.

My problem is that all requests have the delay and seem to run in a single thread, regardless of their "host" value. This despite the fact that I have set the :p value for the request-page bolt to 10, while there are only 2 seed URIs. Here is a sample request log: https://gist.github.com/fmw/7ba875e5e9a7e0172561 (the test data is a collection of dummy HTML documents that have links to a.html until z.html on every page and ten numbered pages for each letter, e.g. 0.html to 9.html for a.html). How do I parallelize requests to the different hosts, while still throttling requests on a per-host basis? I'd also love to hear any general suggestions about implementing a crawler using Storm (I have quite a bit of experience when it comes to crawling and search, but close to zero experience with Storm).

I'd love to hear from you!

Cheers,

Filip de Waard

--
You received this message because you are subscribed to the Google Groups "storm-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to storm-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Twitter: @nathanmarz
http://nathanmarz.com

Filip de Waard

unread,

Mar 25, 2013, 1:21:02 PM3/25/13

to storm...@googlegroups.com, nat...@nathanmarz.com

Hello, Nathan,

Thanks for your response!

On Sunday, March 24, 2013 1:21:02 AM UTC+1, Nathan Marz wrote:

It would help to see a screenshot of the Storm UI for that topology when it's running.

I've uploaded the HTML versions here:

http://vix.io/storm/crawl.html

http://vix.io/storm/bolt-2.html

I was running on a LocalCluster earlier, but now I have created a remote cluster on EC2 using storm-deploy. I have updated the code a little (https://gist.github.com/fmw/aa8a40a70094736c2197) to facilitate running on the remote cluster, but the result is still the same. Only one of the worker nodes has the /tmp/crawl.log file (see https://gist.github.com/fmw/80ddf188db9f2b5be4ef).

-Filip

Michael Rose

unread,

Mar 25, 2013, 1:48:22 PM3/25/13

to storm...@googlegroups.com, Nathan Marz

It looks a lot to me like they're ending up at the same task. In a shuffle grouping, it'll be a modulus over the number of tasks. It may just so happen they're being sent to the same task. Try more seed uris.

-- Sent from mobile

Filip de Waard

unread,

Mar 25, 2013, 2:23:49 PM3/25/13

to storm...@googlegroups.com, Nathan Marz

Hello, Michael,

Thanks for your response.

On Monday, March 25, 2013 6:48:22 PM UTC+1, Michael Rose wrote:

It looks a lot to me like they're ending up at the same task. In a shuffle grouping, it'll be a modulus over the number of tasks. It may just so happen they're being sent to the same task. Try more seed uris.

The request-page bolt isn't just reading off the sprout, but also off the output from the extract-link bolt. This means there are more than two input values. I'm using a fields grouping (not a shuffle grouping), based on the host field emitted by the sprout or extract-links bolt (they emit tuples like ["abc.com" "http://abc.com/a.html'] with the first value being the host and the second the URI).

Maybe I should describe what I want to achieve. I want a separate thread for every host with a built-in delay of e.g. 10 seconds between requests. I want to run many of these threads in parallel. I want to do this so I can crawl pages on multiple domains at the same time, while still throttling the requests to individual domains in order not to DDoS it. I know how to implement this in other ways (and have done so in the past), but can't get it to work using Storm. I assume this is because I either made a mistake in the code I posted or misunderstood the mechanics of Storm (e.g. how a field grouping works).

-Filip

Filip de Waard

unread,

Mar 28, 2013, 9:14:24 AM3/28/13

to storm...@googlegroups.com, Nathan Marz

Could someone please provide some input here?

Filip de Waard

unread,

Apr 2, 2013, 12:34:29 PM4/2/13

to storm...@googlegroups.com, Nathan Marz

Hello,

I figured out the problem. I'm using the test values "abc.com" and "cba.com" for the "host" field. Apparently these values are considered equal by the field grouping, so they are sent to the same thread. Now I'm hashing the host field to get an acceptable level of uniqueness. This is an acceptable workaround for me, but I wonder if this is a bug in Storm? Otherwise it would be worth documenting. I'd love to help out by contributing a patch or documentation update. Please let me know what you think.

Cheers,

Filip de Waard

Enno Shioji

unread,

Apr 3, 2013, 2:27:27 AM4/3/13

to storm...@googlegroups.com

Hi, not sure if I'm helping here, but when you group by field, the only guarantee that you get is that a value with the same key will end up in the same bolt instance. So it's well possible that "abc.com" and "cba.com" end up in the same bolt.

We have implemented a crawler in Storm and have it running in production (we did it in Java, though). We used asynchronous IO, but if I were to use synchronous IO, I wouldn't use thread sleep for rate limiting. If you have green threads this is fine, but if your threads are native threads, they are very expensive resource and you wouldn't want to underutilize the limited number of them you can have. This code example might be of interest: https://github.com/overtone/at-at/blob/master/src/overtone/at_at.clj (it just uses ScheduledExecutorService under the hood). In essence ScheduledExecutorService has a delayed queue inside that orders tasks based on when they can be executed. This way URLs that don't have to wait can be fetched while URLs that have to wait are kept in the queue.

Enno Shioji

unread,

Apr 3, 2013, 4:59:37 AM4/3/13

to storm...@googlegroups.com, an...@nahapetian.com

Hi Andy, yes we are planning to open source this (hopefully) soon.

Re: design decisions,

- NIO vs OIO: we opted for NIO as we were mostly interested in text content (as opposed to large contents) and most servers on the intertubes have long latency. We haven't benchmarked a synchronous IO version so I couldn't tell you numbers though.

- Field grouping on domain & state as plain hash map: This was done for rate limiting per domain and also to cache robots.txt and content. In our case we had an almost uniform distribution of domains so the state for rate limitation could be distributed almost uniformly. This allowed us to have robots.txt and content cache in our bolts as plain hash maps. This kind of state doen't need high reliability (if you loose cache/state and make a redundant request, it's not a big deal). This was simple and performed well. If you have very skewed distribution, this approach could be a problem.

- Using thread.sleep is really not viable. I've seen such code and they perform really bad. We used a delay queue and a thread pool that pulls off this queue inside the bolt to avoid URLs with crawl delays etc. to block the process. I've seen implementation that uses a regular queue and just adds the URL back to the queue if it's not time to fetch it yet, but this can cause busy wait and waste of CPU in certain situations.

What else.. we use S3 to store the fetched content. This was a bit tricky because you want to "stream" into S3 if you want to use decently large file size (which is a good idea because small files are pain to use later with EMR etc.) & be still memory efficient. This can be done by using the multipart upload feature of S3, but it's a bit tricky to use. This code will be included in the crawler code we are planning to open source.

Regards,

Enno

On Wed, Apr 3, 2013 at 7:39 AM, anahap <> wrote:

Hi Enno, is the crawler you wrote, or parts of it open source?
would you care to share design decisions and advantanges /disadvantages of the decisions you made (for example async io,
fieldgrouping on hostname, )

Andy Nahapetian

unread,

Apr 3, 2013, 5:34:25 AM4/3/13

to storm...@googlegroups.com

Hi Enno, thanks for your elaborate reply!

Does this mean you have a crawler bolt with one executor on each worker, and this bolt maintains a queue that asynchronously downloads content?

how do you make sure you don't make more than x parallel requests to the same domain?

TIA

Andy

--
You received this message because you are subscribed to a topic in the Google Groups "storm-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/storm-user/way7TGxi7TY/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to storm-user+...@googlegroups.com.

Enno Shioji

unread,

Apr 3, 2013, 4:24:23 PM4/3/13

to storm...@googlegroups.com

Hi Andy,

> Does this mean you have a crawler bolt with one executor on each worker, and this bolt maintains a queue that asynchronously downloads content?

Yep, that's pretty much it.

> how do you make sure you don't make more than x parallel requests to the same domain?

It boils down to a hash map (to be precise a Guava Cache) per bolt instance which remembers when the domain was contacted last. When a URL is queued off, this cache is queried and it decides if it's ok to fire the request. If yes, the request is fired and the cache updated. If not, the URL is put back to the queue with some delay to prevent busy waiting.

Andy Nahapetian

unread,

Apr 4, 2013, 1:54:58 AM4/4/13

to storm...@googlegroups.com

Hi Enno

Thanks a lot and have a nice day!

Kind Regards,

Andy

Reply all

Reply to author

Forward