Making a web crawler in OTP

606 views
Skip to first unread message

Benjamin Tan

unread,
Dec 30, 2013, 12:14:25 AM12/30/13
to elixir-l...@googlegroups.com
Hi All! 

Been trying to wrap my head around OTP by making myself a web-crawler. 


The layout is pretty simple:

Workers are :temporary, and essentially the ones doing the crawling. The WorkerSupervisor uses a :simple_one_for_one strategy.

I'm using HTTPotion to retrieve the pages, and that's the only external dependency I have so far.

DbServer currently just helps keep track of fetched urls and new urls.

The main entry point of the program is Exrachnid.crawl

When this executes, I get errors that look like: 

=ERROR REPORT==== 30-Dec-2013::13:04:20 ===
** Generic server <0.94.0> terminating
** Last message in was {tcp_closed,#Port<0.4182>}
** When Server state == {state,"www.zalora.sg",80,5000,#Ref<0.0.0.586>,false,
                               undefined,[],false,#Port<0.4182>,false,[],
                               {[],[]},
                               undefined,idle,undefined,<<>>,0,0,[],undefined,
                               undefined,true,undefined,false,undefined,
                               undefined,<<>>,0,false,147471,1,undefined}
** Reason for termination ==
** connection_closed

=ERROR REPORT==== 30-Dec-2013::13:04:21 ===
** Generic server <0.218.0> terminating
** Last message in was {'$gen_cast',
                           {crawl,
                               <<"www.zalora.sg/men/sports/sports-skate/">>}}
** When Server state == []
** Reason for termination ==
** {{'Elixir.HTTPotion.HTTPError','__exception__',<<"retry_later">>},
    [{'Elixir.HTTPotion',request,5,
         [{file,
              "/Users/rambo/code/elixir-code/exrachnid/deps/httpotion/lib/httpotion.ex"},
          {line,134}]},
     {'Elixir.Exrachnid.Worker',handle_cast,2,
         [{file,
              "/Users/rambo/code/elixir-code/exrachnid/lib/exrachnid/worker.ex"},
          {line,38}]},
     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,607}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}

=ERROR REPORT==== 30-Dec-2013::13:04:21 ===
** Generic server <0.221.0> terminating
** Last message in was {'$gen_cast',
                           {crawl,
                               <<"www.zalora.sg/men/sports/sports-others/">>}}


The offending piece of code apparently comes from the Worker:

  def handle_cast({ :crawl, url }, _state) do
    case HTTPotion.get(url, @user_agent, []) do
      Response[body: body, status_code: status, headers: _headers] when status in 200..299 ->
        
        Exrachnid.add_fetched_url(url)

        host = URI.parse(url).host
        
        # Add extracted links
        body          
          |> extract_links(host)
          |> Exrachnid.add_new_urls

      _ -> 
        # TODO: Do nothing yet.
    end
    { :stop, :normal, [] }
  end

Not really sure what is going on here. I have other questions:

Is handle_cast alright to use as in above (as opposed to handle_call)? In this case, once crawling is done, I want the worker to terminate, hence the { :stop, :normal, [] }.

In order to keep on crawling, I'm doing something like: 

defmodule Exrachnid do
  use Application.Behaviour
  # other code omitted.
  def crawl(url) do
    Exrachnid.Worker.crawl(url)
  end

  def add_new_urls(urls) do
    urls 
      |> Exrachnid.DbServer.add_new_urls 
      |> Enum.each(fn(url) -> crawl(url) end)
  end

end

What this does is this:

a) urls get added to the DbServer, and will return only the new urls.
b) the new urls will then each be passed to the crawl function
c) the crawl function starts a child and attaches it to the supervision tree.

Is there anything wrong with this approach? 

Would appreciate any input. Thanks! 

Ben

José Valim

unread,
Dec 30, 2013, 2:51:09 AM12/30/13
to elixir-l...@googlegroups.com

** Reason for termination ==
** {{'Elixir.HTTPotion.HTTPError','__exception__',<<"retry_later">>},
    [{'Elixir.HTTPotion',request,5,
         [{file,
              "/Users/rambo/code/elixir-code/exrachnid/deps/httpotion/lib/httpotion.ex"},
          {line,134}]},
     {'Elixir.Exrachnid.Worker',handle_cast,2,
         [{file,
              "/Users/rambo/code/elixir-code/exrachnid/lib/exrachnid/worker.ex"},
          {line,38}]},
     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,607}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}


It is in our TODO to improve the server error messages but this is basically saying that you got a HTTPotionError with message "retry_later". I would have to look into HTTPotion to know what this is really about.
 
Is handle_cast alright to use as in above (as opposed to handle_call)? In this case, once crawling is done, I want the worker to terminate, hence the { :stop, :normal, [] }.

Yes. If all you want is to start the worker, send it a message and then kill it, cast is fine. :)  The supervisor strategy is also correct and it should restart the worker in case something goes wrong.

Message has been deleted

Benjamin Tan

unread,
Dec 30, 2013, 4:15:48 AM12/30/13
to elixir-l...@googlegroups.com, jose....@plataformatec.com.br
Thanks for the reply José.

I am wondering if it is reasonable to immediately create a worker for each url that comes in. 

Somehow in my mind it seems like this would overload the VM, choke the network etc. Yet from what I read, this seems like a normal thing to do in Erlang, and it does seem to fit the 'let it crash' philosophy.

Any thoughts?

José Valim

unread,
Dec 30, 2013, 4:20:20 AM12/30/13
to elixir-l...@googlegroups.com
Well, you still want to dimension the software around your resources. If you have a database, you don't want to open 100 connections and let it crash but use a reasonable amount instead. If the retry_later is due to lack of resources than using some kind of pool would be more appropriate (you can take a look at poolboy). The let it crash will still have your back in case you estimate it wrong or the resources are unavaiable for another reason.



José Valim
Skype: jv.ptec
Founder and Lead Developer


On Mon, Dec 30, 2013 at 10:14 AM, Benjamin Tan <firebl...@gmail.com> wrote:
Thanks for the reply José.

I am wondering if it is reasonable for immediately create a worker for each url that comes in. 

Somehow in my mind it seems like this would overload the VM, choke the network etc if this happens – although this does seem to fit the 'let it crash' philosophy.

Any thoughts?


On Monday, December 30, 2013 3:51:09 PM UTC+8, José Valim wrote:

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Benjamin Tan

unread,
Dec 30, 2013, 4:25:41 AM12/30/13
to elixir-l...@googlegroups.com
Ah. That makes sense. Looks like I have to rethink how to fetch new urls to the crawlers then. 

Thank you! 


--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/JiT73sdlJ9M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.

Saša Jurić

unread,
Dec 30, 2013, 6:13:00 AM12/30/13
to elixir-l...@googlegroups.com
Hi Benjamin,

My guess is that you're doing too much requests to a single server. This is just a guess I got by looking at the stack trace and the source code on github. I never used HTTPotion or underlying ibrowse library, and I don't have the running environment here.

However, I'd like to use this opportunity to describe my approach to analyzing error output. I see that people are occasionally complaining about Erlang error output. While it is far from ideal, one can still get some conclusions from it. In fact, the output you got is not as creepy as it can be in production.

When some kind of monstrous output occurs in the log, I try to find the first error that occurred. Usually, resolving this will take care of the rest. Now, in OTP based applications, there will be multiple errors, when some server has crashed. What we need to find is the first one that prints the stack trace, because it is usually the source of all troubles.

In your case, it's the second error report. There, I can see the stack trace. The first thing I look for is the top most line in my own code that caused an exception. In your case, this happens here.
This tells me that http request failed, but I don't know why. 

The error ({'Elixir.HTTPotion.HTTPError','__exception__',<<"retry_later">>})  is semi-descriptive. From the look of it, it is generated in HTTPotion. Trying to look further in the stack trace doesn't reveal anything, as I get stuck in the use directive. I'm a bit surprised, and I consider this to be an Elixir problem. The stacke shouldn't be mangled like this, because now I don't know where the exception has been generated. I can report this, but it won't solve my problem, so let's see what else can we do.

The term "retry_later" is obviously a custom made term, so somebody had to create it. Searching in the HTTPotion doesn't find it, so it had to come from the underlying ibrowse library.Indeed, by looking at its code, I am able to see this part. The code seems a bit cryptic, but I can tell that it depends on "Max_Pipe" limit. Poking a bit further, I see this part. It's not very good documented, but my guess is that requests to the same host/port are pipelined. I assume that once the pipeline size exceeds the maximum size, that further requests are rejected. Finally, with a bit more searching, I see that the default pipeline size is 10.

So my assumption is that you make too many concurrent requests to the same server/port. I don't have a running environment, so this is just a guess. You might easily prove or disprove it by printing to screen which requests do you make, and when are they finished.

In any case, I tried to demonstrate that the stack trace can still be used to find out something. Sure, it's horrible, but it's not useless. My hypothesis may be wrong, but at least it gives me some kind of a direction where the source of the problem might be.

As the final tip, if you're developing an otp application, you might want to consider using lager (or its Elixir wrapper). It cleans up the error output, which might make it easier to analyze errors. I didn't use it in production, so I can't really comment on this, but I've seen many people recommending it.

Benjamin Tan

unread,
Dec 30, 2013, 10:44:36 AM12/30/13
to elixir-l...@googlegroups.com
Thanks for the detailed writeup Sasa! 

Exlager makes the output slightly more palatable. 

Indeed, I have traced it to use HTTPotion.base, but didn't go any further. I guess that it's pretty obvious by now that spawning an uncontrolled number of workers to download pages is the wrong way to go. 

I experimented briefly with poolboy but haven't had much luck yet.

I'll probably try a message queue next. Need to find a way to limit the number of workers.

Has anyone implemented something similar? Would love to have a chat with you! Or if someone can play with the program and nudge me in the right direction that would be even better! 

Sasa Juric

unread,
Dec 30, 2013, 11:19:32 AM12/30/13
to elixir-l...@googlegroups.com
I think poolboy should serve your needs.
There was an article a couple of months ago about how to use it in Elixir: http://akash.im/2013/10/03/managing-processes-with-poolboy-in-elixir.html


Benjamin Tan

unread,
Dec 30, 2013, 11:33:41 AM12/30/13
to elixir-l...@googlegroups.com
Yeah, that was the one I was reading. I will try again and report back. Thanks! 

Sasa Juric

unread,
Dec 30, 2013, 12:21:29 PM12/30/13
to elixir-l...@googlegroups.com
Here's a basic demo:


It should be a part of a mix project that lists poolboy as a dependency.
Locally, I start it with 
mix run -e PoolboyDemo.run --no-halt

Now, this is far from finished. It's just a simple sketch to get you going. 
The thing I didn't worry about is returning the result from the worker. When a worker finishes, it must somehow notify some "main controller" or whatever you call it, that will decide what to do with the result (presumably, you want to crawl recursively).
There is also a possible memory leak, since the queue of each worker may grow indefinitely. I don't know anything about crawling, but I presume you have to somehow control input urls, because the list of pending urls will constantly grow.

Disclaimer: I'm no expert in poolboy. This example works, but I don't know if I'm misusing something.
Reply all
Reply to author
Forward
0 new messages