[erlang-questions] gen_tcp very slow to fetch data

20 views
Skip to first unread message

zabrane Mikael

unread,
Nov 16, 2009, 11:17:32 AM11/16/09
to erlang-q...@erlang.org
Hi List !

New to Erlang, I'm trying to implement a simple URL fetcher.
Here's my code (please, feel free to correct it if you find any bug or know
a better approach):

8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8----
-module(fetch).

-export([url/1]).

-define(TIMEOUT, 7000).
-define(TCP_OPTS, [binary, {packet, raw}, {nodelay, true},
{active, true}]).

url(Url) ->
{ok, _Tag, Host, Port} = split_url(Url),

Hdrs = [],
Request = ["GET ", Url, " HTTP/1.1\r\n", Hdrs, "\r\n\r\n"],

case catch gen_tcp:connect(Host, Port, ?TCP_OPTS) of
{'EXIT', Why} ->
{error, {socket_exit, Why}};
{error, Why} ->
{error, {socket_error, Why}};
{ok, Socket} ->
gen_tcp:send(Socket, list_to_binary(Request)),
recv(Socket, list_to_binary([]))
end.

recv(Socket, Bin) ->
receive
{tcp, Socket, B} ->
io:format(".", []),
recv(Socket, concat_binary([Bin, B]));
{tcp_closed, Socket} ->
{ok, Bin};
Other ->
{error, {socket, Other}}
after
?TIMEOUT ->
{error, {socket, timeout}}
end.


split_url([$h,$t,$t,$p,$:,$/,$/|T]) -> split_url(http, T);
split_url(_X) -> {error, split_url}.

split_url(Tag, X) ->
case string:chr(X, $:) of
0 ->
Port = 80,
case string:chr(X,$/) of
0 ->
{ok, Tag, X, Port};
N ->
Site = string:substr(X,1,N-1),
{ok, Tag, Site, Port}
end;
N1 ->
case string:chr(X,$/) of
0 ->
error;
N2 ->
PortStr = string:substr(X,N1+1, N2-N1-1),
case catch list_to_integer(PortStr) of
{'EXIT', _} ->
{error, port_number};
Port ->
Site = string:substr(X,1,N1-1),
{ok, Tag, Site, Port}
end
end
end.

8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8------

When testing it, the receiving socket gets very very slow:
$ erl
1> c(fetch).
2> Bin = fetch:url("http://www.google.com").
......{error,{socket,timeout}}

Am I missing something?
What I like to get at the end is a very fast fetcher. Any hint?

Regards
Zabrane

Chandru

unread,
Nov 16, 2009, 12:53:00 PM11/16/09
to zabrane Mikael, erlang-q...@erlang.org
You are expecting the server to indicate end of response by closing the
connection, but because you specify HTTP/1.1 in the request, the server is
holding up your connection, and you are timing out. Try replacing HTTP/1.1
with HTTP/1.0 in your request, or parse the response to detect end of
response.

cheers
Chandru

2009/11/16 zabrane Mikael <zabr...@gmail.com>

ERLANG

unread,
Nov 16, 2009, 12:59:39 PM11/16/09
to Chandru, zabrane Mikael, erlang-q...@erlang.org
Hi Chandru !

That's fix my problem. Thanks.
While googling a bit, I found two ways to read from the Socket:

recv(Socket, Bin) ->
receive
{tcp, Socket, B} ->
io:format(".", []),
recv(Socket, concat_binary([Bin, B]));
{tcp_closed, Socket} ->
{ok, Bin};
Other ->
{error, {socket, Other}}
after
?TIMEOUT ->
{error, {socket, timeout}}
end.

% version 2 with "gen_tcp:recv"
recv2(Socket, Bin) ->
case gen_tcp:recv(Socket, 0, ?TIMEOUT) of
{ok, B} ->


io:format(".", []),
recv(Socket, concat_binary([Bin, B]));

{error, closed} ->
{ok, Bin};
{error, timeout} ->
{error, {socket, timeout}};


Other ->
{error, {socket, Other}}

end.


Which one is the best in my case (see below: fetch.erl)?

Regards
Zabrane

Le 16 nov. 09 à 18:53, Chandru a écrit :


________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Ngoc Dao

unread,
Nov 16, 2009, 7:51:23 PM11/16/09
to ERLANG, Chandru, zabrane Mikael, erlang-q...@erlang.org
From inet's doc:
http://www1.erlang.org/documentation/doc-4.9.1/lib/kernel-2.4.1/doc/html/inet.html

If the active option is true, which is the default, everything
received from the socket will be sent as messages to the receiving
process. If the active option is set to false (passive mode), the
process must explicitly receive incoming data by calling
gen_tcp:recv/N or gen_udp:recv/N (depending on the type of socket).
Note: Passive mode provides flow control; the other side will not be
able send faster than the receiver can read. Active mode provides no
flow control; a fast sender could easily overflow the receiver with
incoming messages. Use active mode only if your high-level protocol
provides its own flow control (for instance, acknowledging received
messages) or the amount of data exchanged is small.

Joe Armstrong

unread,
Nov 17, 2009, 3:52:33 AM11/17/09
to Chandru, zabrane Mikael, erlang-q...@erlang.org
On Mon, Nov 16, 2009 at 6:53 PM, Chandru
<chandrashekha...@gmail.com> wrote:
> You are expecting the server to indicate end of response by closing the
> connection, but because you specify HTTP/1.1 in the request, the server is
> holding up your connection, and you are timing out. Try replacing HTTP/1.1
> with HTTP/1.0 in your request, or parse the response to detect end of
> response.

This will get you into murkier waters - you'll have to check if a
content length is defined and
then read exactly this number of bytes.

I wrote a tutorial about this a while back

http://www.sics.se/~joe/tutorials/web_server/web_server.html

You'll find a module called http_driver here that does the parsing and
collects the appropriate number of bytes.

/Joe

I

________________________________________________________________

Tony Rogvall

unread,
Nov 17, 2009, 10:03:43 AM11/17/09
to Ngoc Dao, ERLANG, Chandru, zabrane Mikael, erlang-q...@erlang.org
Do not forget about {active, once} mode.
{active,once} will receive one message (depends on buffer size etc)
the it will switch to passive mode. To get the next message you use
inet:setopts(Socket, [{active,once}]) to activate it again. This mode enables
a selective receive at the same time as it enables flow control.

/Tony

zabrane Mikael

unread,
Nov 20, 2009, 7:02:04 PM11/20/09
to Tony Rogvall, Ngoc Dao, ERLANG, Chandru, erlang-q...@erlang.org
Hi List !

While trying to learn how to write a simple TCP Web Server in Erlang which
only dump what it gets to stdout, I realize that time to time, the HTTP
requests get truncated when reaching the server. My main socket loop looks
like:

-------------------------------------------
-define(TCP_OPTIONS,[binary, {packet, 0}, {active, false}, {reuseaddr,
true}]).
...
loop_recv(Socket)
case gen_tcp:recv(Socket, 0) of
{ok, BinData} ->
%% here, I'm assuming that all the HTTP request (Headers +
Body) is in "BinData". Hope I'm right.
io:format("BinData: ~p~n", []),
ok;
NotOK ->
error_logger:info_report([{"gen_tcp:recv/2", NotOK}]),
error
end.
-------------------------------------------

For some requests, the "io:format" prints a truncated data (in BinData):

<<"
http://www.foo.tv/images/v30/LoaderV3.swf?loop=false&quality=high&request=357&HTTP/1.0\r\nHost:
www.foo.tv\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5;
fr; rv:1.9.0.2) Gecko/2008090512 Firefox/3.0.2\r\nAccept: text/h">>

As you can see, the request isn't complete "... Accept: text/h".

Am I doing somthing wrong? How can I fix it please?

Regards
Zabrane

2009/11/17 Tony Rogvall <to...@rogvall.se>

Robert Virding

unread,
Nov 20, 2009, 7:10:07 PM11/20/09
to zabrane Mikael, Tony Rogvall, Ngoc Dao, ERLANG, Chandru, erlang-q...@erlang.org
The call to io:format is wrong as it is missing an argument to print with
~p. So it will print the "BinData: " leader and then crash. I am guessing
you meant it to be:

io:fornat("BinData: ~p~n", [BinData]),

You should have got a warning from the compiler.

Robert

2009/11/21 zabrane Mikael <zabr...@gmail.com>

> Hi List !
>

zabrane Mikael

unread,
Nov 20, 2009, 7:15:03 PM11/20/09
to Robert Virding, Tony Rogvall, Ngoc Dao, ERLANG, Chandru, erlang-q...@erlang.org
Thanks Robert. I just made a quick copy/past for the post and I forgot the
"~p" ;-)
So my problem is still there if you can help !!

2009/11/21 Robert Virding <rvir...@gmail.com>

Colm Dougan

unread,
Nov 22, 2009, 6:03:58 PM11/22/09
to zabrane Mikael, erlang-q...@erlang.org
On Sat, Nov 21, 2009 at 12:15 AM, zabrane Mikael <zabr...@gmail.com> wrote:
> Thanks Robert. I just made a quick copy/past for the post and I forgot the
> "~p" ;-)
> So my problem is still there if you can help !!

Your assumption that the initial gen_tcp:recv will give you a binary
which always includes the full request headers is not correct. If you
are going down that road you would need to write code to check for the
the header terminator and if it isn't present then buffer and read
more until you have the terminator. It isn't hard but it can be
messy.

Most erlang http servers these days use the {packet, http} socket
option which is HTTP aware and which will read/parse headers one by
one for you and will also indicate the end of the headers.
mochiweb_http has a good example of how to use {packet, http} to read
request headers :

http://code.google.com/p/mochiweb/source/browse/trunk/src/mochiweb_http.erl

Colm

________________________________________________________________

zabrane Mikael

unread,
Nov 23, 2009, 6:58:04 AM11/23/09
to Colm Dougan, erlang-q...@erlang.org
Hi Colm,

Your assumption that the initial gen_tcp:recv will give you a binary
> which always includes the full request headers is not correct. If you
> are going down that road you would need to write code to check for the
> the header terminator and if it isn't present then buffer and read
> more until you have the terminator.


Ok. Finding headers terminator (teh double CRLF) isn't difficult.
What's difficult for me is to find the request end?
How can I know that I've read all data in my request (GET or POST)?


> It isn't hard but it can be messy.
>

Joe's code helped me a lot with his reentrant parser. Thanks by the way.

Most erlang http servers these days use the {packet, http} socket
> option which is HTTP aware and which will read/parse headers one by
> one for you and will also indicate the end of the headers.
>

Lev Walkin introduced an example of his web server called yucan which uses
{packer, http_bin}:
http://lionet.livejournal.com/42016.html?thread=774432

Could someone explain the difference?


> mochiweb_http has a good example of how to use {packet, http} to read
> request headers :
>
>
> http://code.google.com/p/mochiweb/source/browse/trunk/src/mochiweb_http.erl


Thanks for the pointer

Regards
Zabrane

Chandru

unread,
Nov 23, 2009, 7:25:04 AM11/23/09
to zabrane Mikael, Colm Dougan, erlang-q...@erlang.org
2009/11/23 zabrane Mikael <zabr...@gmail.com>

> Hi Colm,
>
> Your assumption that the initial gen_tcp:recv will give you a binary
> > which always includes the full request headers is not correct. If you
> > are going down that road you would need to write code to check for the
> > the header terminator and if it isn't present then buffer and read
> > more until you have the terminator.
>
>
> Ok. Finding headers terminator (teh double CRLF) isn't difficult.
> What's difficult for me is to find the request end?
> How can I know that I've read all data in my request (GET or POST)?
>
>

There is no easy answer I'm afraid! You just have to read RFC2616 :)

cheers
Chandru

Colm Dougan

unread,
Nov 23, 2009, 7:27:04 AM11/23/09
to zabrane Mikael, erlang-q...@erlang.org
On Mon, Nov 23, 2009 at 11:58 AM, zabrane Mikael <zabr...@gmail.com> wrote:
>>
>> Your assumption that the initial gen_tcp:recv will give you a binary
>> which always includes the full request headers is not correct. If you
>> are going down that road you would need to write code to check for the
>> the header terminator and if it isn't present then buffer and read
>> more until you have the terminator.
>
> Ok. Finding headers terminator (teh double CRLF) isn't difficult.
> What's difficult for me is to find the request end?
> How can I know that I've read all data in my request (GET or POST)?

erlang doesn't provide any magic way to read the request body (in the
way it does for headers). You have to write that code yourself. You
can look at either the yaws or mochiweb sources for straightforward
erlang implementations of this.

>> It isn't hard but it can be messy.
>
> Joe's code helped me a lot with his reentrant parser. Thanks by the way.
>>
>> Most erlang http servers these days use the {packet, http} socket
>> option which is HTTP aware and which will read/parse headers one by
>> one for you and will also indicate the end of the headers.
>
> Lev Walkin introduced an example of his web server called yucan which uses
> {packer, http_bin}:
> http://lionet.livejournal.com/42016.html?thread=774432
> Could someone explain the difference?

With the "http" option you get the header key and value returned to
you as strings, apart from well known header names which are returned
as atoms. With "http_bin" the same applies, but where there were
strings you get erlang binaries instead.

Colm

Joe Armstrong

unread,
Nov 23, 2009, 10:39:18 AM11/23/09
to zabrane Mikael, Tony Rogvall, Ngoc Dao, ERLANG, Chandru, erlang-q...@erlang.org
On Sat, Nov 21, 2009 at 1:02 AM, zabrane Mikael <zabr...@gmail.com> wrote:
> Hi List !
>
> While trying to learn how to write a simple TCP Web Server in Erlang which
> only dump what it gets to stdout, I realize that time to time, the HTTP
> requests get truncated when reaching the server.

Whenever I read something like that I think - "fragmentation".

If you write N bytes to a TCP socket, you will eventually be able to
read N bytes from the socket
but the bytes may or may not be delivered "all in one go". Since you
have said {packet, 0} you'll just
get whatever happened to be read. This is why you *must* write a
re-entrant parser.

First you collect data until you see "\r\n\r\n" - only then can you
parse the header.
Then you check for a content length header. If you find a content
length header it will contain the
content length (N). Then you collect *exactly* N bytes following the
"\r\n\r\n". Otherwise you collect
until the socket closes (there is also a chunked alternative which I
will ignore)

The code in http://www.sics.se/~joe/tutorials/web_server/http_driver.erl
does this:

If you don't do this your program will work sometimes - in the case
where the incoming packets were not
fragmented but it will fail mysteriously if the packets are fragmented.

Forgetting about fragmentation is the "first basic" mistake that
*everybody* makes when writing
networking code - ....

This mistake happens often when you deploy something. You test it
locally on localhost it works.
You test it live on the Internet - it fails.

Why? Packets are not often fragmented on localhost - but very rare.
The chance of fragmentation on the Internet is very high - even if you
have good connection.

aside: this is why one of the tcp options is {packet, N} - if you
write a client AND a server in Erlang
and BOTH use (say) {packet,4} then gen_tcp will silently reassemble
fragmented packets behind the scenes
before delivering them to the application program.

This together with term_to_binary (and its inverse) and the bit syntax
will save you many sleepless nights.


> My main socket loop looks
> like:
>
> -------------------------------------------
> -define(TCP_OPTIONS,[binary, {packet, 0}, {active, false}, {reuseaddr,
> true}]).
> ...
> loop_recv(Socket)
>   case gen_tcp:recv(Socket, 0) of
>        {ok, BinData} ->
>             %% here, I'm assuming that all the HTTP request (Headers +
> Body) is in "BinData". Hope I'm right.

No No No - programs should not depend upon Hope. This assumption is wrong.

Hint - print out the packet lengths, so you can see the pack lengths ..

/Joe

zabrane Mikael

unread,
Nov 23, 2009, 11:55:56 AM11/23/09
to Joe Armstrong, Tony Rogvall, Ngoc Dao, ERLANG, Chandru, erlang-q...@erlang.org
Thanks for the lesson guys !

Regards
Zabrane

2009/11/23 Joe Armstrong <erl...@gmail.com>

Reply all
Reply to author
Forward
0 new messages