New to Erlang, I'm trying to implement a simple URL fetcher.
Here's my code (please, feel free to correct it if you find any bug or know
a better approach):
8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8----
-module(fetch).
-export([url/1]).
-define(TIMEOUT, 7000).
-define(TCP_OPTS, [binary, {packet, raw}, {nodelay, true},
{active, true}]).
url(Url) ->
{ok, _Tag, Host, Port} = split_url(Url),
Hdrs = [],
Request = ["GET ", Url, " HTTP/1.1\r\n", Hdrs, "\r\n\r\n"],
case catch gen_tcp:connect(Host, Port, ?TCP_OPTS) of
{'EXIT', Why} ->
{error, {socket_exit, Why}};
{error, Why} ->
{error, {socket_error, Why}};
{ok, Socket} ->
gen_tcp:send(Socket, list_to_binary(Request)),
recv(Socket, list_to_binary([]))
end.
recv(Socket, Bin) ->
receive
{tcp, Socket, B} ->
io:format(".", []),
recv(Socket, concat_binary([Bin, B]));
{tcp_closed, Socket} ->
{ok, Bin};
Other ->
{error, {socket, Other}}
after
?TIMEOUT ->
{error, {socket, timeout}}
end.
split_url([$h,$t,$t,$p,$:,$/,$/|T]) -> split_url(http, T);
split_url(_X) -> {error, split_url}.
split_url(Tag, X) ->
case string:chr(X, $:) of
0 ->
Port = 80,
case string:chr(X,$/) of
0 ->
{ok, Tag, X, Port};
N ->
Site = string:substr(X,1,N-1),
{ok, Tag, Site, Port}
end;
N1 ->
case string:chr(X,$/) of
0 ->
error;
N2 ->
PortStr = string:substr(X,N1+1, N2-N1-1),
case catch list_to_integer(PortStr) of
{'EXIT', _} ->
{error, port_number};
Port ->
Site = string:substr(X,1,N1-1),
{ok, Tag, Site, Port}
end
end
end.
8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8-----8------
When testing it, the receiving socket gets very very slow:
$ erl
1> c(fetch).
2> Bin = fetch:url("http://www.google.com").
......{error,{socket,timeout}}
Am I missing something?
What I like to get at the end is a very fast fetcher. Any hint?
Regards
Zabrane
cheers
Chandru
2009/11/16 zabrane Mikael <zabr...@gmail.com>
That's fix my problem. Thanks.
While googling a bit, I found two ways to read from the Socket:
recv(Socket, Bin) ->
receive
{tcp, Socket, B} ->
io:format(".", []),
recv(Socket, concat_binary([Bin, B]));
{tcp_closed, Socket} ->
{ok, Bin};
Other ->
{error, {socket, Other}}
after
?TIMEOUT ->
{error, {socket, timeout}}
end.
% version 2 with "gen_tcp:recv"
recv2(Socket, Bin) ->
case gen_tcp:recv(Socket, 0, ?TIMEOUT) of
{ok, B} ->
io:format(".", []),
recv(Socket, concat_binary([Bin, B]));
{error, closed} ->
{ok, Bin};
{error, timeout} ->
{error, {socket, timeout}};
Other ->
{error, {socket, Other}}
end.
Which one is the best in my case (see below: fetch.erl)?
Regards
Zabrane
Le 16 nov. 09 à 18:53, Chandru a écrit :
________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org
If the active option is true, which is the default, everything
received from the socket will be sent as messages to the receiving
process. If the active option is set to false (passive mode), the
process must explicitly receive incoming data by calling
gen_tcp:recv/N or gen_udp:recv/N (depending on the type of socket).
Note: Passive mode provides flow control; the other side will not be
able send faster than the receiver can read. Active mode provides no
flow control; a fast sender could easily overflow the receiver with
incoming messages. Use active mode only if your high-level protocol
provides its own flow control (for instance, acknowledging received
messages) or the amount of data exchanged is small.
This will get you into murkier waters - you'll have to check if a
content length is defined and
then read exactly this number of bytes.
I wrote a tutorial about this a while back
http://www.sics.se/~joe/tutorials/web_server/web_server.html
You'll find a module called http_driver here that does the parsing and
collects the appropriate number of bytes.
/Joe
I
________________________________________________________________
/Tony
While trying to learn how to write a simple TCP Web Server in Erlang which
only dump what it gets to stdout, I realize that time to time, the HTTP
requests get truncated when reaching the server. My main socket loop looks
like:
-------------------------------------------
-define(TCP_OPTIONS,[binary, {packet, 0}, {active, false}, {reuseaddr,
true}]).
...
loop_recv(Socket)
case gen_tcp:recv(Socket, 0) of
{ok, BinData} ->
%% here, I'm assuming that all the HTTP request (Headers +
Body) is in "BinData". Hope I'm right.
io:format("BinData: ~p~n", []),
ok;
NotOK ->
error_logger:info_report([{"gen_tcp:recv/2", NotOK}]),
error
end.
-------------------------------------------
For some requests, the "io:format" prints a truncated data (in BinData):
<<"
http://www.foo.tv/images/v30/LoaderV3.swf?loop=false&quality=high&request=357&HTTP/1.0\r\nHost:
www.foo.tv\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5;
fr; rv:1.9.0.2) Gecko/2008090512 Firefox/3.0.2\r\nAccept: text/h">>
As you can see, the request isn't complete "... Accept: text/h".
Am I doing somthing wrong? How can I fix it please?
Regards
Zabrane
2009/11/17 Tony Rogvall <to...@rogvall.se>
io:fornat("BinData: ~p~n", [BinData]),
You should have got a warning from the compiler.
Robert
2009/11/21 zabrane Mikael <zabr...@gmail.com>
> Hi List !
>
2009/11/21 Robert Virding <rvir...@gmail.com>
Your assumption that the initial gen_tcp:recv will give you a binary
which always includes the full request headers is not correct. If you
are going down that road you would need to write code to check for the
the header terminator and if it isn't present then buffer and read
more until you have the terminator. It isn't hard but it can be
messy.
Most erlang http servers these days use the {packet, http} socket
option which is HTTP aware and which will read/parse headers one by
one for you and will also indicate the end of the headers.
mochiweb_http has a good example of how to use {packet, http} to read
request headers :
http://code.google.com/p/mochiweb/source/browse/trunk/src/mochiweb_http.erl
Colm
________________________________________________________________
Your assumption that the initial gen_tcp:recv will give you a binary
> which always includes the full request headers is not correct. If you
> are going down that road you would need to write code to check for the
> the header terminator and if it isn't present then buffer and read
> more until you have the terminator.
Ok. Finding headers terminator (teh double CRLF) isn't difficult.
What's difficult for me is to find the request end?
How can I know that I've read all data in my request (GET or POST)?
> It isn't hard but it can be messy.
>
Joe's code helped me a lot with his reentrant parser. Thanks by the way.
Most erlang http servers these days use the {packet, http} socket
> option which is HTTP aware and which will read/parse headers one by
> one for you and will also indicate the end of the headers.
>
Lev Walkin introduced an example of his web server called yucan which uses
{packer, http_bin}:
http://lionet.livejournal.com/42016.html?thread=774432
Could someone explain the difference?
> mochiweb_http has a good example of how to use {packet, http} to read
> request headers :
>
>
> http://code.google.com/p/mochiweb/source/browse/trunk/src/mochiweb_http.erl
Thanks for the pointer
Regards
Zabrane
> Hi Colm,
>
> Your assumption that the initial gen_tcp:recv will give you a binary
> > which always includes the full request headers is not correct. If you
> > are going down that road you would need to write code to check for the
> > the header terminator and if it isn't present then buffer and read
> > more until you have the terminator.
>
>
> Ok. Finding headers terminator (teh double CRLF) isn't difficult.
> What's difficult for me is to find the request end?
> How can I know that I've read all data in my request (GET or POST)?
>
>
There is no easy answer I'm afraid! You just have to read RFC2616 :)
cheers
Chandru
erlang doesn't provide any magic way to read the request body (in the
way it does for headers). You have to write that code yourself. You
can look at either the yaws or mochiweb sources for straightforward
erlang implementations of this.
>> It isn't hard but it can be messy.
>
> Joe's code helped me a lot with his reentrant parser. Thanks by the way.
>>
>> Most erlang http servers these days use the {packet, http} socket
>> option which is HTTP aware and which will read/parse headers one by
>> one for you and will also indicate the end of the headers.
>
> Lev Walkin introduced an example of his web server called yucan which uses
> {packer, http_bin}:
> http://lionet.livejournal.com/42016.html?thread=774432
> Could someone explain the difference?
With the "http" option you get the header key and value returned to
you as strings, apart from well known header names which are returned
as atoms. With "http_bin" the same applies, but where there were
strings you get erlang binaries instead.
Colm
Whenever I read something like that I think - "fragmentation".
If you write N bytes to a TCP socket, you will eventually be able to
read N bytes from the socket
but the bytes may or may not be delivered "all in one go". Since you
have said {packet, 0} you'll just
get whatever happened to be read. This is why you *must* write a
re-entrant parser.
First you collect data until you see "\r\n\r\n" - only then can you
parse the header.
Then you check for a content length header. If you find a content
length header it will contain the
content length (N). Then you collect *exactly* N bytes following the
"\r\n\r\n". Otherwise you collect
until the socket closes (there is also a chunked alternative which I
will ignore)
The code in http://www.sics.se/~joe/tutorials/web_server/http_driver.erl
does this:
If you don't do this your program will work sometimes - in the case
where the incoming packets were not
fragmented but it will fail mysteriously if the packets are fragmented.
Forgetting about fragmentation is the "first basic" mistake that
*everybody* makes when writing
networking code - ....
This mistake happens often when you deploy something. You test it
locally on localhost it works.
You test it live on the Internet - it fails.
Why? Packets are not often fragmented on localhost - but very rare.
The chance of fragmentation on the Internet is very high - even if you
have good connection.
aside: this is why one of the tcp options is {packet, N} - if you
write a client AND a server in Erlang
and BOTH use (say) {packet,4} then gen_tcp will silently reassemble
fragmented packets behind the scenes
before delivering them to the application program.
This together with term_to_binary (and its inverse) and the bit syntax
will save you many sleepless nights.
> My main socket loop looks
> like:
>
> -------------------------------------------
> -define(TCP_OPTIONS,[binary, {packet, 0}, {active, false}, {reuseaddr,
> true}]).
> ...
> loop_recv(Socket)
> case gen_tcp:recv(Socket, 0) of
> {ok, BinData} ->
> %% here, I'm assuming that all the HTTP request (Headers +
> Body) is in "BinData". Hope I'm right.
No No No - programs should not depend upon Hope. This assumption is wrong.
Hint - print out the packet lengths, so you can see the pack lengths ..
/Joe