Concurrent HttpClient usage

745 views
Skip to first unread message

Erdem Agaoglu

unread,
Jan 6, 2012, 1:16:07 PM1/6/12
to spray...@googlegroups.com
Hello all,

A part of my application reads some resources on the web and stores them into CouchDB(a db with pure HTTP API). For this purpose i use two actors, named Web and DB. You can guess what they do. Web actor

1. creates HttpConduits (spray-client) for every website to request a resource
2. may follow redirects, opening new HttpConduits but eventually reaching to a response
3. upon receiving a response, it close()s the HttpConduit
4. and gives the response to the DB

DB actor on the other hand

1. has one HttpConduit, pointing to couch
2. upon receiving a response, does some conversion stuff
3. PUTs the resulting JSON to couch,
4. does not close() anything

Scenario is dead simple. Operations weren't:

Issue1: When handling a single URL everything works perfectly. But when i want Web actor to process about a dozen or two URLs simultanouesly (using "web ! URL") things go haywire. Different HttpConduits in different actors mess up and application usually end up sending PUT requests all-over internet instead of a local couch. Internet of-course tries to tell me i'm doing something wrong but i cannot process those response because the HttpConduits were already closed. When the dispatch method tries to process like "self ! Respond(...)" (as in HttpConduit@108) it gets an NPE because there is no self.

Solution1.1: I found this to be caused by the mutable httpConnection in Conn (at HttpConduit@92):
    class Conn extends HttpConn {
      implicit val timeout = Actor.Timeout(Long.MaxValue) // in scope for '? Connect(...)' call below
      var pendingResponses: Int = -1
      var httpConnection: Option[Either[Future[HttpConnection], HttpConnection]] = None // <---------------- HERE
      ...
I tried to make it immutable but the solution i got with my 2-week scala background wasn't very bright. I fixed the problem of requests going in wrong directions, but requests weren't always getting connections. I was somehow closing conduits prematurely, and NPEs were appearing while trying to establish a connection on "self ! ConnectionResult(..)" (at HttpConduit@100). I tried to fix that too, but couldn't get very far.

Solution1.2: I thought that HttpConduit was not designed to be used in that way. Couch connection had no problems. So i started writing my own HttpConduit, named it HttpGateway. Able to handle lots of short-lived connections, can follow redirects and maybe with some sort of throttling mechanism. It works, up to a point but

Issue 2: It is excruciatingly slow, and does not work after 50 or so connections. I used HttpClient directly. As instructed in the documentation configured a single actor in bootstrap, which was fetched from the registry and used for all those 50 connections. All of them timeouts while trying to establish. 10-20 connections are slow but OK, after that, i start having problems.

Solution2.1: I don't use a single HttpClient. I create an HttpClient for each URL (host to be precise), which is blazing fast. No connection problems, no timeouts (unless server does not answer).

Now my QUESTION. Is it OK to use more than one HttpClient in a single application. Docs say "it runs on a single, private thread" does that mean i'd overflow my machine after a while? or i'm an idiot to event think that?

Thanks in advance for any ideas.

PS1: i'll gladly share HttpGateway to anyone interested
PS2: Sorry for any mistakes in the post, i'm kinda in a hurry.

--
erdem

Chris Carrier

unread,
Jan 6, 2012, 2:12:49 PM1/6/12
to spray...@googlegroups.com
I've not used Spray client much but it's appealing to me especially in
high concurrency situations being built on actors. Are you sharing
HttpConduit instances? In your use case are you calling many pages
from a single host or will the host be more or less random?

I would say your solution of reusing HttpClient is probably not good
considering this snippet from the Spray docs:

"As you can see from this API the spray-can HttpClient works on the
basis of individual connections. There is no higher-level support for
automatic connection pooling and such, since this is considered the
responsibility of the next-higher application layer."

I looked at the potion of the code where you had issues and it seems
strange it would be causing problems. That HttpConnection needs to be
mutable since Conn instances are reused. But I don't see how two
conns would interfere with each other (though I've only briefly
checked out the code). Would you be able to post a snippet of code
where you are actually sending the request and handling the response?

It also sounds like you are prematurely closing your HttpConduit. If
you are getting a NPE because there is no self then that sounds like
you are sending multiple requests through a single conduit but closing
it once you get the first response. Seems like what you really want
to do is something like:

1. Get a new host from wherever they're coming from
2. Declare a new HttpConduit
3. For each relative URI for that host call sendReceive and collect
all results as a list of Futures (with onComplete callbacks to send
results where needed)
4. Use Akka's sequence or traverse to turn your List[Future[Result]]
into a Future[List[Result]] and attach a callback to that to clean up
your conduit since that will guarantee all responses have come back
and you won't orphan any requests

You can skip #3 if you are OK with processing all the Futures only
after they're all complete but I'm not sure that's ideal for your
situation. If this is a long running process that will be calling the
same host over and over then you may not even need to worry about
cleaning up your conduits until the container shuts down. Like I said
I've not used Spray client much so I've used some assumptions. If you
have any more code you can share that may help diagnose the problem.

Chris

Erdem Agaoglu

unread,
Jan 7, 2012, 5:48:31 PM1/7/12
to spray...@googlegroups.com
Hi again, 

I've not used spray-client much either. I'm still learning. In the process, i've found some minor bugs and proposed some improvements but this (if confirmed) is a little bigger. I'll first try to clarify the situation further. I'm not sharing anything (aside from sending messages between actors), at least not on purpose. If spray/akka/JVM does this behind the lines there's no way i'll notice (well i noticed HttpConduits sharing HttpConnections,... i'll get to that). I _am_ calling multiple hosts from a single actor, but if things go big i might request from multiple actors sitting on different machines too.

I noticed reusing HttpClient is a bad idea, but guide[1] shows putting one in a Supervisor and reusing it. HttpConduit reuses it f.e.. That is indeed what _i'm_ trying to find out: currently "... I don't use a single HttpClient. I create an HttpClient for each URL ..." and i don't know if this a good idea, because it contradicts with [1]. (Please, simple yes or no will suffice on this one :) )

The code piece from HttpConduit[2] is really strange to have been causing problems, i know, it shouldn't, but it does. My first attempt to fix that was just a lucky guess and it worked, up to a point. I know there has to be some mutable variable around somewhere but maybe somewhere else, in a SynchronizedList maybe, i don't know.

I probably _am_ closing my HttpConduit prematurely. i couldn't find a good example of closing conduits. I'm closing in onComplete, but i don't know if this is right [5]. As i said, one actor of the application continuously creates Conduits, makes a single request, and closes them. The other actor has one HttpConduit, never closes it.

It seems i couldn't make myself clear about the purpose of the application, because your proposed scenario doesn't fit. I don't have any relative URIs, URLs are generally for different hosts. HttpConduit gets single host:port as constructor so i _have to_ create more than one HttpConduit. List of futures from different resources wouldn't make much difference.



I guess i should've given more info about the problem, but i had already solved some parts of the problem when i first wrote the message so it wasn't easy to get 2-day old logs (especially sent to stdout). I've created a sample application [3] to reproduce the first issue i have encountered (Not because i want to hide the original one, because i left it at computer at work). It is a spray application running on spray-can, using spray-json to [un]marshal things and spray-client to reach the web and the couchdb. To run it:

1. Install couchdb using your package manager which will sit on http://localhost:5984/ by default.
2. Go to http://localhost:5984/_utils and create a database named "db"
3. clone the application, run re:start in sbt shell

To make it collect/save things, you should send some urls to its "/load" path. in order to do that.

1. Create a file with urls in lines (f.e. urls.txt)
2. Send this to application with 
    $ curl --data-binary @urls.txt http://localhost:8080/load
*. (--data-binary is required because curl seems to strip newlines with -d)

Service will send each url to actor "Web", which will create an HttpConduit for the host, request the path and on completion, send the content to actor "Db", finally closing the conduit. "Db" will ask couch if it already has a record with the given url. According to response, "Db" will update or create the record. After that, it will split the content to its lines and sends messages to itself with each line as a new content. Which in turn would make itself ask couch for those lines and create/update accordingly. Pseudo http conversation should go like that for a single URL.

GET      http://google.com/
...      (after response from google, close HttpConduit)
...      (after response from couch, don't close anything)
PUT      http://localhost:5984/db/id_for_google.com (with content) <html> \n google.com \n </html>
PUT      http://localhost:5984/db/id_for_google.com@1 (with content) <html>
PUT      http://localhost:5984/db/id_for_google.com@3 (with content) </html>

Since this is a concurrent application, errors are not deterministic, they sometimes show up, sometimes they don't. The thing that get me into everything is this:

app: 01/07 13:10:28 ERROR[akka:event-driven:dispatcher:event:handler-19] a.e.s.Slf4jEventHandler - 
app: [akka.dispatch.DefaultCompletableFuture]
app: []
app: [java.lang.NullPointerException
app: at cc.spray.client.HttpConduit$MainActor$Conn$$anonfun$dispatchTo$1$2.apply(HttpConduit.scala:108)
app: at cc.spray.client.HttpConduit$MainActor$Conn$$anonfun$dispatchTo$1$2.apply(HttpConduit.scala:107)
app: at akka.dispatch.DefaultCompletableFuture.akka$dispatch$DefaultCompletableFuture$$notifyCompleted(Future.scala:927)
...

line 108 of HttpConduit is a onComplete call which makes stack trace rather pointless. I've put tracing statements around the line and found that "self" is the thing that is null. Since that is nonsense, i traced further (by changing the debug line at HttpConduit:94 to also log the request with host:port) and found out the situation is even more serious:

...
app: Opening new connection to www.infoq.com:80 to request PUT request to http://www.infoq.com:80/db/http%3A%2F%2Ftwitter.com%2F%4015
app: Opening new connection to www.infoq.com:80 to request PUT request to http://www.infoq.com:80/db/http%3A%2F%2Ftwitter.com%2F%4016
app: Opening new connection to blogs.reuters.com:80 to request PUT request to http://blogs.reuters.com:80/db/http%3A%2F%2Ftwitter.com%2F%4029
app: Opening new connection to gapingvoid.com:80 to request PUT request to http://gapingvoid.com:80/db/http%3A%2F%2Ftwitter.com%2F%4032
app: Opening new connection to www.infoq.com:80 to request PUT request to http://www.infoq.com:80/db/http%3A%2F%2Ftwitter.com%2F%4048
app: Opening new connection to www.infoq.com:80 to request PUT request to http://www.infoq.com:80/db/http%3A%2F%2Ftwitter.com%2F%4060
app: Opening new connection to www.infoq.com:80 to request PUT request to http://www.infoq.com:80/db/http%3A%2F%2Ftwitter.com%2F%4063
...

You can see the inconsistency here. "Db" actor tries to write the lines 15, 16, 29, 32, 48, 60 and 63 of http://twitter.com/ response to couch. But most of them goes to infoq.com, others to reuters and such. After those hosts tell me to 400 or 404 away, HttpConduit tries to handle that response; but sometimes self is null, because i probably have already closed it after getting the actual resource (at infoq, reuters and gapingvoid). If i haven't, the 400 response might be handled as the original response for the host, which will create even more problems without even telling. The problem is, sometimes my HttpConduit in "Db", gets the connection for another host (probably in progress at actor "Web") instead of localhost:5984 when reaching for one.

I hope the problem is clearer this time. Writing long posts is getting to me :). Thanks for your continued interest!

PS 1: Another problem is when list of URLs grows beyond 50, all connections time-out, but first things first.

PS 2: Depending on the urls given, you'll see exceptions like 

app: [cc.spray.http.HttpException: Illegal HTTP header 'Cache-Control':
app: Invalid input ';', expected OptWS, '=', ListSep or EOI (line 1, pos 8):
app: private; max-age: 31536000
app:        ^
app: 
app: at cc.spray.http.HttpHeader$.apply(HttpHeader.scala:35)
...

OR

app: 01/07 13:10:30 ERROR[akka:event-driven:dispatcher:event:handler-21] a.e.s.Slf4jEventHandler - 
app: [akka.dispatch.DefaultCompletableFuture]
app: []
app: [cc.spray.http.HttpException: Illegal HTTP header 'Last-Modified':
app: Invalid input '+', expected "GMT" (line 1, pos 27):
app: Sat, 07 Jan 2012 11:05:31 +0000
app:                           ^
app: 
app: at cc.spray.http.HttpHeader$.apply(HttpHeader.scala:35)
...

Those are unrelated to current problem, Mathias already created an improvement[4] for HttpHeader parser being too strict.

[5]: Something unrelated here: closing HttpConduit does not close HttpConnections, simply stops the MainActor of it. Makes connection pool unreachable, but connections live. I think this will create a connection leak but i'm not sure.

Mathias

unread,
Jan 10, 2012, 11:19:26 AM1/10/12
to spray...@googlegroups.com
Erdem,

sorry for not getting back to you earlier on this but I still haven't completely resurfaced after the holidays (however, reaction times should be back to normal from now on).
Also: thanks for digging into spray-client as deeply as you did and reporting the issues you have encountered. I'm sure we'll be able to fix any problems with your spray-client use case.

As there appeared to be some confusion about what the exact responsibilities of the spray-can HttpClient and the spray-client HttpConduit are here is some more detail on this:

The spray-can HttpClient is responsible for the low-level reception and rendering of HTTP messages. Its interface allows you to open a connection to a host, use it for one or more request/response cycles and close it eventually. It does not provide any logic on a "super-connection" level, i.e. it does not distribute requests across a bunch of connections or somehow group connections by host. However, a single HttpClient should be able to handle thousands of concurrent connections, which is why there is no point in creating more than one per JVM instance.

The spray-client HttpConduit provides a layer on top of the HttpClient. It manages a bunch of connections for 1 host. So here, you create one HttpConduit per remote host and work with it until you no longer need to talk to that host.

The problem of several HttpConduits getting into each others way like you are describing is certainly a big issue that needs to be addressed.
Would you be able to distill a small test example exhibiting the behavior you were seeing?

Cheers,
Mathias

---
mat...@spray.cc
http://www.spray.cc

Erdem Agaoglu

unread,
Jan 10, 2012, 1:18:49 PM1/10/12
to spray...@googlegroups.com
Hello Mathias,

Comparing your reaction times with what i've seen in years, I'd say you're awesome! Working with spray for a weeks now, all i can say is you deserve great holidays, i hope you had one.

I kinda understood the fact that HttpClient cannot and shouldnot provide any logic. I tried to use HttpConduit (with connection pooling and message pipelining and all) for both sides (Web and Db) of my application but it didn't work. So i thought HttpConduit's are not supposed to be used in a short-lived fashion. Db actor (reads and writes data) of my application still uses a single never-closing HttpConduit for database connection and i have no problems with it. But i had to change Web actor (reads web-pages on internet) to use something else (an HttpConduit-like structure with host based connection grouping and redirection following and such).

During the process, i've encountered two strange behaviours. First one is HttpConduits stepping onto each others toe. To demonstrate this, i pushed https://github.com/agaoglu/spray-client-concurrency-example. I can run this on some machine too but i'll be easier for you to run and see the logs for yourself.

The second one is a single HttpClient being unable to handle more than ~50 concurrent connections. My (httpClient ? Connect(host, port)) futures timeout. This is easier to demonstrate. Think of an actor like https://gist.github.com/1590275#file_status_reporter.scala. It simply makes a request and writes the response status. Run it with sth like https://gist.github.com/1590275#file_run.scala. Depending on number of urls, application works or not. If list contains 5-10 urls application works. List grows, timeouts happen. List grows to contain 50 items, every connection timeouts. If you change the actor to create a new HttpClient for each message instead of reusing the one in registry, number of urls can grow without a problem.

Thanks for your interest.

--
erdem

Erdem Agaoglu

unread,
Jan 11, 2012, 9:44:06 AM1/11/12
to spray...@googlegroups.com
Hello again,

I got some more info / correction on the HttpClient connection handling issue. Its not the number of URLs effecting the behaviour of HttpClient, but the connection successes. If one connection somehow timeouts, HttpClient timeouts all others waiting in line. Number of URLs illusion occurs because of something else i guess. Maybe like "Machine tries to open ~50 connections simultaneously, some switch, my NIC or another component causes one of them to timeout, and HttpClient reflects this to others". I'll update here when i have something concrete.

--
erdem

Erdem Agaoglu

unread,
Jan 11, 2012, 10:20:30 AM1/11/12
to spray...@googlegroups.com
Found it,

initiateConnection in HttpClient contains a blocking call (https://github.com/spray/spray-can/blob/develop/spray-can/src/main/scala/cc/spray/can/HttpClient.scala#L112). InetSocketAddress creation blocks the thread while it tries to resolve the host. If i

    val future = (httpClient ? Connect(host, port))

with a host that is taking too long to resolve, that future timeouts. It can be avoided by increasing the duration but it won't change the fact that other Connect() calls will wait for one host to get resolved. There should be a way to do this asynchronously. (with a DNSResolver actor perhaps? :) )

cheers
--
erdem


Mathias

unread,
Jan 12, 2012, 9:39:55 AM1/12/12
to spray...@googlegroups.com
Erdem,

very good!
Thanks for the in-depth analysis.
I have created two new issues for spray-client to track progress on the problems you found:
https://github.com/spray/spray/issues/72 and https://github.com/spray/spray/issues/73

Cheers,
Mathias

---
mat...@spray.cc
http://www.spray.cc

Reply all
Reply to author
Forward
0 new messages