futures + refs

54 views
Skip to first unread message

Sam Raker

unread,
Dec 11, 2014, 11:18:04 AM12/11/14
to clo...@googlegroups.com
I've got some code that's using Twitter's HoseBirdClient to pull tweets from the public stream, which I then preprocess and store with CouchDB. Right now, my HBC client is being forced to reconnect more than I'd like, which occasionally causes my app to hang, for reasons I'm not entirely clear on. Regardless, some preliminary research on HBC suggests that the reconnections are being caused by my code failing to keep up with the endpoint, which in turn suggests that my processing+uploading is taking too long. I tried wrapping the processing+uploading part in futures, which definitely sped things up, but caused 409 errors when uploading to CouchDB--briefly, Couch requires any update operation to include a git-style "rev" string, and if the rev you provide isn't the most recent one, it throws a 409 at you. I'm organizing things by hashtag, so tweets with multiple copies of the same hashtag, or series of tweets with the same hashtag are the culprit--future A gets the current doc from Couch, processes it, and uses the rev it got from the currently-existing doc, while future B does the same thing, but finishes first, so now future A has an outdated rev, and that causes the 409. 

The vague solution I've come up with involves using a map to store the rev values, with the last step of the processing/uploading function being to store the rev number Clutch helpfully returns to you after a successful update. From what I can tell, refs are the way to go, since each future is effectively a separate thread. My questions are as follows:
1) Would I have to store the map-of-refs in a ref?
2) Is this even feasible? Would the timing work out?
3) With the addition of all this dereferencing and `dosync`+`alter`-ing, would this actually end up speeding things up all that much?

László Török

unread,
Dec 11, 2014, 11:43:18 AM12/11/14
to clo...@googlegroups.com
Hi Sam,

have you tried putting the incoming (hashtag,tweet) tuples into a queue and have another thread pull them out and upload them to couchdb?

I'm unfamiliar with HBC, but I assume it has a callback-based API, so you should be able to have multiple callbacks/connections/streams feed the same queue and have a single thread do the upload (and maybe batch if necessary).

I don't see refs being a particularly good fit for this problem, but I could be wrong.


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
László Török
--

Sam Raker

unread,
Dec 11, 2014, 11:53:27 AM12/11/14
to clo...@googlegroups.com
So HBC actually does that already--it dumps tweets into a LinkedBlockingQueue. Right now, I'm doing `(loop [tweet (.take queue)]...`, which, I think, essentially amounts to what you're suggesting, but I could be misunderstanding you.

There's a distinct possibility that all the reconnections are caused by my home's internet connection--my local ssh connections get dropped constantly, which suggests there might be a problem somewhere. I figured trying to optimize my code couldn't hurt, though.

You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/zRy6dm1Vmcs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

László Török

unread,
Dec 11, 2014, 12:06:42 PM12/11/14
to clo...@googlegroups.com
Excellent. 

IF the problem is that your write rate to couchdb can't keep up with the incoming tweet stream (which may be the case) try batching the writes to couch, i.e. instead of firing an update on every single tweet build up a larger batch in that loop, and once you hit a threshold (of your choice) then you send it to couch.

As you said, it could be something totally different. :)
Reply all
Reply to author
Forward
0 new messages