Additional Delimiter

19 views
Skip to first unread message

thruflo

unread,
Mar 14, 2010, 6:43:04 PM3/14/10
to Twitter Development Talk
I'm consuming the Streaming API using the filter method (tracking some
user ids). I've noticed that I'm getting an extra, undocumented, line
before each length delimiter.

I connect and get the following coming down the pipe:

{{{

HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.17)

5DE
1496
{"coordinates":null, ... snip ..., "id":10487365330}

A52
2636
{"coordinates":null, ...snip ..., "id":10487377907}

592
1420
{"coordinates":null, ... snip ..., "id":10487298462}


}}}

Now, the Streaming API docs say, "Statuses are represented by a
length, in bytes, a newline, and the status text that is exactly
length bytes. Note that "keep-alive" newlines may be inserted before
each length."

This suggests the following read loop code (based on and equivalent to
the way tweepy's consumer is implemented):

{{{

length = ''
while True:
c = s.recv(1)
if c == '\n':
break
length += c
length = length.strip()
if length.isdigit():
length = int(length)
status_data = s.recv(length)
# do something with the data

}}}

However, if you look at the third status data from above, you see that
the extra line can sometimes be a digit, in that case ``592``. Which
fairly effectively borkes the consumer.

Now, I can hack that read loop in quite a few ways to accomodate this
extra data coming down the pipe. Question is, what's the best way to
do so? Is this something I can rely on, e.g.: I can look for a line
above the length delimiter? Will it always have three chars? Do
statuses always have > 1000 bytes?

Plus I'm wondering whether this has always been the case, or if there
are broken consumers missing tweets out there?

Thanks,

James.

Ed Costello

unread,
Mar 14, 2010, 10:42:58 PM3/14/10
to twitter-deve...@googlegroups.com
On Mar 14, 2010, at 5:43 PM, thruflo wrote:
[…]

> However, if you look at the third status data from above, you see that
> the extra line can sometimes be a digit, in that case ``592``. Which
> fairly effectively borkes the consumer.

From that list you posted:
0x5DE is 1496 + 6 bytes (4 bytes for “1496” plus 2 LFs)
0xA52 is 2636 + 6 bytes
0x592 is 1420 + 6 bytes

Now, I don’t know whether it’s correct that it’s returning a length in hex followed by a length in decimal, but the lengths do appear to be correct if you interpret the first number as hex.
Twitter will have to respond whether this is the correct behavior or not.

-ed costello

Ed Costello

unread,
Mar 14, 2010, 10:56:49 PM3/14/10
to twitter-deve...@googlegroups.com
On Sun, Mar 14, 2010 at 5:43 PM, thruflo <thr...@googlemail.com> wrote:
[..] I've noticed that I'm getting an extra, undocumented, line
before each length delimiter.

What's the command you're sending to twitter and the URL you’re using?  I can’t replicate this (am just getting the decimal length in the responses).

--
-ed costello

John Kalucki

unread,
Mar 14, 2010, 10:58:24 PM3/14/10
to twitter-deve...@googlegroups.com
You appear to be looking at the raw HTTP chunk transfer encoded stream. The documentation assumes that you are using a HTTP client, not the raw TCP stream. If you are using the raw TCP stream, you can try to play games and use the chunk encoding, but there are no guarantees that the chunks will always align with the payload.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.

thruflo

unread,
Mar 15, 2010, 7:47:48 AM3/15/10
to Twitter Development Talk
Hi John,

Ah yes. You've exposed my lack of experience with the lower level
socket library. Sorry, I was making a fuss about nothing ;)

I moved to using a socket connection directly because I found that the
httplib based client tweepy was using tended to hang occasionally when
doing a low latency restart. I debugged the httplib internals to see
it was hanging (maxing my CPU and making the process unresponsive on
an ubuntu linode and on OSX 10.4) at line 391 of python2.6's
httplib.py, in ``HTTPResponse.begin``::

version, status, reason = self._read_status()

This was calling ``self.fp.readline()`` on the underlying socket. I
wasn't sure if this was a Twitter issue or an httplib issue, so I re-
implemented a raw socket consumer so I could see exactly what was
coming down the pipe.

I've found the socket approach more reliable - I can't replicate the
same error in a fair amount of trying (although I don't have a
scientific way of triggering it with httplib, so this is a pretty
loose test).

The code I was using for a few weeks in production which tended to
hang, maxing CPU every so often when doing a low latency restart:

http://gist.github.com/332769

The raw socket version which seems to work, pending dealing with the
chunked encoding:

http://gist.github.com/332758
http://gist.github.com/332759

Thanks,

James.

On Mar 15, 2:58 am, John Kalucki <j...@twitter.com> wrote:
> You appear to be looking at the raw HTTP chunk transfer encoded stream. The
> documentation assumes that you are using a HTTP client, not the raw TCP
> stream. If you are using the raw TCP stream, you can try to play games and
> use the chunk encoding, but there are no guarantees that the chunks will
> always align with the payload.
>

> -John Kaluckihttp://twitter.com/jkalucki
> Infrastructure, Twitter Inc.

Reply all
Reply to author
Forward
0 new messages