[Boost-users] [asio] socket::read_some() splits data into two parts.

Slav

unread,

May 11, 2011, 8:28:47 AM5/11/11

to boost...@lists.boost.org

Hello.
I sent 8004 bytes to server through TCP connection and successfully read it like this:

    int readBytes;
    const int BUFFER_SIZE = 128;
    char charBuf[BUFFER_SIZE];
    do
    {
        readBytes = socket.read_some(boost::asio::buffer(charBuf, BUFFER_SIZE));
    }
    while(readBytes >= BUFFER_SIZE);

But sometimes this code reads just 3752 (precisely) bytes and returns. After that it handles another async_read and reads 4525 bytes (which in sum gives 8004 bytes).
Do I use "read_some()" function in the right way?

Andrew Holden

unread,

May 11, 2011, 9:42:59 AM5/11/11

to boost...@lists.boost.org

Try changing the last line to:

while(readBytes != 0);
_______________________________________________
Boost-users mailing list
Boost...@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users

Slav

unread,

May 11, 2011, 9:57:07 AM5/11/11

to boost...@lists.boost.org

Then messages of length not multiple of 128 (BUFFER_SIZE) will not be read - tested it, anyway, after last socket::read_some() (with readBytes > 0 && < BUFFER_SIZE) next socket::read_some() never ends.

Andrew Holden

unread,

May 11, 2011, 10:36:11 AM5/11/11

to boost...@lists.boost.org

It would probably help to understand that TCP has no concept of a
"message". Anything you write to a socket is appended to a stream of
*bytes*. Several subsystems on both the sending and receiving computers
will have the option of splitting and combining adjacent buffers with no
consideration to how big each individual write was. Read_some has the
option of reading any size up to and including the buffer size, but a
read of less than that size does not mean "end of message". It could
also mean "network congestion", "cable unplugged", or "Windows just felt
lazy".

On further thought, I think I see the problem (and I apologize for the
bad recommendation in my last email). Your sender somehow needs to
communicate the message size or flag the end of the message. A partial
list of options includes:

* Begin the message with a field specifying its total length. The
receiving loop must read this length and then count bytes until it has
the whole message, keeping in mind that each read_some can read any
number of bytes.

* Begin each message with a message id, where each message id has a
known length. Once you calculate the length, count bytes as I described
above.

* End the message with a terminator. You could set up a line-oriented
protocol where a newline terminates the read. With some thought, you
might think of some other terminating byte or string appropriate to your
protocol.

In all cases, unless your final read_some has a carefully controlled
size, remember that your buffer may contain the beginning of the next
message, or even multiple complete messages. If so, it is your
responsibility to retain this until you are ready to process the next
message.

Igor R

unread,

May 11, 2011, 10:39:17 AM5/11/11

to boost...@lists.boost.org

This is exactly what "read_some" does -- it reads SOME data. It may
read even 1 byte.
If you know exactly how many bytes you expect to get, use read() free
function with the appropriate completion condition:
http://www.boost.org/doc/libs/1_46_1/doc/html/boost_asio/reference/read.html

Slav

unread,

May 11, 2011, 10:49:26 AM5/11/11

to boost...@lists.boost.org

Now it's clear to me.
Previosly I thought that socket::async_read() (inside which I call read_some() multiple times) will be called just when whole data will be accepted and read_some() will return "0" between multiple already accepted data bunches (or result will be less then buffer size) - it's not in such way.
Thank you.

Slav

unread,

May 11, 2011, 10:51:39 AM5/11/11

to boost...@lists.boost.org

It must be reflected in documentation. Isn't it?

Igor R

unread,

May 11, 2011, 10:57:44 AM5/11/11

to boost...@lists.boost.org

> It must be reflected in documentation. Isn't it?

"The function call will block until one or more bytes of data has been
read successfully, or until an error occurs."

"Remarks:
The read_some operation may not read all of the requested number of
bytes. Consider using the read function if you need to ensure that the
requested amount of data is read before the blocking operation
completes."

http://www.boost.org/doc/libs/1_46_1/doc/html/boost_asio/reference/basic_stream_socket/read_some/overload1.html

Slav

unread,

May 11, 2011, 11:11:35 AM5/11/11

to boost...@lists.boost.org

"The read_some operation may not read all of the requested number of bytes."

is pretty clear because incoming message isn't bound to the BUFFER_SIZE which is just the place where to srore intermediate data.

"Consider using the read function if you need to ensure that the requested amount of data is read before the blocking operation completes."

sounds like "read until my buffer will be fulled regardless of incoming message length" - not like "read until the logical message will be fully read and use my buffer as something where data will be put - regargingly it's size".

And, by the way, there could come empty messages - just without any data.

TCP contains info about message length - it is duplication to prefix all messages with it's length.

Brad Howes

unread,

May 11, 2011, 8:33:39 AM5/11/11

to boost...@lists.boost.org

On May 11, 2011, at 2:28 PM, Slav wrote:

> I sent 8004 bytes to server through TCP connection and successfully read it like this:

> But sometimes this code reads just 3752 (precisely) bytes and returns. After that it handles another async_read and reads 4525 bytes (which in sum gives 8004 bytes).

Well since your buffer is only 128 bytes long, I hope that you are getting that amount or less from each read_some call. But in general, what you are seeing is normal TCP behavior.

Brad

--
Brad Howes
Calling Team - Skype Prague
Skype: br.howes

Slav

unread,

May 11, 2011, 11:25:04 AM5/11/11

to boost...@lists.boost.org

I meant I read data multiple times (accumulating incoming data using std::string::append(charBuffer) ) until readBytes >= BUFFER_SIZE which, sometimes, happens to interrupt (read_some() returns value less then BUFFER_SIZE when not all 8004 bytes was read).
Now everything is clear to me.
Thanks everyone!

Eric J. Holtman

unread,

May 11, 2011, 11:51:53 AM5/11/11

to boost...@lists.boost.org

On 5/11/2011 10:11 AM, Slav wrote:
>
>
> TCP contains info about message length - it is duplication to prefix all
> messages with it's length.
>

No, it really doesn't. All you get is a stream of bytes,
reliably delivered, in sequence.

So if you write 3,732 bytes onto a socket, there is *NO WAY*
for the reader to tell that you did that. The reader might
get 1 read of 3732 bytes, or 3732 reads of 1 byte, or anything
in between.

The reader could even read *more* than 3732 bytes in one read,
if you wrote more than once.

If you think you're going to get repeatable, identical matching
pairs of reads and writes out of just a TCP socket (that is,
without imposing your own protocol on top of TCP), you are
in for endless hours/days/months of frustration.

Andrew Holden

unread,

May 12, 2011, 10:26:53 AM5/12/11

to boost...@lists.boost.org

On Wednesday, May 11, 2011 11:12 AM, Slav wrote:
>
> And, by the way, there could come empty messages - just
> without any data.

TCP does not have a concept of an empty write.

> TCP contains info about message length - it is duplication
> to prefix all messages with it's length.

Where did you read that? TCP has NO concept of "messages". As such, it
has no concept of "message length". It is only a stream of bytes. The
sending computer can easily combine the buffers from two consecutive
write calls into a single packet, or split the buffer from a single
write call into multiple packets, or both. In either case, ALL
information about the size of the original write call(s), the number of
write calls, and anything else that you hope will provide a clue about
"messages" will be lost. Likewise, the receiving computer can and will
freely combine and split packets into whatever buffers it sees fit, with
similar effects on any "message boundaries". The only thing that will
remain is the sequence of bytes.

Do not try to search for TCP options to change this behavior. The
closest you can come is options that will *usually* keep the message
boundaries. This means that your program will *usually* not crash.

If you wish to preserve message boundaries, then you MUST provide your
own message framing, just as you would if writing to a file. If you
prefix each message with its length, then you can use the read function
to ensure you get the whole message in one call, as you will know the
message length. This will also be effective at ensuring you don't have
the beginning of the next message at the end of your buffer.

Slav

unread,

May 12, 2011, 10:36:20 AM5/12/11

to boost...@lists.boost.org

Yeah - I was really mistaken. Thanks for correcting me.
I reimplemented the reading using message length prefix and now everything works fine.
Left just one question: socket has "receive_buffer_size" option which is by default equals 8192 - does it mean that message of length (if it will come at once) will be truncated? Or I still could read it with multiple "async_read" calles (collecting it into a buffer using data length prefix)?

Marsh Ray

unread,

May 12, 2011, 1:10:50 PM5/12/11

to bo...@lists.boost.org, boost...@lists.boost.org

On 05/11/2011 09:36 AM, Andrew Holden wrote:
> On Wednesday, May 11, 2011 9:57 AM, Slav wrote:
>> Then messages of length not multiple of 128 (BUFFER_SIZE)
>> will not be read - tested it, anyway, after
>> last socket::read_some() (with readBytes> 0&& <
>> BUFFER_SIZE) next socket::read_some() never ends.
>
> It would probably help to understand that TCP has no concept of a
> "message". Anything you write to a socket is appended to a stream of
> *bytes*.

Alternatively, we could say that TCP, in fact, does have a well-defined
concept of messages: they are all exactly one byte long.

> Several subsystems on both the sending and receiving computers

... and sometimes boxes in the middle ...

> will have the option of splitting and combining adjacent buffers with no
> consideration to how big each individual write was.

I've done a little protocol stuff with ASIO now and I must say it's a
lot of fun and I can't go back to doing it any other way.

> On further thought, I think I see the problem (and I apologize for the
> bad recommendation in my last email). Your sender somehow needs to
> communicate the message size or flag the end of the message. A partial
> list of options includes:

The pattern I encounter over and over again (often at multiple levels in
a protocol) is:

class protocol_layer_context
{
vector<uint8> buffer;
void on_received_data(vector<uint8> & rx_bytes)
{
buffer.append(rx_bytes);

// perhaps a virtual override.
size_t msg_len = this->parse_len_from_start_of_buffer();

if (buffer.size() <= msg_len)
{
vector<uint8> msg_buf =
consume_data_from_front_of_buffer(buffer, msg_len);

// perhaps a virtual override
this->process_complete_message(msg_buf);
}

// post another ASIO read request
this->request_more_data();
}
...

But there are some important issues with this naive pseudocode:

1. It can result in recopying the data a bunch of times for every
protocol layer, killing performance.

2. It's susceptible to a denial-of-service (DoS). Bad guy can send trick
you into allocating all your memory.

3. Sometimes the length of a message is stated at the beginning of the
message, sometimes it isn't known until the end.

4. No processing happens on the message until it's completely read, but
some protocols really need the receiving endpoint to process it
incrementally.

5. Error handling

6. Optimal threading

7. Etc.

We find bugs in exactly this logic all the darn time. Often the data
being received is untrusted and possibly malicious. Real-world protocol
implementations will commonly crash under fragmentation fuzzing,
sometimes resulting in exploitable security holes.

In a sense, this is the general refactoring problem of
'incrementalizing' a parsing function by moving all its state from stack
variables into a longer-lived context object.

We've seen it done successfully with coroutines, but that's not a
commonly accepted solution because, frankly, the native C/C++ runtimes
have not yet given coroutines the love (i.e., portability and
performance guarantees) they really deserved.

If someone figured out how to leverage generic techniques to handle just
the unidirectional message delimiting problem in a bulletproof way I
think it would make a really great boost library.

- Marsh

Ted Byers

unread,

May 12, 2011, 2:17:40 PM5/12/11

to boost...@lists.boost.org, bo...@lists.boost.org

> -----Original Message-----
> From: boost-use...@lists.boost.org [mailto:boost-users-
> bou...@lists.boost.org] On Behalf Of Marsh Ray
> Sent: May-12-11 1:11 PM
> To: bo...@lists.boost.org; boost...@lists.boost.org
> Subject: [Boost-users] Delimiting protocol messages (was [asio]
read_some()
> splits data)
>
> On 05/11/2011 09:36 AM, Andrew Holden wrote:
> > On Wednesday, May 11, 2011 9:57 AM, Slav wrote:
> >> Then messages of length not multiple of 128 (BUFFER_SIZE) will not be
> >> read - tested it, anyway, after last socket::read_some() (with
> >> readBytes> 0&& <
> >> BUFFER_SIZE) next socket::read_some() never ends.
> >
> > It would probably help to understand that TCP has no concept of a
> > "message". Anything you write to a socket is appended to a stream of
> > *bytes*.
>
> Alternatively, we could say that TCP, in fact, does have a well-defined
> concept of messages: they are all exactly one byte long.
>

Related to this, I wonder if there are any class libraries that facilitate
processing these byte streams.

I read, in the rationale part of the documentation for ASIO, the following:

"Basis for further abstraction. The library should permit the development of
other libraries that provide higher levels of abstraction. For example,
implementations of commonly used protocols such as HTTP."

It seems like such an obvious thing to do: to write a class library that
contains classes that use the TCP capabilities of boost::asio to
automagically take data read from the socket and do whatever is needed. For
example, one might want to construct a series of http requests from the data
coming in on port 443, and be able to relate the addressing data in the
application layer to that in the TCP layer, and use that comparison to
determine whether to forward the request to server A or server B. One
reason for doing so would be for, for example, my own edification (and
anyone else interested in learning) about how the different OSI layers work.
Another would be for security purposes (e.g. to know whether or not an
authorized user's session has been hijacked).

It seems to me to be an obvious thing to do, but my question to you is "Do
you know of anyone who has done it?" (in some kind of open source project)
If not, do you know of resources available online where I could learn how to
do it? I am finding it hard to find resources that are useful: I have well
developed C++ skills, e.g. to write custom IO stream classes, but need some
guidance on how to proceed with the 'further abstraction' the docs mention,
and what the recommended best practices are specific to (high performance,
secure) networking program development.

You said, " I've done a little protocol stuff with ASIO now and I must say
it's a lot of fun and I can't go back to doing it any other way." How did
you get started on it? Did you use any documentation other than the asio
docs? Do you know of any documents (ideally online) that show how you could
use this stuff to thwart the major kinds of attacks that can be made on a
web server?

Thanks

Ted

Andrew Holden

unread,

May 12, 2011, 2:38:51 PM5/12/11

to boost...@lists.boost.org

TCP will not alter the byte stream, also meaning it will not drop bytes.
If the receive buffer fills, then it will tell the other machine it is
sending data too fast and needs to slow down. It will also have the
other machine resend the data that couldn't fit in the receive buffer.
Your programs (on both ends) will not need to address this issue; the
operating system will handle it for you.

That said, experimenting with the receive buffer size may improve
performance, but will have no effect on correctness. Don't assume more
is better. If you make the buffers too big, you'll just increase
overhead and latency.

Slav

unread,

May 12, 2011, 4:29:43 PM5/12/11

to boost...@lists.boost.org

As I raised this question I thought that TCP itself has the ability to determin the message size - but it does not (I previously widely used ENet (secure data transmission based on UDP) - may be that's why I was misleaded). Important usage "application protocol" build upon "message size prefix" protocol (if such a small thing can be called "protocol") could be the C++ objects' serialization using the boost::serialization, so the whole picture of protocol stack would looks like this:
Ethernet -> IP -> TCP -> "message size prefix" -> "boost::serialization"

Jerry

unread,

May 12, 2011, 5:28:12 PM5/12/11

to boost...@lists.boost.org

Ted Byers <r.ted.byers <at> gmail.com> writes:

>
> It seems to me to be an obvious thing to do, but my question to you is "Do
> you know of anyone who has done it?" (in some kind of open source project)

What about: http://cpp-netlib.github.com/

Apparently it will be submitted for review someday:
http://comments.gmane.org/gmane.comp.lib.boost.user/67431

>
> Thanks
>
> Ted
>

Jerry

Marsh Ray

unread,

May 12, 2011, 5:45:09 PM5/12/11

to boost...@lists.boost.org, Jerry

On 05/12/2011 04:28 PM, Jerry wrote:
> Ted Byers<r.ted.byers<at> gmail.com> writes:
>>
>> It seems to me to be an obvious thing to do, but my question to you is "Do
>> you know of anyone who has done it?" (in some kind of open source project)
>
> What about: http://cpp-netlib.github.com/

Looks interesting. This page in particular looks like it's getting close
to what I was talking about:

http://cpp-netlib.github.com/0.9.0/message.html

I realize the project is new and the docs may not be complete, but every
other page seems to be about its HTTP implementation. Even the generic
basic_message class presumes a headers/body structure.

HTTP is often thought of as a half-duplex message/response protocol
because it (mostly) stateless and originally closed the connection after
every response.

I was interested more in a general facility for a common low-level
protocol buffering pattern.

> Apparently it will be submitted for review someday:
> http://comments.gmane.org/gmane.comp.lib.boost.user/67431

Cool.

- Marsh

Ted Byers

unread,

May 12, 2011, 5:59:30 PM5/12/11

to boost...@lists.boost.org

> -----Original Message-----
> From: boost-use...@lists.boost.org [mailto:boost-users-
> bou...@lists.boost.org] On Behalf Of Marsh Ray
> Sent: May-12-11 5:45 PM
> To: boost...@lists.boost.org
> Cc: Jerry
> Subject: Re: [Boost-users] Delimiting protocol messages (was [asio]
> read_some() splits data)
>

> On 05/12/2011 04:28 PM, Jerry wrote:
> > Ted Byers<r.ted.byers<at> gmail.com> writes:
> >>
> >> It seems to me to be an obvious thing to do, but my question to you
> >> is "Do you know of anyone who has done it?" (in some kind of open
> >> source project)
> >
> > What about: http://cpp-netlib.github.com/
>
> Looks interesting. This page in particular looks like it's getting close
to what I
> was talking about:
>

Yes, It is interesting

> http://cpp-netlib.github.com/0.9.0/message.html
>
> I realize the project is new and the docs may not be complete, but every
> other page seems to be about its HTTP implementation. Even the generic
> basic_message class presumes a headers/body structure.
>

what I haven't found, yet, is a way to compare the IP info in the TCP
packest with the IP info in the HTTP headers. That is in particualr. Mre
generally, I am looking for an online resource for learning network
programming in general and security related network proramming in
particular.

Cheers

Ted

Brad Howes

unread,

May 12, 2011, 6:23:48 PM5/12/11

to boost...@lists.boost.org

On May 12, 2011, at 8:17 PM, Ted Byers wrote:

> Related to this, I wonder if there are any class libraries that facilitate
> processing these byte streams.

Have you looked at boost::serialization? There is an example in boost::asio on how to use them together.

Brad

--
Brad Howes
Calling Team - Skype Prague
Skype: br.howes

_______________________________________________

Marsh Ray

unread,

May 12, 2011, 7:18:57 PM5/12/11

to boost...@lists.boost.org

On 05/12/2011 04:59 PM, Ted Byers wrote:
>
> what I haven't found, yet, is a way to compare the IP info in the TCP
> packest with the IP info in the HTTP headers.

Sometimes a proxy will add something, but usually there aren't any IP
addresses in HTTP headers.

> That is in particualr. Mre
> generally, I am looking for an online resource for learning network
> programming in general and security related network proramming in
> particular.

That's interesting. There are resources about secure programming, and
securing networks, but I don't see much new stuff about basic network
programming. They are probably casualties of the trend to make all
communications run over HTTP(s).

I don't recall ever seeing a book or online resource saying "here's how
to accept data from the network and process it in the most scalable and
secure way using C or C++".

On the crypto side of things I recommend:
> http://www.amazon.com/Cryptography-Engineering-Principles-Practical-Applications/dp/0470474246

I tweeted your question:
https://twitter.com/marshray/status/68810041234432000

Got this recommendation, doesn't seem to be too related to network
programming though. Perhaps we'll get more.
> http://www.amazon.com/Memory-Programming-Concept-Frantisek-Franek/dp/0521520436

- Marsh

Reply all

Reply to author

Forward