writev returns less then expected

Jim Marshall

unread,

Oct 18, 2007, 3:30:24 PM10/18/07

to

Hope this is the right group to post to, if not please direct me as needed.

I have an application which is using NON-blocking sockets. My
application attempts to send a response using the writev function (I am
testing on Fedora Core 6, "Linux starfury 2.6.22.1-32.fc6 #1 SMP Wed Aug
1 14:10:08 EDT 2007 i686 i686 i386 GNU/Linux"), in some circumstances
writev does not write all of the data to the socket. When this happens
the client only gets part of the request and then sits there waiting for
more data.
It is not clear to me why this might happen from the docs.

This seems to be an issue depending on the amount of data being sent
back. For example if I am sending 3000 bytes it works fine, but when I
try to send 30K bytes I run into the problem.

This is the code I am using:

ssize_t senddata(int socket, const char *buffer, int length)
{
ssize_t ret = 0;
CCIMBool shouldRetry = CCIMFalse;
int retVal = -1;
struct iovec iov[] = { {NULL, 0} };
iov[0].iov_base = (char*)buffer;
iov[0].iov_len = length;
do {
ret = writev(socket, iov, 1);
if (ret == -1) {
... handle error or need to retry
}
} while (shouldRetry == CCIMTrue);

In the case where I run into the problem, the buffer length is 36333
(according to strlen) and writev only returns 11340. Using WireShark I
can see that only part of the message is being sent to the client.

errno is not set - but then again writev is not returning '-1'

Any help appreciated.

Thanks
-Jim

moi

unread,

Oct 18, 2007, 4:03:11 PM10/18/07

to

On Thu, 18 Oct 2007 15:30:24 -0400, Jim Marshall wrote:

> Hope this is the right group to post to, if not please direct me as
> needed.
>
> I have an application which is using NON-blocking sockets. My
> application attempts to send a response using the writev function (I am
> testing on Fedora Core 6, "Linux starfury 2.6.22.1-32.fc6 #1 SMP Wed Aug
> 1 14:10:08 EDT 2007 i686 i686 i386 GNU/Linux"), in some circumstances
> writev does not write all of the data to the socket. When this happens
> the client only gets part of the request and then sits there waiting for
> more data.
> It is not clear to me why this might happen from the docs.

man writev:

RETURN VALUE
On success, the readv() function returns the number of bytes read; the
writev() function returns the number of bytes written. On error, -1 is
returned, and errno is set appropriately.

, which MEANS: writev() (,just like write()) MAY write less than your
'length' argument. You'll have to take care of sending the rest of the
data in your buffers in subsequent calls.

>
> This seems to be an issue depending on the amount of data being sent
> back. For example if I am sending 3000 bytes it works fine, but when I
> try to send 30K bytes I run into the problem.
>
> This is the code I am using:
>
> ssize_t senddata(int socket, const char *buffer, int length) {
> ssize_t ret = 0;
> CCIMBool shouldRetry = CCIMFalse;
> int retVal = -1;
> struct iovec iov[] = { {NULL, 0} };
> iov[0].iov_base = (char*)buffer;
> iov[0].iov_len = length;
> do {
> ret = writev(socket, iov, 1);
> if (ret == -1) {
> ... handle error or need to retry
> }
> } while (shouldRetry == CCIMTrue);
>
> In the case where I run into the problem, the buffer length is 36333
> (according to strlen) and writev only returns 11340. Using WireShark I

Forget strlen(). It only counts the number of non-nul bytes in string.
Do you WANT to send nuls ?

> can see that only part of the message is being sent to the client.
>
> errno is not set - but then again writev is not returning '-1'

errno is only set on errors (the -1 return) . Meaningless otherwise.

> Any help appreciated.
>

HTH,
AvK

Jim Marshall

unread,

Oct 18, 2007, 4:31:09 PM10/18/07

to

moi wrote:
> On Thu, 18 Oct 2007 15:30:24 -0400, Jim Marshall wrote:
>
>> Hope this is the right group to post to, if not please direct me as
>> needed.
>>
>> I have an application which is using NON-blocking sockets. My
>> application attempts to send a response using the writev function (I am
>> testing on Fedora Core 6, "Linux starfury 2.6.22.1-32.fc6 #1 SMP Wed Aug
>> 1 14:10:08 EDT 2007 i686 i686 i386 GNU/Linux"), in some circumstances
>> writev does not write all of the data to the socket. When this happens
>> the client only gets part of the request and then sits there waiting for
>> more data.
>> It is not clear to me why this might happen from the docs.
>
> man writev:
>
> RETURN VALUE
> On success, the readv() function returns the number of bytes read; the
> writev() function returns the number of bytes written. On error, -1 is
> returned, and errno is set appropriately.
>
> , which MEANS: writev() (,just like write()) MAY write less than your
> 'length' argument. You'll have to take care of sending the rest of the
> data in your buffers in subsequent calls.

I was under the impression that writev was atomic, my bad. However this
confuses me now as if you had more then 1 item in the vector it would be
very difficult to figure out which part of the vector wasn't sent. I'm
not sure I follow the point of this function.

Guess I'll have to play with it some more.

Thanks

Rick Jones

unread,

Oct 18, 2007, 5:32:58 PM10/18/07

to

Jim Marshall <jim.ma...@wbemsolutions.com> wrote:
> Hope this is the right group to post to, if not please direct me as needed.

> I have an application which is using NON-blocking sockets. My
> application attempts to send a response using the writev function (I
> am testing on Fedora Core 6, "Linux starfury 2.6.22.1-32.fc6 #1 SMP
> Wed Aug 1 14:10:08 EDT 2007 i686 i686 i386 GNU/Linux"), in some
> circumstances writev does not write all of the data to the
> socket. When this happens the client only gets part of the request
> and then sits there waiting for more data. It is not clear to me
> why this might happen from the docs.

> This seems to be an issue depending on the amount of data being sent
> back. For example if I am sending 3000 bytes it works fine, but when I
> try to send 30K bytes I run into the problem.

Since the socket is non-blocking, your writev will do a partial write
whenever there is less space in the socket than you are trying to
write with writev.

If you *know* you will "never" try to write more than N bytes into a
socket at one time and can "know" when that has been emptied - say by
the receipt of data from the remote indicating it has all you sent
previously - you could use setsockopt(SO_SNDBUF) to set the socket
buffer to be larger than your largest writev() call, and even if the
socket is non-blocking writev() should always write everything - well
assuming it can allocate space - socket buffer sizes are limits not
preallocations.... after your setsockopt() call, you should make a
getsockopt() call to make sure you got at least the size you wanted...

One or more of the texts of Stevens Fenner and Rudoff, or Stallings
might be good to add to your bookshelf - they cover all sorts of stuff
like this.

I trust you already know that even if you did put all N KB into the
socket at once that your reciever will still get it in smaller chunks
right? Ie that TCP is a byte-stream service and that message
boundaries are not preserved...

rick jones
--
a wide gulf separates "what if" from "if only"
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

Bit Banger

unread,

Oct 18, 2007, 8:01:07 PM10/18/07

to

Jim Marshall <jim.ma...@wbemsolutions.com> wrote:

>> , which MEANS: writev() (,just like write()) MAY write less than your
>> 'length' argument. You'll have to take care of sending the rest of the
>> data in your buffers in subsequent calls.
>I was under the impression that writev was atomic, my bad. However this
>confuses me now as if you had more then 1 item in the vector it would be
>very difficult to figure out which part of the vector wasn't sent. I'm
>not sure I follow the point of this function.

My guess is you're running up against the amount of TCP socket buffer
space in whichever OS you're using. Since your socket is set to
NON_BLOCKING, when those buffers fill up, your write() or writev()
returns immediately. Otherwise, your call would block until the client
ACK's enough data to free up the buffer space.

The only point of writev() versus write() is that you don't have to
marshall your data into one buffer yourself. You are correct, though,
that in your application, it may be a bit of work to calculate where
the writev() left off in order to send the rest of your message.

There is probably some tuning you can do to increase the TCP buffer
sizes. What OS is this on?

fjb...@yahoo.com

unread,

Oct 18, 2007, 9:15:49 PM10/18/07

to

On Oct 18, 1:31 pm, Jim Marshall <jim.marsh...@wbemsolutions.com>
wrote:

> moi wrote:
> > On Thu, 18 Oct 2007 15:30:24 -0400, Jim Marshall wrote:
>
> >> Hope this is the right group to post to, if not please direct me as
> >> needed.
>
> >> I have an application which is using NON-blocking sockets. My
> >> application attempts to send a response using the writev function (I am
> >> testing on Fedora Core 6, "Linux starfury 2.6.22.1-32.fc6 #1 SMP Wed Aug
> >> 1 14:10:08 EDT 2007 i686 i686 i386 GNU/Linux"), in some circumstances
> >> writev does not write all of the data to the socket. When this happens
> >> the client only gets part of the request and then sits there waiting for
> >> more data.
> >> It is not clear to me why this might happen from the docs.
>
> > man writev:
>
> > RETURN VALUE
> > On success, the readv() function returns the number of bytes read; the
> > writev() function returns the number of bytes written. On error, -1 is
> > returned, and errno is set appropriately.
>
> > , which MEANS: writev() (,just like write()) MAY write less than your
> > 'length' argument. You'll have to take care of sending the rest of the
> > data in your buffers in subsequent calls.
>
> I was under the impression that writev was atomic, my bad.

No. In fact, such a thing is impossible under normal circumstances.
What if it sends one packet and then the link fails?

My man page for writev even says:

When using non-blocking I/O on objects such as sockets that are
subject
to flow control, write() and writev() may write fewer bytes than
requested; the return value must be noted, and the remainder of
the oper-
ation should be retried when possible.

If you think about it, it's entirely possible that you write more data
than there is buffer space for. You've asked for non-blocking I/O, so
writev isn't allowed to wait until the first chunk is sent and then
send some more. All it can do is return, and tell you how much it did
write.

Even when blocking I/O is being used, write/writev can still write
less than requested if some error is encountered, or a signal
arrives. You always have to be prepared to deal with that
possibility.

> However this
> confuses me now as if you had more then 1 item in the vector it would be
> very difficult to figure out which part of the vector wasn't sent.

It isn't that hard, because they're sent in order. For instance, if
you had chunks of size 30, 17, and 24, and writev returned 36, you
know that it wrote all of the first chunk and the first 6 bytes of the
second, so that is where you should start from.

> I'm
> not sure I follow the point of this function.

Sometimes it can be more efficient. For example, imagine an HTTP
server. You've constructed the header of the response in one buffer,
and the body is somewhere else. Without writev(), you would either
have to copy them both into a single buffer, or make two calls to
write() which involves a little more overhead.

But unless you know that the overhead is a serious problem for you,
it's usually more convenient just to use write(). For instance, it's
easier to compute where you should restart a partial write.

Logan Shaw

unread,

Oct 18, 2007, 9:54:23 PM10/18/07

to

Jim Marshall wrote:
> I was under the impression that writev was atomic, my bad. However this
> confuses me now as if you had more then 1 item in the vector it would be
> very difficult to figure out which part of the vector wasn't sent.

I think "very difficult" might be a little bit of an overstatement. As far
as I can tell, you'd just loop through the vector and skip items as long as
the lengths haven't added up to what writev() returned. It's non-trivial,
but it's also not super hard.

In particular, if it's OK to stomp on your iovec structs as you go, I think
this should do the trick of sending everything in a list of iovecs:

struct iovec *first_remaining_iovec = original_iovec_array;
int remaining_count = original_count;

while (remaining_count > 0) {
ssize_t bytes_written = writev(first_remaining_iovec, remaining_count);
if (bytes_written == -1) { /* .... handle error .... */ }

size_t bytes_to_consume = bytes_written;
while (bytes_to_consume > 0) {
if (bytes_to_consume >= first_remaining_iovec->iov_len) {
/* consume entire vector element */
bytes_to_consume -= first_remaining_iovec->iov_len;
remaining_count--;
first_remaining_iovec++;
} else {
/* consume partial vector element */
first_remaining_iovec->iov_len -= bytes_to_consume;
first_remaining_iovec->iov_base += bytes_to_consume;
bytes_to_consume = 0;
}
}
}

> I'm
> not sure I follow the point of this function.

Efficiency. Specially, avoiding unnecessary copies whilst at the same time
avoiding unnecessary system calls (and unnecessary extra packets or other
physical I/O, i.e. following the principle of "never delay in queuing up
everything you already have ready to send").

Note that for maximum efficiency, you could potentially want to extend the
vector if only a partial vector has already been sent. That is, if I call
writev() with 3 items and it returns a count that tells me it wrote the
first 2 and part of the 3rd, I might want to pass a new vector with the
remainder of the 3rd plus a 4th and 5th with some new data that has become
ready in the intervening time.

So I guess writev(), in essence, allows you to maintain a queue of chunks
and ask the system to process as much from the head of that queue of chunks
as it can conveniently do right now, then return.

- Logan

Jim Marshall

unread,

Oct 19, 2007, 1:34:14 AM10/19/07

to Rick Jones

Rick Jones wrote:
> Jim Marshall <jim.ma...@wbemsolutions.com> wrote:
>> Hope this is the right group to post to, if not please direct me as needed.
>
>> I have an application which is using NON-blocking sockets. My
>> application attempts to send a response using the writev function (I
>> am testing on Fedora Core 6, "Linux starfury 2.6.22.1-32.fc6 #1 SMP
>> Wed Aug 1 14:10:08 EDT 2007 i686 i686 i386 GNU/Linux"), in some
>> circumstances writev does not write all of the data to the
>> socket. When this happens the client only gets part of the request
>> and then sits there waiting for more data. It is not clear to me
>> why this might happen from the docs.
>
>> This seems to be an issue depending on the amount of data being sent
>> back. For example if I am sending 3000 bytes it works fine, but when I
>> try to send 30K bytes I run into the problem.
>
> Since the socket is non-blocking, your writev will do a partial write
> whenever there is less space in the socket than you are trying to
> write with writev.

I obviously mis-read something some where (I can't find it now, isn't
that always the case...) but it makes sense that this would happen with
non-blocking sockets, but given my mis-understanding it was confusing to me.

Thanks to everyone for setting me on course.

>
> If you *know* you will "never" try to write more than N bytes into a
> socket at one time and can "know" when that has been emptied - say by
> the receipt of data from the remote indicating it has all you sent
> previously - you could use setsockopt(SO_SNDBUF) to set the socket
> buffer to be larger than your largest writev() call, and even if the
> socket is non-blocking writev() should always write everything - well
> assuming it can allocate space - socket buffer sizes are limits not
> preallocations.... after your setsockopt() call, you should make a
> getsockopt() call to make sure you got at least the size you wanted...

In my application the amount of data we send back can vary from 1K to
100meg so I will just code the function to work as it should.

>
> One or more of the texts of Stevens Fenner and Rudoff, or Stallings
> might be good to add to your bookshelf - they cover all sorts of stuff
> like this.

Thanks, I will look these up.

>
> I trust you already know that even if you did put all N KB into the
> socket at once that your reciever will still get it in smaller chunks
> right? Ie that TCP is a byte-stream service and that message
> boundaries are not preserved...

Yes. My (mis)understanding was that writev was "atomic" in that it
wouldn't return until all the data is written - probably applies to
blocking sockets but not non-blocking ones.

Again thanks to everyone who replied.
>
> rick jones

Rainer Weikusat

unread,

Oct 19, 2007, 5:03:46 AM10/19/07

to

The condition tested inside the loop is always true except for the
last iteration. This means a better way to write this would be

while (nr >= iovs->iov_len) {
nr -= iovs->iov_len;
--n_iovs;
++iovs;
}

if (nr) {
iovs->iov_len -= nr;
iovs->iov_base = (char *)iovs->iov_base + nr;
}

Additionally, the iov_base member is a void *, meaning, arithmetic on
it is undefined as of ISO-C. Treating void * like char * in this
respect is a gcc extension. But IMO it is better to not get into the
habit of using it, because there is an easy way to get bitten by it:

void *p;
struct something *ps0, *ps1;

p = allocate_two_somethings();
if (!p) ...

ps0 = p;
ps1 = p + 1 /* Ouch. Should have been ps0 + 1 */

With default settings, gcc will not catch this, but there is an
optional warning (-Wpointer-arith), which can be used to detect
places where this error may lurk (I have made it for enough times
myself to rather enable the warning).

Rick Jones

unread,

Oct 19, 2007, 1:48:20 PM10/19/07

to

Jim Marshall <jim.ma...@wbemsolutions.com> wrote:

> Yes. My (mis)understanding was that writev was "atomic" in that it
> wouldn't return until all the data is written - probably applies to
> blocking sockets but not non-blocking ones.

I suspect there is some verbiage about atomicity of write/writev
somewhere - usually that is in the context of multiple writers to the
same _file_ and the granularity at which those writes will be
interleaved. When you switch from calling write/writev against a file
to write/writev against a socket you have switched contexts :)

If you are only ever writing against a socket, you might want to
consider sendmsg and friends - not a big deal, but it may avoid a tiny
bit of path mapping from writev to the socket code under the covers.

rick jones
--
denial, anger, bargaining, depression, acceptance, rebirth...
where do you want to be today?

these opinions are mine, all mine; HP might not want them anyway... :)

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Frank Cusack

unread,

Oct 19, 2007, 2:31:59 PM10/19/07

to

On Fri, 19 Oct 2007 01:34:14 -0400 Jim Marshall <jim.ma...@wbemsolutions.com> wrote:
> Yes. My (mis)understanding was that writev was "atomic" in that it
> wouldn't return until all the data is written - probably applies to
> blocking sockets but not non-blocking ones.

writev() is atomic, in the sense that that data from multiple iov's
will not be interleaved with data from other write()/writev()'s to
the same file descriptor. That is, if a writev() returns N (bytes
written), it is guaranteed that those N bytes are not interleaved
with any bytes from another write()/writev(). This is notable
because of the multiple iov's that can be passed ... the atomicity
guarantee dictates that writev() cannot allow other data to be
interleaved as it handles each subsequent iov.

-frank

Frank Cusack

unread,

Oct 19, 2007, 5:53:31 PM10/19/07

to

On Fri, 19 Oct 2007 11:31:59 -0700 Frank Cusack <fcu...@fcusack.com> wrote:
> On Fri, 19 Oct 2007 01:34:14 -0400 Jim Marshall <jim.ma...@wbemsolutions.com> wrote:
>> Yes. My (mis)understanding was that writev was "atomic" in that it
>> wouldn't return until all the data is written - probably applies to
>> blocking sockets but not non-blocking ones.
>
> writev() is atomic, in the sense that that data from multiple iov's
> will not be interleaved with data from other write()/writev()'s to
> the same file descriptor.

Sorry, I meant to the same file.

-frank