> > I don't think its a bug if everything occurs behind a > > firewall. The simpler approach makes sense in some > > cases.
> What do you mean by firewall? A firewall that simply locks out untrusted > data? Or a firewall that actually inspects the sent data, knows that it is > serialized data and checks it for sanity?
The first kind.
> > short a = 1; > > short b = 2; > > short c = 3;
> > oa << a << b << c;
> > Will that lead to 3 packets -- each with a header and > > a body being sent? If so it is a lot of overhead.
> No. Why should it? It's sent as one packet announcing that it is 6 bytes > large.
I'm not sure how the software determines when a message is complete. If the next lines are
int d = 4; oa << d;
how does the software know whether to include d with a, b, and c or should there be a separate message? Could it distinguish between that and
> On Jun 27, 12:41 pm, Sebastian Redl <e0226...@stud3.tuwien.ac.at> > wrote: > > On Tue, 26 Jun 2007 c...@mailvault.com wrote:
> > > I don't think its a bug if everything occurs behind a > > > firewall. The simpler approach makes sense in some > > > cases.
> > What do you mean by firewall? A firewall that simply locks out untrusted > > data? Or a firewall that actually inspects the sent data, knows that it is > > serialized data and checks it for sanity?
> The first kind.
Yes, if the data is trusted, obviously there can be no DoS. At least not an intentional one. (There's always a chance of bugs.)
So it might be nice if the serialization framework supported marking a source as trusted and would use the non-paranoid version then. It's not essential, though.
> > > short a = 1; > > > short b = 2; > > > short c = 3;
> > > oa << a << b << c;
> > > Will that lead to 3 packets -- each with a header and > > > a body being sent? If so it is a lot of overhead.
> > No. Why should it? It's sent as one packet announcing that it is 6 bytes > > large.
> I'm not sure how the software determines when a message > is complete. If the next lines are
> int d = 4; > oa << d;
> how does the software know whether to include d with > a, b, and c or should there be a separate message? > Could it distinguish between that and
> oa << a << b << c << d; ?
The same way any other communication delimits messages: 1) By flushing the stream, so that data is really delivered instead of cached. 2) In addition, by conventions of the protocol that specify the content of a message, so that the receiver knows the message is complete.
> On Jun 25, 1:24 pm, Le Chaud Lapin <jaibudu...@gmail.com> wrote:
> > Naturally, it is optimal in many cases that an object be > > serialized from an archive by construction only, not by assign-after- > > construct. Some objects have heavy-weight default-construction, and > > if one uses this scheme to deserialize say, a 1-million-element > > list<Heavyweight_Class_With_Massive_Constructor>, the performance > > penalty will be interesting indeed.
> I've thought about this some also and like the term > stream constructor here. Recently I've thought that > if a derived object is being received and an error > occurs late in the process, it makes sense to attempt > to salvage what you can.
> class B {...};
> class I : public B {...};
> class D : public I {...};
> B* b = new D(stream_identifier_here);
> If D's constructor releases an exception, the > standard says the sub-objects should be destructed. > Since that is how things have been set up over the > years, it can't easily be changed, but it might be > helpful if there was a way to indicate to the > compiler that a constructor is a stream constructor > and then instead of giving up, it could return an I. > The main reason I think this way is the sender, > network and receiver have put in a lot of work to > get to where it fails.
Hmm...yes, a lot of work, but don't you think it might be better to just let the entire object go? After all, there is intuitive merit in keeping with the spirit of automatic unwinding when full construction failed. Also, the partially-received object will not have affected the state at the sender in any way, so no harm would be lost. Finally, the context in which the serialization occurs is indeterminate when the serialization code is written. What happens the partial object, I, is accepted? Then what? I think it would be a bit like buying an automobile might or might not come with the wheels and windshield, and you must agree to purchase such an automobile before you know whether it will be complete.
> LOL. You criticize me for requiring a single estimate to be made, yet > here you are, happily estimating away, not just a single value, but a > value for every single type of object you ever intend to serialize...
That is better than your alternative, allocating a 1MB buffer.
> You haven't written a serialization framework. You've written a > serialize-send-receive-deserialize chunk of code, whose source code > must be modified to fit in with any particular application. > Congratulations.
My method does not require source code modification. It is far more efficient and less arbitrary than your method, and as another poster pointed out (I forget who), my method will detect breaches much quicker than your method, and more likely at the point where over- consumption is attempting to occur.
> The rest of us will stick to general purpose serialization frameworks.
You must mean you and the Infidels-of-Infinite Memory. :)
> > On Jun 27, 12:41 pm, Sebastian Redl <e0226...@stud3.tuwien.ac.at> > > wrote: > > > On Tue, 26 Jun 2007 c...@mailvault.com wrote:
> > > > I don't think its a bug if everything occurs behind a > > > > firewall. The simpler approach makes sense in some > > > > cases.
> > > What do you mean by firewall? A firewall that simply locks out untrusted > > > data? Or a firewall that actually inspects the sent data, knows that it > is > > > serialized data and checks it for sanity?
> > The first kind.
> Yes, if the data is trusted, obviously there can be no DoS. At least not > an intentional one. (There's always a chance of bugs.)
Agreed.
> So it might be nice if the serialization framework supported marking a > source as trusted and would use the non-paranoid version then. It's not > essential, though.
I guess it is not strictly essential, but not wearing comfortable clothes is not essential either. Nobody wears a suit and tie to bed at least not that I know.
> > > > short a = 1; > > > > short b = 2; > > > > short c = 3;
> > > > oa << a << b << c;
> > > > Will that lead to 3 packets -- each with a header and > > > > a body being sent? If so it is a lot of overhead.
> > > No. Why should it? It's sent as one packet announcing that it is 6 bytes > > > large.
> > I'm not sure how the software determines when a message > > is complete. If the next lines are
> > int d = 4; > > oa << d;
> > how does the software know whether to include d with > > a, b, and c or should there be a separate message? > > Could it distinguish between that and
> > oa << a << b << c << d; ?
> The same way any other communication delimits messages: > 1) By flushing the stream, so that data is really delivered instead of > cached. > 2) In addition, by conventions of the protocol that specify the content of > a message, so that the receiver knows the message is complete.
OK, but that approach has some implementation weaknesses. You can't know the total message size until you get the flush. In the approach I advocate you know that up front and can start sending the data right after you've determined the message size isn't too big. If you use the flush approach you probably have to process the data from each call and put it in a cache/buffer and all that would be wasted if eventually the size of the message grew to be more than the limit. B.Ser is already kind of slow in my opinion, if the flush approach is used the performance difference between B.Ser and our approach would deepen. Also it would mean a significant overhaul interface- wise for B.Ser applications once variadic template support is widely available.
> On Jun 28, 7:27 am, c...@mailvault.com wrote: > > I've thought about this some also and like the term > > stream constructor here. Recently I've thought that > > if a derived object is being received and an error > > occurs late in the process, it makes sense to attempt > > to salvage what you can.
> > class B {...};
> > class I : public B {...};
> > class D : public I {...};
> > B* b = new D(stream_identifier_here);
> > If D's constructor releases an exception, the > > standard says the sub-objects should be destructed. > > Since that is how things have been set up over the > > years, it can't easily be changed, but it might be > > helpful if there was a way to indicate to the > > compiler that a constructor is a stream constructor > > and then instead of giving up, it could return an I. > > The main reason I think this way is the sender, > > network and receiver have put in a lot of work to > > get to where it fails.
> Hmm...yes, a lot of work, but don't you think it might be better to > just let the entire object go? After all, there is intuitive merit in > keeping with the spirit of automatic unwinding when full construction > failed.
I guess both could be accomodated.
B* b = new D(...);
could behave like usual and
B* b = new preserve D(...);
could return an I object if a D is not possible. That seems better than what I wrote above as it could be applied to more than stream constructors.
> Also, the partially-received object will not have affected > the state at the sender in any way, so no harm would be lost.
The headache is you're more likely to have to request a resend and a full resend if you throw it all away.
> Finally, the context in which the serialization occurs is > indeterminate when the serialization code is written. What happens > the partial object, I, is accepted? Then what?
It does what it was designed to do. You shouldn't expect it to do more.
>I think it would be a > bit like buying an automobile might or might not come with the wheels > and windshield, and you must agree to purchase such an automobile > before you know whether it will be complete.
The I instance is useful. It has wheels and a windshield, but no cruise control.
> The asio example deserializes from a std::string, not a socket. The > string is defined on line 132 of connection.hpp .
> You're beating a dead horse.
> Jarl.
Okay, I understand what is happening here... Most of which is that the code goes against my own philosophies...
There are two things:
#1) Even though it deserializes from a string, this specific example does have multiple DoS bugs in it. handle_read_header() blindly resizes inbound_data_ to whatever the headers tells it to. Yes, inbound_data_ is then used to create a string from the data, but the damage is done. Another DoS bug is in line 25 and 26 of the file "stock.hpp" - It calls the default deserialization function for std::string, which itself has the same problem - in boost/ serialization/collections_load_imp.hpp in the function rebuild_collection().
#2) In my opinion, this specific example is not a very good example of receiving data over a socket.
I say #2 because although one can put an arbitrary limit on inbound_data_ which would solve the bug/DoS, this limit really would be arbitrary.
Others in this discussion mentioned "1 megabyte". What specific value to use is not really known.
Yes, you can say that the 'application writer' can decide this. But the real problem is that handle_read_header() is itself a template. This method gets called regardless of whatever type is specified in the original call to async_read(), therefore the same limit must be used for all types T. Is the same limit reasonable regardless of whatever object class is being deserialized?
I feel that this design of the serialization library is ultimately flawed.
The flaw exists even if you are reading from a file or a fixed size std::string instead of a socket.
The flaw forbids the application from controlling the limits for individual fields in an object which are reasonably implemented as stl containers or strings.
I guess it could be said that the concept of the flaw comes from the fact that instantiated STL containers in general can not be given a maximum size limit.. static STL allocators don't cut it for this purpose - perhaps stateful allocators could help out if they became more mainstream, see http://www.lancediduck.com/papers/Cpp/StatefulSTL.pdf
In the boost library, rebuild_collection() (in collections_load_imp.hpp ) could be modified to take an actual object that provides the function of the "class R" template parameter which is the 'reserve_imp' functor. This would require changes to 'load()' and the fancy >> and & operators...
However, this would allow the application code to specify limits on a per-item basis. Right now, rebuild_collection() and therefore most deserializations, provide DoS bugs for your application code....
Here is where my personal philosphy comes into play:
Yes, I know it is safe if you 'trust' your data... However 'gets()' is safe if you trust your data too, yet gets() is deprecated and everybody knows to not use it!
> The asio example deserializes from a std::string, not a socket. The > string is defined on line 132 of connection.hpp .
> You're beating a dead horse.
Jeff is not beating a dead horse.
[Note to mods: I am resending this message because the original message was apparently lost.]
I am going to show in the Boost serialization framework where the author of that framework is doing almost exactly what I pointed out as a DoS vulnerability in my original post. But first, to restate the most serious problem briefly, with current serialization frameworks, including MFC, Boost, and my own, and probably many others, without a fundamental change in the model, the sender of a serialized object can artificially induce the receiver to allocate a large amount of memory, creating a denial-of-service attack. It should be noted that there are other problems beyond the one I am about to demonstrate that would arise without a fundamental change in these serialization frameworks, but I choose this problem because it illustrates the general principle quickly and does not require much insight into serialization to recognize as faulty.
At the sending end of a connection, a programmer would write something like:
int main () { Socket s; std::vector<int> v; s << v; return 0;
}
At the receiving end of a connection, a programmer would write something like:
int main () { Socket s; std::vector<int> v; s >> v; return 0;
}
In both cases, a Socket is a type of "Archive".
This second main() function, in the line s >> v, is where the problem occurs. The essence of the problem can be seen in pieces of the serialization framework, ignoring non-essential elements. We will dissect the framework, starting from the bottom up, with a simple piece.
template<class Archive, class Container, class InputFunction, class R> inline void load_collection(Archive & ar, Container &s) { s.clear(); // retrieve number of elements unsigned int count; unsigned int item_version(0); ar >> BOOST_SERIALIZATION_NVP(count); if(3 < ar.get_library_version()){ ar >> BOOST_SERIALIZATION_NVP(item_version); } R rx; rx(s, count); // <----- THIS IS THE PROBLEM. -Le Chaud Lapin- InputFunction ifunc; while(count-- > 0){ ifunc(ar, s, item_version); }
}
As can be seen from the code above, load_collection is the main function for serializing into several STL collections, including vectors. The "R" parameter to this template function is a "reserve_imp" class just mentioned. load_collection, for each STL collection, including vector<>, does the following:
1. empties the container with s.clear() 2. declares an unsigned int "count" 3. declares a version number 4. Reads "count", taking care that the "stuff" read in includes not only "count", but the literal word 'count' 5. If library version is less than 3, then read in the item version 6. Instantiates a "reserve_imp" object, "rx". 7. Invokes operator () against rx, supplying the container and count as arguments. << ***THIS IS THE PROBLEM*** 8. Declares input function object. 9. Reads in each element of container "count" times using input function.
It can be seen that the size of a std::vector<> will be sized by rx arbitrarily to what the sender says the vector size should be (count), even 50MB. It does not matter whether the count is stored in a tiny 32-byte buffer or a huge 1MB buffer. The vector will still be unwittingly resize to whatever the sender says the size should be, which could happen if the sender is malicious.
> > The asio example deserializes from a std::string, not a socket. The > > string is defined on line 132 of connection.hpp .
> > You're beating a dead horse.
> Jeff is not beating a dead horse.
[snip] >It can be seen that the size of a std::vector<> will be sized by rx >arbitrarily to what the sender says the vector size should be (count), >even 50MB. It does not matter whether the count is stored in a tiny >32-byte buffer or a huge 1MB buffer. The vector will still be >unwittingly resize to whatever the sender says the size should be, >which could happen if the sender is malicious.
Jarl has answered what you are saying numerous times. He includes a message size/header that applications can check to limit their vulnerability to such an attack. It isn't difficult to keep track of how many bytes remain in a message and pass that to a load_collection/Receive function. The function uses that to check the sanity of the count value that it gets.
> Jarl has answered what you are saying numerous times. He includes > a message size/header that applications can check to limit their > vulnerability to such an attack. It isn't difficult to keep track > of how many bytes remain in a message and pass that to a > load_collection/Receive function. The function uses that to > check the sanity of the count value that it gets.
Are you saying that I should modify the function that calls load_collection? That function is deep inside the Boost Serialization framework.
> Jarl has answered what you are saying numerous times. He includes > a message size/header that applications can check to limit their > vulnerability to such an attack. It isn't difficult to keep track > of how many bytes remain in a message and pass that to a > load_collection/Receive function. The function uses that to > check the sanity of the count value that it gets.
Jarl also claimed that Jeff was "beating a dead horse", implying that there is no problem.
Solutions aside, do you or do you not agree that, today, in 2007, 4 of July, Boost Serialization is doing what I said we should avoid in my OP
Do you agree that Boost Serialization is using an implementation that is subject to DoS as I wrote in my original post?
There are many people who are reading these posts would like to know a simple yes/now answer to this question, because they might alter their inclination to use such a framework in the nude if the answer is "yes".
> Jarl has answered what you are saying numerous times. He includes > a message size/header that applications can check to limit their > vulnerability to such an attack. It isn't difficult to keep track > of how many bytes remain in a message and pass that to a > load_collection/Receive function. The function uses that to > check the sanity of the count value that it gets.
Wow, sometimes communication seems so hard even without serialization...
To summarize:
* Boost serialization includes problem code which is the foundation of a DoS.
* This problem happens with all containers including std::list, std::vector, std::string.
* It does not matter if you are deserializing from a string or a socket, remote or local, behind a firewal or not, the problem still exists in Boost.ser.
* It is impossible for any application code to 'work around' this problem without modifying Boost.ser.
* The boost.asio examples using boost.ser contain the DoS problem and can not be fixed without modifying Boost.ser.
* If this problem were fixed in Boost.ser then there would be no problem with doing IPC with Boost.ser.
These above facts do not mean that doing IPC with serialization is bad.
The above facts only mean that doing IPC with the unmodified boost.ser library is bad, and that the boost.asio serialization example is not a good example since it contains hidden DoS problems.
> > 7. Invokes operator () against rx, supplying the container and count > > as arguments. << ***THIS IS THE PROBLEM***
> I guess you didn't bother reading Sebastian Redl's post on the 26th of > June, or any of the followups.
I read them.
1. I pointed out that there was a flaw in my serialization framework, and probably others. 2. Jeff pointed out that the flaw exits in Boost. 3. You attempted to refute that the problem exists in Boost. 4. I showed prima facie evidence that it does exist in Boost.
So, after so many posts, we at least arrive at a simple conclusion. The problem does exist, and no dead horses are being beaten. Now, the next step, is to find a solution, and not to kick at the one you proposed, I have one gets more regular each day I think about it.
In any case, you should not claim that a problem does not exist when it does. There are people who read these posts who do not understand/ care about all the details. They only want to know the end result: does a problem exist or not. With your posts, it could have easily been determined by a casual reader that there was no problem, and such a reader might have used Boost Serialization in the nude, leaving ample opportunity to crash their computers at will.
Do you at least admit that the problem exists in Boost (and MFC for that matter)?
> > Jarl has answered what you are saying numerous times. He includes > > a message size/header that applications can check to limit their > > vulnerability to such an attack. It isn't difficult to keep track > > of how many bytes remain in a message and pass that to a > > load_collection/Receive function. The function uses that to > > check the sanity of the count value that it gets.
> Jarl also claimed that Jeff was "beating a dead horse", implying that > there is no problem.
> Solutions aside, do you or do you not agree that, today, in 2007, 4 of > July, Boost Serialization is doing what I said we should avoid in my > OP
> Do you agree that Boost Serialization is using an implementation that > is subject to DoS as I wrote in my original post?
Yes, and it should be changed to better defend against the possibility. Jarl is the only one I'm aware of who has a solution in place already and he deserves credit for that.
> Do you at least admit that the problem exists in Boost (and MFC for > that matter)?
It appears that you didn't notice that this discovery of yours was pointed out by Sebastian Redl on June 26, and the eminently simple fix was then posted, again by him, on June 27.
It is hard for me to understand why you want to continue this discussion. Or why the moderators allow you to rehash the same content, over and over again. The premise in your OP, of the inutility of C++ serialization frameworks in IPC applications, was as sweeping as it was incorrect, and stemmed from your own use of "socket archives". And now you are implying that your original post only concerned an implementation detail of Boost.Serialization.
If this whole thread isn't a dead horse, I don't know what is.
> > > Jarl has answered what you are saying numerous times. He includes > > > a message size/header that applications can check to limit their > > > vulnerability to such an attack. It isn't difficult to keep track > > > of how many bytes remain in a message and pass that to a > > > load_collection/Receive function. The function uses that to > > > check the sanity of the count value that it gets.
> > Jarl also claimed that Jeff was "beating a dead horse", implying that > > there is no problem.
> > Solutions aside, do you or do you not agree that, today, in 2007, 4 of > > July, Boost Serialization is doing what I said we should avoid in my > > OP
> > Do you agree that Boost Serialization is using an implementation that > > is subject to DoS as I wrote in my original post?
> Yes, and it should be changed to better defend against > the possibility. Jarl is the only one I'm aware of > who has a solution in place already and he deserves > credit for that.
I am glad that we can now all agree that Boost Serialization has a DoS attack that is readily exploitable, along with any other serialization framework that uses the same model outlined in this thread.
As far as giving credit to Jarl...I am the last person to want to deny credit to someone for having an idea, even one that (IMO) is only good, but not great. I was inclined last night to give Jarl credit for his buffer technique...but after thinking more, I do not regard this as a solution. The truth is that I feel that the minimum standard that I would hold for a serialization framework is violated by Jarl's method. In particular, the pre-allocation buffer is, IMO, intolerable, not to mention the multiple-allocation methods that he and others described. His method would fail on a PDA with only 64MB of RAM, and I do run my serialization code on PDA's, and so would a few others.
I would also like to assert for the recorded that I am all but convinced that the only survivable (not perfect, but workable) _requires_ participation by the objects themselves to limit resources. Before I was only vaguely suspicious. Now I am certain. Jarl's method ignores participation by the objects.
Finally, I am _not_ proposing my method as a solution. As I have said many times before, my method, like Jarl's involves arbitrariness, and arbitrariness (to this extent) is usually bad in engineering.
However, my method does not require pre-allocation of memory, nor does it require superfluous reallocations why an object is being serialized into. Any arbitrariness that my method has, Jarl's method will certainly have.
That is why I would find it hard to give him credit for the buffer method. I believe that my method, as weak as it is, is better than what he proposed.
> It appears that you didn't notice that this discovery of yours was > pointed out by Sebastian Redl on June 26, and the eminently simple fix > was then posted, again by him, on June 27.
Jarl, what was just proven is that this "Simple Fix" that was done is not sufficient. Read the boost.ser code where it calls resize() with no limit. What I have been pointing out is that even with Sebastian's "Simple Fix", it doesn't fix the real problem which is unverified forced memory allocations.
> > Do you at least admit that the problem exists in Boost (and MFC for > > that matter)?
> It appears that you didn't notice that this discovery of yours was > pointed out by Sebastian Redl on June 26, and the eminently simple fix > was then posted, again by him, on June 27.
Perhaps you did not notice your own post refuting Jeff Koftinoff's suggestion that Boost Serialization was doing exactly what I had said it was doing.
> It is hard for me to understand why you want to continue this > discussion. Or why the moderators allow you to rehash the same > content, over and over again. The premise in your OP, of the inutility > of C++ serialization frameworks in IPC applications, was as sweeping > as it was incorrect, and stemmed from your own use of "socket > archives". And now you are implying that your original post only > concerned an implementation detail of Boost.Serialization.
I continue this discussion for the same reason that the moderators allow it to continue: I search for truth, and I, like you, implicit in our privilege as posters to this group, have a responsibility to seek it.
The fact is, you tried to refute Jeff's assertion that Boost was doing exactly what I wrote about in the OP, and when I demonstrated indisputable evidence two threads up, you want to discontinue the discussion.
If I had been a random casual reader of this thread who, contrary what you wrote to your previous post, had not had any experience with serialization, and I had decided that Le Chaud Lapin and Jeff Koftinoff were wrong, and Jarl Linrud was right, I might have continued using Boost under the impression that there was no problem, which would have been a problem indeed.
At least Brian Wood had the sensibility that Boost Serialization has a flaw in I pointed out, and because of that flaw, it is possible to to crash machines all over the Internet that use Boost Serialization in the nude.
When you present your arguments, the they are not just for our benefit. They are for the benefit of everyone who reads this group and wants to gain insight to truth. That is why it is so important to see truth.
> I continue this discussion for the same reason that the moderators > allow it to continue: I search for truth, and I, like you, implicit in > our privilege as posters to this group, have a responsibility to seek > it.
> When you present your arguments, the they are not just for our > benefit. They are for the benefit of everyone who reads this group > and wants to gain insight to truth. That is why it is so important to > see truth.
Let's dispense with the sermonizing and try to get our facts straight.
1) The flaw in B.Ser was pointed out by Sebastian Redl, not you. 2) The simple fix for it was also pointed out by Sebastian Redl. 3) That simple fix is valid, as long as one only passes in bounded amounts of data to the deserialization framework.
I think we can agree on those 3 points.
If you still think, as you did in your OP, that there is a general problem with the use of C++ serialization frameworks in IPC applications, then please specify exactly what that problem is.
jaibudu...@gmail.com (Le Chaud Lapin) wrote (abridged):
> I would also like to assert for the recorded that I am all but > convinced that the only survivable (not perfect, but workable) > _requires_ participation by the objects themselves to limit > resources. Before I was only vaguely suspicious. Now I am certain. > Jarl's method ignores participation by the objects.
Here's what I got out of the discussion. The core idea was that the amount of memory allocated should be roughly proportional to the amount of data sent over the link.
Given that, we can control the denial of service attacks by placing a limit on the amount of data the link accepts.
One way to do this is by using packets and buffers. It seems to me that it could also be done by keeping a count of the number of bytes received and checking that it never gets too big. This way would avoid the overhead of extra data copies and memory allocations of a packet-oriented approach. However the limit is enforced, it is a property or policy of the socket, not of the generic serialisation framework.
The responsibility of the serialisation framework is to ensure that the core idea is true. It needs to avoid allocating more memory than the socket is allowed to receive, and preferably not more memory than the socket actually has received. The socket needs to make that information available. So I'd expect code like:
size_t size; ar >> size; size_t limit = ar.bytes_allowed() / sizeof(T); vec.reserve( min( size, limit ) ); for (size_t i = 0; i < size; ++i) { vec.resize( i+1 ); ar >> vec.back(); }
We need a reasonable policy for bytes_allowed(). I think it should probably return at least 1024 even if not that many bytes are available in buffers yet. The bigger it is, the less reallocation and copying but the more vulnerability to denial of service.
Given a reasonably large bytes_allowed(), many vectors will not need to reallocate and copy their memory at all, because their final size will be small enough. Bigger ones will have some overhead, but it will be limited by their logarithmic growth policy and because they are given a reasonably large capacity to start with.
This approach is orthogonal to your push/pop_limit scheme. Both can co-exist. One problem I have with your scheme is that it says, in effect, that one data structure is allowed to allocate more memory than another. This is wrong because it should not be a property of the data structure, but of the socket. A denial-of-service attacker will seek out the trusted data structures to exploit.
It also quite low-level. It requires us to count up the number of bytes each structure might need. But we don't actually care about this. We want to limit the /total/ amount of memory allocated across /all/ the data structures, not the amount used by any specific one of them. It's at the wrong level of abstraction.
If it has to be done at that level, I'd suggest having push_limit() affect the result of bytes_allowed(). Pushing a big number would temporarily increase efficiency (by reducing vector reallocations) at the cost of increased vulnerability. That trade-off seems unavoidable to me. To be honest, it seems like a hack; like saying we care more about efficiency than security.
If you use different limits for different structures the hackers will seek out the structures with the biggest limits and focus their attacks there, ignoring the rest. So you might as well give the rest the same limit; you don't increase vulnerability by doing so, as long as the socket itself limits the total number of bytes received.
My impression is that you have picked up on inefficiencies of specific implementations. I have tried above to emphasis the core ideas, and show how different implementations can reduce or avoid inefficiencies.
Another variation, if you are still bothered by the cost of growing vectors incrementally, is to replace the vector code with something like:
size_t size; ar >> size; if (size > ar.bytes_allowed() / sizeof(T)) throw "too big"; vec.resize( size ); for (size_t i = 0; i < size; ++i) ar >> vec[i];
This treats bytes_allowed() as a hard limit. It seems to me it is less flexible without being significantly more efficient or more secure, for a given result from bytes_allowed(). (In practice we will probably need to compensate for the loss of flexibility by increasing bytes_allowed(), so it will end up being less secure.)
> Jarl, what was just proven is that this "Simple Fix" that was done is > not sufficient. Read the boost.ser code where it calls resize() with > no limit. What I have been pointing out is that even with Sebastian's > "Simple Fix", it doesn't fix the real problem which is unverified > forced memory allocations.
No Jeff, this is a fundamental misunderstanding you share with Le Chaud Lapin, and it's the reason this thread just keeps going on and on.
With that simple fix, there is no longer any reserve() call, and B.Ser. is safe, as long as you pass in *bounded* amounts of data. The amount of memory allocated is *directly proportional* to the amount of raw data passed in, and hence it is implicitly *limited and controlled* by the application.
On Jul 7, 4:48 pm, brang...@ntlworld.com (Dave Harris) wrote: Hi,
> Here's what I got out of the discussion. The core idea was that the > amount of memory allocated should be roughly proportional to the amount > of data sent over the link.
> Given that, we can control the denial of service attacks by placing a > limit on the amount of data the link accepts.
This is true...but you do not want to do it the way you specify below....
> One way to do this is by using packets and buffers. It seems to me that > it could also be done by keeping a count of the number of bytes received > and checking that it never gets too big. This way would avoid the > overhead of extra data copies and memory allocations of a packet-oriented > approach. However the limit is enforced, it is a property or policy of > the socket, not of the generic serialisation framework.
Not true.
It will be eventually seen that the serialization framework must be a first-class participant in limiting the amount of data received by the socket. My gut feeling is that pure virtual functions implemented in the Archive class are called for.
It will also be seen that, in the majority of cases, it is ideal to let the object itself specify to the socket the limit on amount of data received.
> The responsibility of the serialisation framework is to ensure that the > core idea is true. It needs to avoid allocating more memory than the > socket is allowed to receive, and preferably not more memory than the > socket actually has received. The socket needs to make that information > available. So I'd expect code like:
> size_t size; > ar >> size; > size_t limit = ar.bytes_allowed() / sizeof(T); > vec.reserve( min( size, limit ) ); for (size_t i = 0; i < size; > ++i) { > vec.resize( i+1 ); > ar >> vec.back(); > }
My solution does not require any reallocation whatsoever.
> We need a reasonable policy for bytes_allowed(). I think it should > probably return at least 1024 even if not that many bytes are available > in buffers yet. The bigger it is, the less reallocation and copying but > the more vulnerability to denial of service.
I understand what you are saying, but this is not really the issue. You're talking about picking a buffer that is limited in size but big enough to avoid reallocation....
Think about this for a second. This is not the issue. The problem does not go away. The problem was that, if you have a generalized C++ std::string object, that is being serialized into, you do not want the source of the serialization to induce the target to allocate a huge amount of memory for that string. What goes on in between the time that the source exports the string with >> and the target imports the string with >> is almost irrelevant. In the end, if the target has indicated that, "I shall now send to you a string that is 16MB long,", the source will eventually have in its RAM a 16MB string. That is the issue.
It should be intuitively obvious that the socket, while *EXTREMELY* capable of limiting how many bytes is received by it, will not have a clue what that limit should be. Only the context of the application, and specially, the point of serialization of objects, will know what that limit should be. That's why I proposed my stack solution.
> Given a reasonably large bytes_allowed(), many vectors will not need to > reallocate and copy their memory at all, because their final size will be > small enough. Bigger ones will have some overhead, but it will be limited > by their logarithmic growth policy and because they are given a > reasonably large capacity to start with.
Yes, and one 16MB string will DoS my PDA.
> This approach is orthogonal to your push/pop_limit scheme. Both can > co-exist. One problem I have with your scheme is that it says, in effect, > that one data structure is allowed to allocate more memory than another. > This is wrong because it should not be a property of the data structure, > but of the socket. A denial-of-service attacker will seek out the trusted > data structures to exploit.
I have, as of July 1, 2007, decided that all of my data structures will indicate to the "Archive" from which they are about to construct themselves as a matter of serialization will indicate, if appropriate, limits on how much data the Archive is allowed to generate. I have not written any code, but in my mind, at least this part of my solution is highly regular.
> It also quite low-level. It requires us to count up the number of bytes > each structure might need. But we don't actually care about this. We want > to limit the /total/ amount of memory allocated across /all/ the data > structures, not the amount used by any specific one of them. It's at the > wrong level of abstraction.
My stack-based scheme takes the total amount of memory into account to. Sorry, don't have the link handy, but if you read it again, you will see that, when pop_limit() is invoked against the Archive, it decimates the limit that is to be next top-of-stack (TOS). In this case, complex data structures with arbitrary nesting can be safely serialized. The only problem that I see with my method, which I am purposely avoiding discussion until we are all in agreement, is that *sometimes* it will be necessary for the programmer using my library of classes to explicitly invoke push_limit(), pop_limit() against the Archive, say for List<Foo>.
This "drawback" will be true, in general, of container classes, because such classes have no way of knowing, a priori, how big they should be. And it would be imprudent from an engineering point of view to pre-specify a limit, since 100 is no more valid than 100,000,000, generally speaking.
But, the push/pop nature of my solution will maintain regularity, which is probably the most important objective.
> If it has to be done at that level, I'd suggest having push_limit() > affect the result of bytes_allowed(). Pushing a big number would > temporarily increase efficiency (by reducing vector reallocations) at the > cost of increased vulnerability. That trade-off seems unavoidable to me. > To be honest, it seems like a hack; like saying we care more about > efficiency than security.
In my serialization framework, there is only one buffer associated with a socket. Data comes into the buffer, and I serialize into objects from that buffer with no problem. The limit on the size of that buffer _never_ changes. Right now, it is 8192 bytes. As far as efficiency goes, that number is quite sufficient. Note again, tweaking the size of this buffer does not solve the problem I mentioned in my OP.
> If you use different limits for different structures the hackers will > seek out the structures with the biggest limits and focus their attacks > there, ignoring the rest. So you might as well give the rest the same > limit; you don't increase vulnerability by doing so, as long as the > socket itself limits the total number of bytes received.
Again, these buffers are not really related to the problem.
> My impression is that you have picked up on inefficiencies of specific > implementations. I have tried above to emphasis the core ideas, and show > how different implementations can reduce or avoid inefficiencies.
No, its not a problem for inefficiency. Those frameworks, including Boost, are fundamentally defective. I want to emphasize that my own framework was just as defective until I changed it recently (in last 4 days). Fiddling with socket buffers or pre-allocated buffer to "serialize from" piece by piece does not solve the problem.
My stack solution, so far, is the only real solution to the problem.
> Another variation, if you are still bothered by the cost of growing > vectors incrementally, is to replace the vector code with something like:
> size_t size; > ar >> size; > if (size > ar.bytes_allowed() / sizeof(T)) > throw "too big"; > vec.resize( size ); > for (size_t i = 0; i < size; ++i) > ar >> vec[i];
Now the $1,000,000USD question: "What is the value of size?"
> This treats bytes_allowed() as a hard limit. It seems to me it is less > flexible without being significantly more efficient or more secure, for a > given result from bytes_allowed(). (In practice we will probably need to > compensate for the loss of flexibility by increasing bytes_allowed(), so > it will end up being less secure.)
Right. This code in a library would be intolerable (this should be obvious). This code in the nude would be tedious, and the problem would still exist for each T element of the vector.
> > That's not part of the serialization framework! It is part of the > > application code. As it stands, it is a DOS vulnerability. That > > vulnerability can be eliminated without touching even a *single* line > > of the serialization framework, simply by limiting the value of > > inbound_data_size .
> What value should be chosen as a limit on inbound_data_size?
That is completely platform-specific, but it should be large enough for the problem domain, yet just below the amount that would cripple or otherwise negatively impact the system.
For example, if your platform is a "real-mode" OS without virtual memory, the limit may ultimately be determined by the amount of physical RAM available to your C-runtime heap.
Setting a limit on inbound_data_size doesn't solve the problem of an attacker sending you bogus object deserialization requests, though.