See also - OpenWatcom's "Safer C Library" (https://web.archive.org/web/20150503192244/http://openwatcom.org/index.php/Safer_C_Library) and Cyclone (though it's more of a whole new language -- http://cyclone.thelanguage.org/).
I don't know much about the internals of GCC and/or Clang, but what kind of a problem is this? "Fork and patch Clang" problem? "Fork and patch all of LLVM" problem? Or "nuke everything and start over" problem?
-waddlesplash
Why not just write security-critical code in a more "well-defined" language than C, which is at the same time easy to interface with more frivolous C code? For instance, Standard ML.
--
You received this message because you are subscribed to the Google Groups "Boring cryptography" group.
To unsubscribe from this group and stop receiving emails from it, send an email to boring-crypt...@googlegroups.com.
To post to this group, send email to boring...@googlegroups.com.
Visit this group at https://groups.google.com/group/boring-crypto.
For more options, visit https://groups.google.com/d/optout.
It has no useable codegen yet, but it's getting there.
Since accessing an array out of bounds is undefined behavior in C, wouldn't boringC need to keep track of array length at runtime? Doesn't this imply that it must pass array length information when calling a function that takes an array parameter?
It's been around for a little while now, and is widely regarded as the gold standard for correct-by-construction compilers. To pick one quote from a study on compiler bugs performed in 2011 (https://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf)
"The striking thing about our CompCert results is that the middleend
bugs we found in all other compilers are absent. As of early 2011,
the under-development version of CompCert is the only compiler we
have tested for which Csmith cannot find wrong-code errors. This is
not for lack of trying: we have devoted about six CPU-years to the
task. The apparent unbreakability of CompCert supports a strong
argument that developing compiler optimizations within a proof
framework, where safety checks are explicit and machine-checked,
has tangible benefits for compiler users."
While I believe that you have a better grasp and deeper understanding of the nuances of C than most (I hacked on your qmail project for a while, but that was years ago), I'm not sure there is a way to achieve what you want without in fact creating or switching to a new language. boringC would invariably bless some method of writing code that would invalidate someone else's choices. This would mean that some large portion of existing C would need to be ported to boringC to be successful. Is this less painful than building in a new language? Wouldn't it also need to address memory layout choices?
Isn't it possible that we're at a point in lifecycle of C that it's outlived it's usefulness? There are other better languages out there, and one in particular which can match C in performance -and- make the guarantees you're looking for from boringC, while still allowing for somewhat easy back and forth through FFI between the two languages. Why isn't switching to a different language just as viable a choice as picking some standard which will have a similar impact on the community in terms of porting?
It would be nice if it were possible to include separation of the control flow stack from the local data stack, as featured in SafeStack. Segmenting control flow information from local data just makes sense.
http://clang.llvm.org/docs/SafeStack.html
Even if you discard Clang entirely and develop your own front-end compiler, I would suggest that you target LLVM IR for the backend at least initially. It would get you off the ground faster, and then you could consider writing your own backend if necessary or desirable.
What is "well-defined"? SML and its execution behavior are well-defined, safe etc. as long as the language without its actual implementation and the hardware on which the implementation runs is considered.
SML has array-out-of-bounds exceptions as e.g. Java which could lead to the same vulnerabilities as in 1)
SML relies on garbage collection which could lead to timing side-channels.
SML makes no guarantees on the memory layout of structures/types (alignment) another possible (timing) side-channel.
SML signatures can result in different runtime behaviors e.g. signature INTEGER with one implementation using fixed precision and one using arbitrary precision
1) http://armoredbarista.blogspot.de/2014/04/easter-hack-even-more-critical-bugs-in.html
This would allow people writing crypto libraries to turn on the switch, and then adjust their source code until it compiles cleanly, and then it would work--hopefully in a consistent, defined way--on any other compiler.
Certainly ending up with portable, boring crypto code is better than not, right?
Imagine, for example, making gcc more boring by picking some version of
gcc from a few years ago and specifying that code compiled with boringcc
will behave in the same way as code compiled with that version of gcc.
The first is changing the compiler to not make /optimizations/ that take advantage of undefined behavior (so the assembly is "what you'd expect", nothing like null checks being eliminated or signed overflow generating value range assumptions that turn out to be false, seemingly causing logic to fall apart, etc.), and maybe emit some sequences to fix CPU inconsistencies like shift count overflow.
This is attractive to talk about, because it's easy to blame the compiler for breaking one's code; because we know that it can be done without too much performance impact, as old C compilers essentially behaved this way; because the set of undefined behaviors is somewhat arbitrary (e.g. most unsigned integers in typical programs are not expected to overflow, but compilers can only optimize for the signed case), and because in principle we shouldn't need such optimizations, as the performance improvements resulting from them (things like removing redundant loads, redundant code in inlined functions, etc.) can generally be replicated in the source code. In other words, avoiding dangerous optimizations by just not doing them is boring.
But most real-world security vulnerabilities in C and C++ code today are not caused or affected by compiler optimizations. The ones that are, such as those cited in the first post, are especially insidious, but they are few. (Fewer if you stick in -fwrapv -fno-delete-null-pointer-checks.) It's possible that the proportion could increase in the future, as compilers continue to get smarter, but I don't think it will increase much. Currently it is so low that if the goal is security, even though avoiding unsafe optimizations must be part of a complete memory safety solution, I'd say focusing on it now is basically a red herring.
[Brief digression: a reasonably large fraction of vulnerabilities are caused by integer wrapping, such as malloc(count * sizeof(type)). So even if trapping on signed overflow creates breakage in some existing programs, this must be balanced against the possibility of it mitigating actual vulnerabilities. But of course, size_t is unsigned, so it isn't likely to make that much difference, especially in modern programs. See also grsecurity's Size Overflow compiler plugin.]
The other idea is to make the language actually fully memory safe by having the compiler generate extra checking code, removing undefined behaviors such as the behavior of use after free, misindexing arrays, etc. (I'm not sure I'd call that "boring", since it's an area of active research and a practical implementation would probably require a decent amount of innovation, but that's just semantics.)
This is, of course, possible - some existing programs will break, but not too badly - but whereas the previous idea had a small performance cost, this has an enormous one. There's no getting around it. From the links in the original post:
Fail-Safe C: "Some benchmark results show that the execution time are around 3 to 5 times of the original, natively-compiled programs, in avarage"
MIRO: the report includes various benchmark results, which look generally similar or worse - in a worst case, gzip took 3881 times(!) as long to compress a 26 MB file. Also not fully memory safe.
Here is another one with "108% performance overhead on average": https://www.cis.upenn.edu/~milom/papers/santosh_nagarakatte_phd.pdf
It is likely possible to improve this by doing more extensive static analysis of compiled code, but I don't think there will be any drastic improvements in the near future. There are people for whom 3x overhead is acceptable, but I suspect most users would consider it impractical.
I think it's also worth pointing out that this cannot be done /fully/ without recompiling the world - the perfect should not be the enemy of the good, but still, it's an argument against placing too much importance on avoiding ABI breakage. As an obvious example, memcpy is an external library function, and the compiler has no way to know out of the box that the third argument must be no greater than the minimum size of the referents of the first two pointers. So you special case memcpy, or add some __attribute__ flag so it can be done generically, but that still leaves you with many other library functions that take pointer, size argument pairs. There are other examples. And at the very least you must modify the implementation of malloc and free.
FWIW, the news is not all bad. Especially in modern C++ programs - which significantly mitigate array bounds checking issues compared to typical C ones by using containers like std::vector and std::map rather than directly touching pointers, although of course not completely, and you can do similar things in C (but few do) - a large fraction of vulnerabilities seems to be caused by use after free. This can be mostly eliminated just by using a conservative garbage collector, with much lower overhead. But of course this does not provide complete safety.
Fantastic and amazing idea.
I more than 10 years thinking about this each time when setup wrong version of VisualStudio or fight with circular library dependency MinGW on Widnows.
Most gcc not so bad but can't understand what is version installed on this platfrom and who use autotools know what it is just _magic_ bash code.
Who develop tottaly crosplatform software hands up! Another painful platform is to super verbose Mac OS X compiler populate my stderr to many notification's.
Also think about how to replace autotools to more human radable solution like SCons it wery cool but require Python!
This opens an opportunity to deprecate behavior which was undefined to support architectures that are no longer available.
This won't solve everything by any means, but will allow some of the icky bits to become defined.
Hallo Mr. Bernstein ! I think I will speak for many if I say: after qmail and
other .tgz(s) we like v. much it's "natural" to us to just wait and then
be surprised by very well written CC we can download from cr.yp.to :)
</joking>. No one have right to tell you what you should be doing/reserching. But
"idea was crossed" - design, feature list (or blacklist) would be something to start
things up. Or starting community to "just" discuss things ? I'm sure many was (I was)
convinced even if thread shows some doubts. And many would love help.
Good luck!
--
Sylwester Lunski
Don't you think the C language is in hopeless position where any attempt to define its undefined behavior will break a lot of existing C code?
Personally I think we did ourselves a lot of harm by abandoning Pascal in favor of C. I don't recall any undefined behavior in Pascal, and I was big fan of its features, modularity, compile-time and runtime checks, ability to directly inline assembly code, fast single-pass compilation, and good quality of generated code. Had we put so much investment in development of Pascal language, now we would have been in much better position with respect to security, performance and ability to audit the software. But fashion and popularity won, as with many other areas of Computer Science, so now we have suboptimal, but widespread, C and C++.
I think not much can be done about the current state of C. Efforts to change C will take not years, but decades. I think faster way is to slowly move to languages without undefined behavior. This will also be long process, but at least time will be spent to create better software, not to do Sisyphean task of fixing the unfixable.
Recompilation with compiler that detects undefined behaviors just results in compilation error - BUM! - you got "So far unknown error found!" message
for "free" and error needs to be fixed. Becouse _using_ undefined behavior is usually a omision. Or optimization that is supouse to breake a bit later.
So it's more a new test case then a breakage.
But in any of that temporary broken programs how many lines needs to be fixed ? 50% ? 10% 0.001% LoC ? IMO it's not a "lot". But topic is probably
about creating secure software.
--
SL
One nice thing is my backend is a 1:1 dumb translation of code to assembly, the performance is terrible, but at least testing and auditing the whole thing is far more feasible than testing every code path in something like gcc/clang.
To expand on that, it is about 5k-6k LOC and self hosting.
Then you have a language that is as fast, portable, low level and mature as C, without the undefined behavior, but with bounds checks and range validation checks.
It also has nice features such as modules, generics, ranges (which is ideal for limiting values), sub-typing, contracts, parallel processing, a nice bitchy compiler that doesn't allow you to write nasty code (unsafe code is clearly marked), and a large existing Ada code-base.
And that all could be ready within a reasonable time-frame! This CAN be done!
Why not write a user friendly wrapper around Ada with the look and feel of Go?
Why isn't Ada being used? Lack of familiarity and clutter. Both of these problems can be solved.