| boringcc | D. J. Bernstein | 21/12/15 06:40 | As a boring platform for the portable parts of boring crypto software,
I'd like to see a free C compiler that clearly defines, and permanently commits to, carefully designed semantics for everything that's labeled "undefined" or "unspecified" or "implementation-defined" in the C "standard". This compiler will provide a comprehensible foundation for people writing C code, for people auditing C code, and for people formally verifying C code. For comparison, gcc and clang both feel entitled to arbitrarily change the behavior of "undefined" programs. Pretty much every real-world C program is "undefined" according to the C "standard", and new compiler "optimizations" often produce new security holes in the resulting object code, as illustrated by https://lwn.net/Articles/342330/ https://kb.isc.org/article/AA-01167 and many other examples. Crypto code isn't magically immune to this--- one can easily see how today's crypto code audits will be compromised by tomorrow's compiler optimizations, even if the code is slightly too complicated for today's compilers to screw up. A boring C compiler will eliminate these nasty surprises; it will prioritize predictability. I should note that this plan, throwing away gcc and clang in favor of a boring C compiler, isn't the only possible response to these types of security holes. Here are several other responses that I've seen: * Attack the messenger. "This code that you've written is undefined, so you're not allowed to comment on compiler behavior!" The most recent time I saw this, another language lawyer then jumped in to argue that the code in question _wasn't_ undefined---as if this side discussion had any relevance to the real issue. * Say that the C "standard" allows gcc and clang to do whatever they want to all of these "undefined" programs. Yes, we know that it's a stupid "standard"; that isn't the question. What do we do about the resulting security problems? * Blame the security holes on the C programmers who wrote "undefined" programs. This _might_ be reasonable if there were a plausible plan to reduce the fraction of "undefined" programs from ~100% to ~0%, but there isn't. Even if there were, how would this massive code revision be better than keeping the code and switching to a boring C compiler? * Claim that earthquakes in the behavior of "undefined" programs will teach C programmers to stop writing such programs. "That'll show 'em!" But the reality is that this hasn't worked and won't work. * Claim that all we need is for some particular "undefined"-catching tool to be widely used. In fact, these tools are full of false positives and false negatives; at best they catch a few limited types of "undefined" behavior, not changing the big picture. * Claim that a boring C compiler can't possibly support the desired system _performance_. Even if this were true (which I very much doubt), why would it be more important than system _correctness_? Overall I think that these people simply don't understand what most C programmers want. A boring C compiler will very quickly gain users---not merely for security but also for predictability in general; people will appreciate, e.g., having variables automatically initialized to 0. Of course, the compiler has to support the usual C ABI, so that programs compiled with this compiler can be linked to libraries compiled with other compilers (including other languages), and vice versa, allowing an incremental upgrade process. I'm told that the Plan 9 compiler tries to be somewhat predictable--- http://article.gmane.org/gmane.os.plan9.general/76989 ---but it still leaves many things unspecified, and I'm under the impression that it has its own ABI. There's a "Fail-Safe C" compiler that sounds like it defines much more, including safe semantics for buffer overflows--- https://staff.aist.go.jp/y.oiwa/FailSafeC/index-en.html ---but it's clear that this has its own ABI. There are various projects such as http://www.doc.ic.ac.uk/~awl03/projects/miro/ that add more useful semantics to some parts of C _without_ changing the ABI, but the same compilers will happily screw up other parts of C. There's a "Friendly C" proposal that targets other parts of C more relevant to core crypto--- http://blog.regehr.org/archives/1180 ---but the big picture is still tons of misguided flexibility for the compiler writer. ---Dan |
| boringcc | drc...@google.com | 21/12/15 12:27 | I agree. How do you feel about defining signed integer overflow as trapping, with unsigned retaining the ring semantics? Or is that too far from C-classic? W/o optimization, it might tend to slow your code down because of all the extra hopefully-never-taken branches, but it would tend to guard against overflow mistakes. |
| boringcc | ol...@oleks.info | 21/12/15 12:53 | Why not just write security-critical code in a more "well-defined" language than C, which is at the same time easy to interface with more frivolous C code? For instance, Standard ML. |
| Re: boringcc | waddlesplash | 21/12/15 12:56 | On Monday, December 21, 2015 at 9:40:44 AM UTC-5, D. J. Bernstein wrote: See also - OpenWatcom's "Safer C Library" (https://web.archive.org/web/20150503192244/http://openwatcom.org/index.php/Safer_C_Library) and Cyclone (though it's more of a whole new language -- http://cyclone.thelanguage.org/). I don't know much about the internals of GCC and/or Clang, but what kind of a problem is this? "Fork and patch Clang" problem? "Fork and patch all of LLVM" problem? Or "nuke everything and start over" problem? -waddlesplash |
| Re: boringcc | Kurt Roeckx | 21/12/15 13:24 | On Mon, Dec 21, 2015 at 12:56:26PM -0800, waddlesplash wrote:The undefined behaviour of C is deliberate so that compilers can make optimazations. They assume you write code that only has defined meaning and generate code for that defined meaning. That means for instance that if you add 2 signed integers they're going to assume it doesn't overflow and then for instance make assumptions based on that on wether other code ever going to be executed or not. If they can't make such assumptions, they need to generate a lot more code to be able to do the same thing. They both have options to generate code that detect (some) of the undefined behaviour at run time. But that's far generating code where you turn that undedined behaviour into defined behaviour. I have never looked at either code, but I assume it's non obvious to get them to do that. Kurt |
| unk...@googlegroups.com | 21/12/15 13:39 | <Ce message a été supprimé.> | |
| Re: boringcc | Tony Arcieri | 21/12/15 13:44 | On Mon, Dec 21, 2015 at 12:53 PM, <ol...@oleks.info> wrote:Why not just write security-critical code in a more "well-defined" language than C, which is at the same time easy to interface with more frivolous C code? For instance, Standard ML. +1. See also: Rust ("ML in C++ clothing", or thereabouts) Tony Arcieri
|
| Re: boringcc | brennan...@openmailbox.org | 21/12/15 13:59 | A "boring" programming language and compiler is a much needed tool. The stuff we have today is far too complicated. For example: "Lord knows how GCC does register allocation right now." <https://gcc.gnu.org/wiki/RegisterAllocation>. Must it be C? Is the goal to use it to compile the existing code? I think that some modern ideas could help design a better and simpler language. For example a type system that separates 'public' from 'secret' data: if the variable 'x' has a secret type then "if(x) {" .. could be rejected as a type error. Some of the links were very interesting - they claim to insert runtime tests to ensure type safety. We want type safety but there is a danger that runtime checks could stop things being constant-time. Ideas from Pfenning and Xi (or more modern: F*) might prove useful in statically ensuring that array bounds are respected. Another project in this area was Cyclone <http://www.eecs.harvard.edu/~greg/cyclone/old_cyclone.html> which sadly isn't seeing much use but all the research and code is still available to build on top of. Anyway I really support this idea because the only way we are going to reduce the amount of vulnerabilities in code is by improving the languages and tools we use. Programmers have been using C for long enough to have a good understanding and classification of the kinds of weaknesses that come up - putting that information back into the toolchain is the next important step in improving the quality and security of programs we write. |
| Re: boringcc | kaven...@gmail.com | 21/12/15 14:27 | C/C++ is too old and complicated for nothing it wasn't designed well for the modern age and it shows there is a lack of a clean design à la C# ...
it need a big corporation behind it to clean it and give a blue print a design plan to straight things up steer them in the right way etc there is too much idiosyncrasies in the language that doesn't stand well today anymore that's why C#/JAVA took over the business world (outside of gaming and some niche market) I am not saying it's not popular or not used anymore it is but I don't see it as a bright future with computers and datacenter getting faster and faster and all the advantage that C# as over C++ or even Javascript or Java for that matter etc I know there is a lot of old programmers here who learned C/C++ in the 80's-90's and didn't change and have a hard time to agree with me but it's the truth when you take a honest look on things, it's not normal when 1 language is so "fragile" with a tons of different way of doing 1 thing etc and all the compilation process and linking etc things can improve |
| Re: boringcc | Kenneth Seefried | 21/12/15 14:55 | Not to be "that guy", but as cool as it is to ponder doing this in some bright, shiny language like Rust or ML, Ada exists for a reason. There's even an open source compiler (Gnat) that can already generate code that can link with C/C++/etc code bases. Work would need to go into streamlining the run time overhead for this use case, but how much compared to maintaining some C compiler fork? KJ
|
| Re: boringcc | marc.c...@gmail.com | 21/12/15 14:56 | Check out the "simple C compiler" (scc) It has no useable codegen yet, but it's getting there. |
| Re: boringcc | sc0...@gmail.com | 21/12/15 17:02 | Maybe we need some switch to GCC that will make everything defined? |
| Re: boringcc | xua...@gmail.com | 21/12/15 17:06 | > Of course, the compiler has to support the usual C ABI Since accessing an array out of bounds is undefined behavior in C, wouldn't boringC need to keep track of array length at runtime? Doesn't this imply that it must pass array length information when calling a function that takes an array parameter? |
| Re: boringcc | D. J. Bernstein | 21/12/15 21:52 | xua...@gmail.com writes:Here's a definition that would be moderately helpful for security: any read out of bounds produces 0; any write out of bounds is ignored. In a new language this would be a poor choice compared to auto-resizing arrays, for the same reasons that 32-bit integers are a poor choice compared to bigints. However, in a language without a clean exception mechanism, it's not clear how to handle auto-resizing failures. There are other ways to define everything without the "out of bounds" concept: for example, one can define where variables are stored in a flat address space, and then treat all pointer operations the same way the hardware does. But the definition above is simpler and more useful. No. Conceptually, what one wants is a "fat pointer" that knows the start and end of the array that it came from, but this can be physically stored as * a thin pointer p, compatible with the traditional ABI, and * a separate associative array mapping &p to (start,end). This is implemented in, e.g., the MIRO project that I linked to before. ---Dan |
| Re: boringcc | D. J. Bernstein | 21/12/15 22:02 | drc...@google.com writes:This is clearly too far from what programmers expect and rely on: it would break tons of existing C code. If you're designing a new language for general-purpose usage, you should provide auto-resizing bigints, not fake "integers" with sharp edges. But my goal here isn't to design a new language; it's to provide predictable behavior for the existing C code base. ---Dan |
| Re: boringcc | mil...@cs.cornell.edu | 21/12/15 22:56 | You may be interested in investigating CompCert, a verified C compiler with a clear, formal semantics. It's been around for a little while now, and is widely regarded as the gold standard for correct-by-construction compilers. To pick one quote from a study on compiler bugs performed in 2011 (https://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf) "The striking thing about our CompCert results is that the middleend |
| Re: boringcc | Benjamin Fry | 21/12/15 23:00 | > But my goal here isn't to design a new language; it's to provide predictable behavior for the existing C code base. While I believe that you have a better grasp and deeper understanding of the nuances of C than most (I hacked on your qmail project for a while, but that was years ago), I'm not sure there is a way to achieve what you want without in fact creating or switching to a new language. boringC would invariably bless some method of writing code that would invalidate someone else's choices. This would mean that some large portion of existing C would need to be ported to boringC to be successful. Is this less painful than building in a new language? Wouldn't it also need to address memory layout choices? Isn't it possible that we're at a point in lifecycle of C that it's outlived it's usefulness? There are other better languages out there, and one in particular which can match C in performance -and- make the guarantees you're looking for from boringC, while still allowing for somewhat easy back and forth through FFI between the two languages. Why isn't switching to a different language just as viable a choice as picking some standard which will have a similar impact on the community in terms of porting? |
| Re: boringcc | D. J. Bernstein | 22/12/15 06:29 | bluej...@gmail.com writes:No. Imagine, for example, making gcc more boring by picking some version of gcc from a few years ago and specifying that code compiled with boringcc will behave in the same way as code compiled with that version of gcc. You might think (and I do think) that this is a mistake for various reasons, but you can't seriously believe that this will "invalidate" large fractions of the code base. Linux programmers have already tested pretty much every real-world C program they care about with that version of gcc; the code doesn't "need to be ported". There's actually a rather large cloud of compiler versions and compiler options that work for pretty much all of this code. It's crystal clear that, for example, almost the entire code base is compatible with signed addition defining overflow the same way that every modern CPU handles it. Until quite recently, gcc didn't know any other way to compile code! Within the cloud of compiler behaviors that are compatible with the existing code base, boringcc can make choices that are simpler and safer than what gcc has ever done. For example, it's obviously a win for boringcc to initialize everything to 0, rather than defining initial values in terms of previous stack frames etc. Sure, the lack of previous testing of this initialization means that this initialization _might_ break some code, but I bet you'll have to look through quite a few deployed programs before you find one that would be broken by this. There's tons of effort to build and use better languages, often with the long-term goal of replacing today's massive C code base. However, the short-term situation is that misguided gcc "optimizations" are a huge security threat, and boringcc will eliminate that threat much more quickly than new languages will, precisely because it _will_ work with typical C code. Building a new compiler is orders of magnitude less work than rewriting the C code base in new languages. ---Dan |
| boringcc | Jeremiah Gowdy | 22/12/15 06:51 | I agree that there is a place for different semantics than what standard C provides. However, I would suggest that throwing away gcc and Clang altogether doesn't have to be the answer to that. Why couldn't you treat it as a front-end issue? Define a subset of C ("BoringC") that is interoperable with C from an ABI perspective, but which provides the semantics you desire. Then look to write a "BoringC" front-end into Clang. From there you should be able to control most of the optimizations that would interfere with your desired behavior, and generate LLVM IR that satisfies your requirements. You still get all the free labor from the LLVM backends for supported platforms. There seems to be a lot of effort focused on LLVM right now. It would be nice if it were possible to include separation of the control flow stack from the local data stack, as featured in SafeStack. Segmenting control flow information from local data just makes sense. http://clang.llvm.org/docs/SafeStack.html Even if you discard Clang entirely and develop your own front-end compiler, I would suggest that you target LLVM IR for the backend at least initially. It would get you off the ground faster, and then you could consider writing your own backend if necessary or desirable. |
| Re: boringcc | funny....@gmail.com | 22/12/15 06:52 | Go(lang) ? |
| Re: boringcc | D. J. Bernstein | 22/12/15 07:02 | mil...@cs.cornell.edu writes:The clarity, formality, and verification of CompCert are nice features, but I don't see how they address the dangers that I'm talking about. Specifically, these features don't even try to stop the compiler from adding an "optimization" that's guaranteed safe for programs obeying the C "standard" but that breaks the other 100% of the deployed C code base. CompCert also isn't free software---the authors say that you have to pay them if you want to use CompCert for something other than research. They've put the software online, which makes this hard to enforce, but most people will do what they're told. Putting the software online also does a masterful job of interfering with funding for any academics interested in writing a free version. Another project that isn't even trying to do what I want is gcc -fsanitize=undefined. To see that this option isn't actually _defining_ behavior, try compiling main() { int x = 0x7ffffffe; return x < x + 4; } with gcc-4.9.2 -fsanitize=undefined, and you'll see x.c:4:16: runtime error: signed integer overflow: 2147483646 + 4 cannot be represented in type 'int' with return value 0, while gcc-4.9.2 -fsanitize=undefined -O3 produces no output and return value 1. Even worse, -fsanitize=undefined breaks so much code that it has no hope of being deployed for serious programs. -fsanitize=undefined is the C version of the Westboro Baptist Church shouting "SINNERS! YOU'RE ALL GOING TO HELL!"---remarkably ineffective at stopping the "sinners" and remarkably effective at getting people to stop listening. ---Dan |
| Re: boringcc | ndt...@gmail.com | 22/12/15 07:11 | Am Montag, 21. Dezember 2015 21:53:44 UTC+1 schrieb ol...@oleks.info: What is "well-defined"? SML and its execution behavior are well-defined, safe etc. as long as the language without its actual implementation and the hardware on which the implementation runs is considered. SML has array-out-of-bounds exceptions as e.g. Java which could lead to the same vulnerabilities as in 1) 1) http://armoredbarista.blogspot.de/2014/04/easter-hack-even-more-critical-bugs-in.html |
| boringcc | andrew0...@gmail.com | 22/12/15 10:43 | Wouldn't it be better/simpler to add a switch to clang or GCC that emits warnings (or errors) for any "undefined" or "unspecified" or "implementation-defined" construct in the compiled code? This assumes that someone knows what all these things are, which I'm not sure anyone does. This would allow people writing crypto libraries to turn on the switch, and then adjust their source code until it compiles cleanly, and then it would work--hopefully in a consistent, defined way--on any other compiler. Certainly ending up with portable, boring crypto code is better than not, right? |
| Re: boringcc | Benjamin Fry | 22/12/15 12:17 | Imagine, for example, making gcc more boring by picking some version of Ok, this makes much more sense to me, and clarifies what you're want to do. Others have mention the LLVM, with which it would be interesting to see what could become issues for the LLVM backend, and the front-end becoming a better defined Clang. Then whatever features are built into LLVM would improve all LLVM based languages, like Rust (my personal choice for all new systems level development). Thanks for clarifying. -ben |
| Re: boringcc | com...@gmail.com | 22/12/15 14:05 | It seems like there are two very different ideas being grouped together here. The first is changing the compiler to not make /optimizations/ that take advantage of undefined behavior (so the assembly is "what you'd expect", nothing like null checks being eliminated or signed overflow generating value range assumptions that turn out to be false, seemingly causing logic to fall apart, etc.), and maybe emit some sequences to fix CPU inconsistencies like shift count overflow. This is attractive to talk about, because it's easy to blame the compiler for breaking one's code; because we know that it can be done without too much performance impact, as old C compilers essentially behaved this way; because the set of undefined behaviors is somewhat arbitrary (e.g. most unsigned integers in typical programs are not expected to overflow, but compilers can only optimize for the signed case), and because in principle we shouldn't need such optimizations, as the performance improvements resulting from them (things like removing redundant loads, redundant code in inlined functions, etc.) can generally be replicated in the source code. In other words, avoiding dangerous optimizations by just not doing them is boring. But most real-world security vulnerabilities in C and C++ code today are not caused or affected by compiler optimizations. The ones that are, such as those cited in the first post, are especially insidious, but they are few. (Fewer if you stick in -fwrapv -fno-delete-null-pointer-checks.) It's possible that the proportion could increase in the future, as compilers continue to get smarter, but I don't think it will increase much. Currently it is so low that if the goal is security, even though avoiding unsafe optimizations must be part of a complete memory safety solution, I'd say focusing on it now is basically a red herring. [Brief digression: a reasonably large fraction of vulnerabilities are caused by integer wrapping, such as malloc(count * sizeof(type)). So even if trapping on signed overflow creates breakage in some existing programs, this must be balanced against the possibility of it mitigating actual vulnerabilities. But of course, size_t is unsigned, so it isn't likely to make that much difference, especially in modern programs. See also grsecurity's Size Overflow compiler plugin.] The other idea is to make the language actually fully memory safe by having the compiler generate extra checking code, removing undefined behaviors such as the behavior of use after free, misindexing arrays, etc. (I'm not sure I'd call that "boring", since it's an area of active research and a practical implementation would probably require a decent amount of innovation, but that's just semantics.) This is, of course, possible - some existing programs will break, but not too badly - but whereas the previous idea had a small performance cost, this has an enormous one. There's no getting around it. From the links in the original post: Fail-Safe C: "Some benchmark results show that the execution time are around 3 to 5 times of the original, natively-compiled programs, in avarage" MIRO: the report includes various benchmark results, which look generally similar or worse - in a worst case, gzip took 3881 times(!) as long to compress a 26 MB file. Also not fully memory safe. Here is another one with "108% performance overhead on average": https://www.cis.upenn.edu/~milom/papers/santosh_nagarakatte_phd.pdf It is likely possible to improve this by doing more extensive static analysis of compiled code, but I don't think there will be any drastic improvements in the near future. There are people for whom 3x overhead is acceptable, but I suspect most users would consider it impractical. I think it's also worth pointing out that this cannot be done /fully/ without recompiling the world - the perfect should not be the enemy of the good, but still, it's an argument against placing too much importance on avoiding ABI breakage. As an obvious example, memcpy is an external library function, and the compiler has no way to know out of the box that the third argument must be no greater than the minimum size of the referents of the first two pointers. So you special case memcpy, or add some __attribute__ flag so it can be done generically, but that still leaves you with many other library functions that take pointer, size argument pairs. There are other examples. And at the very least you must modify the implementation of malloc and free. FWIW, the news is not all bad. Especially in modern C++ programs - which significantly mitigate array bounds checking issues compared to typical C ones by using containers like std::vector and std::map rather than directly touching pointers, although of course not completely, and you can do similar things in C (but few do) - a large fraction of vulnerabilities seems to be caused by use after free. This can be mostly eliminated just by using a conservative garbage collector, with much lower overhead. But of course this does not provide complete safety. |
| Re: boringcc | johnn...@gmail.com | 23/12/15 09:57 | someone reposted it here http://bbs.progrider.org/prog/read/1450729336 |
| Re: boringcc | D. J. Bernstein | 23/12/15 17:44 | I agree that implementing a simple definition of the behavior of
out-of-bounds array access, especially with ABI compatibility, is slightly more difficult than implementing a simple definition of the behavior of "int" arithmetic. It's nevertheless clear that putting this burden on the compiler is much, much, much better than forcing the programmer to constantly worry about it. The desired situation, the boring situation, is the compiler implementing simple, stable definitions of _all_ language constructs. The historical reason for keeping out-of-bounds array access undefined is the same as the historical reason for keeping int overflow undefined: a misguided pursuit of performance. So I don't agree that "optimization" is the right word for separating these two types of undefined behavior. The NULL screwups in gcc, for example, were certainly intended to be "optimizations", and relied on quite a bit of gcc implementation work. com...@gmail.com writes: [ referring specifically to optimizations _added_ to gcc ] > But most real-world security vulnerabilities in C and C++ code today > are not caused or affected by compiler optimizations. The ones that > are, such as those cited in the first post, are especially insidious, > but they are few. Let's focus on crypto, the topic of this mailing list. There's a continuing trend towards simpler and simpler crypto code, with more and more testing, auditing, and verification. I'm sympathetic to the "Obviously none of this should be written in C" view, but getting people to switch to a new language is _much_ tougher than getting them to switch to a new compiler, so it's not a surprise that tons of critical crypto code is written in C. In this environment, the ill-defined behavior of C compilers is one of the largest remaining threats to security, and it's hard to accept the argument that this threat should be ignored because there are also other threats. It's clear that gcc -fwrapv is an improvement, but -fwrapv is only one out of hundreds of issues; asking programmers to * wait in fear for the gcc authors to screw up something new, * wait even longer for gcc to add a "fix this" option, and * manually deploy that option means preserving the threat for many years. This is also why it's a mistake to try to identify some top issues (as in "Friendly-C") and hope that those are the only issues. > Brief digression: a reasonably large fraction of vulnerabilities are > caused by integer wrapping, such as malloc(count * sizeof(type)). The reliable way to stop buffer overflows is to have bounds-checking enforced by the compiler. Bounds-checking is compatible with typical C code; it doesn't need a massive porting effort. You're focusing on the details of a particular type of miscalculation of the buffer size. Obviously a new language with automatic bigints would get this particular calculation right. Obviously a new language with overflow exceptions, while less satisfactory in general, would have the same effect for this particular pattern---it doesn't matter here whether the exception is in the calculation or in the memory allocation a moment later. But what I can't figure out is what behavior you're suggesting _for a C compiler_. A trap-on-signed-overflow option * isn't what gcc -fsanitize=undefined does, * would be even less deployable than gcc -fsanitize=undefined, and * would, as you note, fail to fix size_t overflows. So you're talking about something that is unimplemented, undeployable, and unrelated to the security problem you're talking about. Today boringcc is also unimplemented, but it would be deployable and would solve the security problem. > Fail-Safe C: "Some benchmark results show that the execution time are > around 3 to 5 times of the original, natively-compiled programs, in > avarage" First of all, _if_ there's a conflict between security and performance, then security has to win. Second, there are many deployed C programs where nobody will even notice a 3x slowdown. I see tons of code that fritters away incredible amounts of CPU time---but in most cases this slowdown simply doesn't matter for the end user. Many programs can be (and sometimes are being) rewritten in Python without anyone noticing the performance difference. Third, even within the programs that run hot, the CPU time is spent in a tiny fraction of the code. The main reason for boringcc to work with the existing ABI is for interoperability with existing libraries (and vice versa), but a side benefit is that tons of code compiled with boringcc can be linked with a few hot spots that haven't upgraded to boringcc. Maintaining ABI compatibility isn't rocket science. Fourth, the modern trend is to rewrite hot spots in domain-specific languages; see http://cr.yp.to/talks.html#2015.04.16 for an analysis of how and why this is happening. This makes the performance of C code even less relevant to the user---while we still need a compiler that can handle all the remaining C code. These effects are particularly obvious in crypto. Fifth, the slowdown that you're talking about is from a compiler without any serious bounds-checking optimizations. This is like benchmarking an ancient C compiler many years ago (think gcc -O0) and concluding that nobody will ever switch from asm to C. ---Dan |
| Re: boringcc | vit...@gmail.com | 25/12/15 17:13 | > As a boring platform for the portable parts of boring crypto software, > I'd like to see a free C compiler that clearly defines, and permanently > commits to, carefully designed semantics for everything that's labeled > "undefined" or "unspecified" or "implementation-defined" in the C > "standard". This compiler will provide a comprehensible foundation for > people writing C code, for people auditing C code, and for people > formally verifying C code. Fantastic and amazing idea. I more than 10 years thinking about this each time when setup wrong version of VisualStudio or fight with circular library dependency MinGW on Widnows. Most gcc not so bad but can't understand what is version installed on this platfrom and who use autotools know what it is just _magic_ bash code. Who develop tottaly crosplatform software hands up! Another painful platform is to super verbose Mac OS X compiler populate my stderr to many notification's. Also think about how to replace autotools to more human radable solution like SCons it wery cool but require Python! |
| unk...@googlegroups.com | 26/12/15 06:43 | <Ce message a été supprimé.> | |
| Re: boringcc | vom...@google.com | 26/12/15 11:32 | It's worth mentioning that there were originally two reasons for leaving things undefined in the definition. The first was performance, as many have mentioned. Dropping checks for array boundaries falls in this category. The second reason was that different processors do different things. This, for example, is why overflow of unsigned values is defined but overflow of signed values is not. Here's a critical point: many of the systems that behaved differently just aren't around any more. For example, I don't know of any 1's complement systems outside of museums. This opens an opportunity to deprecate behavior which was undefined to support architectures that are no longer available. This won't solve everything by any means, but will allow some of the icky bits to become defined. |
| Re: boringcc | comex | 26/12/15 23:04 | On Wed, Dec 23, 2015 at 8:44 PM, D. J. Bernstein <d...@cr.yp.to> wrote:I'm responding to this first, with some random thoughts, because it's probably the most interesting/relevant thing to talk about. I'm actually pretty curious how much of the overhead in these experiments is essential. Perhaps my previous post was a bit pessimistic, but when multiple independent academic attempts at the same goal have resulted in relatively similar overheads, I'm skeptical that it isn't very difficult to do better. Yet it is odd that they do (arguably) considerably worse than not only languages that require a lot of manual annotation to ensure safety (Rust), but those that merely rely on garbage collection and restricting manual pointer operations, at a performance cost (Go, Java). AFAICT this is partly because there is a lot to keep track of: - To avoid use after free, the papers I've looked at use either a "lock and key" (metadata maps allocations to unique nonces, and pointers to the referent's nonce at the time it was created; each malloc changes the nonce; dereferences compare) or a conservative GC. The former requires extra loads and tests on a large fraction of dereferences; in the latter case, all GC adds some timing unpredictability, and conservative GC itself is not unacceptably expensive compared to precise but does have some overhead. Meanwhile, both Go and Java use precise GC: the former mark and sweep, the latter copying. Copying is *definitely* out for C as it would silently break things like pointer comparisons and hash tables, but precise mark-and-sweep might not totally be... if we could accurately predict what (struct) types allocations were being used for. At least two recent papers opt for "lock and key" and justify it in part as being more useful for debugging than GC, as mistakes in the original use of malloc/free can be detected rather than merely worked around; but if the goal is to always use the protection in production, helpfulness for debugging is not terribly valuable. That said, predictability is always nice. But GC is the only approach that need not add any overhead to dereferences (at least reads, depending on concurrency), and therefore the only one capable of recovering something close to full original performance if conditions are optimal. It is also probably more robust in the presence of threading. Most notably because a thread could be in the middle of accessing a pointer when another frees it (and could be paused indefinitely in that state by the kernel). For loads this could be mostly rectified (at the cost of confusing compiler optimizations) by loading the pointer before the lock value, and inserting appropriate barriers on architectures where this is necessary for ordering; but that doesn't work for stores, so I think this would require some sort of RCU-like thread progress checking or else thread PC inspection before allocations could actually be reused, though I might be missing something simpler. Also, if improperly synchronized accesses (including code that assumes individual stores to pointer-typed variables are atomic) are as common in legacy C code as I think they are, since the key and pointer value cannot be updated together atomically, a thread that observed only one being changed could crash spuriously. (It might be possible to check for this in the crash handler and fix up execution, but at the cost of significantly complicating the CFG. Ew.) A downside to GC is that it is somewhat more dangerous if applications obfuscate pointers to save memory: GC missing a pointer may lead to use after free, while a lock-and-key approach is more likely to just crash on a pointer whose metadata has gotten lost in transit. I suppose a primarily GC based approach could do some sort of type check upon casting from integer to pointer, if a type could be identified with some dumb heuristics, and complain at compile time otherwise. Both cases require modifying the original source to properly rectify, so it is not like this complaint represents a reduced "just works" factor. - And then to avoid out-of-bounds indexing, every pointer has bounds stored for it in a table. The other languages simply don't allow pointer arithmetic outside of special array structures; but most pointers never have arithmetic done on them in C either [citation needed]. It's apparently necessary to store bounds information with pointers rather than with their referents because temporarily out-of-bounds pointers are common (including but not limited to the standards-legal one-past-the-end, of course). Aside from representing overhead in all cases, this too is tricky to make safe in the presence of threading: I think you would need to load the base and size in the metadata table atomically, and since x86 can't even do 16 byte atomic loads properly (unless you use MPX?), that means stuffing them into a single int64 somehow. I guess you could have a 32-bit address space (ew). And that doesn't even solve the spurious panic issue. But if we could guarantee that some pointer variables/fields were never used for indexing, there would at least be no need to manage this information for them. In this and the previous case, a little more performance could be eked out if we could essentially identify the subset of C code that acts like Go or Java code - no pointer arithmetic and no casts. This can be done statically - I think with reasonable recall with flow-insensitive analyses (more robust with some sensitivity), but only with whole program optimization, which generally has a lot of arguments against it. And it would be fairly fragile/unpredictable as a single unanalyzable cast (or memcpy, etc.) to a popular type could deoptimize all the code that uses that type. Or maybe you would rather eat the performance for the sake of predictability. Though even then, I think it will be necessary to do some ugly-feeling things to avoid unsafety in edge cases, such as the issue I mentioned above with pointers stored as integers. Please let me know if I'm completely barking up the wrong tree in some areas. For the sake of completeness, on to some less technical points: > [..] > In this environment, the ill-defined behavior of C compilers is one ofIt's true that I'm looking at this from my typical viewpoint, and my specialty is not cryptography. But core crypto code is typically linked into codebases multiple orders of magnitude larger, which (especially on the client side, but also on the server) almost invariably suffer a constant hemorrhage of real, exploitable (often actually exploited) vulnerabilities. If compiler optimizations are one of the largest remaining threats to such core code, despite historically having only on rare occasions been known to cause an actual vulnerability in /any/ code, then even though it's laudable to address, as long as that hemorrhage continues, putting any significant focus on it seems misdirected. But there is little reason to debate the point; since you are not interested in compromising for performance on more fundamental (non-optimization-related) undefined behaviors, but really want to go for airtight memory safety, then every base must be covered and of course avoiding potentially dangerous compiler optimizations is one of them. It is just a question of emphasis. (Also, there are exceptions to my claim of innocuousness: in particular, I think theoretical vulnerabilities caused by attempts to zero a buffer after working with it being optimized away are quite common, although real-world exploits are not.) This part of my post was more of a quick thought, not fully thought out, but basically this is another thing that only really matters if you're making compromises. As you say, it's not necessary if the undersized buffer is bounds-checked anyway. (However, I'm not sure it's as undeployable as you say. Also, while size_t is indeed more important, signed overflows are not "unrelated" because they do sometimes cause size overflow vulnerabilities in practice - int is sometimes used for sizes, especially in old code, but also in new code following misguided coding standards like Google's.) (It's off topic, but having previously seen that, I'm more swayed by the conventional wisdom that is pretty much the exact opposite. If even in 2015 most programs are pretty awful at taking advantage of SMP and parallelizing across 4 or even 2 cores, there is little hope in the near future of meshes of hundreds becoming the norm, even if /some/ applications use them... meanwhile, single core speed is reaching an asymptote, and indeed has regressed in recent years after factoring in the importance of underutilizing it to save power, making optimizing compilers more important than ever. I guess you've heard all this before.) True. (And yet Python is losing mindshare these days in part due to its slowness, and the reverse for Go. Supposedly.) |
| boringcc | slu...@gmail.com | 05/01/16 04:16 | > As a boring platform for the portable parts of boring crypto software,> I'd like to see a free C compiler... Hallo Mr. Bernstein ! I think I will speak for many if I say: after qmail and </joking>. No one have right to tell you what you should be doing/reserching. But Good luck! -- |
| Re: boringcc | choj...@gmail.com | 07/01/16 03:29 | > This is clearly too far from what programmers expect and rely on: it > would break tons of existing C code. Don't you think the C language is in hopeless position where any attempt to define its undefined behavior will break a lot of existing C code? Personally I think we did ourselves a lot of harm by abandoning Pascal in favor of C. I don't recall any undefined behavior in Pascal, and I was big fan of its features, modularity, compile-time and runtime checks, ability to directly inline assembly code, fast single-pass compilation, and good quality of generated code. Had we put so much investment in development of Pascal language, now we would have been in much better position with respect to security, performance and ability to audit the software. But fashion and popularity won, as with many other areas of Computer Science, so now we have suboptimal, but widespread, C and C++. I think not much can be done about the current state of C. Efforts to change C will take not years, but decades. I think faster way is to slowly move to languages without undefined behavior. This will also be long process, but at least time will be spent to create better software, not to do Sisyphean task of fixing the unfixable. |
| Re: boringcc | slu...@gmail.com | 11/01/16 01:16 | > Don't you think the C language is in hopeless position where any attempt to define its undefined behavior will break a lot of existing C code? Recompilation with compiler that detects undefined behaviors just results in compilation error - BUM! - you got "So far unknown error found!" message But in any of that temporary broken programs how many lines needs to be fixed ? 50% ? 10% 0.001% LoC ? IMO it's not a "lot". But topic is probably -- |
| Re: boringcc | andrewc...@gmail.com | 14/01/16 19:46 | I've written an incredibly stupid C compiler here https://github.com/andrewchambers/c it is far from complete and full of bugs. One nice thing is my backend is a 1:1 dumb translation of code to assembly, the performance is terrible, but at least testing and auditing the whole thing is far more feasible than testing every code path in something like gcc/clang. |
| Re: boringcc | andrewc...@gmail.com | 14/01/16 19:49 | On Friday, January 15, 2016 at 4:46:11 PM UTC+13, andrewc...@gmail.com wrote: To expand on that, it is about 5k-6k LOC and self hosting. |
| Re: boringcc | vuurspugen...@gmail.com | 02/02/16 22:43 | Why not write a user friendly wrapper around Ada with the look and feel of Go? That would probably only cost writing a spec doc, 10,000 LOC and a couple of new file extensions. Then you have a language that is as fast, portable, low level and mature as C, without the undefined behavior, but with bounds checks and range validation checks. It also has nice features such as modules, generics, ranges (which is ideal for limiting values), sub-typing, contracts, parallel processing, a nice bitchy compiler that doesn't allow you to write nasty code (unsafe code is clearly marked), and a large existing Ada code-base. And that all could be ready within a reasonable time-frame! This CAN be done! |
| Re: boringcc | Kenneth Seefried | 03/02/16 05:28 | On Wed, Feb 3, 2016 at 1:43 AM, <vuurspugen...@gmail.com> wrote:Why not write a user friendly wrapper around Ada with the look and feel of Go? I continue to my mystified by this attitude of "well yes Ada solves all these problems, but I'm not willing to actually *use* it...". |
| Re: boringcc | vuurspugen...@gmail.com | 03/02/16 06:34 | Op woensdag 3 februari 2016 14:28:20 UTC+1 schreef Kenneth Seefried:
Why isn't Ada being used? Lack of familiarity and clutter. Both of these problems can be solved. |
| Re: boringcc | Daira Hopwood | 03/02/16 17:40 | On 07/01/16 11:29, choj...@gmail.com wrote:I don't see any reason to believe that those bugs will be more serious than bugs that C programmers are used to dealing with when porting to a new compiler or platform. Memory-safe languages are great (I'm designing one myself), and they're certainly the present and future of language design. That's no excuse not to fix C. -- Daira Hopwood ⚧Ⓐ |
| Re: boringcc | true.c...@gmail.com | 22/07/16 08:03 | Late to the discussion, but there is RV-MATCH [1] tool by "runtime verification" startup which crashes cleanly on all undefined behavior.
I think it matches the description of "boring compiler" fairly well. 1. https://runtimeverification.com/match/1.0-SNAPSHOT/docs/ On Thursday, February 4, 2016 at 2:40:37 AM UTC+1, Daira Hopwood wrote: |
| Re: boringcc | rigger...@gmail.com | 28/08/17 05:46 | Even later to the discussion, but we have come up with a "Lenient C" dialect and implemented it in Safe Sulong, a system for executing LLVM-based languages on the JVM: http://ssw.jku.at/General/Staff/ManuelRigger/ManLang17.pdf. |
| Re: boringcc | moname...@gmail.com | 27/07/18 04:02 | |
| Re: boringcc | moname...@gmail.com | 02/08/18 05:05 | |
| Re: boringcc | moname...@gmail.com | 10/08/18 05:37 |