What steps will reproduce the problem? 1. InTextSegment makes the assumption that everything between _start and data_start is read-only. This is not always the case - I'm running into a situation where mmap() is returning memory located between the .text and .data segments. This leads to a failed assertion, after memory is overwritten that ctemplate thought was immutable. (I don't have a good way to isolate and reproduce this, unfortunately.)
2. As a related but less important issue: Shared libraries get relocated past data_start, so their string literals will always be treated as "mutable", and won't benefit from the optimization. StaticTemplateString seems to be a decent alternative, though.
What is the expected output? What do you see instead?
ugh, that bit of code was always a bit of a hack and now it's coming back to bite us. :-( What I really want to capture is the addresses of the beginning and end of .text, but that's sadly not exposed via gcc. Though honestly I'm surprised that in your system, .data isn't mapped in right after .text.
I'm wondering if it makes sense to figure out where text_end is by analyzing the binary itself somehow. It should be possible, and I see it done in places like this: http://geco.mines.edu/papi/html/papif_get_exe_info.html But it may be a lot of work. I'm not sure what the best path is here. Making it a compile-time option, like you suggest, would work, but I doubt anyone would turn it on (especially since it's hard to guarantee it will be safe).
I think your suggestion is probably the best one, but I don't know under what situations to recommend that people turn it on. (Making it a runtime option could help, but that will kill any of the speed advantages, probably.) I'm not
Thinking a bit more about what I said about shared library text segments, I realized those can't be safely considered immutable after all, particularly in programs that use dlopen/dlclose. The only "safe" segment does seem to be the text of the main executable.
Looking at the papi sources, they actually do use a /proc/%d/maps approach for Linux targets. However, rather than look for the _start symbol, they're looking for a read-only, executable map with a filename that matches readlink("/proc/%d/exe").
Thanks for doing this research! It's good to know the options.
I'm not sure how I feel about doing a file I/O operation to implement what may well be a very minor optimization. We should maybe just encourage folks to use STS_INIT more for creating template strings, rather than trying to guess whether someone is using a static string (that the compiler put in the text segment).
I'm writing to some local linker experts to see if they have any ideas.
An interesting development. Here's what the local expert had to say: --- The situation described in that bug report can occur when using an old version of glibc which had a bug which permitted this case to happen. For OS they list RHEL 5 / Ubuntu 10.4. I'm not sure what that means. I would be horrified if the glibc bug existed on Ubuntu 10.4. I would not be at all surprised if it existed on RHEL 5.
Were you only seeing the weird behavior on rhel 5?
He goes on to say: --- The fixed glibc should be mprotecting the pages between the end of the read segment and the start of the write segment such that they are not allocated by future calls to mmap. I don't know how that will show up in /proc/self/maps.
All that said, I now wonder whether glibc only fills in holes in shared libraries, and whether perhaps nothing fills in holes in the executable itself. I don't know. --- so that leaves me not certain of what's going on. But if you are only seeing the problem on RHEL 5, then we can try to protect this code with some libc detection, perhaps.
After looking at my Ubuntu 10.04 machine again, it looks like I had misread the process map originally. RHEL5 was the only place where the issue actually occurred.
By the way, I -finally- managed to reproduce the allocation issue just now on the x86_64 RHEL5 machine. The two keys to reproducing it were: 1) Position-independent executable (gcc -pie -fPIC) 2) The process must fork. The issue affects the child, but not the parent.
I wonder which of these issues are necessary to reproduce the problem in theory, and which just happened to be necessary to trigger it in your case. We could perhaps just turn off this optimization if people compiled with -fPIE, for instance, if that was a necessary condition. But I worry the problem could trigger without -PIE, maybe with more difficulty.
If the problem really is with RHEL5 only, maybe we could just turn it off for older libc's (it looks like RHEL5 is 4.5 years old now?)
It looks like the behavior still exists in current kernels (tested with 3.0.0), but only commonly occurs under specific conditions. The kernel has two different strategies for finding unused memory for mmap: arch_get_unmapped_area (the legacy "bottomup" layout), and arch_get_unmapped_area_topdown. I have only been able to reproduce this in the "bottomup" memory layout.
There are four cases where recent kernels will use bottomup instead of topdown: 1) The topdown mmap fails, and tries bottomup as a fallback (Kernel comment: /* A failed mmap() very likely causes application failure, so fall back to the bottom-up function here. This scenario can happen with large stack limits and large mmap() allocations. */) 2) The ADDR_COMPAT_LAYOUT personality flag is set: setarch x86_64 --addr-compat-layout ./a.out 3) The stack size is unlimited: ulimit -s unlimited ; ./a.out 4) The "vm.legacy_va_layout" sysctl is turned on: sudo sysctl -w vm.legacy_va_layout=1 ; ./a.out
Thanks for all the research! It makes sense that topdown vs bottomup would affect how likely this is to occur, since the address space in question is pretty low in the virtual address space.
It still leaves open the question of whether any part of libc is supposed to protect the page having .text on it, to make it not writable even after the end of the text segment.
Still not sure what the best course forward might be. Any suggestions?
While tricks such as /proc/$$/maps (checked once at startup) might work, the cross-platform support seems like more hassle than it's worth.
One possibility is for the user to supply the end-of-text pointer to ctemplate. The "etext" symbol appears to be commonly available. Although ctemplate wouldn't be able to use it directly (doing so would likely bind to the "etext" of libctemplate.so), the main() program could bind etext and supply a pointer to ctemplate.
extern char etext;
int main() { ctemplate::set_etext_pointer(&etext); // ...
}
By defaulting the value to NULL, the optimization effectively defaults to disabled. Those users who (a) need it, and (b) know what they're doing, have the option of enabling it.
To further refine the idea, the main program might also supply its own "start" pointer as well. This would be useful for applications where the user of ctemplate is not the main executable, but rather another shared library.
Looking at this, in preparation for a new release (hopefully next week!)
I'm wondering if it's not a simpler fix to just replace data_start with _etext. It looks like the gnu binutils supports _etext, and I think it's the one that supports data_start as well, so there's no compatibility issues. And since the reference to _etext happens in a .h file, _etext will be evaluated in the context of the main executable, not libctemplate, so the concerns you have there wouldn't apply either.
Assuming there are no pitfalls -- and this is all poorly documented, so I have no idea if _etext is as portable as data_start or not -- I'll aim for that solution.
That's good to hear. When I tested it, __etext did what we want it to do (especially considering that the function is inlined in a header file.) However, what about the "_start" symbol, which would only be valid in the context of the main executable? Does the linker provide an alternate way to reliably find the start of the current text segment?
Both __etext and _start "should" point to the main executable. I think that is what we want in this case. My expectation is that most uses of TemplateString will be in the main executable. If not, it's just a little bit of optimization lost.
I attached some test cases involving various combinations of _start/__etext referenced by the main executable, a shared library, and the same shared library loaded by dlopen.
The case of dlopen(... RTLD_DEEPBIND) will become an issue if/when the php-ctemplate (http://code.google.com/p/php-ctemplate/) project revives itself; PHP extensions are loaded with dlopen(..., RTLD_LAZY | RTLD_GLOBAL | RTLD_DEEPBIND).
Here's the output of the etext-test "make test" on my Ubuntu machine:
Test 1 - _start/__etext referenced by main executable, no libraries main: _start=0x400490 __etext=0x400696
Test 2 - _start/__etext referenced by library, but not by main executable shlib: _start=0x400550 __etext=0x400746
Test 3 - _start/__etext referenced by dlopen() library, but not by main executable dl: _start=(nil) __etext=0x2b44414456a6
Test 4 - _start/__etext referenced both by dlopen() library and by main executable main: _start=0x400690 __etext=0x400956 dl: _start=0x400690 __etext=0x400956
Test 5 - _start/__etext referenced both by dlopen() library with RTLD_DEEPBIND, and by main executable main: _start=0x400690 __etext=0x400956 dl: _start=0x400690 __etext=0x2ad51f4fe6a6
Test 6 - _start/__etext referenced dlopen() library with RTLD_DEEPBIND, but not by main executable dl: _start=(nil) __etext=0x2b2b4d7126a6
This is good to know -- thanks for doing the tests.
DEEPBIND indeed can cause trouble -- I've had problems with it before. Though given test 3, it looks like dlopen() is a problem even without DEEPBIND? It looks like as long as code doesn't use dlopen(), _start and __etext behave sanely, but once dlopen() enters the picture, all bets are off in the dlopened-library. Is that your reading as well?
I ran your tests replacing __etext with data_start and data_start doesn't have the problems that __etext does: --- Test 1 - _start/__etext referenced by main executable, no libraries main: _start=0x4004c0 data_start=0x601010
Test 2 - _start/__etext referenced by library, but not by main executable shlib: _start=0x4005a0 data_start=0x601010
Test 3 - _start/__etext referenced by dlopen() library, but not by main executable dl: _start=(nil) data_start=(nil)
Test 4 - _start/__etext referenced both by dlopen() library and by main executable main: _start=0x4006d0 data_start=0x601038 dl: _start=0x4006d0 data_start=0x601038
Test 5 - _start/__etext referenced both by dlopen() library with RTLD_DEEPBIND, and by main executable main: _start=0x4006d0 data_start=0x601038 dl: _start=0x4006d0 data_start=0x601038
Test 6 - _start/__etext referenced dlopen() library with RTLD_DEEPBIND, but not by main executable dl: _start=(nil) data_start=(nil) ---
I was going to release a new ctemplate version tonight, that replaced data_start with __etext. I'm wondering if it wouldn't be better to stick with data_start. What is your feeling?
In my testing, we saw some problems when compiling with -O2, with using __etext. Based on that and the results you showed with dlopen(), I think it's premature to use __etext at this point. I've reverted the change, and will look into this again after the new ctemplate release is out.
Yeah, I think etext might end up being a dead-end. In testing on my Ubuntu 11.10 machine, literal strings aren't actually being put into the .text section at all. They're being put into the separate .rodata (read-only, non-executable data) section. Testing against the range [_start, __etext) would end up excluding the very strings we're trying to match.
I tried various tactics such as defining a static "landmark" variable in the header file (with the hope of using its address as a reference location), but without consistent success.
The documentation says this: --- You may use this built-in function in either a macro or an inline function. However, if you use it in an inlined function and pass an argument of the function as the argument to the built-in, GCC will never return 1 when you call the inline function with a string constant --- which is exactly the position we're in, so that all matches up. It's a good idea, though!
Does the current data_start correctly capture stuff in .rodata? Maybe the current test isn't working either, and we can just safely remove it.
Probably the best thing would be to actually get perf numbers to see if it makes a difference. It may be that the string-copying here is so fast, the optimization is totally unnecessary.
The three sections are ordered: .text [gap] .rodata .data
On the older RHEL5 system, there is no .rodata, so it's simply: .text [gap] .data
So yes, the existing code is working as intended (at least for strings located in the main executable's .rodata or .text section, as opposed to strings belonging to a library), except when mmap() decides to place an anonymous writable map inside that gap.