Stack Overflow on windows...

132 views
Skip to first unread message

Kris Malfettone

unread,
Feb 14, 2014, 8:43:53 AM2/14/14
to ninja...@googlegroups.com
Hi all, I am trying to debug a problem with one of our builds and I was hoping someone on here could lend some insight.  I have a rather large build that I create using CMake + Ninja on Windows.  During one of our nightly builds we received a "Stack overflow" error from I believe ninja.  I am not 100% sure but it looks like ninja is the tool that is experiencing the stack overflow.

I am running on a box with 4 cores using the default -j behavior of 6 jobs.  I see 6 failure reports from ninja in the log but then "Stack overflow" from I believe the OS.  Normally, I wouldn't suspect ninja because it has been running this project for some time and it has proved quite capable of handling large projects.  However, it is the only command running and I believe there should have been one more line of output:
"ninja: build stopped: subcommand failed."
or something to that effect which is not present in the log.

Since I see all 6 FAILURE lines reported from ninja but not the final message from ninja this leads me to believe the error occurred in ninja.  Though I am very surprised at what could cause this during what I would believe to be the shutdown/reporting path of ninja.

My question is does this line of thinking seem valid?  If so, do you have any suggestions for what I can do to alleviate the problem.  I believe on windows I can update the ninja binary to set a larger stack size but that may or may not be the appropriate solution.  This may just hide the problem in until our project increases in size.

Again, in closing I would like to mention that I think ninja having this problem seems very surprising and I am even skeptical that this is indeed the case.  Perhaps its exhaustion from all the shoveling here on the east coast but I am having trouble coming up with an alternative hypothesis.

Any feedback would be appreciated.

-Kris

Kevin Ingwersen

unread,
Feb 14, 2014, 11:13:58 AM2/14/14
to Kris Malfettone, ninja-build
Hey.

„ninja: subcommand failed“ means that the command you have been running failed. Here is an example output of a failing command:

[1/1] CXX: main.cpp
FAILED: g++ main.cpp -o main
main.cpp:1: Fatal error: File „sys/nonsense.h“ not found

Whenever one of your build commands return a non-null (int 0) exit code, the command is treatened to have been failing. The stack overflow might actually be something from the programm you are trying to run.

To test for the issue, try to build without -j, as ninja does parallel building by its own, no need for a -j switch to use.
Alternatively, greate a different project file - like Nmake. Try to run then, and see if the error still appears.

If both above do show the error, its definitively the compiler/something command being invoked. Otherwise, it should be ninja. For a solution, clone the ninja source, navigate to src/ and use Windows explorer and search for exactly that „Stack Overflow“ message. it should find a file that contains that string.

Keep in mind, that any exit code that is not 0, is a „FAILURE“.

Kind regards, ingwie
--
You received this message because you are subscribed to the Google Groups "ninja-build" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ninja-build...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Malfettone, Kris

unread,
Feb 14, 2014, 11:37:17 AM2/14/14
to Kevin Ingwersen, Kris Malfettone, ninja-build

I am slightly confused by your example.  Are you saying that I should or shouldn’t see an additional print out after the listing of what command failed?  For instance, with your example of a bad include I see the following:
FAILED: … compiler cmd …

…: fatal error C1083: Cannot open include file: 'sys/nonsense.h': No such file or directory…

ninja: build stopped: subcommand failed.

 

 

Meaning I see 3 pieces of distinct info:

1)      A FAILED message showing me what command failed.

2)      The output of the failed command

3)      A final message from ninja indicating the overall status of the build which in this case is: build stopped: subcommand failed

 

If g++ had a stack overflow what I would expect to see is as follows:
FAILED: g++ main.cpp

Stack overflow

ninja: build stopped: subcommand failed.

 

Am I wrong in this assumption?  Would it not print out the 3rd line as I expect?

 

 

Also, I do not specify –j I was just trying to describe that I am using the default behavior.

 

 

-Kris




IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Evan Martin

unread,
Feb 14, 2014, 12:44:55 PM2/14/14
to Kris Malfettone, ninja-build
Hey Kris,

I haven't heard about this problem before, but from your analysis it does seem plausible that it's Ninja crashing.

I'm pretty bad at Windows in particular but as far as Scott re-explained the code to me just now, Ninja attempts to catch crashes and write out a minidump, so maybe if there's a stack overflow we'll attempt to write a ninja_crash_dump_XX.dmp file in the temp dir:

However, Scott suggested that it's also possible that if we're out of stack, the minidump itself may fail, which may mean there is no dump file either.

So perhaps look around for such a file.  From there we can maybe talk about your options.  If you're willing to attempt to debug it further we'd appreciate your help!


--

Bill Hoffman

unread,
Feb 14, 2014, 1:20:46 PM2/14/14
to ninja...@googlegroups.com
On 2/14/2014 12:44 PM, Evan Martin wrote:
> Hey Kris,
>
> I haven't heard about this problem before, but from your analysis it
> does seem plausible that it's Ninja crashing.
>
Building a debug ninja might help as Windows will insert a bunch of
checks and asserts.

-Bill


Kris Malfettone

unread,
Feb 14, 2014, 1:50:55 PM2/14/14
to ninja...@googlegroups.com
Unfortunately the build server clears its temporary directory at the start of each run so I no longer have access to where the file would have been located if it was created.

I am willing to debug it in any way you think would be useful.  Unfortunately the problem doesn't appear to be regular and its only occurred once.  So I don't know how useful it would be.  I am also very inexperienced in the windows side of things as well, so if anyone has any tools they recommend I would be happy to try anything.

My guess would be finding a way to find a high water mark for the stack would be the most useful even if the problem doesn't reproduce itself.  It would at least give us an idea if we are indeed nearing the limit.  This could either rule out ninja or confirm it is plausible.  I do not believe there are any tools to do that only instrumentation of the code would be able to.

The debug binary sounds like a good approach if I can get the problem to repeat itself.  I will see how much impact to the speed of the build a debug binary will have.  If it isn't too great I can enable it for all our builds and if the problem resurfaces we might get some information.

Thanks for all the feedback and taking the time to look into this, even if we don't figure it out at least it serves as another data point if anyone else runs into a similar problem in the future.

Matthew Woehlke

unread,
Feb 14, 2014, 2:01:03 PM2/14/14
to ninja...@googlegroups.com
On 2014-02-14 13:50, Kris Malfettone wrote:
> My guess would be finding a way to find a high water mark for the stack
> would be the most useful even if the problem doesn't reproduce itself. It
> would at least give us an idea if we are indeed nearing the limit. This
> could either rule out ninja or confirm it is plausible. I do not believe
> there are any tools to do that only instrumentation of the code would be
> able to.

If you have built (or are willing to build, which I guess you are if you
are thinking about doing a debug build) ninja yourself, I believe you
can change the stack size when building it. This isn't quite the same as
a HWM, but the default is typically more than adequate for normal use,
so if you are sometimes getting close to the default limit but not
always exceeding it, lowering it might at least cause a crash to happen
more consistently.

(And no, I don't know offhand *how* to do so... GIYF.)

--
Matthew

Scott Graham

unread,
Feb 14, 2014, 2:05:02 PM2/14/14
to Kris Malfettone, ninja-build
A debug build would be a good start.

You could also try removing the __try from here: https://github.com/martine/ninja/blob/master/src/ninja.cc#L1098 (i.e. just do the non _MSC_VER path). If there's a crash then, you'd get the regular windows crash dialog which would allow you to confirm that it's ninja, and perhaps use a debugger to get more information. On the downside, that would likely mean that the build machine would hang at the dialog so you mightn't want to do that.

FWIW, the default is a 1MB stack, and I don't see the ninja build files overriding that with /STACKSIZE. Maybe someone else can guess at an O() estimate for how much space ninja might need based on the build graph.


Neil Mitchell

unread,
Feb 17, 2014, 4:27:08 PM2/17/14
to Scott Graham, Kris Malfettone, ninja-build
I would recommend against trying a debug build at first. With MSVC the
stack usage in debug build is often significantly higher (x10 for
certain constructs), so what can be a different issue in release can
become a stack issue in debug, further muddying the waters.

> Unfortunately the problem doesn't appear to be regular and its only occurred once.

The easiest way to check for a stack issue is to run: editbin
ninja.exe /stack:10000000, which will give you a 10 million byte
stack. Similarly, decrease that number to make the issue more likely
to reproduce. Most programs can get away with a 100Kb stack, and if
Ninja is exceeding 500Kb with your build its highly likely that some
other build system or parallel interleaving would exceed the default
of 1Mb.

Thanks, Neil

Malfettone, Kris

unread,
Feb 19, 2014, 8:20:57 AM2/19/14
to Neil Mitchell, Scott Graham, Kris Malfettone, ninja-build
I am planning to try the "reduce the stack size till we find a high water mark" approach. However, I haven't had time to do it yet.

I did want to report back that I received three more strange build errors on the same machine since the stack overflow that are not stack overflows. In particular, one that reports "In-page error". From everything I read this seems to either indicate a problem with one of our drives or networking issues running some of our NFS stored code. I wanted to mention this because from what I can gather it appears this can also lead to unexplained behavior in the code which might have been the real cause of the stack overflow. I think as soon as I can get a reasonable estimate of the stack usage we can rule out the stack overflow.

Thanks again for all the suggestions / advice. I will send any information as I get it.

-Kris

-----Original Message-----
From: ninja...@googlegroups.com [mailto:ninja...@googlegroups.com] On Behalf Of Neil Mitchell
Sent: Monday, February 17, 2014 4:27 PM
To: Scott Graham
Cc: Kris Malfettone; ninja-build
Subject: Re: Stack Overflow on windows...

Malfettone, Kris

unread,
Mar 5, 2014, 1:10:28 PM3/5/14
to Malfettone, Kris, Neil Mitchell, Scott Graham, ninja-build
All,
I wanted to report back now that I had a chance to try reducing the stack size of the binary and rerunning our builds. I didn't try to drill in too far in but in short for our build ninja can't start up with a 250 K stack size but with a 500 K stack it starts and completes fine. The default stack size on windows is 1 M. This is an 8 core box using the default parallel behavior. Also, our build consists of 18201 targets.

After seeing the errors mentioned in my last email and the fact that this only occurred once I believe this wasn't a problem with ninja but more the result of some other problem. I appreciate everyone's input and if this ever occurs again I'll be sure to share it with the list. Also, if you want to me to run any other tests I would be happy to.

Thanks again for making ninja available to all, I can't express how much better our development stack is as a result.

-Kris
Reply all
Reply to author
Forward
0 new messages