Error: "PnaclCoordinator: Compile stream chunk failed. The PNaCl translator has probably crashed"

739 views
Skip to first unread message

AlainC

unread,
Dec 18, 2014, 11:37:07 AM12/18/14
to native-cli...@googlegroups.com
We have the title error occurring on a customer's machine (and only this one).

The machine is running chrome 39 on Win 7 64 bits (Celeron) with 2Gb of memory which does not seem exhausted.

If we look at the Chrome's task manager, the memory grows up until approximately 200 MB for "NativeClient" and then we get this error.
Of course we cannot reproduce this problem on any other machine, even with very similar configuration (in this cas the process reaches 800Mnb for a 11Mb pexe) !
The pexe is built using pepper_37 and passes pnacl-abicheck and pnacl-translate for both 32 and 64 bits with pepper_39 tools.

The problem is that we cannot install the NaCl SDK on the customer's machine.

What I wonder is if we can get support here by providing some trace from Chrome.

Here is what I use to produce log files on my machine:

set NACL_DEBUG_ENABLE=1
set PPAPI_BROWSER_DEBUG=1
set NACL_PLUGIN_DEBUG=1
set NACL_PPAPI_PROXY_DEBUG=1
set NACL_SRPC_DEBUG=5
set NACLVERBOSITY=5

set NACL_EXE_STDOUT=c:\temp\nacl\nacl_stdout.log
set NACL_EXE_STDERR=c:\temp\nacl\nacl_stderr.log
set NACLLOG=c:\temp\nacl\nacl.log
"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --no-sandbox

Depending on values, 5 or 30, i get output of 4 or 200Mb for nacl.log, and it is completely obscure to me...

My question is: what are the values of NACL_SRPC_DEBUG/NACLVERBOSITY  which could be of use for investigation by NaCl team ?

Any suggestion is welcome. Thanks in advance.

PNaclCoordinator.jpg

JF Bastien

unread,
Dec 29, 2014, 11:18:22 AM12/29/14
to native-cli...@googlegroups.com
Hi Alain,

This does seem odd, especially that the client's machine always fails yet isn't hitting a memory limit. Are you able to produce logs from that machine (not yours) and share their content with us? I think the logging level that produces 5MB may be sufficient to figure out what's wrong.

Thanks,

JF

--
You received this message because you are subscribed to the Google Groups "Native-Client-Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to native-client-di...@googlegroups.com.
To post to this group, send email to native-cli...@googlegroups.com.
Visit this group at http://groups.google.com/group/native-client-discuss.
For more options, visit https://groups.google.com/d/optout.

AlainC

unread,
Dec 30, 2014, 10:33:46 AM12/30/14
to native-cli...@googlegroups.com
Thanks JF. The customer is not available before next week. I will try to reach him on Monday.

AlainC

unread,
Jan 6, 2015, 9:31:12 AM1/6/15
to native-cli...@googlegroups.com
Hello,

I was finally able to access the customer's machine this morning.

First I have run the following commands:

set NACL_DEBUG_ENABLE=1
set PPAPI_BROWSER_DEBUG=1
set NACL_PLUGIN_DEBUG=1
set NACL_PPAPI_PROXY_DEBUG=1
set NACL_SRPC_DEBUG=5
set NACLVERBOSITY=5

set NACL_EXE_STDOUT=%TEMP%\nacl_stdout.log
set NACL_EXE_STDERR=%TEMP%\nacl_stderr.log

set NACLLOG=%TEMP%\nacl.log

"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --no-sandbox

Unexpectedly and for the first time on this machine, the PNaCl compilation and the execution of the nexe were succesful !!
This succes produced an (attached as nacl_stderr-ok.zip) nacl_stderr.log of 14 Mb and a nacl.log of 1.4 Gb !!
We tried some other launches while the nexe was in cache and it was also succesful, of course.

Trying to reproduce the problem, after emptying the cache, we had few "Aïe, aïe, aïe" messages from Chrome...
Then, I decided to undef NACLLOG variable, and this time the compiler crashed with the usual "Compile stream chunk failed"
The nacl_stderr.log is attached as nacl_stderr-ko.zip

Hoping this can help, best regards
nacl_stderr-ok.zip
nacl_stderr-ko.zip

JF Bastien

unread,
Jan 20, 2015, 1:08:34 PM1/20/15
to native-cli...@googlegroups.com
Hi Alain,

Sorry for missing this email.

I looked at the logs and they look pretty similar. Does the KO log just end like that? If I remove the timers in square brackets the only difference between the logs is the size of what's being sent (the KO one is pretty consistent at sending 32812 bytes, the OK one varies a lot more), and the content itself. It looks like overall the OK log processes 12161077 bytes whereas the KO ones processes 2527237 (see note below).

The KO log does contain the string N6P5Base19CStrongAssertFailedE, which I can't trace to anything in our code (or any code on the internet really). Do you know what that is?


JF


Note: I did the sum of bytes with:
grep "Receive: new message" nacl_stderr.ko.log | cut -d, -f2 | sed 's/ bytes //g' | awk '{sum+=$1} END {print sum}'

--

AlainC

unread,
Jan 21, 2015, 5:14:51 AM1/21/15
to native-cli...@googlegroups.com
Hi JF,

It seems normal to have a smaller "Ko" file as the compiler crashes in this case...

N6P5Base19CStrongAssertFailedE is the name of a function of us. It is the mangled name for StrongAssertFailed in P5Base workspace.


JF Bastien

unread,
Jan 21, 2015, 10:51:24 AM1/21/15
to native-cli...@googlegroups.com
It seems normal to have a smaller "Ko" file as the compiler crashes in this case...

Agreed, what I mean is that the logs look similar, and then the KO one just ends without anything special.


N6P5Base19CStrongAssertFailedE is the name of a function of us. It is the mangled name for StrongAssertFailed in P5Base workspace.

 I can't see anything else in the log that looks wrong. I'm not sure I can figure out what the issue is without a repro or some other information on the crash.

Just to recap: the pexe translates properly everywhere except on a single machine where it fails most (but not all) of the time? There has to be something that's different about this one machine?

AlainC

unread,
Jan 21, 2015, 11:54:23 AM1/21/15
to native-cli...@googlegroups.com


On Wednesday, January 21, 2015 at 4:51:24 PM UTC+1, JF Bastien wrote:

 I can't see anything else in the log that looks wrong. I'm not sure I can figure out what the issue is without a repro or some other information on the crash.

It was supposed to be the purpose of the log... Should we try to get a more detailed report ? Or can we get a status from the translator ?
 

Just to recap: the pexe translates properly everywhere except on a single machine where it fails most (but not all) of the time? There has to be something that's different about this one machine?

Recap is correct. The only time the compile+execution was Ok on the customer's machine was the first time I tried with log traces (nacl_stderr-ok)...

We did'nt find any way to reproduce on some other machine, and the customer is too far away from us for being able to investigate easily. However we are lucky: he allow us to take control of his machine from time to time to try to solve this...


 

JF Bastien

unread,
Jan 22, 2015, 4:08:40 PM1/22/15
to native-cli...@googlegroups.com
On Wed, Jan 21, 2015 at 8:54 AM, AlainC <acor...@tftlabs.com> wrote:


On Wednesday, January 21, 2015 at 4:51:24 PM UTC+1, JF Bastien wrote:

 I can't see anything else in the log that looks wrong. I'm not sure I can figure out what the issue is without a repro or some other information on the crash.

It was supposed to be the purpose of the log... Should we try to get a more detailed report ? Or can we get a status from the translator ?

I'm not sure there's much we can do here, it sounds like the machine is the problem, especially considering translation works sometimes on that machine and never fails anywhere else.

Is it out of disk space? You said it wasn't low on memory. We've seen antiviruses cause issues with Chrome before, does it have one installed? Broken/malicious extensions sometimes also cause issues, though rarely with NaCl, are there any installed? Or maybe it has faulty memory, or a bad Chrome install. It may be worth re-installing Chrome?

Soeren Balko

unread,
Feb 26, 2015, 8:33:40 PM2/26/15
to native-cli...@googlegroups.com
We're actually having a similar issue on one device (an older Windows 7-based laptop) but have had customer complaints, which we were not able to reproduce on our end. In particular, low-end Chromebooks seem to be regularly (but not always) affected. We even got ourselves a last-generation 2GB ARM-based Chromebook to provoke these problems - needless to say: to no avail. 

What happens is that sometimes the PNaCl loading/interpreter process just gets stuck after signalling a first "progress" event. No "crash", "error", "loadend" event is ever emitted, let alone another "progess" or "load" event. That's a shame, as we now had to introduce timeouts for certain critical phases during the PNaCl module loading progress, and offer a fallback to an asmjs module if PNaCl stubbornly fails to load. 

In one case we figured that a VPN Chrome extension was the culprit and after removing it, loading the PNaCl module started working again. In other cases we have no clue what caused the breakup.

Personally, I think the PNaCl embedding procedure leaves much to desire. I dislike the fact that I need to inject an <embed> element into the DOM, ideally wrapped by another element that we register the event listeners with. That alone is a potential source of race conditions where an event may fire before the callback was registered. I don't really see any upside of the PNaCl module being part of the DOM. Or the event bubbling does not work for some reason (not sure if that ever happens). Couldn't there be a plain Javascript API to load a PNaCl module and to interface it? Maybe along the lines of the Workers API. In the short term, we should always be sent a "crash" or "error" event if the PNaCl module loading toolchain throws up. 

Soeren
Reply all
Reply to author
Forward
0 new messages