"Bad version or endian-key"?

Eric Zhang

unread,

Jul 21, 2014, 7:43:11 PM7/21/14

to

Hi guys,

I am working on parallelization MATLAB computing. Strangely enough, my code sometimes works and sometimes fails, although untouched, with the following error.

Error using parallel.internal.pool.deserialize (line 9)
Bad version or endian-key

Error in distcomp.remoteparfor/getCompleteIntervals (line 38)
origErr = parallel.internal.pool.deserialize(intervalError);

Error in ga (line 68)
parfor idx = 1:numel(new_indi_idices) % only assess the necessary

Caused by:
Error using distcompdeserialize
Bad version or endian-key

I sorta believe this is a MATLAB bug, because it pops up very irregularly. Sometimes it fails after running for 1 day +, whereas sometimes it fails a few hours.

How could I solve this? This is really driving me crazy, I mean, just imagine how frustrating every morning you wake up and find the simulation stopped at 2 am.

P.S.: Suggested by "bad version", I tried both R2014a and R2013b, and both versions have the problem.

Thanks a lot in advance!
Eric

Edric M Ellis

unread,

Jul 22, 2014, 2:48:49 AM7/22/14

to

"Eric Zhang" <ericzhan...@gmail.com> writes:

> I am working on parallelization MATLAB computing. Strangely enough, my
> code sometimes works and sometimes fails, although untouched, with the
> following error.
>
> Error using parallel.internal.pool.deserialize (line 9)
> Bad version or endian-key
>
> Error in distcomp.remoteparfor/getCompleteIntervals (line 38)
> origErr = parallel.internal.pool.deserialize(intervalError);
>
> Error in ga (line 68)
> parfor idx = 1:numel(new_indi_idices) % only assess the necessary
>
> Caused by:
> Error using distcompdeserialize
> Bad version or endian-key
>
> I sorta believe this is a MATLAB bug, because it pops up very
> irregularly. Sometimes it fails after running for 1 day +, whereas
> sometimes it fails a few hours.

Hm, this is definitely not expected. Usually errors like this occur when
the data transfer between the workers and the MATLAB client is truncated
or corrupted in some way.

Do you have any simple self-contained example code that reproduces the
problem? What cluster type are you using? What OS are you running on?

> P.S.: Suggested by "bad version", I tried both R2014a and R2013b, and
> both versions have the problem.

Just to let you know - the "version" mentioned in the error is talking
about the serialization version of the data, not the version of MATLAB.

Cheers,

Edric.

Eric Zhang

unread,

Jul 26, 2014, 6:24:09 PM7/26/14

to

Edric M Ellis <eel...@mathworks.com> wrote in message <ytwoawh...@uk-eellis-deb7-64.dhcp.mathworks.com>...

Hey Edric, thanks a lot for the reply, but I really tried my best to create a self-contained code that reproduces this error, but failed, because it actually involves calling the external software COMSOL.

Although COMSOL is involved, I still believe this error comes from MATLAB parallel, because once I change parfor to normal for, it runs without any errors for days.

By the way, I am on school's HPC, which means that the several workers may span over several nodes. Does that matter? After all, it works for hours before this error pops up,

Best regards,
Eric

Edric M Ellis

unread,

Jul 28, 2014, 3:44:15 AM7/28/14

to

"Eric Zhang" <ericzhan...@gmail.com> writes:

> Edric M Ellis <eel...@mathworks.com> wrote in message <ytwoawh...@uk-eellis-deb7-64.dhcp.mathworks.com>...
>> "Eric Zhang" <ericzhan...@gmail.com> writes:
>>
>> > I am working on parallelization MATLAB computing. Strangely enough, my
>> > code sometimes works and sometimes fails, although untouched, with the
>> > following error.
>> >
>> > Error using parallel.internal.pool.deserialize (line 9)
>> > Bad version or endian-key

>> > [...]

>>
>> Hm, this is definitely not expected. Usually errors like this occur when
>> the data transfer between the workers and the MATLAB client is truncated
>> or corrupted in some way.
>>

> Hey Edric, thanks a lot for the reply, but I really tried my best to
> create a self-contained code that reproduces this error, but failed,
> because it actually involves calling the external software COMSOL.
>
> Although COMSOL is involved, I still believe this error comes from
> MATLAB parallel, because once I change parfor to normal for, it runs
> without any errors for days.
>
> By the way, I am on school's HPC, which means that the several workers
> may span over several nodes. Does that matter? After all, it works for
> hours before this error pops up,

Are you using an interactive parallel pool to do this, or is everything
running on the cluster inside e.g. a 'batch' job? If you are using an
interactive pool, it might be worth trying a 'batch' job instead as then
there will be no communication from your host to the remote cluster.

If you haven't used it before, the batch reference page is here:

<http://www.mathworks.com/help/distcomp/batch.html>

and you'll want to do something like

c = parcluster(...); % get your HPC cluster
j = batch(c, @myFunction, 2, {args}, 'Pool', 15);

where 'myFunction' contains your PARFOR loops etc.

Cheers,

Edric.

Emmanuel Kalunga

unread,

Jun 16, 2015, 2:11:03 PM6/16/15

to

I had the same error. If this error only happens sporadically, I would suggest using the try/catch method.

Chad

unread,

Jun 19, 2015, 7:45:14 PM6/19/15

to

Eric, I ran into this recently and I was able to reproduce the issue quite reliably, but I can't share exact code. The issue as I see it is as follows:
1) Very large data sets being poorly sliced by the parfor programming. In my case, I masked the ability to slice entirely.
2) Debugging is done with reduced datasets. In my case, I reduced 8Gb to 80MB to test, and 80MB is handled perfectly well as a broadcast variable. When I scaled up to 8Gb, the broadcasting fails to work properly.

Combining these to things from my repeatable issue, I would think this has to do with your free memory size in relation to the broadcast variable size you're working with. There is probably a way to make a table ahead of time that is single-dimensional and is set up nicely for slicing. There may also be an issue on the return side of the parfor, but I can't comment on that part. To me, this issue is a memory leak when parfor creates the workspaces for each node of your cluster. The more nodes you have and the bigger the broadcast variable, the worse the issue is.

-Chad
"Eric Zhang" wrote in message <lqk8ie$aus$1...@newscl01ah.mathworks.com>...

Olivier

unread,

Aug 1, 2017, 6:01:24 PM8/1/17

to

"Eric Zhang" wrote in message <lqk8ie$aus$1...@newscl01ah.mathworks.com>...

I just experienced the same error with a code not using parallel processing but managing large data files. Clearing variables was inefficient but closing the session and reopening a new one solved the problem and the same code could run without problem and handle the same files....

Olivier