Floating point exception while training

76 views
Skip to first unread message

cmog...@gmail.com

unread,
Apr 25, 2013, 1:24:27 AM4/25/13
to ebl...@googlegroups.com
Hello,
Sometimes, when I train my net, it throws "Floating point exception" or "Killed" and stops training. I don't see any regularity, it's random. I think occurs on calculating derivatives stage. If I restart training from last saved mat-file, it's ok, it continues training. But if it occurs at night, much time is wasted.

It is possible to avoid this exception? Or automatically continue?

the_minion

unread,
Apr 25, 2013, 1:27:56 AM4/25/13
to ebl...@googlegroups.com, cmog...@gmail.com
Is that windows specific? I've been running some training for a few weeks now, on trunk on Linux, don't see any problem.
If it is Linux or OSX, you can use metarun to auto-resume. Windows doesn't have the metarun utility unfortunately (we use some linux shell-fu in that, so wont build for Windows).

Alternatively, you can probably write a batch script to automate this.

soumith

unread,
Apr 25, 2013, 1:30:37 AM4/25/13
to ebl...@googlegroups.com, Александр Могилко
Also,

Make sure you are training in double precision, and not float. Maybe the 2nd derivative is becoming really tiny and something weird is happening because of float precision limitations.


--
You received this message because you are subscribed to the Google Groups "eblearn" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eblearn+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

cmog...@gmail.com

unread,
Apr 25, 2013, 6:49:43 AM4/25/13
to ebl...@googlegroups.com, Александр Могилко
Yes, training precision is double.
I run eblearn on Linux, built from latest trunk 2 days ago. My last attempts were able to train 13-15 epochs in a row.
Ok, I'll try some scripts.

четверг, 25 апреля 2013 г., 9:30:37 UTC+4 пользователь the_minion написал:

cmog...@gmail.com

unread,
Apr 27, 2013, 4:26:16 AM4/27/13
to ebl...@googlegroups.com, Александр Могилко
Hi again.
I used to work with old version, r2522, there was only Floating point exception random at 30-50 epoch.
But week ago I installed last trunk resivion to use dropout. But every exactly 13 epoch process is stoped with "Killed" message. Today I was observing this act. System was very lagging, even moving cursor. And at 13 epoch process was killed and system started to be fast again. I think that training consumes more and more resources. And it's like it described here:
If the user or sysadmin did not kill the program the kernel may have. The kernel would only kill a process under exceptional circumstances such as extreme resource starvation (think mem+swap exhaustion).


четверг, 25 апреля 2013 г., 14:49:43 UTC+4 пользователь cmog...@gmail.com написал:
Reply all
Reply to author
Forward
0 new messages