how to interrupt a long-running model.fit() ?

1,787 views
Skip to first unread message

Bernd F.

unread,
May 19, 2018, 7:23:48 PM5/19/18
to TensorFlow.js Discussion
I am writing a webpage to experiment with models written in tensorflow.js. The user should be be able to stop and ideally also continue  a long-running learning process, i.e. a run of model.fit().

How could I realize this?

I am already using a callback to show intermediate results

model.fit(xxx, yyy, {
        batchSize
: batchsize,
        epochs
: epochs,
        callbacks
: {
            onEpochEnd
: async (epoch, logs) => {
                mycb
(epochs, logs);
           
},
       
}
   
})

How could I use this (or something else) to stop the learning?

Once stopped, is there a way to continue or do I have to start all over?

Shanqing Cai

unread,
May 19, 2018, 9:22:18 PM5/19/18
to TensorFlow.js Discussion
Hi Bernd,

Good question. I just filed this issue: https://github.com/tensorflow/tfjs/issues/312

The short answer is there is no clean way to do it currently. We will add a property `stopTraining` to the `tf.Model` class. This property can be set to true by callbacks to force the training to stop. This is how early stopping is done in Keras.

Until that happens, you can try throwing an Error from within the callback to stop the training.

Best,
Shanqing

Bernd F.

unread,
May 20, 2018, 8:53:33 AM5/20/18
to TensorFlow.js Discussion
Hi Shanquing,

Thanks for filing this issue.

Your proposal to throw an error works nicely. My code now looks like

    try {

        model
.fit(xxx, yyy, {
            batchSize
: batchsize,
            epochs
: epochs,
            callbacks
: {
                onEpochEnd
: async (epoch, logs) => {

                   
if (should_stop) {
                        should_stop
= false;
                        $
("#gobutton").html("Train");
                       
throw "that is it";
                   
}

                    mycb
(epoch, logs);
                   
// Await web page DOM to refresh for the most recently plotted values.
                    await tf
.nextFrame();
               
},
           
}
       
}).then(() => {
            $
("#gobutton").html("Train");
            console
.log("done ....");
       
});
   
} catch (err) {
        console
.log("Error caught:",err.message) // this is not executed when the exception is thrown
   
}


I now can interrupt model.fit() by externally setting the variable "should_stop". A minor remaining point is, that the catch does not work. Rather I get the message:

uncaught exception: that is it

How could I handle the exception in a controlled way?

Manraj Singh

unread,
May 20, 2018, 9:25:28 AM5/20/18
to TensorFlow.js Discussion
Hi Bernd,

Since `model.fit` returns a Promise, you should catch the error with chainable `.catch((e) => console.log(e))` (without try/catch) or resolve the promise with `await` while keeping the try/catch. Adding a reference link: https://stackoverflow.com/a/40886720/2692667

Jeremy Ellis

unread,
Jul 28, 2018, 7:32:17 PM7/28/18
to TensorFlow.js Discussion


Just an update here:

Manraj catching an error method works, but more smooth is Shanqing method when he creates a class to exit model.fit shown in this codePen


I posted a feature request here as I think a third but easier method should be available. 




Youth Overturn

unread,
Aug 9, 2018, 11:05:47 PM8/9/18
to TensorFlow.js Discussion
It doesn't work on 0.12.4 no matter what 
this.model.model.stopTraining=true;
this.model.stopTraining=true

Shanqing Cai

unread,
Aug 9, 2018, 11:18:09 PM8/9/18
to qcgm1...@gmail.com, TensorFlow.js Discussion
Hi Youth,

How did you do it exactly? Can you show your code? There is currently a known issue that you'd have to set `this.model.stopTraining` from a callback class object in order for this to work. Setting it from an anonymous callback function doesn't work. 

See the codepen at https://codepen.io/caisq/pen/xzMYZx. I tried it with latest version (0.12.4) and it works fine.

Shanqing

--
You received this message because you are subscribed to the Google Groups "TensorFlow.js Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tfjs+uns...@tensorflow.org.
Visit this group at https://groups.google.com/a/tensorflow.org/group/tfjs/.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tfjs/3c68f929-5c52-45c2-a7d2-a13cb346a8aa%40tensorflow.org.


--
---
Shanqing Cai
Software Engineer
Google

Jeremy Ellis

unread,
Aug 14, 2018, 2:18:46 AM8/14/18
to TensorFlow.js Discussion
Looks like a pull request has been done to address this issue. Hoping for a codePen from Shanqing Cai  to show how to use it.

Jeremy Ellis

unread,
Aug 21, 2018, 12:41:07 PM8/21/18
to TensorFlow.js Discussion

Shanqing Cai has got graceful exiting working in model.fit without making a special callback class. Implemented I think in tensorflowjs version 0.12.5

Example codepen at (also includes one way to get initialEpoch working so when you continue training, the epoch starts from where you left off)


Note 1: Model.fit now automatically calls 

await tf.nextFrame(); 

Note 2: Careful when pausing the model that you don't call model.fit again as that causes a non-crashing error
Reply all
Reply to author
Forward
0 new messages