"ConvergenceWarning:" on logistical regression in Doc2vec evaluation step

Fionn Delahunty

unread,

Jul 9, 2018, 7:22:45 AM7/9/18

to gensim

This is a slightly odd question, I'm experimenting with the PV-DBOW model using the IMDB notebook.

I've done the unpythonic and unholy thing and just taken that notebook and tried to write in my own dataset, please excuse my clear laziness!

The dataset is in a slightly different format (CSV). I believe I've correctly translated it into the tagged document format however.

When I'm training the model however, on the evaluation stage I get the following warning...

Anaconda3\lib\site-packages\statsmodels\base\model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

I've tried to debug the statsmodel logistical regression, but I suspect the error is more likely present in the input of the dataset instead.

You can find a link to the notebook here, you'll need to download it locally since github ins't playing ball today. If anyone had any suggestions, or has experienced this issue before i'd really appreciate the support!

Gordon Mohr

unread,

Jul 9, 2018, 11:04:21 AM7/9/18

to gensim

Does everything seem to work before getting to the statsmodel/classifier part? Does logging indicate training has occurred? Do the doc-vectors most-similar to others seem somewhat plausible?

If so, the problem is outside gensim – and a statsmodels-specific forum might be a better place to ask. For example, some of the suggestions at <https://stats.stackexchange.com/questions/313426/mle-convergence-errors-with-statespace-sarimax> might be applicable, like tinkering with `maxiter`, even though you'r using a different model.

It looks like your dataset is rather small - tens-of-thousands of documents are more common. A small dataset with a hard-to-distinguish boundary might be helped with other choices of solver/solver-parameters, as described at <http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit.html#statsmodels.discrete.discrete_model.Logit.fit>. Small datasets might also require smaller Doc2Vec models (such as smaller vector-sizes) to avoid overfitting.

You could also try switching to another kind of classifier, or a classifier from scikit-learn.

If you do forward your working notebook elsewhere for help, it'd be good to (1) enable logging, to the INFO level at least; (2) show outputs for steps through to the actual error you're receiving (which I don't see in the current gist).

- Gordon

Fionn Delahunty

unread,

Jul 9, 2018, 12:07:02 PM7/9/18

to gensim

Does everything seem to work before getting to the statsmodel/classifier part? Does logging indicate training has occurred? Do the doc-vectors most-similar to others seem somewhat plausible?

Yes, I believe so. I'm getting the expected output at both the build_vocab & training stage.

It looks like your dataset is rather small - tens-of-thousands of documents are more common. A small dataset with a hard-to-distinguish boundary might be helped with other choices of solver/solver-parameters, as described at <http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit.html#statsmodels.discrete.discrete_model.Logit.fit>. Small datasets might also require smaller Doc2Vec models (such as smaller vector- sizes) to avoid overfitting.

I've reduced the model to a more simple one, 10 vectors etc and the error still occurs.

You could also try switching to another kind of classifier, or a classifier from scikit-learn.

I've also tried the LR model from Scikit learn and came across a different error, array size issue if I remember correctly. The issue occurred with the IMBD dataset however.

Thanks for your advice anyway!

Fionn Delahunty

unread,

Jul 9, 2018, 3:28:12 PM7/9/18

to gensim

I activated logging as you suggested :+1

2018-07-09 20:23:06,047 : INFO : EPOCH - 20 : training on 137692 raw words (133545 effective words) took 0.2s, 756451 effective words/s
2018-07-09 20:23:06,048 : INFO : training on a 2753840 raw words (2670900 effective words) took 3.6s, 738132 effective words/s
C:\Users\Fionn Delahunty\Anaconda3\lib\site-packages\statsmodels\base\model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
2018-07-09 20:23:06,115 : WARNING : Effective 'alpha' higher than previous training cycles
2018-07-09 20:23:06,116 : INFO : training model with 8 workers on 6505 vocabulary and 1100 features, using sg=0 hs=0 sample=0 negative=5 window=5

I reduced the epochs and vector sizes but still get the error, might you have any suggestions?

Gordon Mohr

unread,

Jul 9, 2018, 5:10:23 PM7/9/18

to gensim

As mentioned, this is a statsmodels-issue, since Doc2Vec training completes and gives expected output. So I wouldn't expect it to be resolved by tinkering with Doc2Vec parameters (though that might incrementally improve the quality of results in other ways).

Did you try the options mentioned in the statsmodels-specific sources I linked, like a different (larger) 'maxiter' or other changes to the solver/solver-parameters? You'd have to go to a statsmodel-forum for more guidance than that - this error gets a bunch of hits, so it's not unheard of. Or, switch to scikit-learn and fix the separate error there.

- Gordon

Reply all

Reply to author

Forward