Note that the chart you've excerpted reports plain softmax as being 25x to 100x slower than either HS or negative-sampling. It'd have to offer some massive benefits elsewhere to justify that slowness – and if those faster modes buy enough of a speedup to adjust *other* parameters in the direction of more-time/more-quality (such as 'size', 'window', 'epochs', etc), any supposed advantage of full softmax (on a naive ceteris-paribus or runtime-oblivious basis) could evaporate.
Neither the original Google-released `word2vec.c` code nor Facebook's original FastText code implemented a plain softmax mode, probably because of its non-competitiveness in runtime performance. Gensim's implementations are modeled after those, and behave the same way with `hs=0, negative=0`.
Also, that table's row on hierarchical sampling doesn't match my understanding, at least in comparison to negative-sampling. HS is most likely to be competitive (agains the default of using negative-sampling) on smaller vocabularies. On larger vocabularies, its training time becomes longer as a function of the size of the vocabulary (and thus average encoding-length of individual words). For negative-sampling, training time is only sensitive to the size of the corpus (& other parameters), but is unaffected by the size of the vocabulary (because the same single-positive-example and N-negative-examples are trained for each target word, no matter how large V the vocabulary-size).
Published work uses a wide range of training-epochs, very often 5-20. But theoretically, a large-enough corpus where the same words appear in equally-good and equally-varies usage contexts throughout could generate fine vectors in a single epoch. (I believe it's been reported that the 'GoogleNews` vectors were the result of 3 passes over 100 billion words of text.) And a small corpus might benefit from more epochs.
Unfortunately the 'loss' values which could offer an objective indicator of model convergence aren't well-reported by the `gensim` code – see pending issue <
https://github.com/RaRe-Technologies/gensim/issues/2617> to do some necessary cleanup/completion. Until then, I think the existing `loss` values can be re-interpreted to help determine what's happening in `Word2Vec`, but maybe not `FastText`, and thus evaluating the end results of different trial values is the only sure tactic.
- Gordon