Regarding your subject line question, "hierarchical softmax" describes one way of composing the desired *output* of the neural network, and acquiring the prediction errors to backpropagate. It doesn't describe how to generate the training examples (or input to the network).
Your list of contexts is roughly correct as to the 5 training examples that your sentence would provide – specifically, using the "CBOW" model with a 'window' of 2. (2 words before, 2 words after.) The candidate vectors for those 4 context words would be summed (or averaged) to get the input to the neural network. The network forward-calculations are run, and then a small relevant subset of the network's output is checked for accuracy. (In "hierarchical softmax", a subset of nodes that binary-encode the desired target word are checked, so some of these nodes have a target value of 1.0 and others 0.0. In "negative sampling", a node representing the target word is checked for closeness to 1.0 and N nodes representing other random words are checked for closeness to 0.0.) The errors are backpropagated, nudging the network hidden weights and candidate vector representations to a do a little better on the current training example.
I said 'roughly correct' above because in the CBOW code I've seen, there's no padding-word added/averaged in. The context just shrinks to the actual boundaries of the source text. (So your first training-context would be: [The: cat, ate].)
However, it seems typical to preserve punctuation – the '.' would be a context token with its own vector – and to create context windows from multi-sentence text. To the extent a <pad> token might help, punctuation tokens may too. (So, your last two contexts might be instead: [mouse: ate, the, '.'] and ['.': the, mouse].)
Also, the actual window used around any target word is randomly chosen from 1 up to the 'window' parameter, serving to give the closer words more weight. (So of your listed contexts, in an actual run about half would be smaller than you've shown, using an effective window of ±1 instead of ±2.)
In the "skip-gram" mode alternative to "CBOW", rather than averaging the context words, each is used as a pairwise training example. That is, in place of one CBOW example such as [predict 'ate' from average('The', 'cat', 'the', 'mouse')], the network is presented with four skip-gram examples [predict 'ate' from 'The'], [predict 'ate' from 'cat'], [predict 'ate' from 'the'], [predict 'ate' from 'mouse']. (The same random window-reduction occurs, so half the time that would just be two examples, of the nearest words.)
Subsampling is a choice independent of any of the above: it will randomly drop some of the most-frequent words from the source text, before preparing any contexts. The training can then be a bit faster, since a lot of (plausibly-redundant) contexts will be skipped. Also, since the unsampled words disappear entirely, it effectively makes the context windows of remaining nearby words a little larger. (In your sentence, if the 2nd 'the' was randomly dropped, then 'cat' and 'mouse' would be in each other's context windows.)
All of these choices also apply to paragraph-vectors. There, the "DBOW" method is most analogous to skip-grams: its training examples are pairs like [predict 'cat' from 'DOC_1']. (In this mode, you don't necessarily even have to do any word-to-word prediction, so you might not use or create any word-vectors.)
The "DM" method is more analogous to CBOW: one of its training examples might be [predict 'cat' from average('DOC_1', 'The', 'ate, 'the')]. This will inherently train word and document vectors simultaneously. The paragraph-vectors paper also suggests concatenation as another option (rather than sum or average) for combining the context vectors for presentation to the prediction neural network. (In such a case, you'd need <pad> words and might not do any dynamic window-reduction.)