Superficially, gensim does it this way because that's what was done in the original word2vec.c package. Specifically the `distance` executable averaged all words input on a line, then returned the 'closest' to the combination, and the `word-analogy` executable averaged together two positive words, with one word negated, to set the target vector from which similarity was calculated.
More intuitively, the idea of 'words as directions in a high-dimensional space' requires an ability to conserve and combine those directions. Summing/averaging the full-n-dimensional vectors plausibly does that. Averaging just the distances quickly loses the 'directionality', collapsing it to single values.
So for example perhaps the vectors for 'male' and 'monarch', when averaged in n-dimensional space, point in a very similar direction as 'king'. You could test if this is the case, or perhaps find the word 'king' if you didn't already expect it, by finding [pseudocode] `minimum(all_distances(mean(`male', 'monarch')))`.
If you instead do `minimum(means(zip(all_distances('male'), all_distances('monarch'))))`, you won't be focusing in on the same mixed-meaning 'midpoint' direction between the two seed words; you'll be also finding words very close to one but far from the other (whose simple distance-average is also small). These might still be interesting words! (You'd have to try it.) But those words aren't usually what tasks like analogy-solving are looking for, in the high-dimensional space.
Another factor: if you have a large number m of seed words, averaging them once, then computing the v distances to the entire v-sized vocabulary, requires less calculation than computing the m*v distances, then performing the v averages.
- Gordon