Not familiar with the SpaCy docs or practices about what's "more than enough", but regarding those corpus-size number:
(3000 docs * 150 words/doc =) 450,000 words is fairly small for Word2Vec training. Also, only 1000 unique words is very small; I wouldn't expect to get string 300d or even 100d word-vectors for such a tiny vocabulary. (Maybe, 20-32d vectors?)
Also, 3000 docs is very small for `Doc2Vec`: published work often uses tens-of-thousands to millions of training documents.
You are correct that gensim's summarization functions have no options in their interface to provide your own tokenization/comparison options. The `summarize()` takes just a raw string, and you're stuck with its fixed, internal sentence- and word-tokenizations, and its other processing – which doesn't appear to make use of things like word-vectors anyway, as opposed to simple exact word co-occurrences.
So if your cleaning process can output a plain-string improved version - especially with regard to typos – that is still readable text, it might help the summarizer a little. But vector-modeling can't help at all.
And, while lemmatization might help the algorithm notice sentence-interrelationships a little, it'd also result in "summaries" that include lemmatized words, which may not be what's wanted.
So I'm not sure I'd expect much from "gensim.summarization". It's pretty simple and inflexible code, without strong links to other gensim algorithms & practices – and even the tutorial examples are unimpressive.
- Gordon