Thanks for the extra details, but that's quite a hard format in which to try to understand & review changes - hand-assembled, informal, with ad hoc comments (& I think a fatal missing-space on 1st line of `infer_gradients()`?).
As I don't work in Cython often, my review & incremental changes rely a lot on seeing code in context, and following the exact practices of surrounding code – which is hidden in this presentation. Best for review would be a PR on Github showing your minimal error-triggering changes – not to ever really integrate the code, but to see the exact code diffs highlighted in context, in a rich interface.
That said:
- If you've changed the signature of `init_d2v_config()`, won't many many other changes in other functions (not shown here) have also been necessary? (Seeing everything in context, without any assumptions of what's not relevant, may be required to see the problem.)
- If it's really only your allocations & method-signature changes that are sufficient to cause the error, I'd look especially if you've broken any prior assumptions around nearby-allocated objects/arrays – especially but not only the `
c.work` and `c.layer1_size` arrays implicated by your location of a line that can be commented-out to dodge the blatant error.
- That line you've noted as sufficient to comment-out to prevent the crash is an essential bit of initialization; even if its removal stoes the error, I wouldn't count on correct operation without it. (That it merely zeros-out some memory, successfully without erroring, is suggestive that secondary effects of that initialization cause problems down the road: perhaps clobbering other necessary data, or enabling real calcs to occur that if initialization didn't happen would've somehow failed in some other more subtle but non-crashing way?)
Also, I see reference to just the `dm_concat` mode in your code fragments. This is the least-used (& least-tested) mode, and it results in models that are much-larger & require far-more data & time to train. I've *never* seen an example where this mode offered any clear benefit, and I haven't been able to reproduce the claimed results in the original 'Paragraph Vectors' paper reliant on this mode. & I've also seen others report those results as being irreproducible as well.
So I personally wouldn't use `dm_concat` mode unless I was fairly certain I needed to, with sufficient data & time to pay its extra costs, & constantly check its results against the more common & straightforward modes. Its different NN structure from the other modes also means it's kind of shoehorned-in alongside them; the deep internal code may be more fragile and difficult to further adapt. (It's more likely your changes could be interacting with latent-but-this-far-unstressed errors in the preexisting code.)
And, note that the `dm_concat` concatenation step creates a virtual NN "input layer" that is (2 * window) word-vectors in width. And to the extent a context word might appear more than once in the window, in 2 or more positions both before and after the central target word, there's not, trivially & necessarily, a *single* set-of-update-gradients for one "word" with respect to one training-center-word prediction: there could be anywhere from 1 to 2*window (if the same word appeared in all context-window positions) gradients, which in the existing code would be applied by training in series to the single source word, never becoming a single total summary update to one word.
Given that, it's hard to see how your allocation of a `(window, vector_size)` array for `word_gradients` could be adequate to your stated task. Its size neither matches the `2 * window_size` NN input layer, nor the count of unique words in the context (count of actual updates to source words). Further, given that all word-representations are *frozen* during inference, it's unclear that the vestigial per-word updates from the training-backpropagation are really values of interest to you – as opposed to the updates to the doc-vectors (tag-vectors) exactly when a word-of-interest is either (a) the center target predicted words; or (b) one of the window-context words.
So unless you have strong unique reasons for using `dm_concat` mode, I'd avoid it as a basis for experiments, & especially for advanced extensions. Testing/extending things in the other modes should be far easier.
Taking a broader view, given your goal to "identify which words of a document induce the biggest change in the document vector", there *may* be useful ways to do that without reaching deep into the training/inference code. This is a bit speculative, with perhaps less theoretical grounding that a gradients-based measure could offer, but it could be worth trying things like:
- examining the doc-vector inferred from single-word documents
- comparing the multiple doc vectors that can be inferred from an original document, the same document perturbed to be *without* the word-of-interest, and the same document perturbed to include *extra* occurrences of the word-of-interest - as an informal indicator of *which way* & *how much* that particular word's presence or absence changes a document's vector-representation
Keep in mind that inference's parameters themselves can benefit a lot from tuning - especially using more than the default number of `epochs` left over from training, and more `epochs` may be especially beneficial on shorter documents.
- Gordon