Segmentation fault in a new function

Olivier GRACIANNE

unread,

Jan 24, 2023, 10:49:46 AM1/24/23

to Gensim

Hello there,

I am a PhD student working a lot with Doc2Vec and I use its Gensim implementation.
First of all, all the thanks I can give for the work done to achieve it. This code allows the research I work on as well as many others.

To get to the problem quickly : I am trying to get the gradient values backpropagated to the word and document vectors of a given document being inferred.
To do this, I have copied the infer_vector function in doc2vec_inner.pyx into a new function called infer_gradients, adding two arguments (both numpy.ndarry, casted to REAL_t * to be used without gil and accessed through a Doc2VecConfig cstruct) to its signature in order to make the computed gradients available in the calling function situated in doc2vec.py.

What's changing from the copied infer_vector function is so just the new arguments and new vector substractions done after calling our_saxpy to update the document and word vectors.

When I run the code, I get a segmentation fault. After investigations, it appeared that the new infer_gradients returns normally and that the sefgault occurs within doc2vec.py, whenever I try to access a member of the numpy.ndarray passed in infer_gradients signature.

However, if the line : "

memset(c.work, 0, c.layer1_size * cython.sizeof(REAL_t)) # work to accumulate l1 error

"

is commented out of the code copied from infer_vector, I don't get the segmentation fault.

My knowledge of Cython is quite limited and from what I know in C/C++, I'd say that the arguments I added are not well declared and thus allocated on adressess overlapping those of other already existing members of the Doc2VecConfig object being used.

But even if it is true, I can't manage to find out how correcting it.

Feel free to ask me anything to clarify my problem if you want to help me.

Thank you for reading this and with my best regards,

Olivier G.

Gordon Mohr

unread,

Jan 25, 2023, 1:32:55 PM1/25/23

to Gensim

Alas, it's quite hard to debug Cython code, as it's essentially C with a bit of Python syntax/glue. Subtle errors lead to complete process deaths – like the segmentation fault you're seeing. And, the illegal access that triggers the detected fault can occur quite far (in code space & even running time) from the truly-responsible code.

I've typically relied on things like:

- reviewing code intensely line-by-line for correctness

- applying new changes in the smallest-possible increments, with stress testing between, to narrow the implied-contributors as much as possible

- adding extensive extra output, to display/catch any possible contributions or violations of assumptions/requirements as soon as possible

- examing coredump files with `gdb` to examine more state at the moment of failure

From theory, I've often hoped that re-enabling Cython options usually left off in working code for performance purposes, like the `boundscheck` option you can see in the file's header, might flag some errors earlier and more directly – but that's never yet nailed a problem for me. Still, you might want to try it.

Without seeing your code – and also not quite understanding what its unique new goals are – it's hard to speculate on what's going wrong.

But, you mention seom type-casting of numpy arrays, and the slightest error which causes a mismatch in lengths or data types can easily result in accesses to illegal or unaligned data, or inadvertent corruption of nearby data which then triggers errors some time later.

So I would focus initial/intense attention at

- every new object/array allocation you're doing

- every new type declaration and type-casting

- all indexing/indexing assumptions about the width of datatypes (`sizeof`/etc), or about proper origins/offsets/multiplications/additions of indexed-access

If you could post your minimal-but-sufficient-to-trigger-the-error changes to the original function, with the changes clearly marked, some other or more-specific ideas might be possible... but mainly if the changes are pretty small in size & intent.

- Gordon

Olivier GRACIANNE

unread,

Jan 26, 2023, 11:09:53 AM1/26/23

to Gensim

Hello and thanks for your answer,

It is precisely doing what you told that I could figure out which line is triggering the segfault.

Our objective is to identify which words of a document induce the biggest change in the document vector positionning, directly exploiting the retropropagated gradients during document inference. I hope this makes our purpose clearer.

Initially, I thought about just editing the infer_vector function in doc2vec_inner.pyx to get those gradients. But facing my segfault problem, I copied it into a new function and edited this one to be able to identify where the segfault is coming from without breaking the infer_vector function.

So here is the code I added in the function calling order, the shortest possible. For information, in doc2vec_inner.pyx, I've set boundscheck to True.

First is the script calling the model new function with :

"""

def get_gradient() :

# a few lines to load the learned model (through Doc2Vec.load and KeyedVectors.load calls) and the data it trained on

doc_tag = cluster_tags[0] # doc_tag : int

doc_words = tag_to_text[doct_tag] # doc_words : list(str)

test = model.infer_vector(doc_words) # testing call to be sure my modifications didn't break the model elsewhere, new doc "test" is inferred perfectly

gradient_dict = model.infer_gradients(doc_tag, doc_words)

"""

Then comes the code added in doc2vec.py :

"""

def infer_gradients(self, doc_idx, doc_words, alpha=None, min_alpha=None):

alpha = alpha orself.alpha

min_alpha = min_alpha or self.min_alpha

doctag_indexes = [doc_idx]
document_gradient = zeros((1, self.vector_size), dtype=REAL)

word_gradients = zeros((self.window, self.vector_size), dtype=REAL)

infer_gradients_dm_concat( # imported from doc2vec_inner

self, doc_words, doctag_indexes, alpha, document_gradient, word_gradients,

work=None, neu1=None, learn_words=False, learn_hidden=False

)

print(type(document_gradient))

print(document_gradient.dtype)

print(document_gradient.tostring()) # <--- this is where the segfault is detected when I use GDB, for example

return {'document_gradient' : document_gradient, 'word_gradients' : word_gradients}

"""

I needed to add two modifications in doc2vec_inner.pxd to make my arguments available without gil, which are in the Doc2VecConfig struct definition and in the calling signature of init_d2v_config :

"""

cdef struct Doc2VecConfig:

# [...] every other struct members

REAL_t *document_gradient

REAL_t *word_gradients

# [...]

cdef init_d2v_config(Doc2VecConfig *c, model, alpha, learn_doctags, learn_words, learn_hidden, document_gradient=*, word_gradients=*,

train_words=*, work=*, neu1=*, word_vectors=*, words_lockf=*, doctag_vectors=*, doctags_lockf=*, docvecs_count=*)

"""

Then I needed to add a few lines to the definition of the init_d2v_config in doc2vec_inner.pyx function as follows :

"""

cdef init_d2v_config(Doc2VecConfig *c, model, alpha, learn_doctags, learn_words, learn_hidden, document_gradient=None, word_gradients=None,

train_words=False, work=None, neu1=None, word_vectors=None, words_lockf=None,

doctag_vectors=None, doctags_lockf=None, docvecs_count=0):

# [...] all the original code of the function untouched

if document_gradient is None : document_gradient = zeros((1, model.vector_size), dtype=REAL)

c[0].document_gradient = <REAL_t *>(np.PyArray_DATA(document_gradient))

if word_gradients is None : word_gradients = zeros((model.window, model.vector_size), dtype=REAL)

c[0].word_gradients = <REAL_t *>(np.PyArray_DATA(word_gradients))

"""

Then would come the function I've added to doc2vec_inner.pyx, copy/pasting the function infer_vector from the same file. But I've tried to comment out all of what I added. It appears that just changing the function signature and editing the call to the init_d2v_config is enough to get the segfault. So here goes :

"""

def infer_gradients_dm_concat(model, doc_words, doctag_indexes, alpha, document_gradient, word_gradients, work=None, neu1=None,

learn_doctags=True, learn_words=True, learn_hidden=True,

word_vectors=None, words_lockf=None, doctag_vectors=None, doctags_lockf=None):

# [...] C variables declaration with cdef copied from infer_vector function

init_d2v_config(&c, model, alpha, learn_doctags, learn_words, learn_hidden, document_gradient, word_gradients,

train_words=False, work=work, neu1=neu1, word_vectors=word_vectors, words_lockf=words_lockf,

doctag_vectors=doctag_vectors, doctags_lockf=doctags_lockf)

# [...] everything else copied from infer_vector, I'll just show which is the line triggering the segfault

with no gil :

for i in range(c.document_len):

# [...]

memset(c.work, 0, c.layer1_size * cython.sizeof(REAL_t)) # work to accumulate l1 error # <---- if this line is commented out, the code is not segfaulting anymore

# [...]

"""

Many thanks for any help you could give.

I believe somehow the document_gradient and word_gradients declaration and/or initialization are the cause of the problem. But I don't understand why or how, nor why it is the memset call presence that determines the apparition of the segfault.

Once again thanks for taking the time to read this,

Best regards,

Olivier

Gordon Mohr

unread,

Jan 26, 2023, 4:34:44 PM1/26/23

to Gensim

Thanks for the extra details, but that's quite a hard format in which to try to understand & review changes - hand-assembled, informal, with ad hoc comments (& I think a fatal missing-space on 1st line of `infer_gradients()`?).

As I don't work in Cython often, my review & incremental changes rely a lot on seeing code in context, and following the exact practices of surrounding code – which is hidden in this presentation. Best for review would be a PR on Github showing your minimal error-triggering changes – not to ever really integrate the code, but to see the exact code diffs highlighted in context, in a rich interface.

That said:

- If you've changed the signature of `init_d2v_config()`, won't many many other changes in other functions (not shown here) have also been necessary? (Seeing everything in context, without any assumptions of what's not relevant, may be required to see the problem.)

- If it's really only your allocations & method-signature changes that are sufficient to cause the error, I'd look especially if you've broken any prior assumptions around nearby-allocated objects/arrays – especially but not only the `c.work` and `c.layer1_size` arrays implicated by your location of a line that can be commented-out to dodge the blatant error.

- That line you've noted as sufficient to comment-out to prevent the crash is an essential bit of initialization; even if its removal stoes the error, I wouldn't count on correct operation without it. (That it merely zeros-out some memory, successfully without erroring, is suggestive that secondary effects of that initialization cause problems down the road: perhaps clobbering other necessary data, or enabling real calcs to occur that if initialization didn't happen would've somehow failed in some other more subtle but non-crashing way?)

Also, I see reference to just the `dm_concat` mode in your code fragments. This is the least-used (& least-tested) mode, and it results in models that are much-larger & require far-more data & time to train. I've *never* seen an example where this mode offered any clear benefit, and I haven't been able to reproduce the claimed results in the original 'Paragraph Vectors' paper reliant on this mode. & I've also seen others report those results as being irreproducible as well.

So I personally wouldn't use `dm_concat` mode unless I was fairly certain I needed to, with sufficient data & time to pay its extra costs, & constantly check its results against the more common & straightforward modes. Its different NN structure from the other modes also means it's kind of shoehorned-in alongside them; the deep internal code may be more fragile and difficult to further adapt. (It's more likely your changes could be interacting with latent-but-this-far-unstressed errors in the preexisting code.)

And, note that the `dm_concat` concatenation step creates a virtual NN "input layer" that is (2 * window) word-vectors in width. And to the extent a context word might appear more than once in the window, in 2 or more positions both before and after the central target word, there's not, trivially & necessarily, a *single* set-of-update-gradients for one "word" with respect to one training-center-word prediction: there could be anywhere from 1 to 2*window (if the same word appeared in all context-window positions) gradients, which in the existing code would be applied by training in series to the single source word, never becoming a single total summary update to one word.

Given that, it's hard to see how your allocation of a `(window, vector_size)` array for `word_gradients` could be adequate to your stated task. Its size neither matches the `2 * window_size` NN input layer, nor the count of unique words in the context (count of actual updates to source words). Further, given that all word-representations are *frozen* during inference, it's unclear that the vestigial per-word updates from the training-backpropagation are really values of interest to you – as opposed to the updates to the doc-vectors (tag-vectors) exactly when a word-of-interest is either (a) the center target predicted words; or (b) one of the window-context words.

So unless you have strong unique reasons for using `dm_concat` mode, I'd avoid it as a basis for experiments, & especially for advanced extensions. Testing/extending things in the other modes should be far easier.

Taking a broader view, given your goal to "identify which words of a document induce the biggest change in the document vector", there *may* be useful ways to do that without reaching deep into the training/inference code. This is a bit speculative, with perhaps less theoretical grounding that a gradients-based measure could offer, but it could be worth trying things like:

- examining the doc-vector inferred from single-word documents

- comparing the multiple doc vectors that can be inferred from an original document, the same document perturbed to be *without* the word-of-interest, and the same document perturbed to include *extra* occurrences of the word-of-interest - as an informal indicator of *which way* & *how much* that particular word's presence or absence changes a document's vector-representation

Keep in mind that inference's parameters themselves can benefit a lot from tuning - especially using more than the default number of `epochs` left over from training, and more `epochs` may be especially beneficial on shorter documents.

- Gordon

Olivier GRACIANNE

unread,

Jan 27, 2023, 10:05:10 AM1/27/23

to Gensim

Hello and thanks for you answer,

I will go through more investigations to find the cause my error and if eventually I can't find it, I will do a PR request to make my modifications clearer.

To answer to the several points you talk about :

- changing `init_d2v_config()` did make other changes in many other functions necessary. Essentially, it was about passing `None` for the two arguments I added. But I did not make it appear here as all those other functions work perfectly fine with these modifications.

- "I'd look especially if you've broken any prior assumptions around nearby-allocated objects/arrays" -> I totally agree with this. I'm pretty sure my segfault is coming from this, but after a few days of investigation, I still can't find out what allocations are overlapping.

- "That line you've noted as sufficient to comment-out to prevent the crash is an essential bit of initialization" -> I sure don't want to comment out that initialization step. But as far as I could have tested (being limited by the segfault), it is the line triggering the error. You said that this line is "enabling real calcs to occur [and] that if initialization didn't happen would've somehow failed in some other more subtle but non-crashing way?" which is, I think, exactly what is happening.

About the use of the dm_concat. We've been through a lot of mathematical writing/reasoning with my PhD directors and we've decided to use it because it gives more information about the word order to the model. Another researcher from our lab shown, in a code-embedding task, that having this supplementary order information drove to better embeddings for his code classification task. Furthermore, as we try to get the gradient retropropagated to a specific part of the input, we do not want a part of the information it contains being lost because of the input being aggregated into a smaller vector.

That said, we do know that this dm_concat is making the model slower and bigger. It is not a problem for the applications our works could have, so we will keep using the dm_concat mode. I do note, though, that this is the least-tested part of the code. I will have to keep in head later on, thanks for telling me.

- "It's hard to see how your allocation of a `(window, vector_size)` array for `word_gradients` could be adequate to your stated task" -> It's not. `word_gradients` is meant to be 2*window sized, it is my mistake here.

- "It's unclear that the vestigial per-word updates from the training-backpropagation are really values of interest to you – as opposed to the updates to the doc-vectors (tag-vectors) exactly when a word-of-interest is either (a) the center target predicted words; or (b) one of the window-context words." -> That is something we're investigating. I don't really believe, as you're stating, that those vestigal updates will give us any clues about word interaction, at least at inference stage. But as we're conducting an extensive study of what we can learn from the model mechanisms, we have to check it out to be sure we're not missing anything. As you say, I believe that insights, at inference stage, will come from the updates of the doc-vectors.

About your suggestions to identify document words of interest :

- "examining the doc-vector inferred from single-word documents" -> we did not think about this one and this is something I'm adding to my list. Thanks for the suggestion!

- "comparing the multiple doc vectors that can be inferred from an original document" -> this is the experiments we are currently running. Document ablation, perturbation and masking are what we're exploring for now, hoping we will later be able to compare it to the gradient study I'm trying to code.

The training tuning is also something we want to explore, especially if we manage to show that words-of-interest can be identified with something from the above. It could give clues about what parameters of the training are imapcting the information the model captures from the data.

Once again many thanks for taking the time to read me,

Best regards,

Olivier

Olivier GRACIANNE

unread,

Feb 8, 2023, 8:41:52 AM2/8/23

to Gensim

Hello again,

I finally managed to identify what caused my segfault and thought it might be interesting for you.

Initially, I made the call to the infer_gradients method I added in doc2vec_inner.pyx copying its infer_vector function look like this :

infer_gradients_dm_concat(

self, doc_words, doctag_indexes, alpha, document_gradient=document_gradient, word_gradients=word_gradients,

learn_words=False, learn_hidden=False

)

Note here that I removed the work and neu1 parameters from the infer_vector call, as those two are initialized in the init_d2v_config method within doc2vec_inner.pyx if they are not provided.

This is what was causing the segfault. Explicitly initializing those two objects and passing them through the infer_gradients call resolves the problem.

I don't really understand why though. I suspect that having these parameters initialized later makes the python garbage collector free something it should not, but this is just speculative.