Gensim's .model file format/encoding and privacy issues

37 views
Skip to first unread message

Edgar Steiger

unread,
Nov 22, 2022, 10:39:26 AM11/22/22
to Gensim
Thanks for maintaining this amazing software that we used with great success in a non-human language related task (healthcare research).
Currently we're investigating how to make our derived Doc2Vec embedding model available for other researchers. We have a few concerns and problems, where I hope this group can give answers or impulses:
  1. I tried to open the doc2vec ".model" file (the one you get from the save command in python) directly in an editor. I coudn't figure out what's the correct encoding, if there is one? ANSI and UTF-8 still showed weird characters (I'm on a windows machine).
  2. I tried to open this file to see what information of the corpus is actually persisted in the .model file. I figured it will be the whole vocabulary as well as all document tags and the model weights. Are vectors part of the .model file? If yes, how can i save a doc2vec model without the document tags and document vectors, but still allow inference for new documents?
So what I'm looking for is a way to save and safely share a doc2vec model file without giving away information on any original individual documents that are part of our training corpus. Is this possible?

Gordon Mohr

unread,
Nov 22, 2022, 11:56:52 AM11/22/22
to Gensim
The Gensim `.model` file is essentially a Python 'pickle' object-serialization of the main `Doc2Vec` object, but with any oversized internal arrays saved as separate files alongside it.

This multi-file approach: (1) offers some efficiencies in saving/loading; (2) allows use of operating-system-level memory-mapping of those large-arrays on load, which can be useful in some scenarios; and (3) works around some implementation limits with regard to large arrays (as are common in these models) in older Pythons.

But it does require all the files be kept-together for the main `.model` file to remain loadable. And, newer Pythons no longer have that implementation limit – so at the cost of some efficiency/optional-features, Gensim models could now be just Python-pickled, into one single file, for simplicity. (That is, skipping the native object `.save()`/`.load()` methods and just using Python's usual pickling utilitys/idioms.)

The Python pickle-format isn't optimized for legibility in other tools. Any textual data in it will be mixed with other binary data & indicators of Python-specific structuring. Essentially, pickled objects are only meant to be read-in by other Python code. 

Even further, since the act of unpickling inherently lets the loaded file run arbitrary Python code, **loading a pickled file is equivalent to letting those who gave you the file run code on your machine**. That's an important consideration for anyone using, or distributing, files which using Python pickling. It may look like inert data; it's potentially arbitrary code. 

None of the training documents are retained as part of the `Doc2Vec` model. But it does, inherently & vividly, retain & share a list of all learned-tokens from the training corpus, with their relative frequencies. (That is, all tokens appearing at least `min_count` times.) If these are generic natural-language words, probably no problem. If instead these are things like unique customer IDs, distinguishable names of customers/clients, subjects-of-database-rows, tokens that only appear in particular proprietary corpora, etc, then the trained model will inherently leak that those tokens were in the training data.

You could consider removing such sensitive tokens before training the model – that's the cleanest approach, & totally separate preprocessing from Gensim code – but then those tokens wouldn't receive vectors, & overall modeling-quality may go down. 

You could also consider training first, then removing or obscuring some tokens later. But there's no built-in support for doing this – it'd require some custom code to mutate the model – and once you start doing such arbitrary surgery on a post-training model, the effects on its utility for any intended purpose become murky. For example, if the inference-capabilities were created expecting many tens-of-thousands of tokens to be part of relevant documents, that are now being erased from the model's weights, will inference on future docs be as useful? It might – ideally elided tokens would be rare-ish unique things, that didn't have giant effects on the model's main learnings. But if they were there during trainining, maybe the model over-relied on them. You'd want to run some experiments testing the effects of different strategies.

Separate from mere "in-or-out" token information, the fact that vector-similarities can be suggestive of co-occurrences in training might also imply but not guarantee that certain things were in the corpus. But this is harder to reason about. When these algorithms are used properly, where they are strongest, they reflect broad generalities in the corpus, by design. But if applying to insufficient or peculiar data, or with extremely oversized/misparamterized models prone to overfitting, there might be increased chance that some token-to-token weight-correlations give inadvertently-strong hints of certain features in the corpus. Given the range of ways I've seen things clumsily misused, it's hard to rule out that some extreme/corner-case usages leak more suggestive, but rarely dispositive, info.

 (One example that jumps to mind: natural language text tends to have tokens in a smooth Zipfian distribution. But certain structured databases might instead have clusters of related tokens that all have identical or highly-related frequencies – for example, peculiar field names that are repeated per row or JSON object. Seeing that atypical distribution of frequencies in a model's vocabulary data could then be highly suggestive that the training data coming from a particular kind of source.)

- Gordon

Edgar Steiger

unread,
Nov 22, 2022, 2:52:49 PM11/22/22
to gen...@googlegroups.com
Thanks Gordon for the detailed answer!

I'm actually not concerned about sharing the tokens ("words") or their frequency, even the "word" vectors would be fine too share. I set a reasonable min_count parameter such that rare "words" are not used anymore, which could give away too much individual information.

I'm concerned about sharing the  "document" vectors from the training data or the original "documents" though (since these describe human beings). But i want to enable other researchers to infer vectors using my trained model for new "documents".

Can I safely share a usable .model file of doc2vec without the accompanying vector file? 

I guess the vocabulary information and the trained weights should be enough to infer new vectors of new documents.

--
You received this message because you are subscribed to a topic in the Google Groups "Gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/ISeQzGSJUHk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/7252a2de-13e5-4ae0-99ff-2827cf15cac9n%40googlegroups.com.

Gordon Mohr

unread,
Nov 23, 2022, 9:36:23 AM11/23/22
to Gensim
Yes, a model doesn't have to retain the bulk-traned doc-vectors in order to do inference on new data.

If I recall correctly (but please test on your loaded model & let me know if there are issues), the best way to delete just the doc-vectors, leaving the model functional for inference, is something like the following, to replace them with a stub emtpy-set of vectors:

    d2v_model.dv = new KeyedVectors(d2v_model.dv.vector_size, 0)

Thereafter, you won't be able to look up trained doc-vectors in that model, but can infer new ones from new texts.

Downstream users of shared models should be informed that:

* For best results, they should preprocess their texts-for-inference in the same way (cleanup, tokenization, etc) as the training data was preprocessed. (You may want to provide a recommended function.) And, unrecognized tokens are essentially ignored. 
* Per my advisory in the prior message, `.model` files can essentially trigger arbitrary code, so should only be loaded from trusted/verified sources. (True `.save()` files from Gensim only re-form the saved object, but a maliciously-crafted replacement file could do other things, and that capability is inherent to the convenience of this file-format.)

- Gordon

Edgar Steiger

unread,
Nov 23, 2022, 10:30:07 AM11/23/22
to Gensim
thank you Gordon, this was exactly what I needed:
mymodel.dv = KeyedVectors(mymodel.dv.vector_size, 0) # replaced the document vectors and tags completely
mymodel.dv.vectors # just check there's nothing
mymodel.save("mymodel.model") # saved a much smaller model file (i have many more "documents" than "words"), without an additional .npy file for the document vectors
mymodel.infer_vector(["word", "anotherword"]) # still gives a reasonable document vector
Reply all
Reply to author
Forward
0 new messages