The Gensim `.model` file is essentially a Python 'pickle' object-serialization of the main `Doc2Vec` object, but with any oversized internal arrays saved as separate files alongside it.
This multi-file approach: (1) offers some efficiencies in saving/loading; (2) allows use of operating-system-level memory-mapping of those large-arrays on load, which can be useful in some scenarios; and (3) works around some implementation limits with regard to large arrays (as are common in these models) in older Pythons.
But it does require all the files be kept-together for the main `.model` file to remain loadable. And, newer Pythons no longer have that implementation limit – so at the cost of some efficiency/optional-features, Gensim models could now be just Python-pickled, into one single file, for simplicity. (That is, skipping the native object `.save()`/`.load()` methods and just using Python's usual pickling utilitys/idioms.)
The Python pickle-format isn't optimized for legibility in other tools. Any textual data in it will be mixed with other binary data & indicators of Python-specific structuring. Essentially, pickled objects are only meant to be read-in by other Python code.
Even further, since the act of unpickling inherently lets the loaded file run arbitrary Python code, **loading a pickled file is equivalent to letting those who gave you the file run code on your machine**. That's an important consideration for anyone using, or distributing, files which using Python pickling. It may look like inert data; it's potentially arbitrary code.
None of the training documents are retained as part of the `Doc2Vec` model. But it does, inherently & vividly, retain & share a list of all learned-tokens from the training corpus, with their relative frequencies. (That is, all tokens appearing at least `min_count` times.) If these are generic natural-language words, probably no problem. If instead these are things like unique customer IDs, distinguishable names of customers/clients, subjects-of-database-rows, tokens that only appear in particular proprietary corpora, etc, then the trained model will inherently leak that those tokens were in the training data.
You could consider removing such sensitive tokens before training the model – that's the cleanest approach, & totally separate preprocessing from Gensim code – but then those tokens wouldn't receive vectors, & overall modeling-quality may go down.
You could also consider training first, then removing or obscuring some tokens later. But there's no built-in support for doing this – it'd require some custom code to mutate the model – and once you start doing such arbitrary surgery on a post-training model, the effects on its utility for any intended purpose become murky. For example, if the inference-capabilities were created expecting many tens-of-thousands of tokens to be part of relevant documents, that are now being erased from the model's weights, will inference on future docs be as useful? It might – ideally elided tokens would be rare-ish unique things, that didn't have giant effects on the model's main learnings. But if they were there during trainining, maybe the model over-relied on them. You'd want to run some experiments testing the effects of different strategies.
Separate from mere "in-or-out" token information, the fact that vector-similarities can be suggestive of co-occurrences in training might also imply but not guarantee that certain things were in the corpus. But this is harder to reason about. When these algorithms are used properly, where they are strongest, they reflect broad generalities in the corpus, by design. But if applying to insufficient or peculiar data, or with extremely oversized/misparamterized models prone to overfitting, there might be increased chance that some token-to-token weight-correlations give inadvertently-strong hints of certain features in the corpus. Given the range of ways I've seen things clumsily misused, it's hard to rule out that some extreme/corner-case usages leak more suggestive, but rarely dispositive, info.
(One example that jumps to mind: natural language text tends to have tokens in a smooth Zipfian distribution. But certain structured databases might instead have clusters of related tokens that all have identical or highly-related frequencies – for example, peculiar field names that are repeated per row or JSON object. Seeing that atypical distribution of frequencies in a model's vocabulary data could then be highly suggestive that the training data coming from a particular kind of source.)
- Gordon