Train model to associate one word to a set of data.

Skip to first unread message

Max Briones

Nov 9, 2021, 11:20:40 AM11/9/21
to Gensim
I have a capstone project where I need to ingest student resumes using Word2Vec, and then use the model to find the best fit for project positions using vector math. My problem is that I'm having trouble associating a student ID with their resumes. The image below shows how I'm sorting my information. The only way I'm able to somewhat relate the Candidate ID to the resume is by adding the Candidate ID to the start of the resume, but it doesn't give great results, and it also make me have to keep a min_count = 1, which creates a lot of noise in the model. Any help would be appreciated. 


Gordon Mohr

Nov 15, 2021, 9:51:58 AM11/15/21
to Gensim
Word2vec itself doesn't particularly associate words to unique record IDs, so this seems a strained application of the algorithm. 

Is this an academic exercise, in which the technology is fixed even if it's not the best approach, or is the genuine goal assigning real candidates to well-fit positions?

If the latter, other techniques may be better. Especially if the target positions can be modeled a small number of classes, defined by a 'golden training set' of good matches rather than textual descriptions, exploring the full range of potential text-classification approaches may make more sense. (Some of those approaches might be enriched by word-vectors, but after a full exploration word2vec might only wind up a minor contributor to the final approach.)

Attempting to shoehorn per-record identifiers into the word2vec model as a single prepended token will only cause them to be influenced by other tokens within `window` words, and still leaves them with only one occurrence. As you've noted, that forces `min_count=1` to retain them – with such a low `min_count` value almost always a bad idea in word2vec-like algorithms. But along that same improvised-trickery direction, you could (1) re-insert the identifier every few words to put it near all the text's words; or (2) use a giant `window` that ensure that identifier (& every other word in each text) appears in *all* other words' training contexts.

However, those hacks start to make plain word2vec behave more like its sibling algorithm `Doc2Vec` (aka the 'Paragraph Vector' algorithm), which by design trains up whole-text vectors, that sort-of 'float' into all contexts. So you may also want to try `Doc2Vec`, with the record-IDs as the paragraph-vector-keys (aka 'tags' in Gensim's implementation).

- Gordon

Reply all
Reply to author
0 new messages