My understanding of their paper is that the categories are practically 'synthetic words', and thus are trained identically to the words. They really just drop into the skip-gram pairs, as if they were themselves words.
So looking into the details of what happens in `train_sg_pair()` may not be necessary – except to satisfy curiosity. You just want to make sure pairs of [category_tag, target_word] are mixed in with all the other usual pairs of [anchor_word, target_word]. And the easiest way to do that would be preprocessing the corpus to expand it with more synthetic text that represents the categories, rather than editing the training code.
By way of example, consider a longer sentence like the following, and assume a 'window' of 2 for simplicity:
['many', 'projects', 'are', 'preferring', 'python', 'for', 'machine', 'learning', '.']
In plain skip-gram training, the pairs anchored around 'python' would be:
['python', 'are']
Now let's consider 'python' to be a word in the category 'CAT_PROGRAMMING_LANGUAGE'. I believe (but am not sure) that under the Yahoo paper's training method, the category training is performed, at this anchor-point while iterating over the text, by also training the pairs:
['CAT_PROGRAMMING_LANGUAGE', 'are']
['CAT_PROGRAMMING_LANGUAGE', 'preferring']
['CAT_PROGRAMMING_LANGUAGE', 'for']
['CAT_PROGRAMMING_LANGUAGE', 'machine']
Presumably, this substitution would also affect other contexts that overlap that location. That is, 'CAT_PROGRAMMING_LANGUAGE' would also sometimes appear as the 2nd (target) word of the pair. (But this might not be strictly necessary.)
So really, the known categories are best thought of (and probably implemented as) special words, sometime dropped-in. The most quick-and-dirty way to simulate this would be, when you have the one sentence above. expand your corpus to also include the sentence:
['many', 'projects', 'are', 'preferring', 'CAT_PROGRAMMING_LANGUAGE', 'for', 'machine', 'learning', '.']
That might not be ideal – other words in the sentence, outside the category-influenced window (like 'many', 'projects', etc) are also now artificially getting more training examples. (Maybe in practice that's not a problem.) Excerpting just the window around the word with category-alternates would minimize that effect – though there'd still be a distortion around the excerpt-edges. But this preprocessing-only approach might be an easy initial way to test most of the technique without modifying library code. If that works, then see if a more precise modification of training – by editing the `train_*` methods, in python and cython, to only create the minimal number of new training pairs.
- Gordon