Text - Tag association

27 views
Skip to first unread message

Dave Waterworth

unread,
Oct 6, 2021, 6:57:12 PM10/6/21
to Gensim
I have corpus of approx 400k (text,tags) pairs where text is a relatively short description and tags is a set of 3-5 tags from a vocab of around 250 tags.

I'm interested in training a relatively simple model that can learn which words are most likely associated with each tag. I'd like to eventually extend to the association between n-grams of (consecutive) words and n-grams of tags

Is this something I could do easily with gensim? It seems similar to most language models except instead of ngrams from a single text string I need to consider bigrams consisting of a pair one from each input string? Is it just as easy to enumerate these myself and use a counter?

Gordon Mohr

unread,
Oct 11, 2021, 8:12:49 PM10/11/21
to Gensim
To literally do exactly what you describe - "learn which words are most likely associated with each tag" – you wouldn't particularly need Gensim. You could just use a counting-based approach, tallying for each word-token the tally of each tag seen alongside that word-token, or for each tag the tally of word-tokens seen alongside that tag. 

Doing the same with bigrams (etc) is just a matter of preprocessing the texts to turn the bigrams into wordlike tokens - either instead of, or in addition to the original unigrams. 

(The resulting growth in unique n-grams *might* require some sort of overflow-to-disk or multi-pass approach to do a precise count, or recourse to some sort of approximate counting... but there's not much in Gensim to help there. Some of the Word2Vec/Phrases code has a *really* crude, and in my opinion almost always inadvisable, mechanism for discarding less-frequent tallies mid-counting controlled by a `max_vocab_size` parameter.) 

If your real goal goes beyond just word-tag association summary numbers, you might use those counts to calculate PMI (Pointwise Mutual Information) or TFIDF values as another indicator of a term's informativeness. Or, proceed to training classifiers that predict the labels from the texts (in whatever representation), Some classifiers can be interrogated, either explicitly or via the submission of synthetic texts, to get indications of which words most often signify certain labels. And some Gensim models, like LDA or Doc2Vec, can convert docs into formats that also then may be analyzed to find various influential or 'close-in-meaning' words. 

Still, with what you've specifically requested, a simple counting-correlation may be plenty. 

- Gordon

Dave Waterworth

unread,
Oct 14, 2021, 10:44:56 PM10/14/21
to Gensim
Thanks Gordon!

Yes you're right, I was probably overthinking it. Correlation between tags and words/word-grams is probably all I need.

Regards
David
Reply all
Reply to author
Forward
0 new messages