Semi-supervised model: How to add a-priori information about the class of some words in the dictionary

Maria Milkova

unread,

Aug 6, 2019, 10:05:31 AM8/6/19

to bigartm-users

Dear BigARTM team!

Thanks for your product!

I try to build a model to mining some economic content. That is something similar to Murat Apishev and colleagues work (Mining Ethnic Content Online with Additively Regularized Topic Models). I have about 150000 documents and its dictionary is about 710000 terms (words and most common bigrams ). I also have dictionary of economic words that I want to mine (about 3000 terms: words and bigrams). In this economic dictionary I know to what topic each word belongs to. there are 22 topics.

I’ve built artm model (with modification of value for economic and non-economic words in the dictionary like Murat suggested; with most common regularizers). What I want to do is to add a-priori information about each economic word’s class. Classes are not intersected and unbalanced (some classes contain about 10-20 economic terms, but some about 600 economic terms).

Is that LabelRegularizationPhiRegularizer that can help? I saw examples how it works for documents classification tasks, but that does not work for me because this economic content that I want to mine is a small part of my collection and I am not interested in other part of collection.

Thanks in advance for any suggestions!

Very sincerely,

Maria

Maria Milkova

unread,

Aug 7, 2019, 6:06:44 PM8/7/19

to bigartm-users

Dear Alex, thanks for your reply!

That is clear about custom dictionary - thanks for the detailed documentation! But the question was how to add to this ‘white list’ of words a-priori information about each word’s class. I applied DecorrelatorPhiRegularizer, it works, but not as I need. The idea is to tell the model that terms w¹₁,…w¹_s1 always belong to class1, w²₁,…w²_s2 always belong to class2,…, w²²₁,…w²²_s22 always belong to class22.

That is something like this: http://www.machinelearning.ru/wiki/images/2/22/Voron-2013-ptm.pdf p.35, section 5.2, see paragraph “Topic relevance data” on p.36.

Thanks for your time!

Very sincerely,

Maria

ср, 7 авг. 2019 г. в 14:53, Oleksandr Frei <oleksan...@gmail.com>:

Hi,
In this case you may use SmoothSparsePhi with custom dictionary, as describe here: http://docs.bigartm.org/en/stable/tutorials/python_userguide/regularizers_and_scores.html (look at the end of this tutorial, starting from "Let’s return to the dictionaries")
Kind regards
Alex

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/1399d5d4-d6cd-4c55-bc40-49aba94e15be%40googlegroups.com.

--

С уважением,
Мария Милкова

Oleksandr Frei

unread,

Aug 8, 2019, 3:58:55 AM8/8/19

to Maria Milkova, bigartm-users

>The idea is to tell the model that terms w¹₁,…w¹_s1 always belong to class1, w²₁,…w²_s2 always belong to class2,…, w²²₁,…w²²_s22 always belong to class22.

sorry, that's confusing - do you perhaps mean topics here, not classes? I.e. " w¹₁,…w¹_s1 always belong to topic1", etc?

The trick then is to create, int your case, 22 dicionaries and 22 SmoothSparsePhi regularizers - one for every topic. Each regular should do "smoothing" of a specific topic (i.e. it adds a very large regularization constant for a given topic, for a subset of words that you define in the dictionary - again specific dictionary for each topic). Alternatively you may do "sparsing" of all topics except the one you are willing to keep.

Is this what you are looking for?

To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/CACkLbjtO71FehQeZG6H9OFWU8aNRhb2q50EHCNuM%3DDqNOLaodg%40mail.gmail.com.

Reply all

Reply to author

Forward