Co-ooccurrence dictionary gathering problems

90 views
Skip to first unread message

ellum...@gmail.com

unread,
Dec 10, 2017, 2:35:36 PM12/10/17
to bigartm-users
Hi there.

BigARTM is really cool, glad to use this.

I'm facing problems while co-ooccurrence dictionary gathering. I try to follow note in BigARTM CLI Reference - so I've entered this is terminal:

bigartm -c jobs_corpus.txt -v jobs_cooc_vocab.txt --cooc-window 10 --cooc-min-tf 200 --write-cooc-tf cooc_tf_ --cooc-min-df 200 --write-cooc-df cooc_df_ --write-ppmi-tf ppmi_tf_ --write-ppmi-df ppmi_df_

where jobs_corpus.txt is my collection in VW format and  jobs_cooc_vocab.txt is dictionary that I've saved through following snippet:

dictionary = artm.Dictionary(data_path='jobs-simple-corpus')# загрузка данных в словарь
dictionary.save_text(dictionary_path='jobs_cooc_vocab.txt')

I guess that  jobs_cooc_vocab.txt is not the file that I really should pass as -v argument value and this is my fault reason. Anyway after running command in Terminal I see this output:

items per batch = 2365
Parsing text collection... OK.
32 batches created with total of 15739 items, and 36993 words in the dictionary; NNZ = 665556, average token weight is 1.22822


And nothing is changed then - no new files, no changes in existing files.

FIY I'm using BigARTM 0.9.0 on Ubuntu.

Hope anybody will help me.

Regards, Nikolay 

Михаил Солоткий

unread,
Dec 10, 2017, 3:29:47 PM12/10/17
to ellum...@gmail.com, bigartm-users
Hi. You are right, that's not that file you should pass as -v argument. In fact it's almost that file, but there is some additional information. Try to clear out that file from modality labels, tf, df and some other information about tokens that you have there. You need to get a file only with tokens, each row should consist of one token. :)


10.12.2017, 22:35, "ellum...@gmail.com" <ellum...@gmail.com>:
--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/d2159632-6a3d-4620-bb30-8dd33325a081%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


-- 
С уважением,
Михаил Солоткий

ellum...@gmail.com

unread,
Dec 10, 2017, 3:46:18 PM12/10/17
to bigartm-users
Understood, thank you!

Some additional questions then:

1) what about modalities? Should dictionary consist of tokens from all modalities?
2) should I additionally add --save-dictionary argument to generate cooc dictionary command? or how I can access generated dictionary?

Regards, Nikolay

воскресенье, 10 декабря 2017 г., 23:29:47 UTC+3 пользователь Михаил Солоткий написал:

ellum...@gmail.com

unread,
Dec 10, 2017, 3:48:24 PM12/10/17
to bigartm-users
I thought a bit more LOL and I think that I should create separate dictionary for each modality. Please correct me if im wrong

воскресенье, 10 декабря 2017 г., 23:46:18 UTC+3 пользователь ellum...@gmail.com написал:

Михаил Солоткий

unread,
Dec 10, 2017, 4:07:16 PM12/10/17
to ellum...@gmail.com, bigartm-users
1) Tool that gathers co-occurrence dictionaries doesn't consider some other modalities besides basic modality yet (which is labeled in text by '|@default_class'). Also, if bigartm finishes working and doesn't add any files in your current folder, it means that the tool couldn't find any pair of tokens from vocab file in the collection in some window or tokens from your vocab doesn't belong to basic modality.
2) No, you can just write that line in terminal again with no additional flags. Generated dictionary is in text format and it'll appear in your current folder after running.
 
10.12.2017, 23:46, "ellum...@gmail.com" <ellum...@gmail.com>:
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/b52afc74-8f79-4366-8d2c-c493ab01edef%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ellum...@gmail.com

unread,
Dec 10, 2017, 4:11:52 PM12/10/17
to bigartm-users
Understood.

So I should create VW files for each modality (with single default modality in each file), UCI vocab for each modality and create cooc vocabularies for each modality. Then I should use these cooc dictionaries in TopTokens (for example) score, that is added for each modality too. Right?

Regards, Nikolay

понедельник, 11 декабря 2017 г., 0:07:16 UTC+3 пользователь Михаил Солоткий написал:

Михаил Солоткий

unread,
Dec 10, 2017, 4:24:20 PM12/10/17
to ellum...@gmail.com, bigartm-users
Hm. Co-occurrences of tokens from other modalities besides basic is an exotic thing (and I don't know the cases where it could be used). :) That's why it isn't implemented, but you are right you want to calculate co-occurrences of tokens from other modalities, you can create a lot of vw files and vocab files for them (actually you can use the same vocab. If some token is in vocab but isn't in vw it's just ignored). You can pass that one cooc dictionary file in TopTokens score.
 
11.12.2017, 00:11, "ellum...@gmail.com" <ellum...@gmail.com>:
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/4643a99c-78f6-4537-8d3f-562b9bbe4505%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ellum...@gmail.com

unread,
Dec 10, 2017, 4:31:10 PM12/10/17
to bigartm-users
Cool. Now everything fell into place. Thank you a lot!

Regards, Nikolay

понедельник, 11 декабря 2017 г., 0:24:20 UTC+3 пользователь Михаил Солоткий написал:

ellum...@gmail.com

unread,
Dec 12, 2017, 12:52:14 PM12/12/17
to bigartm-users
Hi there again!

I've prepared separate UCI vocabularies and VW files and run CLI command again. Now new files were generated: cooc_df_, cooc_tf_, ppmi_df_, ppmi_tf_. But that's all - where is cooc vocabulary indeed? I need to pass something as vocab_file_path
 arg value to create artm dictionary from it, but I have not one. What I'm doing wrong?

Regards, Nikolay

понедельник, 11 декабря 2017 г., 0:31:10 UTC+3 пользователь ellum...@gmail.com написал:

Михаил Солоткий

unread,
Dec 12, 2017, 2:08:46 PM12/12/17
to ellum...@gmail.com, bigartm-users
Hi. You need to create artm.Dictionary object and gather it using one of your cooc dictionaries. I'd suggest to use one of ppmi_tf_, ppmi_df_ files, because there were some experiments that showed that coherence with ppmi function as value(w_i, w_j) correlates with interpretability of topics. But if you want, you can use any of that 4 created files. You can gather a artm dictionary like this:

dict = artm.Dictionary()
dict.gather(data_path='my_collection_batches', vocab_file_path='vocab.txt', cooc_file_path='ppmi_tf_')

Then you can pass this dict in TopTokensScore as dictionary argument.
 
 
12.12.2017, 20:52, "ellum...@gmail.com" <ellum...@gmail.com>:
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/a19fab82-eda3-4a61-b375-fb4679300878%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Михаил Солоткий

unread,
Dec 12, 2017, 2:11:04 PM12/12/17
to ellum...@gmail.com, bigartm-users
So vocab_file_path is not cooc dictionary argument, is vocab with unique tokens of your collection (or it's subset, or underset)
 
12.12.2017, 22:08, "Михаил Солоткий" <mihl...@yandex.ru>:
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/1335601513105725%40web15g.yandex.ru.

For more options, visit https://groups.google.com/d/optout.

ellum...@gmail.com

unread,
Dec 12, 2017, 4:42:56 PM12/12/17
to bigartm-users
Roger that, thank you! It works OK now.

About using non-default modalities in coherence calculating - I have relevant case. I have 3 modalities in my model - something like title, full description and keywords. I'd like to calculate coherence for each modality and get average value as a quality measure for topic. But looks like I can't do this - I should to create three models - the first where title is default, the second where description is default, the third where keywords is default. May be it makes sense to implement several modalities support in this feature - to my mind it would be cool, but may be I'm misunderstanding something, im new in ARTM :)

Regards, Nikolay

вторник, 12 декабря 2017 г., 22:11:04 UTC+3 пользователь Михаил Солоткий написал:

Михаил Солоткий

unread,
Dec 13, 2017, 5:27:02 AM12/13/17
to ellum...@gmail.com, bigartm-users
Hm. Co-occurrences are calculated for tokens in a window (for example of size 10). So titles for documents are located far enough from each other and in order to consider co-occurrences between different titles you would need to make window size large enough (it takes longer to compute but still can be done). Here is another thing: the whole sense of calculating co-occurrences between tokens is to find that pairs of tokens that occur many times close to each other and occur not by chance (ppmi measures how nonrandom is co-occurrence of several tokens). If some tokens are located close enough and occur many times not by chance, may be they should be in the same topic. And if you set parameter window size to a large value, you then consider tokens that aren't closely located in text and it breaks the whole idea.
But you are right - it would be cool to make support for several modalities. Just because it's more convenient for user and because somebody can come up with a relevant case and then make some experiments with help of our tool. :)


 
13.12.2017, 00:43, "ellum...@gmail.com" <ellum...@gmail.com>:
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/5d24fc70-4e3f-4c8d-9251-d3627f6b7a93%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ellum...@gmail.com

unread,
Dec 13, 2017, 9:48:38 AM12/13/17
to bigartm-users
Could you please clarify what do you mean when say "titles for documents are located far enough from each other"?

Regards, Nikolay
Reply all
Reply to author
Forward
0 new messages