Cannot get document names when transform or get_theta()

40 views
Skip to first unread message

Святослав Игуана

unread,
Sep 18, 2017, 12:54:10 PM9/18/17
to bigartm-users
Hello

I prepare my vowpal wabbit file like this: "docname |default token token token ..."
Every document is named as prefix-number, for example "train-10" or "test-31"

I init my model like this:
model_artm = artm.ARTM(num_topics=T, topic_names=["topic_"+str(i) for i in range(T)],
num_document_passes=5, reuse_theta=False, class_ids={'default':1.},
cache_theta=True, seed=11, show_progress_bars=True)

When I use model_artm.get_phi().to_csv('name.csv') I get numerical colnames: [0, 1, 2, ... N], where N is documents number. The most confusing thing is that numbers are not sorted. It seems like names sorted within batches, and then batches are shuffled, for example colnames from 3000 to 5999, then from 0 to 2999 then from 6000 to 8999. So I'm not sure, that columns are in the same order as documents in my vowpal wabbit file. Same thing when I use transform() from saved model. When I try to use transform right after fitting with theta_matrix_type='cache', I get None in despite of cache_theta=True in init.

Best Regards,
Sviatoslav

Oleksandr Frei

unread,
Sep 19, 2017, 4:00:17 AM9/19/17
to Святослав Игуана, bigartm-users
Hi Sviatoslav,

Try to add
         theta_columns_naming='title' 
into model initialization (artm.ARTM). This tells BigARTM to use the docname from your WV file as document labels instead of the dummy IDs that you get now.
Does this fix you issue?

> transform right after fitting with theta_matrix_type='cache', I get None in despite of cache_theta=True in init
This is expected. Once you set theta_matrix_type='cache' in transform() call BigARTM expects you to retrieve the matrix from the cache using get_theta() method. The reason why users may want to retrieve theta in two steps (transform() and then get_theta()) is because get_theta has somewhat more options - for example retrieve specific topics. Otherwise, if you need theta matrix at once, just omit theta_matrix_type param in transform() call to get the matrix. 

Kind regards,
Alex 

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-users+unsubscribe@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/0e5c8aa8-4871-4672-a5d3-3e25ac1ff3ec%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Святослав Игуана

unread,
Sep 19, 2017, 4:53:38 AM9/19/17
to bigartm-users
Thank you, Alex!
I did not read docs carefully. Now everything works as expected.

King regards,
Sviatoslav.

вторник, 19 сентября 2017 г., 11:00:17 UTC+3 пользователь Oleksandr Frei написал:
Hi Sviatoslav,

Try to add
         theta_columns_naming='title' 
into model initialization (artm.ARTM). This tells BigARTM to use the docname from your WV file as document labels instead of the dummy IDs that you get now.
Does this fix you issue?

> transform right after fitting with theta_matrix_type='cache', I get None in despite of cache_theta=True in init
This is expected. Once you set theta_matrix_type='cache' in transform() call BigARTM expects you to retrieve the matrix from the cache using get_theta() method. The reason why users may want to retrieve theta in two steps (transform() and then get_theta()) is because get_theta has somewhat more options - for example retrieve specific topics. Otherwise, if you need theta matrix at once, just omit theta_matrix_type param in transform() call to get the matrix. 

Kind regards,
Alex 
On Mon, Sep 18, 2017 at 6:54 PM, Святослав Игуана <iggi...@gmail.com> wrote:
Hello

I prepare my vowpal wabbit file like this: "docname |default token token token ..."
Every document is named as prefix-number, for example "train-10" or "test-31"

I init my model like this:
model_artm = artm.ARTM(num_topics=T, topic_names=["topic_"+str(i) for i in range(T)],
num_document_passes=5, reuse_theta=False, class_ids={'default':1.},
cache_theta=True, seed=11, show_progress_bars=True)

When I use model_artm.get_phi().to_csv('name.csv') I get numerical colnames: [0, 1, 2, ... N], where N is documents number. The most confusing thing is that numbers are not sorted. It seems like names sorted within batches, and then batches are shuffled, for example colnames from 3000 to 5999, then from 0 to 2999 then from 6000 to 8999. So I'm not sure, that columns are in the same order as documents in my vowpal wabbit file. Same thing when I use transform() from saved model. When I try to use transform right after fitting with theta_matrix_type='cache', I get None in despite of cache_theta=True in init.

Best Regards,
Sviatoslav

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.

Oleksandr Frei

unread,
Sep 19, 2017, 4:59:03 AM9/19/17
to Святослав Игуана, bigartm-users
No problem! Just ask if anything isn't clear enough - quit often BigARTM's documentation is vague and not giving good examples.

To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-users+unsubscribe@googlegroups.com.

To post to this group, send email to bigart...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages