How to compute/extract n_d for huge collection? (Input format - VW)

64 views
Skip to first unread message

Rose Aysina

unread,
Jul 23, 2019, 5:29:05 AM7/23/19
to bigartm-users
Hello! 

I need to compute p(d|t) matrix: 
p(d|t) = p(t|d) * p(d) / p(t) = theta_td * n_d / sum_d (theta_td * n_d)

I hope it's correct formula. 
The problem is that I need to extract somehow length of each document - n_d. 

How can I do that if the input format is VW file and it's impossible to compute n_dw matrix in memory? 

Thanks! 

Anna Potapenko

unread,
Jul 23, 2019, 11:54:56 AM7/23/19
to Rose Aysina, bigartm-users
Hi Rose,

Could not you just go through each document independently and sum word counts from the VW file? No need to compute the whole n_dw matrix at once to get n_d counts. 

Cheers,
Anna

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/29ee8e66-e88e-4d51-8f23-a53d811a602b%40googlegroups.com.

Rose Aysina

unread,
Jul 24, 2019, 3:13:35 AM7/24/19
to bigartm-users
Hi Anna :) 

I hoped there is another way than sequentially go through VW and parse it because I have several VW for the same collection.
As I can see from the formulas, bigartm must compute n_d internally. I hoped that maybe there is some hidden API that will allow to extract it (like with n_t values for p(t|w) inside TopicMass score).

Am I right that n_d for several class_ids is the sum of values in VW of all class_ids (modalities)? Not the number of class_id tokens, but the exact value of token that I put to VW (after ':')? 

Rose.

On Tuesday, July 23, 2019 at 6:54:56 PM UTC+3, Anna Potapenko wrote:
Hi Rose,

Could not you just go through each document independently and sum word counts from the VW file? No need to compute the whole n_dw matrix at once to get n_d counts. 

Cheers,
Anna

On Tue, Jul 23, 2019, 10:29 AM Rose Aysina <rose....@gmail.com> wrote:
Hello! 

I need to compute p(d|t) matrix: 
p(d|t) = p(t|d) * p(d) / p(t) = theta_td * n_d / sum_d (theta_td * n_d)

I hope it's correct formula. 
The problem is that I need to extract somehow length of each document - n_d. 

How can I do that if the input format is VW file and it's impossible to compute n_dw matrix in memory? 

Thanks! 

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigart...@googlegroups.com.

Anna Potapenko

unread,
Aug 7, 2019, 4:21:52 AM8/7/19
to Rose Aysina, bigartm-users
Hi Rose,

Sorry for the delay!

Yes, n_d is a sum of all counts, not just indicators.
Not sure though if there is an existing API to retrieve it - Murat might know better :)

Cheers,
Anna

To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/726005a0-6b75-4bfe-8a29-fc602f751fa2%40googlegroups.com.

Мурат Апишев

unread,
Aug 7, 2019, 6:47:43 AM8/7/19
to Anna Potapenko, Rose Aysina, bigartm-users
Hi!
 
Yes, n_d are counters, but they are computed on the fly so are not accessable through API(


 ---
С уважением, Мурат Апишев.
 
Regards, Murat Apishev.



07.08.2019, 11:21, "Anna Potapenko" <anna.a.p...@gmail.com>:
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/CAHoVyFgFw-8zpJxwumLPUL0kccL7wtxqRLZ20jTuVM_O-tnmtw%40mail.gmail.com.

Oleksandr Frei

unread,
Aug 7, 2019, 7:21:28 AM8/7/19
to Мурат Апишев, Anna Potapenko, Rose Aysina, bigartm-users
Hi,
Here is one hack / workaround that would require a minimum (and quite logical) change to the library.

Let's recall that currently user can't extract raw (non-normalized) n_td values. That's a problem. We have nice mechamisms "cache_theta" and "theta_name", but those give access only to normalized theta_td probability distributions, i.e. p(t|d).  I suggest we implement a new feature (or hack) that let's the user access raw n_td before they get normalized. Simplest way would be to hack these two places:
https://github.com/bigartm/bigartm/blob/master/src/artm/core/processor_helpers.cc#L381 CreateThetaCacheEntry(new_cache_entry_ptr, theta_matrix, batch, p_wt, args);
https://github.com/bigartm/bigartm/blob/master/src/artm/core/processor_helpers.cc#L506 CreateThetaCacheEntry(new_cache_entry_ptr, theta_matrix, batch, p_wt, args);

We may add a new boolean flag ProcessBatchesArgs.cache_ntd_instead_of_theta (ok name?), and set default value to "false". When user sets it to true, processr_helpers.cc start saving non-normalized theta cache entries. 
In this case we may want to disallow "reuse_theta" when "cache_ntd_instead_of_theta" is set to true, so that reuse_theta doesn't pick non-normalized values.

One this feature is in place, you can easily calculate n_d as follows:
(1) create a dummy model with just one topic (here you may also want to specify a subset of class_ids to take into account)
(2) set num_document_passes = 1
(3) set cache_ntd_instead_of_theta = True
(4) call model.transform() or  model.master.process_batches() - whatever is your favorite way of getting theta matrix for a given batch

Does it makes sense?

Kind regards
Oleksndr




Rose Aysina

unread,
Aug 12, 2019, 5:05:35 AM8/12/19
to bigartm-users
Hi all! 

Thank you very much for replies. 

Oleksandr, if this ability will be in the API, that will be great! 
It is very frequent operation (calculate n_d) and as I understand everybody creates its own hack to solve this. 
My problem is that I create several input data based on the collection (with various values inside n_dw array as it has mixed types of class_ids),
and to perform sequential pass through VW (or something like this) is very painful in the perspective of repeated experiments. 

Am I right that after this flag in API the (5)-th step would be to just sum n_td by topics (n_d = sum_t n_td)? 
Should I create issue on GitHub? 

Thanks.
Rose. 


On Wednesday, August 7, 2019 at 2:21:28 PM UTC+3, Oleksandr Frei wrote:
Hi,
Here is one hack / workaround that would require a minimum (and quite logical) change to the library.

Let's recall that currently user can't extract raw (non-normalized) n_td values. That's a problem. We have nice mechamisms "cache_theta" and "theta_name", but those give access only to normalized theta_td probability distributions, i.e. p(t|d).  I suggest we implement a new feature (or hack) that let's the user access raw n_td before they get normalized. Simplest way would be to hack these two places:
https://github.com/bigartm/bigartm/blob/master/src/artm/core/processor_helpers.cc#L381 CreateThetaCacheEntry(new_cache_entry_ptr, theta_matrix, batch, p_wt, args);
https://github.com/bigartm/bigartm/blob/master/src/artm/core/processor_helpers.cc#L506 CreateThetaCacheEntry(new_cache_entry_ptr, theta_matrix, batch, p_wt, args);

We may add a new boolean flag ProcessBatchesArgs.cache_ntd_instead_of_theta (ok name?), and set default value to "false". When user sets it to true, processr_helpers.cc start saving non-normalized theta cache entries. 
In this case we may want to disallow "reuse_theta" when "cache_ntd_instead_of_theta" is set to true, so that reuse_theta doesn't pick non-normalized values.

One this feature is in place, you can easily calculate n_d as follows:
(1) create a dummy model with just one topic (here you may also want to specify a subset of class_ids to take into account)
(2) set num_document_passes = 1
(3) set cache_ntd_instead_of_theta = True
(4) call model.transform() or  model.master.process_batches() - whatever is your favorite way of getting theta matrix for a given batch

Does it makes sense?

Kind regards
Oleksndr




On Wed, Aug 7, 2019 at 12:47 PM Мурат Апишев <grea...@yandex.ru> wrote:
Hi!
 
Yes, n_d are counters, but they are computed on the fly so are not accessable through API(


 ---
С уважением, Мурат Апишев.
 
Regards, Murat Apishev.



07.08.2019, 11:21, "Anna Potapenko" <anna.a....@gmail.com>:

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigart...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigart...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages