How to compute/extract n_d for huge collection? (Input format

Rose Aysina

unread,

Jul 23, 2019, 5:29:05 AM7/23/19

to bigartm-users

Hello!

I need to compute p(d|t) matrix:

p(d|t) = p(t|d) * p(d) / p(t) = theta_td * n_d / sum_d (theta_td * n_d)

I hope it's correct formula.

The problem is that I need to extract somehow length of each document - n_d.

How can I do that if the input format is VW file and it's impossible to compute n_dw matrix in memory?

Thanks!

Anna Potapenko

unread,

Jul 23, 2019, 11:54:56 AM7/23/19

to Rose Aysina, bigartm-users

Hi Rose,

Could not you just go through each document independently and sum word counts from the VW file? No need to compute the whole n_dw matrix at once to get n_d counts.

Cheers,

Anna

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/29ee8e66-e88e-4d51-8f23-a53d811a602b%40googlegroups.com.

Rose Aysina

unread,

Jul 24, 2019, 3:13:35 AM7/24/19

to bigartm-users

Hi Anna :)

I hoped there is another way than sequentially go through VW and parse it because I have several VW for the same collection.

As I can see from the formulas, bigartm must compute n_d internally. I hoped that maybe there is some hidden API that will allow to extract it (like with n_t values for p(t|w) inside TopicMass score).

Am I right that n_d for several class_ids is the sum of values in VW of all class_ids (modalities)? Not the number of class_id tokens, but the exact value of token that I put to VW (after ':')?

Rose.

On Tuesday, July 23, 2019 at 6:54:56 PM UTC+3, Anna Potapenko wrote:

Hi Rose,

Could not you just go through each document independently and sum word counts from the VW file? No need to compute the whole n_dw matrix at once to get n_d counts.

Cheers,
Anna

On Tue, Jul 23, 2019, 10:29 AM Rose Aysina <rose....@gmail.com> wrote:

Hello!

I need to compute p(d|t) matrix:
p(d|t) = p(t|d) * p(d) / p(t) = theta_td * n_d / sum_d (theta_td * n_d)

I hope it's correct formula.
The problem is that I need to extract somehow length of each document - n_d.

How can I do that if the input format is VW file and it's impossible to compute n_dw matrix in memory?

Thanks!

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigart...@googlegroups.com.

Anna Potapenko

unread,

Aug 7, 2019, 4:21:52 AM8/7/19

to Rose Aysina, bigartm-users

Hi Rose,

Sorry for the delay!

Yes, n_d is a sum of all counts, not just indicators.

Not sure though if there is an existing API to retrieve it - Murat might know better :)

Cheers,

Anna

To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/726005a0-6b75-4bfe-8a29-fc602f751fa2%40googlegroups.com.

Мурат Апишев

unread,

Aug 7, 2019, 6:47:43 AM8/7/19

to Anna Potapenko, Rose Aysina, bigartm-users

Hi!

Yes, n_d are counters, but they are computed on the fly so are not accessable through API(

---

С уважением, Мурат Апишев.

Regards, Murat Apishev.

07.08.2019, 11:21, "Anna Potapenko" <anna.a.p...@gmail.com>:

To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/CAHoVyFgFw-8zpJxwumLPUL0kccL7wtxqRLZ20jTuVM_O-tnmtw%40mail.gmail.com.

Oleksandr Frei

unread,

Aug 7, 2019, 7:21:28 AM8/7/19

to Мурат Апишев, Anna Potapenko, Rose Aysina, bigartm-users

Hi,

Here is one hack / workaround that would require a minimum (and quite logical) change to the library.

Let's recall that currently user can't extract raw (non-normalized) n_td values. That's a problem. We have nice mechamisms "cache_theta" and "theta_name", but those give access only to normalized theta_td probability distributions, i.e. p(t|d). I suggest we implement a new feature (or hack) that let's the user access raw n_td before they get normalized. Simplest way would be to hack these two places:

https://github.com/bigartm/bigartm/blob/master/src/artm/core/processor_helpers.cc#L381 CreateThetaCacheEntry(new_cache_entry_ptr, theta_matrix, batch, p_wt, args);

https://github.com/bigartm/bigartm/blob/master/src/artm/core/processor_helpers.cc#L506 CreateThetaCacheEntry(new_cache_entry_ptr, theta_matrix, batch, p_wt, args);

We may add a new boolean flag ProcessBatchesArgs.cache_ntd_instead_of_theta (ok name?), and set default value to "false". When user sets it to true, processr_helpers.cc start saving non-normalized theta cache entries.

In this case we may want to disallow "reuse_theta" when "cache_ntd_instead_of_theta" is set to true, so that reuse_theta doesn't pick non-normalized values.

One this feature is in place, you can easily calculate n_d as follows:

(1) create a dummy model with just one topic (here you may also want to specify a subset of class_ids to take into account)

(2) set num_document_passes = 1

(3) set cache_ntd_instead_of_theta = True

(4) call model.transform() or model.master.process_batches() - whatever is your favorite way of getting theta matrix for a given batch

Does it makes sense?

Kind regards

Oleksndr

To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/2297431565174860%40sas1-519d0aa5daa3.qloud-c.yandex.net.

Rose Aysina

unread,

Aug 12, 2019, 5:05:35 AM8/12/19

to bigartm-users

Hi all!

Thank you very much for replies.

Oleksandr, if this ability will be in the API, that will be great!

It is very frequent operation (calculate n_d) and as I understand everybody creates its own hack to solve this.

My problem is that I create several input data based on the collection (with various values inside n_dw array as it has mixed types of class_ids),

and to perform sequential pass through VW (or something like this) is very painful in the perspective of repeated experiments.

Am I right that after this flag in API the (5)-th step would be to just sum n_td by topics (n_d = sum_t n_td)?

Should I create issue on GitHub?

Thanks.

Rose.

On Wednesday, August 7, 2019 at 2:21:28 PM UTC+3, Oleksandr Frei wrote:

Hi,
Here is one hack / workaround that would require a minimum (and quite logical) change to the library.

Let's recall that currently user can't extract raw (non-normalized) n_td values. That's a problem. We have nice mechamisms "cache_theta" and "theta_name", but those give access only to normalized theta_td probability distributions, i.e. p(t|d). I suggest we implement a new feature (or hack) that let's the user access raw n_td before they get normalized. Simplest way would be to hack these two places:
https://github.com/bigartm/bigartm/blob/master/src/artm/core/processor_helpers.cc#L381 CreateThetaCacheEntry(new_cache_entry_ptr, theta_matrix, batch, p_wt, args);
https://github.com/bigartm/bigartm/blob/master/src/artm/core/processor_helpers.cc#L506 CreateThetaCacheEntry(new_cache_entry_ptr, theta_matrix, batch, p_wt, args);

We may add a new boolean flag ProcessBatchesArgs.cache_ntd_instead_of_theta (ok name?), and set default value to "false". When user sets it to true, processr_helpers.cc start saving non-normalized theta cache entries.
In this case we may want to disallow "reuse_theta" when "cache_ntd_instead_of_theta" is set to true, so that reuse_theta doesn't pick non-normalized values.

One this feature is in place, you can easily calculate n_d as follows:
(1) create a dummy model with just one topic (here you may also want to specify a subset of class_ids to take into account)
(2) set num_document_passes = 1
(3) set cache_ntd_instead_of_theta = True
(4) call model.transform() or model.master.process_batches() - whatever is your favorite way of getting theta matrix for a given batch

Does it makes sense?

Kind regards
Oleksndr

On Wed, Aug 7, 2019 at 12:47 PM Мурат Апишев <grea...@yandex.ru> wrote:

Hi!

Yes, n_d are counters, but they are computed on the fly so are not accessable through API(

---
С уважением, Мурат Апишев.

Regards, Murat Apishev.

07.08.2019, 11:21, "Anna Potapenko" <anna.a....@gmail.com>:

To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/726005a0-6b75-4bfe-8a29-fc602f751fa2%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigart...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/CAHoVyFgFw-8zpJxwumLPUL0kccL7wtxqRLZ20jTuVM_O-tnmtw%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigart...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/2297431565174860%40sas1-519d0aa5daa3.qloud-c.yandex.net.

Reply all

Reply to author

Forward

How to compute/extract n_d for huge collection? (Input format - VW)

Rose Aysina

Anna Potapenko

Rose Aysina

Anna Potapenko

Мурат Апишев

Oleksandr Frei

Rose Aysina