--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/29ee8e66-e88e-4d51-8f23-a53d811a602b%40googlegroups.com.
Hi Rose,Could not you just go through each document independently and sum word counts from the VW file? No need to compute the whole n_dw matrix at once to get n_d counts.Cheers,Anna
On Tue, Jul 23, 2019, 10:29 AM Rose Aysina <rose....@gmail.com> wrote:
Hello!--I need to compute p(d|t) matrix:p(d|t) = p(t|d) * p(d) / p(t) = theta_td * n_d / sum_d (theta_td * n_d)I hope it's correct formula.The problem is that I need to extract somehow length of each document - n_d.How can I do that if the input format is VW file and it's impossible to compute n_dw matrix in memory?Thanks!
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigart...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/726005a0-6b75-4bfe-8a29-fc602f751fa2%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/CAHoVyFgFw-8zpJxwumLPUL0kccL7wtxqRLZ20jTuVM_O-tnmtw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/2297431565174860%40sas1-519d0aa5daa3.qloud-c.yandex.net.
Hi,Here is one hack / workaround that would require a minimum (and quite logical) change to the library.Let's recall that currently user can't extract raw (non-normalized) n_td values. That's a problem. We have nice mechamisms "cache_theta" and "theta_name", but those give access only to normalized theta_td probability distributions, i.e. p(t|d). I suggest we implement a new feature (or hack) that let's the user access raw n_td before they get normalized. Simplest way would be to hack these two places:https://github.com/bigartm/bigartm/blob/master/src/artm/core/processor_helpers.cc#L381 CreateThetaCacheEntry(new_cache_entry_ptr, theta_matrix, batch, p_wt, args);https://github.com/bigartm/bigartm/blob/master/src/artm/core/processor_helpers.cc#L506 CreateThetaCacheEntry(new_cache_entry_ptr, theta_matrix, batch, p_wt, args);We may add a new boolean flag ProcessBatchesArgs.cache_ntd_instead_of_theta (ok name?), and set default value to "false". When user sets it to true, processr_helpers.cc start saving non-normalized theta cache entries.In this case we may want to disallow "reuse_theta" when "cache_ntd_instead_of_theta" is set to true, so that reuse_theta doesn't pick non-normalized values.One this feature is in place, you can easily calculate n_d as follows:(1) create a dummy model with just one topic (here you may also want to specify a subset of class_ids to take into account)(2) set num_document_passes = 1(3) set cache_ntd_instead_of_theta = True(4) call model.transform() or model.master.process_batches() - whatever is your favorite way of getting theta matrix for a given batchDoes it makes sense?Kind regardsOleksndr
On Wed, Aug 7, 2019 at 12:47 PM Мурат Апишев <grea...@yandex.ru> wrote:
Hi!Yes, n_d are counters, but they are computed on the fly so are not accessable through API(---С уважением, Мурат Апишев.Regards, Murat Apishev.
07.08.2019, 11:21, "Anna Potapenko" <anna.a....@gmail.com>:
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/726005a0-6b75-4bfe-8a29-fc602f751fa2%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigart...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/CAHoVyFgFw-8zpJxwumLPUL0kccL7wtxqRLZ20jTuVM_O-tnmtw%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigart...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/2297431565174860%40sas1-519d0aa5daa3.qloud-c.yandex.net.