RAM consumption pt.2

23 views
Skip to first unread message

a.shironosov

unread,
May 7, 2015, 4:24:39 AM5/7/15
to bigart...@googlegroups.com
Hello.
Tonight i'm faced with new challenge :)  I used bigartm on collection of 10000 documents with huge number of terms (more than 2 millions) for only 400 topics, and it consumes huge amount of memory ( 30 GB ram + 50 GB swap) after all iterations (near SynchronizeModel). I'm using getting ThetaMatrix batch by batch and number of topics/documents is quite small, so i guess the problem is somewhere in getting tokens/top tokens. Any ideas?

Sincerely, Alex.

Oleksandr Frei

unread,
May 7, 2015, 3:51:45 PM5/7/15
to a.shironosov, bigart...@googlegroups.com
Hi Alex,

First, do you really need 2m tokens? :) Note that recently we have added an easy way to filter our too rare and too frequent tokens. Check it out in this example. The trick is that you don't need to re-generate your batches - just point InitializeTopicModel to an existing folder with batches, and tweak min_items and max_percentage thresholds.

But back to your question - you are right, and BigARTM should handle 2m tokens. For your case (e.g. |W|>>|D|) I expect BigARTM to use around |T|*|W|*sizeof(float)*K, where K=3 without regularizers and K=5 (with regularizers). For you parameters this gives 400*2e6*4*5=15 GB. But in your case it grows further, so there should be some issue causing this... Are you using the latest BigARTM? I did some big changes to improve memory usage around 3 month ago, and some final tweaks recently. 

Could you also send me your BigARTM log? I will like to check if it has "Merger queue is full, waiting..." or "including X ms in waiting for merger queue" messages. If you have a lot of those then memory usage can also grow.

Thanks,
Alex

P.S. In a long term the plan for BigARTM is to represent p(w|t) distributions in a sparse way, so that memory usage does not grow as O(|W|*|T|).

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/4110a59f-2500-4fe9-9252-0ef6297225ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

a.shironosov

unread,
May 8, 2015, 1:49:12 AM5/8/15
to bigart...@googlegroups.com, a.shir...@gmail.com
Thank you for response. 
Unfortunately, i've already deleted all corresponding logs, but i'll make same experiment asap. And i will try newest version of bigartm, because lifetime of my version is about a couple of months.

Sincerely, Alex.

a.shironosov

unread,
May 13, 2015, 5:08:43 AM5/13/15
to bigart...@googlegroups.com
Hello!

It seems that newer version of library eliminated the problem with too much memory consumption, but i've faced with something new. I caught an 'Exception  : InvalidOperation :  Length mismatch in fields Batch.class_id and Batch.token, batch.id = 3067fe9b-ff7f-0000-7905-7355f17f0000' on reading ThetaMatrix batch by batch. The point is that i haven't batch with the same name in my batch folder.

Code for reading ThetaMatrix:
fs::recursive_directory_iterator it(options.batch_folder);  
fs::recursive_directory_iterator endit;
int csvn=0;
while (it != endit) 
{
if (fs::is_regular_file(*it) || it->path().extension() == ".batch") 
{
std::shared_ptr<Batch> batch = artm::LoadBatch(it->path().string());
artm::GetThetaMatrixArgs args;
args.set_model_name(model_config.name());
args.mutable_batch()->CopyFrom(*batch);
std::shared_ptr< ::artm::ThetaMatrix> theta = master_component->GetThetaMatrix(args);
csvn++;
std::string csvname = "/home/snapper/topic_modelling2/topics/theta_" +std::to_string(csvn)+ ".csv";
std::cout<< "writing to:"<< csvname << std::endl;
std::ofstream of(csvname.c_str());
artm::ThetaMatrix* theta_matrix = theta.get();
std::cout<<"items in theta:"<<theta_matrix->item_id_size() <<std::endl;
for (int i=0;i<theta_matrix->item_id_size();i++)
std::vector <double> probs;
for (int j=0;j<theta_matrix->item_weights(i).value_size(); j++)
{
probs.push_back(theta_matrix->item_weights(i).value(j));
}
std::vector<size_t> sorted=sort_indexes(probs);
of<<theta_matrix->item_id(i)<<",";
int num_themes=10;
for (int j=0;j<num_themes;j++)
{
if (j>0) of<<",";
of<<sorted[j];
}
of<<std::endl<<theta_matrix->item_id(i)<<",";
for (int j=0;j<num_themes;j++)
{
if (j>0) of<<",";
of<<probs[sorted[j]];
}
of<<std::endl;
}
of.close();
}
++it;
}

Log file is included.

Sincerely, Alex.
..dominik.digipro.ru.snapper.log.INFO.20150513-103239.5013
srcmain.cc

Oleksandr Frei

unread,
May 25, 2015, 5:20:16 AM5/25/15
to a.shironosov, bigart...@googlegroups.com
Hi,

Sorry, I didn't realize I answered just to Alex instead of "Reply to All". This had been solved some time ago :)

================================================
Hi Alex,

I agree "3067fe9b-ff7f-0000-7905-7355f17f0000" is confusing -- the problem is that internally BigARTM identifies all batches by GUIDs, so if your filename can't be parsed to GUID then we will assign a new one, and the info about filename is lost (so I can't log it at the moment). But I'll fix that.

(1) Does it fail during LoadBatch() or during GetThetaMatrix()? It might be that LoadBatch() rejects the batch because it has inconsistent fields (batch.class_id and batch.token). Could you find out which batch is it? And load it manually:

#include <fstream>
Batch batch;
std::ifstream fin(full_filename.c_str(), std::ifstream::binary);
batch.ParseFromIstream(&fin)
fin.close();

Then check batch.class_id_size() and batch.token_size(). 
It is OK if class_id_size() == 0, then BigARTM interpret all tokens as tokens of the default modality ("@default_class").
Otherwise, if class_id_size() != 0, the expectation is that class_id_size() equals to batch.token_size().

(2) However, if LoadBatch() succeeds and GetThetaMatrix() fails - then its a clear bug, and I'll look into it.

Thanks,
Alex

P.S. Thanks for the code that exports to CSV! I'll included that in cpp_client, because exporting Phi and Theta matrices in plain text seems to be a very common task.
================================================

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages