transform method error for new documents

lospeq

unread,

Feb 14, 2018, 3:07:44 AM2/14/18

to bigartm-users

I set separate folders for the training and test data batches. Training data looks like this (mmro collection; <=300 per docs class):

2007-MMRO13/dedovets |@labels_class class1 |@default_class [tokens...]
2007-MMRO13/protasov_AS |@labels_class class1 |@default_class [tokens...]
...
2011-MMRO15/kelmanov_4 |@labels_class class4 |@default_class [tokens...]
2011-MMRO15/ezhova |@labels_class class4 |@default_class [tokens...]

I made a simple artm model and now want to get transform predictions for a new document (which is actually just a document from the training collection). So I make a new batch just for this document and run transform, but always get the same error:

Exception in thread Thread-1:

Traceback (most recent call last):

File "C:\Program Files\Python36\lib\threading.py", line 916, in _bootstrap_inner

self.run()

File "C:\Program Files\Python36\lib\threading.py", line 864, in run

self._target(*self._args, **self._kwargs)

File "C:\Program Files\Python36\lib\multiprocessing\pool.py", line 119, in worker

result = (True, func(*args, **kwds))

File "C:\BigARTM\Python\artm\master_component.py", line 908, in transform

theta_matrix_info = self._lib.ArtmRequestTransformMasterModelExternal(self.master_id, args)

File "C:\BigARTM\Python\artm\wrapper\api.py", line 161, in artm_api_call

self._check_error(result)

File "C:\BigARTM\Python\artm\wrapper\api.py", line 97, in _check_error

raise exception_class(error_message)

artm.wrapper.exceptions.InvalidOperationException: Transform: no tokens in effect --- either tokens not present in the model, or tokens were ignored due to class_id

Traceback (most recent call last):

File "<pyshell#280>", line 1, in <module>

fail = model.transform(batch_vectorizer=tbv)

File "C:\BigARTM\Python\artm\artm_model.py", line 939, in transform

batch_vectorizer.num_batches)

File "C:\BigARTM\Python\artm\artm_model.py", line 482, in _wait_for_batches_processed

async_result.wait(1)

File "C:\Program Files\Python36\lib\multiprocessing\pool.py", line 635, in wait

self._event.wait(timeout)

I've tried a number of variations of both training and test VW formatting – changing the size of the collection, using different tokens and altering their number for @labels_class modality, leaving it blank, or getting rid of modalities altogether; I still get the error.

The only time I don't get it is when I explicitly state the class (it should be one of the classes present in training collection, or error appears again) of the test document as well and make a batch out of it (which I guess I'm not supposed to do according to docs?), so test VW would look like:

2007-MMRO13/dedovets |@labels_class class1 |@default_class [tokens...]

But then in both Θ and class predictions (with predict_class_id='@labels_class' parameter) I either get a matrix full of zeros or numbers that are far from what that document originally had in model's Θ.

Also tried the code bit from one of the previous threads and the error traceback got shorter but it's still the same error:

Traceback (most recent call last):

File "C:\Users\so\Desktop\gomi no gomi\nande\sukuwarerarerutoomouka\nandatteshitemomuri.py", line 74, in <module>

theta = model._lib.ArtmRequestTransformMasterModel(model.master.master_id, args)

File "C:\BigARTM\Python\artm\wrapper\api.py", line 161, in artm_api_call

self._check_error(result)

File "C:\BigARTM\Python\artm\wrapper\api.py", line 97, in _check_error

raise exception_class(error_message)

artm.wrapper.exceptions.InvalidOperationException: Transform: no tokens in effect --- either tokens not present in the model, or tokens were ignored due to class_id

I can't seem to pinpoint what I'm doing wrong. Any help would be much appreciated.

Oleksandr Frei

unread,

Feb 14, 2018, 3:59:07 AM2/14/18

to lospeq, bigartm-users

Hi,

This is strange.. could you share with me your data files and the python script to reproduce this weird behavior?

Kind regards,

Alex

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-users+unsubscribe@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/e8dfd653-cfdd-41b4-98c0-73962c8f07f5%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

lospeq

unread,

Feb 14, 2018, 4:32:13 AM2/14/18

to bigartm-users

Sure, if google drive is okay:
https://drive.google.com/open?id=17bcBmGEyRXEAn1VOWXx4uiWilp5HkJ1Q

среда, 14 февраля 2018 г., 13:59:07 UTC+5 пользователь Oleksandr Frei написал:

Oleksandr Frei

unread,

Feb 14, 2018, 6:44:57 AM2/14/18

to lospeq, bigartm-users

Thanks! I'll take a look.

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-users+unsubscribe@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/151dcd0b-57a6-40cc-ace0-63b86bae22d7%40googlegroups.com.

lospeq

unread,

Feb 14, 2018, 1:42:29 PM2/14/18

to bigartm-users

I tried to run it on a different PC, and transform did a great job at predicting Θ. I wonder if system locale has anything to do with it since I have it set to Japanese.

Also, if possible, I'd like to know how does transform manage to fit new vectors to the existing matrices from a mathematical viewpoint. I've tried to solve it myself with some matrices juggling and cosine similarity, but it didn't go too well. Possibly some papers to read up about?

Still though, I get a matrix of zeros for all classes when trying to classify documents with predict_class_id='@labels_class'. Not sure how the data should look to get it right.

lospeq

unread,

Feb 14, 2018, 2:42:26 PM2/14/18

to bigartm-users

Actually, I honestly have no clue what in the world just happened but I rebooted my main PC and everything now suddenly works perfectly.

And as for the classification, I kept getting zeros just because of the model parameters, not because of data.

I'm so sorry for wasting your time.

But I still hope for some transform tips!

Oleksandr Frei

unread,

Feb 14, 2018, 3:14:31 PM2/14/18

to lospeq, bigartm-users

Hi,

Don't worry, I'm glad you solved it!

For transform, here is a code snipped in python that does the job:

https://github.com/bigartm/bigartm/wiki/Q&A#how-to-classify-test-batch-by-your-own-code-including-smooth-theta-regularizer

- iterationForDocument corresponds to num_document_passes

- the terminology (item / field / token / token_weight ) might be confusing, it corresponds to our internal representation of the data; item = document; token = word in that document; token_weight = term frequency (how many times work occured in a document)

- the code snipped doesn't take care about modalities (aka class_ids). But basically you need to multiply token_weight by the weight of the modality that corresponds to that token

- alpha is your regularization constant from "sparse theta" regularizer (put alpha=0 if you don't use it).

The mathematics is described here:

https://link.springer.com/article/10.1007/s10994-014-5476-6

See algorithm 2.1. It describes fitting for both Phi and Theta, but if you only need Theta you still repeat the same iterative procedure until theta converges (without updating phi).

Hope this helps! Please let me know if you have further questions.

Kind regards,

Alex

--

You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-users+unsubscribe@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/970134b5-0230-4e3c-befa-b48352e359e3%40googlegroups.com.

lospeq

unread,

Feb 14, 2018, 6:32:47 PM2/14/18

to bigartm-users

Oh, I tried to run that snippet before but didn't go through with it because batch.item didn't seem to have 'field' field so it threw errors.

But now that I look, as one of the snippets above it shows, it's enough to switch 'field' to 'item' (e.g. item.token_id, item.token_weight) to get it work.

I suppose classification task should look about the same but with target modality tokens in place of topics in Theta?

Either way, hugely glad to have found this community. Thank you!

Oleksandr Frei

unread,

Feb 14, 2018, 6:43:20 PM2/14/18

to lospeq, bigartm-users

> switch 'field' to 'item' (e.g. item.token_id, item.token_weight) to get it work.

yes, that's correct!

> classification task should look about the same but with target modality tokens in place of topics in Theta?

I'm not sure I understood the way you want to make it... classification is defined here:

http://www.machinelearning.ru/wiki/images/d/da/Voron17fast.pdf (setionc IX.B - Multimodal ARTM / The class modality)

First you need to infer topics for a given document. Later, you use

to find distribution on classification labels. Then take argmax - e.i class with highest p(c|d').

Kind regards,

Alex

--

You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-users+unsubscribe@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/d7235298-4325-4cb3-a861-5209c4828b76%40googlegroups.com.

lospeq

unread,

Feb 15, 2018, 12:51:47 AM2/15/18

to bigartm-users

Ah, I see, I was wrong then.

Again, thank you very much!

Reply all

Reply to author

Forward