transform method error for new documents

84 views
Skip to first unread message

lospeq

unread,
Feb 14, 2018, 3:07:44 AM2/14/18
to bigartm-users
I set separate folders for the training and test data batches. Training data looks like this (mmro collection;  <=300 per docs class):
2007-MMRO13/dedovets |@labels_class class1 |@default_class [tokens...]
2007-MMRO13/protasov_AS |@labels_class class1 |@default_class [tokens...]
... 2011-MMRO15/kelmanov_4 |@labels_class class4 |@default_class [tokens...]
2011-MMRO15/ezhova |@labels_class class4 |@default_class [
tokens...]

I made a simple artm model and now want to get transform predictions for a new document (which is actually just a document from the training collection). So I make a new batch just for this document and run transform, but always get the same error:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Program Files\Python36\lib\threading.py", line 916, in _bootstrap_inner
    self.run()
  File "C:\Program Files\Python36\lib\threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Program Files\Python36\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "C:\BigARTM\Python\artm\master_component.py", line 908, in transform
    theta_matrix_info = self._lib.ArtmRequestTransformMasterModelExternal(self.master_id, args)
  File "C:\BigARTM\Python\artm\wrapper\api.py", line 161, in artm_api_call
    self._check_error(result)
  File "C:\BigARTM\Python\artm\wrapper\api.py", line 97, in _check_error
    raise exception_class(error_message)
artm.wrapper.exceptions.InvalidOperationException: Transform: no tokens in effect --- either tokens not present in the model, or tokens were ignored due to class_id

Traceback (most recent call last):
  File "<pyshell#280>", line 1, in <module>
    fail = model.transform(batch_vectorizer=tbv)
  File "C:\BigARTM\Python\artm\artm_model.py", line 939, in transform
    batch_vectorizer.num_batches)
  File "C:\BigARTM\Python\artm\artm_model.py", line 482, in _wait_for_batches_processed
    async_result.wait(1)
  File "C:\Program Files\Python36\lib\multiprocessing\pool.py", line 635, in wait
    self._event.wait(timeout)

I've tried a number of variations of both training and test VW formatting – changing the size of the collection, using different tokens and altering their number for @labels_class modality, leaving it blank, or getting rid of modalities altogether; I still get the error.
The only time I don't get it is when I explicitly state the class (it should be one of the classes present in training collection, or error appears again) of the test document as well and make a batch out of it (which I guess I'm not supposed to do according to docs?), so test VW would look like:
2007-MMRO13/dedovets |@labels_class class1 |@default_class [tokens...] 
But then in both Θ and class predictions (with predict_class_id='@labels_class' parameter) I either get a matrix full of zeros or numbers that are far from what that document originally had in model's Θ.
Also tried the code bit from one of the previous threads and the error traceback got shorter but it's still the same error:

Traceback (most recent call last):
  File "C:\Users\so\Desktop\gomi no gomi\nande\sukuwarerarerutoomouka\nandatteshitemomuri.py", line 74, in <module>
    theta = model._lib.ArtmRequestTransformMasterModel(model.master.master_id, args)
  File "C:\BigARTM\Python\artm\wrapper\api.py", line 161, in artm_api_call
    self._check_error(result)
  File "C:\BigARTM\Python\artm\wrapper\api.py", line 97, in _check_error
    raise exception_class(error_message)
artm.wrapper.exceptions.InvalidOperationException: Transform: no tokens in effect --- either tokens not present in the model, or tokens were ignored due to class_id

I can't seem to pinpoint what I'm doing wrong. Any help would be much appreciated.

Oleksandr Frei

unread,
Feb 14, 2018, 3:59:07 AM2/14/18
to lospeq, bigartm-users
Hi,
This is strange.. could you share with me your data files and the python script to reproduce this weird behavior?
Kind regards,
Alex

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-users+unsubscribe@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/e8dfd653-cfdd-41b4-98c0-73962c8f07f5%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

lospeq

unread,
Feb 14, 2018, 4:32:13 AM2/14/18
to bigartm-users
Sure, if google drive is okay:
https://drive.google.com/open?id=17bcBmGEyRXEAn1VOWXx4uiWilp5HkJ1Q


среда, 14 февраля 2018 г., 13:59:07 UTC+5 пользователь Oleksandr Frei написал:

Oleksandr Frei

unread,
Feb 14, 2018, 6:44:57 AM2/14/18
to lospeq, bigartm-users
Thanks! I'll take a look.

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-users+unsubscribe@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.

lospeq

unread,
Feb 14, 2018, 1:42:29 PM2/14/18
to bigartm-users
I tried to run it on a different PC, and transform did a great job at predicting Θ. I wonder if system locale has anything to do with it since I have it set to Japanese. 
Also, if possible, I'd like to know how does transform manage to fit new vectors to the existing matrices from a mathematical viewpoint. I've tried to solve it myself with some matrices juggling and cosine similarity, but it didn't go too well. Possibly some papers to read up about?

Still though, I get a matrix of zeros for all classes when trying to classify documents with predict_class_id='@labels_class'. Not sure how the data should look to get it right.

lospeq

unread,
Feb 14, 2018, 2:42:26 PM2/14/18
to bigartm-users
Actually, I honestly have no clue what in the world just happened but I rebooted my main PC and everything now suddenly works perfectly.
 And as for the classification, I kept getting zeros just because of the model parameters, not because of data.

I'm so sorry for wasting your time. 
But I still hope for some transform tips!

Oleksandr Frei

unread,
Feb 14, 2018, 3:14:31 PM2/14/18
to lospeq, bigartm-users
Hi,
Don't worry, I'm glad you solved it!
For transform, here is a code snipped in python that does the job:
- iterationForDocument corresponds to num_document_passes
- the terminology (item / field / token / token_weight ) might be confusing, it corresponds to our internal representation of the data; item = document; token = word in that document; token_weight = term frequency (how many times work occured in a document)
- the code snipped doesn't take care about modalities (aka class_ids). But basically you need to multiply token_weight by the weight of the modality that corresponds to that token
- alpha is your regularization constant from "sparse theta" regularizer (put alpha=0 if you don't use it).

The mathematics is described here:
See algorithm 2.1. It describes fitting for both Phi and Theta, but if you only need Theta you still repeat the same iterative procedure until theta converges (without updating phi).

Hope this helps! Please let me know if you have further questions.

Kind regards,
Alex

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-users+unsubscribe@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.

lospeq

unread,
Feb 14, 2018, 6:32:47 PM2/14/18
to bigartm-users
Oh, I tried to run that snippet before but didn't go through with it because batch.item didn't seem to have 'field' field so it threw errors. 
But now that I look, as one of the snippets above it shows, it's enough to switch 'field' to 'item' (e.g. item.token_id, item.token_weight) to get it work.

I suppose classification task should look about the same but with target modality tokens in place of topics in Theta?

Either way, hugely glad to have found this community. Thank you!

Oleksandr Frei

unread,
Feb 14, 2018, 6:43:20 PM2/14/18
to lospeq, bigartm-users
> switch 'field' to 'item' (e.g. item.token_id, item.token_weight) to get it work.
yes, that's correct! 

> classification task should look about the same but with target modality tokens in place of topics in Theta?
I'm not sure I understood the way you want to make it... classification is defined here: 
http://www.machinelearning.ru/wiki/images/d/da/Voron17fast.pdf (setionc IX.B - Multimodal ARTM / The class modality)
First you need to infer topics for a given document. Later, you use
Inline image 1
to find distribution on classification labels. Then take argmax - e.i class with highest p(c|d').

Kind regards,
Alex


--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-users+unsubscribe@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.

lospeq

unread,
Feb 15, 2018, 12:51:47 AM2/15/18
to bigartm-users
Ah, I see, I was wrong then.

Again, thank you very much!
Reply all
Reply to author
Forward
0 new messages