Data representativeness and "none of the above" label

50 views
Skip to first unread message

j...@blyng.io

unread,
Jan 18, 2021, 4:15:57 AM1/18/21
to cloud-nl-discuss
Dear AutoML team,

Here is the context of my questions located at the end of this email:
  • Our NLP application: 
    • Text classification, Multi-label 
    • The goal of our model is to search for the presence (or non presence) of specific type of content in text (at the sentence level).
  • File type: 
    • 30-pages text PDF
  • Number of files used so far for training: 
    • about 100 (=3,000 pages)
  • Number of labels: 6
    • 5 defined labels that we need to run predictions; and
    • 1 label called "None_of_the_above", as suggested HERE, that don't match any of our defined labels. We understand that having this "None_of_the_above" label can improve the accuracy of our model.
  • Number of example per label: 
    • 100 to 700 examples (sentences) per defined labels; and 
    • 65,000 examples (sentences, individual words, numbers) for the "none_of_the_above" label. 
We understand that it is recommended to have a ratio of maximum 10x between the label with the smallest number of examples (in our case 100) and the label with the biggest number of examples (in our case 65,000).

Obviously, it is ok for us to reduce the number of examples of the current biggest label from 65,000 to 1,000 in our CSV file. However, we are afraid of having a data representativeness issue if we do so. 

Indeed, on average 1 file contains 30 pages, which equals to about 700 examples (ie: sentences/words/numbers). Out of these 700 examples, about 50 examples (sentences) are used for training our 5 defined labels; the remaIning 650 examples (sentences/words/numbers) per file will automatically fall into the "none of the above" label.

If we only select randomly 1,000 examples for the "none of the above" label out of the 65,000, it feels like we are only using 1.5 files to train the "none of the above" label. 

As mentioned above, 1 file has on average 650 "none of the above" examples, and we have actually used about 100 files so far to gather the training phrases we need for the 5 defined labels.

Questions for you:
  • Is having  the "None_of_the_above" label compulsory, strongly recommended or optional?
  • Would you recommend us to run 2 models: one without the "None_of_the_above" label and one with it (containing 1,000 examples maximum)? Or would you recommend something else?
  • Would you also see a potential data representativeness issue if we only use 1,000 examples out of the 65,000 available?
  • Any other suggestions or things we should keep in mind?
Thanks.
Julien

rodv...@google.com

unread,
Jan 18, 2021, 6:50:28 AM1/18/21
to cloud-nl-discuss

Hello Julien,

Even though I do not belong to the AutoML Engineering Team, I would like to share my point of view about the aforementioned questions. I believe you provided the wrong link for the documentation about the label (`None_of_the_above`). You linked [1], which is for AutoML Vision, while I understand that you meant to cite [2], for AutoML Natural Language.

According to [2], users are suggested to consider using the label `None_of_the_above`,”for documents that don't match any of your defined labels”. Including this label may improve the accuracy of your model, however; its use is not mandatory. It is strongly recommended that, whenever your model seems to have lower accuracy than expected, you also try adding this label and check if the accuracy improves.

Take into account that in [2], it is stated “The model works best when there are at most 100 times more documents for the most common label than for the least common label”. Consequently, given that you have at least 100 examples for each label, the maximum recommended number of examples for the `None_of_the_above` would be 10000 documents. The risk of having some labels present with a much higher frequency than others is that machine learning models might learn to label all the examples with the most frequent label as this might be the way to get a lower overall classification error rate.

Instead of deleting examples from the `None_of_the_above`, another approach would be to try to include more examples for the rest of your 5 labels in order to also strike a balance with respect to the most frequent label.

Finally, I believe that it might be worth comparing several models and checking how the accuracy fluctuates with respect to the frequency ratio between the `None_of_the_above` label and the other five labels in your model. You might for example create one model without this special label, one with 1000 examples, one with 10000 examples and one with the full 65000 examples and monitor the metrics obtained by your models in the test set in order to keep the most efficient one. 

Additionally, make sure that, in those cases in which you reduce the number of your examples for the label `None_of_the_above`, you do this at random. This way, most of your files will be represented within the dataset, even if fewer sentences are taken from each of them.

I hope this helped clarify your questions.

[1]:

https://cloud.google.com/vision/automl/docs/prepare

[2]:

https://cloud.google.com/natural-language/automl/docs/prepare#expandable-1

Reply all
Reply to author
Forward
0 new messages