Hello Julien,
Even though I do not belong to the AutoML Engineering Team, I would like to share my point of view about the aforementioned questions. I believe you provided the wrong link for the documentation about the label (`None_of_the_above`). You linked [1], which is for AutoML Vision, while I understand that you meant to cite [2], for AutoML Natural Language.
According to [2], users are suggested to consider using the label `None_of_the_above`,”for documents that don't match any of your defined labels”. Including this label may improve the accuracy of your model, however; its use is not mandatory. It is strongly recommended that, whenever your model seems to have lower accuracy than expected, you also try adding this label and check if the accuracy improves.
Take into account that in [2], it is stated “The model works best when there are at most 100 times more documents for the most common label than for the least common label”. Consequently, given that you have at least 100 examples for each label, the maximum recommended number of examples for the `None_of_the_above` would be 10000 documents. The risk of having some labels present with a much higher frequency than others is that machine learning models might learn to label all the examples with the most frequent label as this might be the way to get a lower overall classification error rate.
Instead of deleting examples from the `None_of_the_above`, another approach would be to try to include more examples for the rest of your 5 labels in order to also strike a balance with respect to the most frequent label.
Finally, I believe that it might be worth comparing several models and checking how the accuracy fluctuates with respect to the frequency ratio between the `None_of_the_above` label and the other five labels in your model. You might for example create one model without this special label, one with 1000 examples, one with 10000 examples and one with the full 65000 examples and monitor the metrics obtained by your models in the test set in order to keep the most efficient one.
Additionally, make sure that, in those cases in which you reduce the number of your examples for the label `None_of_the_above`, you do this at random. This way, most of your files will be represented within the dataset, even if fewer sentences are taken from each of them.
I hope this helped clarify your questions.
[1]:
https://cloud.google.com/vision/automl/docs/prepare
[2]:
https://cloud.google.com/natural-language/automl/docs/prepare#expandable-1