Hi,
You need to pre-process collection and convert it into VW format. VW format looks as follows:
document_number_one |@labels_class peninsular_malaysia_@LABEL monitoring_of_exports_@LABEL export_of_waste_@LABEL |@default_class posit pursuant amend:5 relev
document_number_two |@labels_class agricultural_statistics_@LABEL livestock_@LABEL animal_production_@LABEL statistical_method_@LABEL cattle_@LABEL swine_@LABEL |@default_class council:5 made commun:2 treati
Each line starts with document identifier (must be without spaces), followed by tokens.
In the example above both documents are represented as bag-of-words, e.i. I've conted how many times each token is contained in the documents, and I've put that count as "token:count". That's optional, i.e. it's ok to put just text - just make sure if doesn't have colon (":") or pipe ("|"). Pipe indicates class label, for example |@label_class defines that all subsequent tokens will be of class @label_class (all until next pipe, e.i. |, which we'll re-define class).
Simplest VW file, where all tokens belong to the default class will just look as follows:
doc1 content of my document number
doc2 the quick brown fox jumps over the lazy dog
But remember that tokens must be cleaned, stemmed, and legitimatized, converted to standard case, etc. "DOG" , "dog", "Dog" would be three different tokens, "jumps" and "jump" would be different too.
Kind regards,
Alex