I'm not sure if this is the best forum to ask my question (please redirect me if not), but I'm wondering what is generally a good strategy to clean the data.
For example, I realized that in both the training and validation sets, there are news articles with extremely long titles and a lot of junk words.
using the line for example:
cat articles-validation-bypublisher.xml | grep "This Just In"
will show many instances.
Within the articles, there are also terms such as "Follow us on Twitter", advertisement, and other online usages.
Would you clean these before you put them into the model, or the model should be capable of finding the relevant information?
It's often difficult for me to decide how much I should clean the data and if I'm throwing away information.
I'm wondering if anyone can share some insights or experience on how important data cleaning is for their models.