Hyperpartisan News Detection Data Cleaning

30 views

Skip to first unread message

chialu...@gmail.com

unread,

Jan 31, 2019, 10:38:07 AM1/31/19

to PAN Workshop Series on Digital Text Forensics

Hi,

I'm not sure if this is the best forum to ask my question (please redirect me if not), but I'm wondering what is generally a good strategy to clean the data.

For example, I realized that in both the training and validation sets, there are news articles with extremely long titles and a lot of junk words.

using the line for example:

cat articles-validation-bypublisher.xml | grep "This Just In"

will show many instances.

Within the articles, there are also terms such as "Follow us on Twitter", advertisement, and other online usages.

Would you clean these before you put them into the model, or the model should be capable of finding the relevant information?

It's often difficult for me to decide how much I should clean the data and if I'm throwing away information.

I'm wondering if anyone can share some insights or experience on how important data cleaning is for their models.

Reply all

Reply to author

Forward

0 new messages