You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hello,
I am a social science researcher at NYU. I'm doing a project on open source training data sets, many of which are cleaned subsets of the Common Crawl. One thing my team has noticed is that many Common Crawl snapshots are composites of content - it's not just one clean page of text. Has there been any research on why this is the case? Any information would be much appreciated.