You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi,
the files plain text (encoding UTF-8) compressed with xz [1,2].
On Linux you can show the content by, e.g.,
xzcat fa.2013_1.raw.xz | less
and you'll get
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
There are about 12 war.gz files listed in the warc.paths.gz. Download these files and have a look if this is enough for your needs (whichever they might be). If not, then CommonCrawl is not the dataset you're looking for. Some further reading might help you out, such as Schäfer, R., & Bildhauer, F. (2013). Web Corpus Construction. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. http://doi.org/10.2200/S00508ED1V01Y201305HLT022