how to know the file format

fatemek...@gmail.com

unread,

May 24, 2017, 6:10:44 AM5/24/17

to Common Crawl

hello,

I have some crawl files from commoncrawl.com but I don't know the files format.

In attached file,I've sent a picture from them, would you please help me how to extract them? I need only the webpages.

thanks,

pic.jpg

Sebastian Nagel

unread,

May 24, 2017, 6:24:49 AM5/24/17

to common...@googlegroups.com

Hi,

the files plain text (encoding UTF-8) compressed with xz [1,2].
On Linux you can show the content by, e.g.,
xzcat fa.2013_1.raw.xz | less
and you'll get

df6fa1abb58549287111ba8d776733e9 24.000000
http://barnamenevis.org/showthread.php?213820-%D8%B4%D8%A8%DA%A9%D9%87-%DA%A9%D8%B1%D8%AF%D9%86-%DA%86%D9%86%D8%AF-%DA%A9%D8%A7%D9%85%D9%BE%DB%8C%D9%88%D8%AA%D8%B1-%D8%A8%D8%A7-%D9%85%D9%88%D8%AF%D9%85-adsl-%D9%88%D8%A7%DB%8C%D8%B1%D9%84%D8%B3
شروع: 24 خرداد)
دوره عملی برنامه نویسی
df6fa1abb58549287111ba8d776733e9 26.000000
http://barnamenevis.org/showthread.php?213820-%D8%B4%D8%A8%DA%A9%D9%87-%DA%A9%D8%B1%D8%AF%D9%86-%DA%86%D9%86%D8%AF-%DA%A9%D8%A7%D9%85%D9%BE%DB%8C%D9%88%D8%AA%D8%B1-%D8%A8%D8%A7-%D9%85%D9%88%D8%AF%D9%85-adsl-%D9%88%D8%A7%DB%8C%D8%B1%D9%84%D8%B3
پیشرفته تحت ویندوز (اصفهان) (شروع: 27 خرداد)
دوره آموزشی برنامه نویسی

That's the answer based on the assumption that the files are taken from
http://data.statmt.org/ngrams/raw/
If this correct I would also recommend to check the documentation provided on
http://statmt.org/ngrams/index.html

Best,
Sebastian

[1] https://tukaani.org/xz/
[2] https://en.wikipedia.org/wiki/Xz

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

fatemek...@gmail.com

unread,

May 25, 2017, 2:43:57 AM5/25/17

to Common Crawl

thanks for your reply'

so the data sets are contained the ngrams of a page, not the exact page?

Message has been deleted

fatemek...@gmail.com

unread,

May 26, 2017, 2:42:30 AM5/26/17

to Common Crawl

If so, Is it possible to download only Persian web site in common crawl?

I only need the webpages, so I think only need the .wet format files?

Ivan Habernal

unread,

May 26, 2017, 2:59:09 AM5/26/17

to Common Crawl

Dear fatemekhazaeee,

If you only need a plain-text from pages written in Farsi, you might want to get them out of the C4Corpus: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/#_how_to_access_c4corpus_using_http

First, download the list of all pre-processed files ( https://commoncrawl.s3.amazonaws.com/contrib/c4corpus/CC-MAIN-2016-07/warc.paths.gz ) and then download only those with "Lang_fa" in the file name.

Best,

Ivan

fatemek...@gmail.com

unread,

May 26, 2017, 4:40:58 AM5/26/17

to Common Crawl

thanks a lot,

sorry for my tedious questions,

I need about 2 or 3 GB of pages, and this file ( https://commoncrawl.s3.amazonaws.com/contrib/c4corpus/CC-MAIN-2016-07/warc.paths.gz ) contains a few lines in farsi documents.

how can I access more pages?

Ivan Habernal

unread,

May 26, 2017, 4:48:38 AM5/26/17

to Common Crawl

There are about 12 war.gz files listed in the warc.paths.gz. Download these files and have a look if this is enough for your needs (whichever they might be). If not, then CommonCrawl is not the dataset you're looking for. Some further reading might help you out, such as Schäfer, R., & Bildhauer, F. (2013). Web Corpus Construction. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. http://doi.org/10.2200/S00508ED1V01Y201305HLT022

Best,

Ivan

Reply all

Reply to author

Forward