New to Crawler

86 views
Skip to first unread message

Maytham Fahmi

unread,
Aug 31, 2016, 6:27:36 PM8/31/16
to Common Crawl
Hi guys,
I am studing in IT-Univseristy of Copenhagen and have had a need for generating a dummy text file.

To start with I have downloaded one of example-warc-java-master package and work on it, it works fine, I further build a parser to parse the content out of warc.gz file.

Later on I wanted to work on warc.wet.gz file, but I got lost, I tried cc-warc-examples-master but I got lost too.

My Question is there any better packages to take the data our of warc.wet.gz file in Java?

If yes please adivce, if no what is my way forward to get the data out of warc.wet.gz

Thanks indeed.
Maytham

Ivan Habernal

unread,
Sep 1, 2016, 2:31:30 AM9/1/16
to Common Crawl
Hi Maytham,

I hope I understood it correctly - you want to generate texts (in a particular language, say English) using some kind of generative statistical model that is trained on a large English corpus. Is that right? If so, you basically need a monolingual corpus, and you decided to use CommonCrawl (although there are many other existing corpora). Correct?

In this case, our C4Corpus should fit your needs perfectly; it's a "cleaned" version of CommonCrawl (only extracted plain text) which also includes language information and some other metadata. Have a look here https://dkpro.github.io/dkpro-c4corpus/ (and also at the LREC paper), the documentation also contains examples how to get the plain text easily for any further processing.

Hope it helps!

Best,

Ivan

Maytham Fahmi

unread,
Sep 28, 2016, 10:39:41 AM9/28/16
to Common Crawl
Thx Ivan, I will give it a look, but mean while I have developed a full functional code.
Reply all
Reply to author
Forward
0 new messages