Hi Maytham,
I hope I understood it correctly - you want to generate texts (in a particular language, say English) using some kind of generative statistical model that is trained on a large English corpus. Is that right? If so, you basically need a monolingual corpus, and you decided to use CommonCrawl (although there are many other existing corpora). Correct?
In this case, our C4Corpus should fit your needs perfectly; it's a "cleaned" version of CommonCrawl (only extracted plain text) which also includes language information and some other metadata. Have a look here
https://dkpro.github.io/dkpro-c4corpus/ (and also at the LREC paper), the documentation also contains examples how to get the plain text easily for any further processing.
Hope it helps!
Best,
Ivan