This is an interesting question and one that has come up before. We'd love to provide a solution for this but we've had trouble in the past defining the requirements of such a dataset.
The main issue is what constitutes a random sample of the web. I'd imagine that a good approximation for a random sample would be obtaining a single page from each domain. There are hundreds of millions of domains however, and for a sample of 100k or 1 million web pages, you'd not even get a page per domain. Another question is which page would we want from each domain? Using home pages is not likely to be a great sample as they're all quite formulaic in structure and frequently contain little information or only snippets to other more detailed pages. If not home pages, how do we decide what page we come across on a domain? Trying to decide upon the size of the sample is another question.
The core issue is in making the random sample support the widest set of use cases possible. If anyone has any insights on that, I'd be interested to hear them! :)
As a first approximation, a single segment from a crawl archive generally contains a few million web pages. Processing this to create a sampling of documents should be relatively easy and may fit well depending on the task.