that sounds about right, by eyeball measure, it seems like this should equal out to about 30-40k pages
https://web.archive.org/web/*;type=text/list.seqfan.eu/*what about the yahoo groups data? is that needing to be recovered, or is there some assist i can provide you with if its covered?
from what i can tell the yahoo groups data appears well captured, in the WARC files for common crawl for a few years in a format like so:
{"numRecords": 1, "recFirstNextTopic": 0, "recFirstLastPosted": 0, "digestNum": 0, "recFirstTopicStatus": 0, "subject": "Hypergeometric 2F1", "yahooAlias": "grafixpl", "author": "Artur", "topicLastRecord": 162, "topicInfoStatus": 0, "recFirstTopicFirstRecord": 0, "topicStatus": 0, "email": "grafix@...", "firstRecInfoStatus": 2, "parent": 0, "recFirstTopicNextRecord": 0, "prevTopic": 154, "recFirstDigestNum": 0, "nextTopic": 0, "lastPosted": 1225116634, "date": 1225116634, "recFirstPrevTopic": 0, "topicNextRecord": 0, "recFirstTopicLastRecord": 0, "hasAttachments": 0, "threadLevel": 0, "topicPrevRecord": 0, "recFirstTopicPrevRecord": 0, "summary": "Dear Richard, Thank you for this formula!!!! That mean that roots of my quintic polynomial have also geometric interpretation! Root[4 k - k2 + 5 k2 x + (20 k -", "length": 2219, "messageId": 162, "recFirstNumRecords": 0, "topicFirstRecord": 0}
i run a computer science organization, i know something like parsing common crawl and extracting years of warcs is a compute intensive and extremely frustrating task if you dont have the infrastructure already built to doso, so i imagine that i may be able to really save a lot of effort on this front,
otherwise im very capable when it comes to large scale data so please do use me as a resource if it is at all helpful
appreciated,