Is there a way to obtain only the URLs that have been crawled? I'd like to use it to obtain all the URLs that belong to a domain that fulfil a certain regex, for example.
Alternatively, because the file sizes are really very big, is there a web API that I can work with so that I can just query the data directly there rather than download the files in full?