> I'd prefer to not use Athena as this is a personal project
> and I'm somewhat sure that the Athena route will bankrupt me.
If done right, it should cost you not more than $0.50 to intersect
the list of URLs with one monthly crawl using Athena.
The idea is to create a second table which holds your URL list
in the form of SURT URLs
and then do a join on the SURT columns of the two tables.
The SURT URL is used to avoid mismatches on the URL by common variations:
https vs http, a trailing slash, swapped URL query params.
I'm currently doing similar work - compare the URL database of a
focused crawl with the CC index to compare the two crawls and evaluate
the coverage. I'll share how to do this but I'd need a couple of days
to write it down.
A table join on the domain column is described here:
Let me know whether this drafted solution sounds good to you and ping me in case
you wait for a quick response.