All,
I finally got around to running Tika against the dataset with tika-eval's language id -- not perfect, but decent - and more importantly its out-of-vocabulary (oov) statistic that can indicate that electronic text as stored is likely junk.
I focused heavily on PDF-specific metadata items.
There are two main tables:
a) the "container" files, the main PDFs -- one row per URL
b) the "container" files and the embedded files -- at least one row per URL, optionally many rows per URL, one for each embedded file.
Let me know if you have any questions.
Best,
Tim