So I was noodling around a bit and here's what I did:
1. split all of the pdfs out into per-page output
2. converted all the per-page pdfs to text files.
3. wrote a python script to do some relatively naive word indexing
4. enhanced the naivete a bit to avoid really common words and pretty
much anything that appears more than 500 times.
5. dumped all of this to a series of text files.
Limits of its use:
a. it's word-separated not 'phrase' so 'sam' is separate from 'hillborne'
b. the first 10-20 RR on pdf appear to be ocr'd in. So the text is
occasionally garbled which results in 'odd' things.
c. a lot of 'grantisms' in use - so when he says 'pillar and means
'hunqapillar' well - that's under 'p' not under 'h'
d. if you look for 'rivendell' or 'bike' you're not going to find it
b/c, well, that seemed silly to include for fairly obvious reasons, I
hope. :)
If anyone has 36-40 in a pdf I can run this across them too.
It's not a proper index, of course, but it is a heck of a start for
anyone who wants to refine it down.
Neat facts:
the first time the word 'atlantis' appears ( RR18 - pg 0011).
romulus appears 25 times in total.
that something like 'rambouillet' appears in a variety of interesting
spellings through out.
-sv
Ideally I'd like to get the pdfs from riv that have each page as text
and images not each page as one big image.
-sv
stephen,
I know what you mean for the first ones especially. But since about
issue 23 or so - they've been in a better format pdf which has text as
text.
-sv
if I can find them - I will.
-sv