Rivendell Reader Index

322 views
Skip to first unread message

Seth Vidal

unread,
May 2, 2011, 2:22:58 AM5/2/11
to rbw-owne...@googlegroups.com
I was reading the knothole today and Grant was talking about an index
to the Rivendell Readers. I've got most of the readers on pdf as a
solstice present a couple of years back.

So I was noodling around a bit and here's what I did:

1. split all of the pdfs out into per-page output
2. converted all the per-page pdfs to text files.
3. wrote a python script to do some relatively naive word indexing
4. enhanced the naivete a bit to avoid really common words and pretty
much anything that appears more than 500 times.
5. dumped all of this to a series of text files.


Limits of its use:
a. it's word-separated not 'phrase' so 'sam' is separate from 'hillborne'
b. the first 10-20 RR on pdf appear to be ocr'd in. So the text is
occasionally garbled which results in 'odd' things.
c. a lot of 'grantisms' in use - so when he says 'pillar and means
'hunqapillar' well - that's under 'p' not under 'h'
d. if you look for 'rivendell' or 'bike' you're not going to find it
b/c, well, that seemed silly to include for fairly obvious reasons, I
hope. :)


If anyone has 36-40 in a pdf I can run this across them too.

It's not a proper index, of course, but it is a heck of a start for
anyone who wants to refine it down.

Neat facts:
the first time the word 'atlantis' appears ( RR18 - pg 0011).

romulus appears 25 times in total.

that something like 'rambouillet' appears in a variety of interesting
spellings through out.


-sv

Seth Vidal

unread,
May 2, 2011, 2:23:54 AM5/2/11
to rbw-owne...@googlegroups.com

might be nice if I sent a link to the results huh?

http://sethdot.org/~skvidal/misc/RR-index/

-sv

Justin August

unread,
May 2, 2011, 6:49:54 AM5/2/11
to RBW Owners Bunch
Nrrrrrrrrrrrrdz. Was going to do the same but had no PDFs.

On May 2, 2:23 am, Seth Vidal <skvi...@gmail.com> wrote:

Stephen S

unread,
May 2, 2011, 9:44:26 AM5/2/11
to RBW Owners Bunch

I scanned 36-40. How can I help?

Seth Vidal

unread,
May 2, 2011, 9:46:21 AM5/2/11
to rbw-owne...@googlegroups.com
On Mon, May 2, 2011 at 9:44 AM, Stephen S <elph...@gmail.com> wrote:
>
> I scanned 36-40. How can I help?
>

Ideally I'd like to get the pdfs from riv that have each page as text
and images not each page as one big image.

-sv

Stephen S

unread,
May 2, 2011, 10:04:24 AM5/2/11
to RBW Owners Bunch
Ok thats fine. I have them in PDF if you decide you want them. They
aren't OCR'd at all they are just as big images but to be honest it
was a lot easier to read those than the ones from Riv. I don't have to
try and refigure out words that the OCR got wrong.

Stephen

On May 2, 6:46 am, Seth Vidal <skvi...@gmail.com> wrote:

Seth Vidal

unread,
May 2, 2011, 11:07:35 AM5/2/11
to rbw-owne...@googlegroups.com
On Mon, May 2, 2011 at 10:04 AM, Stephen S <elph...@gmail.com> wrote:
> Ok thats fine. I have them in PDF if you decide you want them. They
> aren't OCR'd at all they are just as big images but to be honest it
> was a lot easier to read those than the ones from Riv. I don't have to
> try and refigure out words that the OCR got wrong.
>

stephen,
I know what you mean for the first ones especially. But since about
issue 23 or so - they've been in a better format pdf which has text as
text.

-sv

islaysteve

unread,
May 2, 2011, 7:17:11 AM5/2/11
to RBW Owners Bunch
Nice! I hope you are able to complete the rest of the issues.

Seth Vidal

unread,
May 2, 2011, 12:45:20 PM5/2/11
to rbw-owne...@googlegroups.com
On Mon, May 2, 2011 at 7:17 AM, islaysteve <alki...@verizon.net> wrote:
> Nice!  I hope you are able to complete the rest of the issues.
>

if I can find them - I will.

-sv

Reply all
Reply to author
Forward
0 new messages