word index

9 views
Skip to first unread message

Johannes Wilm

unread,
Jun 8, 2013, 2:14:48 PM6/8/13
to booktype-dev, Booktype, Internal Aloha Editor Dev, Gideon Lehmann
Hey again,
on my todo list for BookJS there are two remaining items:

- cross referencing
- word index

Cross references would as default just take internal links and converting their text to name a page number rather than the link text.

Word indexes is what many, specifically scientific books, have at their back. It lets you see that "George Washington" is mentioned on pages 1, 2, 56-62, 73 and 159.

I put that feature on the list. I used that for my Nicaragua book. Now I wodner though: do they really still make sense? If I needed to search the text for a word, wouldn't I instead find the electronic verison of the book and do a fulltext search? it's generally quite a lot of work for an indexer to do the indexing of a book and ocne done, the indexed word are in the way if you do a normal select-copy operation.

What are your opinions? Have you used a word index lately? Do we know of cleints who use them? I myself am just not sure.

--
Johannes Wilm
BookJS Developer


skype: johanneswilm

Johannes Wilm

unread,
Jun 8, 2013, 7:52:37 PM6/8/13
to booktype-dev, Booktype, Internal Aloha Editor Dev, Gideon Lehmann
Hey once more,

I asked both on Facebook and some internal publishers lists and the answer seems to be that word indexes continue to be extremely important. So I willg o ahead as planned.

Daniel James

unread,
Jun 10, 2013, 6:29:34 AM6/10/13
to bookty...@googlegroups.com, Johannes Wilm, Booktype, Internal Aloha Editor Dev, Gideon Lehmann
Hi Johannes,

> I asked both on Facebook and some internal publishers lists and the
> answer seems to be that word indexes continue to be extremely important.

I would agree, because if we require the reader to have the electronic
version to look up the index, we make the paper version redundant.

Also, to take a long view, can we guarantee that the electronic index
will be available in 20 or 30 years when the paperback is still in some
library somewhere? Back in my college days of the early 90's, research
abstracts were on CD-ROM and today most new computers don't have a CD
drive.

Publishers in the UK at that time used Syquest or Iomega drives, and you
never see those any more. You would have trouble finding a computer with
the correct SCSI connector, even if you still had a working drive.

Great to hear these features are coming along :-)

Cheers!

Daniel

Johannes Wilm

unread,
Jun 10, 2013, 7:05:57 AM6/10/13
to Daniel James, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev, Gideon Lehmann
On Mon, Jun 10, 2013 at 6:29 AM, Daniel James <daniel...@sourcefabric.org> wrote:
Hi Johannes,

> I asked both on Facebook and some internal publishers lists and the
> answer seems to be that word indexes continue to be extremely important.

I would agree, because if we require the reader to have the electronic
version to look up the index, we make the paper version redundant.

Not necessarily. I would for example read the paper version of a book. When I then need to find something later on that I knew was in the book, I would instead look for the electronic version and search for that instead. Then I could reread the passage there and possibly cite it.

 

Also, to take a long view, can we guarantee that the electronic index
will be available in 20 or 30 years when the paperback is still in some
library somewhere? Back in my college days of the early 90's, research
abstracts were on CD-ROM and today most new computers don't have a CD
drive.

yes, that is a valid argument.  

Publishers in the UK at that time used Syquest or Iomega drives, and you
never see those any more. You would have trouble finding a computer with
the correct SCSI connector, even if you still had a working drive.

Few people are able to find their emails from around 2000-2002, even though they were commited at the time to keep them.

So, yes. Good point.
 

Great to hear these features are coming along :-)

Cheers!

Daniel

Daniel James

unread,
Jun 10, 2013, 7:24:28 AM6/10/13
to Johannes Wilm, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev, Gideon Lehmann
Hi Johannes,

> I would for example read the paper version of a book. When I then need
> to find something later on that I knew was in the book, I would instead
> look for the electronic version and search for that instead.

I would have to do the same, if there was no index in the paper version :-)

Given the extra cost of printing and shipping print versions, it would
be good if print versions retained as many features as possible.
Booktype print output should be at least as good as traditional books,
otherwise people may drift to electronic versions, stop producing print,
and only realise the limitations of ebooks after it's too late.

Cheers!

Daniel

Johannes Wilm

unread,
Jun 10, 2013, 7:29:20 AM6/10/13
to Daniel James, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev, Gideon Lehmann
So we have to protect readers from themselves? :)

The interesting thing is that there also seems to be a large interest in word indexes in ebooks. I would never have guessed, but people seem to use ti to browse through the contents before actually reading the book.

There is actually a draft spec for this type of index: http://www.idpf.org/epub/idx/epub-idx-20130307.html#h.6g4efj7njror
 

But notice what it says about locators in epubs: http://www.idpf.org/epub/idx/epub-idx-20130307.html#h.6g4efj7njror

"Paper books have commonly used page, section or paragraph numbers as locators.  An ebook may choose to use legacy page numbers, paragraph numbers, section numbers, simple sequential numbers, terms, icons, or anything else desired as the rendered part of the locator."


I read that as: we don't really have a clue of how to do this right. :)


Cheers!

Daniel

Daniel James

unread,
Jun 10, 2013, 7:45:01 AM6/10/13
to Johannes Wilm, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev, Gideon Lehmann
Hi Johannes,

> An ebook may choose to use legacy page numbers, paragraph numbers,
> section numbers, simple sequential numbers, terms, icons, or anything
> else desired as the rendered part of the locator."
>
> I read that as: we don't really have a clue of how to do this right. :)

Legacy page number works if you want to look up the matching print
edition, but is a bit silly for Booktype when you can output any paper size.

The Bible already has this figured out, using chapter and verse numbers
to compare across editions. Perhaps we can reinvent the wheel :-)

Cheers!

Daniel

Johannes Wilm

unread,
Jun 10, 2013, 9:27:57 AM6/10/13
to Gideon Lehmann, Daniel James, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev
I agree... but in the end it will be Aco and Borko who are the ones implementing this. And they should also get to be the ones who get to   have the fun with figuring out how exactly to do this.

On Mon, Jun 10, 2013 at 8:48 AM, Gideon Lehmann <gideon....@sourcefabric.org> wrote:
hey johannes and daniel,

addressing paragraphs actually sounds like a way which could be worth
investigating.

All the best Gideon

-----Ursprüngliche Nachricht-----
Von: Daniel James [mailto:daniel...@sourcefabric.org]
Gesendet: Montag, 10. Juni 2013 13:45
An: Johannes Wilm
Cc: bookty...@googlegroups.com; Booktype; Internal Aloha Editor Dev;
Gideon Lehmann
Betreff: Re: [booktype-dev] Re: word index

Daniel James

unread,
Jun 10, 2013, 1:32:31 PM6/10/13
to Gideon Lehmann, Johannes Wilm, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev
Hi Gideon,

> addressing paragraphs actually sounds like a way which could be worth
> investigating.

If we use paragraph numbers, does that mean we need to generate a new
index each time someone adds or removes a paragraph break? We might need
to regenerate the index on chapter save only, rather than with each
edit, to limit the computational expense.

Very cool to be able to have unique URLs for each paragraph
automatically, though (for uses such as academic citations). Each <p>
tag in the body text could have a randomly generated id.

It would be easier to list just the chapter name in the index, but that
wouldn't suit books with long chapters.

Also there would need to be an edit feature for the index chapter, so we
would need to figure out how to do the locking (e.g. someone is editing
the index while another person saves a chapter).

Cheers!

Daniel

Johannes Wilm

unread,
Jun 10, 2013, 2:05:02 PM6/10/13
to Daniel James, Gideon Lehmann, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev
On Mon, Jun 10, 2013 at 1:32 PM, Daniel James <daniel...@sourcefabric.org> wrote:
Hi Gideon,

> addressing paragraphs actually sounds like a way which could be worth
> investigating.

If we use paragraph numbers, does that mean we need to generate a new
index each time someone adds or removes a paragraph break? We might need
to regenerate the index on chapter save only, rather than with each
edit, to limit the computational expense.

In a live WYSIWYG book editing environment, something that Booktype currently doesn't have and won't have according to current plans, at least not in 2013, every key stroke would have to check whether all references are still were they used to be. That also applies to the total number of pages, the placement of footnotes marginnotes and top floats.

BookJS allows that for all these other elements and will also allow it for the remaining two features. 

However, the way we use it in Booktype, it is only used for final rendering. At that stage nothing moves. The same also applies to epub creation.

 

Very cool to be able to have unique URLs for each paragraph
automatically, though (for uses such as academic citations). Each <p>
tag in the body text could have a randomly generated id.

I think you would just have a mechanism that counts how many <p> elements there are before the one with the reference in it which would be invoked at epub-compile time. No need to give them all paragraphs IDs. 
 

It would be easier to list just the chapter name in the index, but that
wouldn't suit books with long chapters.

Also there would need to be an edit feature for the index chapter, so we
would need to figure out how to do the locking (e.g. someone is editing
the index while another person saves a chapter).

The index function, at least for now, will be autogeneration only. What kinds of things would you want to edit in it?
 

Cheers!

Daniel

Daniel James

unread,
Jun 11, 2013, 5:16:54 AM6/11/13
to Johannes Wilm, Gideon Lehmann, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev
Hi Johannes,
> In a live WYSIWYG book editing environment, something that Booktype
> currently doesn't have and won't have according to current plans, at
> least not in 2013, every key stroke would have to check whether all
> references are still were they used to be.

Can we get away with only checking references on chapter save?

> I think you would just have a mechanism that counts how many <p>
> elements there are before the one with the reference in it which would
> be invoked at epub-compile time. No need to give them all paragraphs IDs.

That is simpler, but if you want web accessible references (URLs) you
have the problem of indexes getting broken between versions, whenever
paragraphs are added or removed. How about making the paragraph id a
hash of the content of that element? Then you would have a way to check
for changes, only re-indexing when it was actually needed (i.e. scan for
paragraphs which have changed).

If you had a chapter URL ending in a hash that no longer existed, you
could use a rewrite rule to display the chapter rather than a 404, or
even a message such as 'That reference no longer exists'. Ideally we
would combine Booktype book version URLs with these references so that
you could be sure to find a reference, as long as you were looking at
the right version of the book.

> The index function, at least for now, will be autogeneration only. What
> kinds of things would you want to edit in it?

Editors might want to remove ordinary words which have slipped into the
index. Presumably there's a common word exclusion list, but it might not
be available for every language. Perhaps the exclusion list should be
editable from the book settings page.

Cheers!

Daniel

Johannes Wilm

unread,
Jun 11, 2013, 6:42:40 AM6/11/13
to Daniel James, Gideon Lehmann, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev
On Tue, Jun 11, 2013 at 5:16 AM, Daniel James <daniel...@sourcefabric.org> wrote:
Hi Johannes,
> In a live WYSIWYG book editing environment, something that Booktype
> currently doesn't have and won't have according to current plans, at
> least not in 2013, every key stroke would have to check whether all
> references are still were they used to be.

Can we get away with only checking references on chapter save?

well, it's done by javascript. And browsers are quick enough these days.  

> I think you would just have a mechanism that counts how many <p>
> elements there are before the one with the reference in it which would
> be invoked at epub-compile time. No need to give them all paragraphs IDs.

That is simpler, but if you want web accessible references (URLs) you
have the problem of indexes getting broken between versions, whenever
paragraphs are added or removed. How about making the paragraph id a
hash of the content of that element? Then you would have a way to check
for changes, only re-indexing when it was actually needed (i.e. scan for
paragraphs which have changed).

I think we may be talking about two differetn things here. The indexing function as I will implement it will be similar to that of latex: Within the normal text, you can press a button that says something like "add index term". The user then gets to enter an index term. Within the editor it will then be represented by a green dot or something similar. See my presentations here: https://wiki.sourcefabric.org/display/Booktype/BookJS (search for "Simple word index"). 
 



If you had a chapter URL ending in a hash that no longer existed, you
could use a rewrite rule to display the chapter rather than a 404, or
even a message such as 'That reference no longer exists'. Ideally we
would combine Booktype book version URLs with these references so that
you could be sure to find a reference, as long as you were looking at
the right version of the book.

I think the New York Times has something like that, but it's something quite different and has nothing to do with what is going to eb implemented with this indexing function. The hyper links in the ebook will like go directly to the index element and not to the paragraph it is contained in. If a chapter or a paragraph has no index words within it, it won't be possible to link to it either.
 

> The index function, at least for now, will be autogeneration only. What
> kinds of things would you want to edit in it?

Editors might want to remove ordinary words which have slipped into the
index. Presumably there's a common word exclusion list, but it might not
be available for every language. Perhaps the exclusion list should be
editable from the book settings page.

The system doesn't invent the index based on some algorithm. It simply collects all instances of <span class="pagination-index-term" data-term="Brown, George" id="..."></span> and creates the index of that.

If the user then changes his mind and doesn't want Geroge Brown mentiuoned in the index afterall, he needs to go back into the chapter where he added the index term and remove it again. 
 

Cheers!

Daniel

Daniel James

unread,
Jun 11, 2013, 6:57:03 AM6/11/13
to Johannes Wilm, Gideon Lehmann, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev
Hi Johannes,

> Within the normal text, you can press a button that says something like
> "add index term". The user then gets to enter an index term. Within the
> editor it will then be represented by a green dot or something similar.

Ah, I see. High quality index, but time consuming.

I was thinking it would be an automatic index of all non-common words
and short phrases, using something like
http://code.google.com/p/maui-indexer/

Cheers!

Daniel

Johannes Wilm

unread,
Jun 11, 2013, 7:08:53 AM6/11/13
to Daniel James, Gideon Lehmann, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev
On Tue, Jun 11, 2013 at 6:57 AM, Daniel James <daniel...@sourcefabric.org> wrote:
Hi Johannes,

> Within the normal text, you can press a button that says something like
> "add index term". The user then gets to enter an index term. Within the
> editor it will then be represented by a green dot or something similar.

Ah, I see. High quality index, but time consuming.

Very time consuming. But that's why people have it as their job. It can take weeks or months to index a book completely. 

I was thinking it would be an automatic index of all non-common words
and short phrases, using something like
http://code.google.com/p/maui-indexer/

I see. Something like that could possibly be run over the texts by the click of a button and then add the <span class="pagination-index-term"> instances. But a good index does not just link terms in the text. It may say "In Nicaragua, the poor gained access to more buying power and the rich stayed the same in the early 2000s" and link that to Nicaragua->economy->income gap->2000s . The terms "economy" and "income gap" are not mentioned within the text.

In non-fiction books, the author is responsible for the index. Many "cheat" by hiring a professional to do it for them, but personally I am not sure if I would trust anyone else than myself to make a good index of something I write.
 


Cheers!

Daniel

Daniel James

unread,
Jun 11, 2013, 7:46:05 AM6/11/13
to Johannes Wilm, Gideon Lehmann, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev
Hi Johannes,

> a good index does not just link terms in the text. It may say "In
> Nicaragua, the poor gained access to more buying power and the rich
> stayed the same in the early 2000s" and link that to
> Nicaragua->economy->income gap->2000s . The terms "economy" and "income
> gap" are not mentioned within the text.

I suppose you could auto-generate the index, then manually edit it to
group terms like these. The manual indexing could then be crowd-sourced
from readers etc.

As much as I appreciate a high-quality, manually curated index, there
has got to be a way of making it quicker and easier. Also, traditional
indexers did not have permalinks available to them...

> In non-fiction books, the author is responsible for the index.

I think that depends on the publisher. I wrote a book for Apress and was
never asked to create the index myself.

Cheers!

Daniel

Johannes Wilm

unread,
Jun 11, 2013, 8:04:24 AM6/11/13
to Daniel James, Gideon Lehmann, bookty...@googlegroups.com, Booktype, Internal Aloha Editor Dev
On Tue, Jun 11, 2013 at 7:46 AM, Daniel James <daniel...@sourcefabric.org> wrote:
Hi Johannes,

> a good index does not just link terms in the text. It may say "In
> Nicaragua, the poor gained access to more buying power and the rich
> stayed the same in the early 2000s" and link that to
> Nicaragua->economy->income gap->2000s . The terms "economy" and "income
> gap" are not mentioned within the text.

I suppose you could auto-generate the index, then manually edit it to
group terms like these. The manual indexing could then be crowd-sourced
from readers etc.

possibly. I was talking to a bunch of publishing people about this, and they all seemed to agree that the word index is among the most valuable things in a book. Sales depend on whether there is an index and how good it is. 

 

As much as I appreciate a high-quality, manually curated index, there
has got to be a way of making it quicker and easier. Also, traditional
indexers did not have permalinks available to them...

Well, indexers still exist, and if they do their book with latex, they have pretty much the same tools available that I will add to html-based books. It would be interesting, I would think, to meet with a real indexer, or a group of them and listen to their arguments.
 

> In non-fiction books, the author is responsible for the index.

I think that depends on the publisher. I wrote a book for Apress and was
never asked to create the index myself.

Nono, you aren't asked to do it -- but you are responsible in the sense that it's considered part of the written content and that it's part of what your name is assigned to, you are liable for its content, etc. . 
 

Cheers!

Daniel
Reply all
Reply to author
Forward
0 new messages