<html>
<body>
<ol>
<script>
// define the test text
var mytext="this is 'MyText', an awful piece of $%&%$§ is mytext, that's what it is.";
// show it as "1"
document.writeln("<li>", mytext, "</li>");
// change to lower case
var mylower= mytext.toLowerCase();
// show it as "2"
document.writeln("<li>", mylower, "</li>");
// get all "words"
var myarray= mylower.match(/(\w+)/g);
// show them as "3"
document.writeln("<li>",myarray,"</li>");
// count word occurence
var mylist= new Object;
for (i in myarray) {
++mylist[myarray[i]];
}
// show just the words as "4"
document.writeln("<li>",Object.keys(mylist),"</li>");
</script>
</ol>
</body>
</html>
So in a couple hundred pdf files where each pdf has 1 to 40 pages of text, I'm pushin 4mb including tiddlywiki and I want to get back down to speedy 2mb size. Did that answer your question?
You can have a look at the blog post but the interesting stuff is in the comment section :) eg: Discussion about memory usage, lookup speed, index creation speed, search algorithms ....
Near the end of the second blog post comment section. Mike Koss came up with a working installation (http://lookups.pageforest.com/) that works with Trie's (no typo)
Some links about the theoretical backgound (for those who are interested :)
have fun!
mario