Need help building keyword generator

73 views
Skip to first unread message

andrew.j.harrison84

unread,
May 14, 2013, 8:39:55 PM5/14/13
to tiddly...@googlegroups.com
I've been using Tiddlywiki to view pdf files. I include the pdf file by viewing it in an embedded iframe (if anyone knows a better way let me know) and then I copy the content into a hidden area (i.e. /%hidden content%/) in the tiddler so that it is searchable. I discovered that since I had alot of content, it was causing my tiddlywiki to grow drastically and this actually slowed it down. To try to reduce the searchable content I created a textbox that would remove the spaces from the hidden content. Then I copy it and paste it as one long word into the hidden content area of my pdf tiddler. Here is what I have so far:
[[textboxtoTextarea]]
/%
!show
<html><nowiki><form style="display:inline">
<input type="text" name="toggleT" value="$1" onkeypress="
return !(window.event && window.event.keyCode == 13);" onchange="
var newtext=toggleT.value.replace(/ /g,'');
toggleT.value=newtext;
document.myform.outputtext.value += newtext;
return false;
">
</form></html>
!end
%/<<tiddler {{var src='textboxtoTextarea'; src+(tiddler&&tiddler.title==src?'##info':'##show')}}
with: {{'$1'!='$'+'1'?'$1':'textboxtoTextarea'}}
{{'$2'!='$'+'2'?'$2':'clicking this link will copy this text into the textarea'}}
>>

[[My form]]
<<tiddler textboxtoTextarea with: "this is a text">>
<html>
<form name="myform">
<table border="0" cellspacing="0" cellpadding="5"><tr>
<td><textarea name="outputtext" cols="40" rows="10"></textarea></td>
</tr></table>
</form>
</html>

Then I thought this could be even better to compress it even more so I started trying to figure out what I would need to do to make it better. The steps I think I would need to take are:
1. Remove the characters ) and ( altogether. Not sure why but it sounded good.
2. Replace all non-alphanumeric characters with a space. Maybe this could be done using regular expression syntax \W and \s.
3. Change everything to lower case letters since my search box is not case sensitive. Maybe using .toLowerCase() or :%s/.*/\L&/g but not sure how that works.
4. The next step is so complex I don't even know where to start. I want to remove all duplicates and partial duplicates. I think I have to start by sorting the words by length smallest to largest which after searching all over I only found one place that can do this which was on a web page called sortmylist.com. I'm still trying to figure out how they do it. I'm lost. But then I think I would take each word and search through the entire content with it and if it is anywhere, not add it to the results.
5. Then finally remove all spaces.

Can anyone offer any words of wisdom, help, or advise? Maybe if someone knows that it wont work in future versions of Tiddlywiki? Sometimes when you work on something to long, you find out in the end you've either reinvented the wheel or created a giant monstrosity.

Stephan Hradek

unread,
May 15, 2013, 3:36:48 AM5/15/13
to tiddly...@googlegroups.com
Sorry, but I don't completely understand what you're doing there.

But this is what I'd try (example html)
<html>
<body>
<ol>
<script>
// define the test text
var mytext="this is 'MyText', an awful piece of $%&%$§ is mytext, that's what it is.";
// show it as "1"
document.writeln("<li>", mytext, "</li>");
// change to lower case
var mylower= mytext.toLowerCase();
// show it as "2"
document.writeln("<li>", mylower, "</li>");
// get all "words"
var myarray= mylower.match(/(\w+)/g);
// show them as "3"
document.writeln("<li>",myarray,"</li>");
// count word occurence
var mylist= new Object;
for (i in myarray) {
    ++mylist[myarray[i]];
}
// show just the words as "4"
document.writeln("<li>",Object.keys(mylist),"</li>");
</script>
</ol>
</body>
</html>

Arc Acorn

unread,
May 15, 2013, 3:59:21 AM5/15/13
to tiddly...@googlegroups.com

PMario

unread,
May 16, 2013, 1:59:09 PM5/16/13
to tiddly...@googlegroups.com
how many PDFs do you index that way?
-m

PMario

unread,
May 16, 2013, 2:01:45 PM5/16/13
to tiddly...@googlegroups.com
uups pushed post too fast.

how much text (in pages of text) do you have per pdf?

-m

andrew.j.harrison84

unread,
May 16, 2013, 7:46:35 PM5/16/13
to tiddly...@googlegroups.com

So in a couple hundred pdf files where each pdf has 1 to 40 pages of text, I'm pushin 4mb including tiddlywiki and I want to get back down to speedy 2mb size. Did that answer your question?



Sent from my Samsung Epic™ 4G Touch
--
You received this message because you are subscribed to the Google Groups "TiddlyWikiDev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tiddlywikide...@googlegroups.com.
To post to this group, send email to tiddly...@googlegroups.com.
Visit this group at http://groups.google.com/group/tiddlywikidev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

PMario

unread,
May 17, 2013, 5:31:46 PM5/17/13
to tiddly...@googlegroups.com, andrew.j.harrison84
On Friday, May 17, 2013 1:46:35 AM UTC+2, infernoape wrote:

So in a couple hundred pdf files where each pdf has 1 to 40 pages of text, I'm pushin 4mb including tiddlywiki and I want to get back down to speedy 2mb size. Did that answer your question?

I did wait a bit to see if someone comes up with a solution, because I only have an idea, how it could work.
 
About 2 years ago I did have a closer look to text indexing / full text searching. But since it wasn't practical for my TW usecase, I did stop investigating. Reading your description about your workflow, I think it would make sense to have a closer look again. If you go on reading, you'll see, why it wouldn't make sense for 10+ pdfs. 100++ is a different matter :)

-----

Have a look at: http://lookups.pageforest.com/
On the left side you can copy paste some text. (Text from one of your PDFs)

Top right you have a button "Build Tree" -> creates a searchable index, where the text search is astonishing fast.
"Build Tree" also prints a compression result. For "small (~2k) texts it is about 50% compression rate) for big texts ~600kByte it is much better.

The text search input is botoom left!

Click top right button "Load Dictionary" and "Build Tree"  then enter any word in the search input (bottom left) and see the magic.
eg: "wor .. ld " As you can see results are pretty fast and updated as you type.

This isn't exactly what you need, since it is a dictionary lookup but it could be adjusted to be used as a "full text search". To create a workflow, that works for you it would need a new "TrieSearchPlugin" that can handle the hidden index tiddlers. Every pdf would get it's own index. Similar to your existing workflow but the indexes would be much smaller than the "plain text" and searching should be quite fast.

All components are open source, so it would be possible to integrate the stuff into a TW. IMO the problem with the library is, that it isn't ready to be used as a TW plugin. Some heavy refactoring and some adjustmets would be necessary. eg: it can only handle english text well because öäü ... and such is ignored ... imo it can't handle numbers eg: 2013

It can't handle typos, so some type of "fuzzy search" would be cool, which would need more pre-processing ....

----------- tl;dr
The background

John Resig (inventor of jQuery) blogged about a dictionary lookup algorithm in 2011

You can have a look at the blog post but the interesting stuff is in the comment section :) eg: Discussion about memory usage, lookup speed, index creation speed, search algorithms ....

Near the end of the second blog post comment section. Mike Koss came up with a working installation (http://lookups.pageforest.com/) that works with Trie's (no typo)

Some links about the theoretical backgound (for those who are interested :)

have fun!

mario










Reply all
Reply to author
Forward
0 new messages