YES - For a very long time I've wanted an assistant that watches what I do and helps me - this is my
ultimate goal.
I want to reduce entropy - I want to discover similar tiddlers and merge them to reduce entropy.
I have been thinking about how to do this for 30 odd years (not for tiddlers - but for text representing ideas)
When I make a new tiddler I ask myself "Have I done this or something like this before" - or "has anybody else done this"
I know of one good algorithm for this.
I make a new tiddler T and want to find the most similar tiddler to T in a collection of tiddlers.
This is very simple A is similar to B if
size(compress(A ++ B)) is just a bit larger than size(compress(A))
If size(compress(A++B)) > size(compress(A)) + size(compress(B)) then A and B are very dissimilar
++ is concatination.
Why is this? compression algorithms look for similarities between different parts of a text -
they work well when they find similarities.
This is a crazy good algorithm - I gave it all the paragraphs in a book I'd written - typed a new paragraph and it ranks
any similar paragraphs it can find.
The problem is that it's very inefficient - if I had a few million tiddlers it would be way too slow.
My next idea would be to use the rsync algorithm for plagiarism detection. This
is linear in the number of character of the new tiddler - it finds short fragments of identical text
very very quickly - this is why Universities etc use it to detect cheating students.
So I'd propose using rsync to make a set of candidate tiddlers then least compression difference
to rank the candidates.
Could also use TD*IDF similarity to make candidates.
Ultimately I'd like to find all paragraphs in the planet name them by their SHA checksums
then put them into an entropy reduction machine that finds similarities and throws out
duplicates and near matching data.
It seems to me that intelligence is partly the ability to recognise similarities between things
so I think an entropy reduction machine would be great.
What attracted me to the TW in the first place was the granularity of the tiddlers.
They must be not to small and not too big and capture a single idea - there's
a Swedish word for this "Lagom" - that and transclusion to combine ideas are
fundamental to building build large structures by combining smaller things.
The problem with search engines is that you have to think up a query.
In similarity detection what you have written becomes the query - and you ask
"what is the most similar thing you can find to <this>"
This topic has fascinated me for years - I view entropy reduction as one of the key
unsolved problems in computer science
Cheers
/Joe