Intertwingling the TiddlyWiki - TF-IDF and tag inference

125 views
Skip to first unread message

Rob Hoelz

unread,
Jan 21, 2019, 12:04:10 PM1/21/19
to TiddlyWiki
Hi everyone (especially Jeremy and Joe) -

I finally got around to watching this talk, and I was enraptured the whole time, especially by the part about inferring tags and using TF-IDF to come up with more accurate suggestions.  Is the source code for your work freely available?  I tried my hand at tag inference using forests of decision trees a few months back, and I'd like to study alternative approaches!

Thanks,
Rob

Joe Armstrong

unread,
Jan 21, 2019, 12:33:31 PM1/21/19
to TiddlyWiki
The code I wrote was a bit messy and just as an experiment. 
Good enough for proof of concept but not for production - it was just written to test a few ideas.

I don't mind sending you a private copy - but explaining how it works would be low priority.

A better idea would be for me to put it up on github together with my library of Erlang code that
parses and mucks with tiddlers - I'm trying to programmatically create TWs from other data sources.

If you saw the talk you'd see that we're interested in "Communicating TW's" I can imagine TW's sending messages
to each other - but this is a long way off ...

I did make a little writeup that explains the method (enclosed) - the code was just a prototype and written in Erlang - the problem at the moment is that this is not integrated in any way with a live TW - Our idea was to integrate this through a socket interface.

At the moment I'm learning the TW so hopefully when I understand more I'll figure out how to
connect the TW to Erlang through a socket and fun and games will follow :-)

The TF*IDF algorithm is very simple (see the writeup) most of the work is in tokenising the input
into words - from  then on it's easy (in pure JS) - integrating this with the TW would then be
as they say "an exercise to the reader" (that's what I say when I don't know how to do this :-)

Cheers

/Joe
tag_predictor.pdf

Rob Hoelz

unread,
Jan 21, 2019, 1:03:09 PM1/21/19
to TiddlyWiki
Thanks, Joe!  I'll read over that PDF you sent over; as far as the code goes, I think the PDF documentation describing the methodology should suffice.

-Rob

Rob Hoelz

unread,
Jan 21, 2019, 11:27:56 PM1/21/19
to TiddlyWiki
Again, thanks for sharing, Joe!  I looked through the PDF and had a few thoughts:

  * Did you do any additional processing of the tiddler bodies, eg. stemming, chunking into bigrams/trigrams, or stripping out various wikitext elements like URLs?  If you did, I'd be curious to hear how that affected your results!
  * During the talk, you mention the idea of an "assistant" that sits off to the side and helps you work on tiddlers as you type.  I often think that it would be helpful if TiddlyWiki offered me suggestions for tiddlers that might be related to what I'm currently writing, and I think perhaps your TF-IDF "significant term" detection approach might make for a step in the right direction.  Perhaps the top N TF-IDF terms for each tiddler could be encoded as a vector, and tiddlers whose vectors have the highest cosine similarity could be offered as matches in this regard - what do you think?

-Rob

Joe Armstrong

unread,
Jan 22, 2019, 6:46:07 AM1/22/19
to TiddlyWiki
YES - For a very long time I've wanted an assistant that watches what I do and helps me - this is my
ultimate goal.

I want to reduce entropy - I want to discover similar tiddlers and merge them to reduce entropy.

I have been thinking about how to do this for 30 odd years (not for tiddlers - but for text representing ideas)

When I make a new tiddler I ask myself "Have I done this or something like this before" - or "has anybody else done this"

I know of one good algorithm for this.

I make a new tiddler T and want to find the most similar tiddler to T in a collection of tiddlers.


This is very simple   A is similar to B if
size(compress(A ++ B)) is just a bit larger than size(compress(A))

If size(compress(A++B)) > size(compress(A)) + size(compress(B)) then A and B are very dissimilar

++ is concatination.

Why is this? compression algorithms look for similarities between different parts of a text - 
they work well when they find similarities.

This is a crazy good algorithm - I gave it all the paragraphs in a book I'd written - typed a new paragraph and it ranks
any similar paragraphs it can find. 

The problem is that it's very inefficient - if I had a few million tiddlers it would be way too slow.

My next idea would be to use the rsync algorithm for plagiarism detection. This 
is linear in the number of character of the new tiddler - it  finds short fragments of identical text
very very quickly - this is why Universities etc use it to detect cheating students.

So I'd propose using rsync to make a set of candidate tiddlers then least compression difference
to rank the candidates.

Could also use TD*IDF similarity to make candidates.

Ultimately I'd like to find all paragraphs in the planet name them by their SHA checksums
then put them into an entropy reduction machine that finds similarities and throws out
duplicates and near matching data.

It seems to me that intelligence is partly the ability to recognise similarities between things
so I think an entropy reduction machine would be great.

What attracted me to the TW in the first place was the granularity of the tiddlers.

They must be not to small and not too big and capture a single idea - there's
a Swedish word for this "Lagom" - that and transclusion to combine ideas are
fundamental to building build large structures by combining smaller things.

The problem with search engines is that you have to think up a query.

In similarity detection what you have written becomes the query - and you ask
"what is the most similar thing you can find to <this>"

This topic has fascinated me for years - I view entropy reduction as one of the key
unsolved problems in computer science

Cheers

/Joe

Dave

unread,
Jan 22, 2019, 11:15:41 AM1/22/19
to TiddlyWiki
Instead of comparing paragraphs or tiddlers, how hard would it be too detect unique sentences and compare them? I guess you'd almost need an AI to do that, hey?

TonyM

unread,
Jan 22, 2019, 7:19:55 PM1/22/19
to TiddlyWiki
Joe,

We share the same aims, and as an experienced information/knowledge management professional I look forward to us finding effective tools to make these "inferences" using software and I am keen to contribute to this as well. 

Though given my own experience in this, I also find that every piece of data information can be found in some context and if we preserve this context when saving it, we capture a great deal of "metadata". First this is like saying "if we know exactly how to store something, then we will most likely know how to retrieve it". This allows for algorithms that mere humans can use, but also sets us up with a richer data source that out automation can query. 

One idea to keep in mind, it TiddlyWiki can readily export and import tiddlers, so it would be trivial to hand of a large number of tiddlers to a powerful knowledge processor and either alter and re-import tiddlers and/or return new tiddlers that capture the relationships found (Perhaps after determining hidden relationships). 

TiddlyWiki really is a platform and it is very easy to see boxes that proscribe its function, when in fact there are a multitude of dimensions in TiddlyWiki in which it can "look outside" its apparent boxes.

When running on top of nodeJS, tiddlers are files, and I believe if you found the tools in JavaScript you may be able to adapt them to operate on tiddlers even without the Export/Import steps. TiddlyWiki would be the portal to look at your data but your data could have a life independent of how you look at it.

Keep having such conversations please.

Regards
Tony

Rob Hoelz

unread,
Jan 22, 2019, 9:55:47 PM1/22/19
to TiddlyWiki
That's a neat trick involving compression, Joe - I wonder if you could adapt a locality sensitive hash like simhash to create a specialized index for quick comparisons?

Another thing that your mention of rsync reminded me of is word embeddings, in particular the latest and greatest in that field, such as fasttext and ELMo.  Both of those algorithms use a large window of character n-grams (rather than individual words), which is kind of like rsync's rolling checksum, and their resulting models would allow for statements like "the dog jumped over the fox" to compare similarly to "the collie leapt over the vixen".

I'm really excited to watch your approach on entropy evolve over time!

-Rob

Kalmir

unread,
Jan 23, 2019, 1:28:57 AM1/23/19
to TiddlyWiki
DevonThink app should be really good at this:

"One of the key features of DevonThink Pro Office is its smart searching algorithms, its ability to suggest similar texts based on the contents of what you are looking at, etc. It does this by means of a proprietary algorithm, so I can't really tell you how it works, but just know that it does. It works best on smaller chunks of text. In this way, I was reading through a particular source from the 3 million-word-strong Taliban Sources Project database and then I clicked the "See also" button and it had found a source I would never otherwise have read on the same topic, even though it didn't even use one of the keywords I would have used to search for it. It uses semantic webs of words to figure this stuff out." - Alex Strick (https://www.alexstrick.com/blog/PhD-tools-DevonThink)

Reply all
Reply to author
Forward
0 new messages