Hey Everyone,
So I've actually finally been working on this; thought I'd share what I have done so far and get some feedback on what I'm planning.
Attached is a screenshot of the output of a small deep learning system I've put together (in Python/TensorFlow); it learns a word embedding and uses that to decide with what probability it should scite a paper, based on a users previous scites.
In particular, for each word it looks at, it emits a value that corresponds to how "good" this word is; if lots of the words are good, it scites the paper, otherwise it ignores it. In some sense it's learning a list of keywords; but it uses a bit of context, because, say for example that "quantum group" is different to "quantum <most-other-things>". In the picture, blue means "this word is interesting", and red means "this word is not interesting to you".
The screenshot shows this model being applied independently to titles (on the left) and abstracts (on the right).
The training data is just the entire SciRate database that I've "seen" over it's lifetime. Because I use SciRate in a fairly dogmatic way (looking at everything); this actually works quite well.
The main technical step left is to build a joint model that looks at titles and abstracts like: sciteProb = a * titleSciteProb + (1-a) abstractSciteProb; and make "a" learnable,
Then we need to think about how/if we want to include this on SciRate itself. One somewhat-involved way to implement it would be this:
- Have an interface where each user can configure their own model to be learned. Things going into this are:
* Which users' data to use (I may like to learn from my own data, and Arams)
* Which arxiv categories to learn within (I may only be interested in Aram's interests inside quant-phys)
* Timescale to learn over
- Support the training of these models in a scalable way on the server; on my comp (16gb ram lenovo x1 4th gen) it only takes 2-3 minutes; we could probably speed this up.
- Figure out how to make the recommendations in the UI; maybe an email would be the easiest way until we figure out some spot to show the suggestions on the interface.
At this point I'm interested in thoughts on all aspects. I'm even quite happy to not deploy this to SciRate at all for the time being. I still need to compare this system with "typical" recommendation approaches, (i.e. simple pre-defined keyword matching; and a bag-of-words approach classification).
It'll be a few weeks at least before I'm even ready to think about deploying it properly.
Let me know your thoughts. If you want more information, just let me know; at some point soon I'll write up a blog post with a lot more detail and the comparisons, so we can see if it's even worthwhile.
At the moment both models independently reproduce my own Scite-NoScite choices ~81-84% of the time; I think combining them will do a few percent better.
--
Noon