paper suggestions on scirate

Noon Silk

unread,

Nov 18, 2015, 9:39:20 PM11/18/15

to Jaiden Mispy, aram harrow, Steve Flammia, scirate-dev

hey all,

so I'd like to have a shot at this in the next few months.

jaiden/anyone - is it possible to get an offline copy of the scirate database?

if not i'll just use my own data but i'll have to do a pretty significant scrape of arXiv, so would prefer it i could just grab it from scirate.

does anyone else want to be involved/have thoughts on how to do this?

--

Noon Silk, ن

http://silky.github.io/

"Every morning when I wake up, I experience an exquisite joy — the joy
of being this signature."

Jaiden Mispy

unread,

Nov 19, 2015, 5:48:48 PM11/19/15

to Noon Silk, aram harrow, Steve Flammia, scirate-dev

If you send me your ssh public key I can give you access to the live server. There's a daily pg_dump cron job that outputs scirate_live.dump to the scirate home directory, you can grab it from there.

The dump can't be made publicly accessible since it contains sensitive information (password digests, deleted comments, etc). Could potentially set up an anonymizing thingy if more people need it.

Implementation-wise, if you can do it with just postgres that would be ideal, but I expect it to be more in elasticsearch territory. Try to avoid introducing new systems if possible (like Redis). It needs to be fast enough that it doesn't bottleneck requests, but it's fine if it isn't all ~webscale~.

Let me know if you run into any problems!

Noon Silk

unread,

Nov 19, 2015, 5:54:43 PM11/19/15

to Jaiden Mispy, aram harrow, Steve Flammia, scirate-dev

thanks jaiden - it is:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDdwCn2NGT/fr3agpCn3Cdx8vZNVVaqMIU2EOkbHTOkM92pzX0jpiGLwPeVhwfeltjv8hhXfUlmoTQ4Bh1Fan9qY8yAoMejGiCA9iSKc9YGphMYXTBWx+NM9oQCXdH7NicmQILoKuZu4uKNcXPQ7obYfeU1o5808wF7LY3c5w1dF7VeczKzt0u7EpJLYtm6aGOxslaeoQkk+n32edR/GSGNOqMPfhdSOTLrgXFun4EthQ75qx6PPYZKiiiTTK9jd8V6YagZ8Uf4U3kd7DXMQ/cq0U44HDXu9lpVSeQuLwwNbP++jO/WxM1RoG4rIHIedGHhNYQzw2RJQno1LhoulIyZ

re minimising deps: sounds good. i'll first get something working offline anyway, and then discuss/port it over to what we have.

Jaiden Mispy

unread,

Nov 19, 2015, 6:02:36 PM11/19/15

to Noon Silk, aram harrow, Steve Flammia, scirate-dev

you should have ssh access to sci...@scirate.com now. don't break anything :)

Noon Silk

unread,

Nov 19, 2015, 6:04:58 PM11/19/15

to Jaiden Mispy, aram harrow, Steve Flammia, scirate-dev

indeed i do.

i won't! :)

Steve Flammia

unread,

Nov 21, 2015, 2:36:21 PM11/21/15

to Noon Silk, Jaiden Mispy, aram harrow, scirate-dev

Hi Noon,

I don't know anything about actually implementing a recommendation engine, but I know that there are open source recommendation engines available, such as easyrec. It might be relatively straightforward to plug and play with something like that.

As far as specific methods go, I know that matrix completion-based methods worked quite well in the netflix challenge. The very best recommendations come from combinations of methods that have been tuned by dynamically reweighting how much each method contributes so as to optimize the accuracy.

-- Steve

Noon van der Silk

unread,

Jul 26, 2016, 7:00:56 PM7/26/16

to Steve Flammia, Jaiden Mispy, aram harrow, scirate-dev

Hey Everyone,

So I've actually finally been working on this; thought I'd share what I have done so far and get some feedback on what I'm planning.

Attached is a screenshot of the output of a small deep learning system I've put together (in Python/TensorFlow); it learns a word embedding and uses that to decide with what probability it should scite a paper, based on a users previous scites.

In particular, for each word it looks at, it emits a value that corresponds to how "good" this word is; if lots of the words are good, it scites the paper, otherwise it ignores it. In some sense it's learning a list of keywords; but it uses a bit of context, because, say for example that "quantum group" is different to "quantum <most-other-things>". In the picture, blue means "this word is interesting", and red means "this word is not interesting to you".

The screenshot shows this model being applied independently to titles (on the left) and abstracts (on the right).

The training data is just the entire SciRate database that I've "seen" over it's lifetime. Because I use SciRate in a fairly dogmatic way (looking at everything); this actually works quite well.

The main technical step left is to build a joint model that looks at titles and abstracts like: sciteProb = a * titleSciteProb + (1-a) abstractSciteProb; and make "a" learnable,

Then we need to think about how/if we want to include this on SciRate itself. One somewhat-involved way to implement it would be this:

- Have an interface where each user can configure their own model to be learned. Things going into this are:

* Which users' data to use (I may like to learn from my own data, and Arams)

* Which arxiv categories to learn within (I may only be interested in Aram's interests inside quant-phys)

* Timescale to learn over

- Support the training of these models in a scalable way on the server; on my comp (16gb ram lenovo x1 4th gen) it only takes 2-3 minutes; we could probably speed this up.

- Figure out how to make the recommendations in the UI; maybe an email would be the easiest way until we figure out some spot to show the suggestions on the interface.

At this point I'm interested in thoughts on all aspects. I'm even quite happy to not deploy this to SciRate at all for the time being. I still need to compare this system with "typical" recommendation approaches, (i.e. simple pre-defined keyword matching; and a bag-of-words approach classification).

It'll be a few weeks at least before I'm even ready to think about deploying it properly.

Let me know your thoughts. If you want more information, just let me know; at some point soon I'll write up a blog post with a lot more detail and the comparisons, so we can see if it's even worthwhile.

At the moment both models independently reproduce my own Scite-NoScite choices ~81-84% of the time; I think combining them will do a few percent better.

--

Noon

--

Noon van der Silk, ن

2016-07-27-083941_3840x2160_scrot.png

Aram Harrow

unread,

Jul 27, 2016, 11:45:17 PM7/27/16

to Noon van der Silk, Steve Flammia, Jaiden Mispy, scirate-dev

This is awesome!

One way this might be useful is like in the front page of scholar.google.com, where it suggests a few papers. In particular, it might suggest some papers that you haven't already scited. Of course then there should be a "dismiss" button so you don't have to see them again if you don't like the suggestion.

There are a few nontrivial questions here, like the UI ones above, the issues of server time that you mentioned, and then of course the which-model(s) question.

Would it be easy for us to run your code locally?

aram

Noon van der Silk

unread,

Jul 31, 2016, 4:32:13 AM7/31/16

to Aram Harrow, Steve Flammia, Jaiden Mispy, scirate-dev

Hey Aram,

Yep, you should be able to run the code; I'll hopefully put it up on GitHub later this week; I'll point you at it when I'm done.

--

Noon

Noon van der Silk

unread,

Aug 4, 2016, 2:37:05 AM8/4/16

to Aram Harrow, Steve Flammia, Jaiden Mispy, scirate-dev

Hey All,

Some thoughts:

1. I've had a think and I think a very good way to deploy this initially is as a weekly newsletter, that everyone would have to opt into, which would work like so:

- We would train a recommendation model per scirate category (quant-ph, cs.AI, cs.ML, ...) based on everyone's scites,

- We use that to build a "recommended paper" list each week, from all the unscited papers that week,

The newsletter content will be:

- Top scited papers this week (directly from the DB, no machine learning),

- "Recommended papers" - The ones not scited but which the system recommends.

- When emailing a newsletter, we send them initially say 20 recommendations at most (later we make this configurable),

- We'll build this system on a different computer initially, and test it internally (where I work) and with whoever from here is interested?

I like this because it encourages involvement ("Your scites go toward recommending the papers for everyone to look at!") and is very light-weight. There's heaps of scope for improvements to say build customised models per-person who receives the email, etc. But also this is a good first step.

2. I work at Silverpond now - silverpond.com.au - and they're willing to sponsor my time to work on deploying this. As a result there's a few others in the office that I want to get involved to help deploy this. Hence I'd like to share the Scirate database with the team so we can all be involved. Is this a problem for anyone? Let me know ASAP if so. I'm not going to give anyone else ssh access to the website; but sharing the actual DB backups will be useful.

Let me know if you have thoughts on this, otherwise I'll continue working along this path, and just present results as I get them.

--

Noon

Aram Harrow

unread,

Aug 10, 2016, 7:56:35 AM8/10/16

to Noon van der Silk, Aram Harrow, Steve Flammia, Jaiden Mispy, scirate-dev

Hi Noon,

That's great news about Silverpond!
It's fine if others there get access to the DB backups. Maybe they could introduce themselves on the list?

Yes, the newsletter is a good idea to start with. I guess we in any case probably need a little screen real-estate to advertise it, but this could be a small banner across the top.

aram

Noon van der Silk

unread,

Aug 22, 2016, 2:00:02 AM8/22/16

to Aram Harrow, Aram Harrow, Steve Flammia, Jaiden Mispy, scirate-dev

Hey All,

Following up on this, here's what this would send out (say on a Sunday) along with the top-(10 or 20) scited papers:

1608.03643 - Beyond Spectral: Tight Bounds for Planted Gaussians - 0.93317
1608.04195 - Heralded quantum controlled phase gates with dissipative dynamics in macroscopically-distant resonators - 0.929823
1608.04215 - Einstein-Podolsky-Rosen Entanglement between Separated Atomic Ensembles - 0.929735
1608.03613 - Back action evading quantum measurement of motion in a negative mass reference frame - 0.91322
1608.04245 - The Bayesian Low-Rank Determinantal Point Process Mixture Model - 0.861707
1608.03731 - Hierarchy in Sampling Gaussian-correlated Bosons - 0.857308
1608.03970 - Entanglement entropy scaling in solid-state spin arrays via capacitance measurements - 0.85628
1608.03456 - Entanglement generation through particle detection in systems of identical fermions - 0.849742
1608.04073 - Quantum Stern-Gerlach experiment and path entanglement of Bose-Einstein condensate - 0.84509
1608.03498 - Spin-photon entanglement interfaces in silicon carbide defect centers - 0.836499
1608.03497 - Implications of non-Markovian dynamics for the Landauer bound - 0.831815
1608.04145 - Decoherent Histories Quantum Mechanics Starting with Records of What Happens - 0.794342
1608.04164 - Dephasing-assisted Gain and Loss in Mesoscopic Quantum Systems - 0.790128
1608.03437 - Coherent spaces, Boolean rings and quantum gates - 0.779257
1608.03952 - On the optical fields propagation in realistic environments - 0.773768
1608.03507 - Learning Mobile App Usage Routine through Learning Automata - 0.764073
1608.04240 - Diffusive lossless energy and coherence transfer by noisy coupling - 0.743096
1608.03533 - Sequence Graph Transform (SGT): A Feature Extraction Function for Sequence Data Mining - 0.739139
1608.04099 - Violation of Bell's Inequality Under Global Unitary Operations - 0.729357
1608.04149 - Reply to Gillis's "On the Analysis of Bell's 1964 Paper by Wiseman, Cavalcanti, and Rieffel" - 0.714808

This is the list of predictions (arxiv_id, title, goodness probability) of the system based on everyone's scites; from a week ago until forever, then trained on the last week's worth of papers. I.e. this are papers that we all missed, in the quant-phys/ML categories.

Does this look reasonably useful? Any thoughts?

--

Noon

Steve Flammia

unread,

Aug 22, 2016, 5:04:05 AM8/22/16

to Noon van der Silk, Aram Harrow, Aram Harrow, Jaiden Mispy, scirate-dev

Hi Noon,

So, just to make sure that I understand correctly what the idea is here… These are the papers that are predicted to be of interested based on *all* users? In some sense, then, this represents the average scirate user.

Personally, I’m a little reluctant to introduce a weekly email digest to Scirate just on the grounds that it is a rather big departure from the current user experience. Aram’s suggestion of having a little “dismissable” bubble a la google scholar seems to be a more consistent experience. However, this also would require some development of the UI, and I don’t know how much additional work this would require beyond your (awesome! heroic!) development of the recommendation engine. I don’t want to sound negative on these developments, I'm just saying what I think would be the best direction in which to push these ideas.

-- Steve

Noon van der Silk

unread,

Aug 22, 2016, 7:21:11 PM8/22/16

to Steve Flammia, Aram Harrow, Aram Harrow, Jaiden Mispy, scirate-dev

Hey Steve,

It's a bit subtle how the recommendations are; indeed, it's trained on everyone's scites since the beginning of time (basically). Categories aren't considered by the model yet; but the recommendations it shows are only from the categories you are subscribed to. So say it could learn that "quantum algorihtms" is a likable phrase, but if it saw that phrase on a paper in "physics.soc-ph" (and you were subscribed) it would still recommend it to you, even if you only like such papers in "quant-ph". The reason for doing this is basically there isn't enough training data if one is to consider a per-category training approach. I think in the future I'll include the category in the main model itself; rather than training independent models.

Note also, of course, the recommendations are fairly biased to things the 3 of us like. Hopefully this will change as more people use SciRate.

Let's think of the newsletter as a first step.

Until I've trialled it for a while I won't make it a feature on the website itself; it'll be a setup that we can use internally to decide how good/useful the recommendations are. After trialling it this way for a few months then we can open it up as a feature publically.

I agree that ultimately a UI feature on the website will be a good way to include the recommendations, but I want to trial them first before doing that.