Trending docs algorithm / sorting

137 views
Skip to first unread message

Paul Hinett

unread,
Jul 24, 2012, 4:52:19 PM7/24/12
to rav...@googlegroups.com
Hi,

I am trying refactor and simplify something in my app.
 
I have a 200,000+ audio documents, i want to display to users of my site the currently trending / hot audio documents based on a couple of variables (number of downloads + age of uploaded audio).

At the moment it's basically ordered by number of downloads it has received in the last 7 days), and how i get this is fairly complex and long winded with a console application which runs nightly to calculate the number of downloads received in 7 days).

I have read about the way 'Hacker News' does there sorting which seems relatively simple and gives a score based on the number of votes and the age of the post.

(p - 1) / (t + 2)^1.5

Description: 

Votes divided by age factor

p = votes (points) from users.
t = time since submission in hours.

p is subtracted by 1 to negate submitters vote.
age factor is (time since submission in hours plus two) to the power of 1.5.


I can't do this within my index because i would need to use DateTime.Now to get how many hours old this document is, does anyone have any suggestions on how I could work around this or if this is even possible with RavenDb.

If not, does anyone have any suggestions on how I could do this without running a console application nightly to process my audio documents in batches.

Thanks,
Paul

Ryan Heath

unread,
Jul 24, 2012, 5:16:09 PM7/24/12
to rav...@googlegroups.com
I don't think you need DateTime.Now.
You could use some Date as a starting point, say 20100101.
Then use the hours since that starting point. Then sort the docs
desc(!) using the formula.
I believe you should get the same results.

// Ryan

Paul Hinett

unread,
Jul 24, 2012, 5:23:45 PM7/24/12
to rav...@googlegroups.com
But wouldn't this store the calculated value in the index...what
happens 7 days later and that audio document still has the same score
in the index because it hasn't been re-indexed?

Matt Warren

unread,
Jul 24, 2012, 5:30:11 PM7/24/12
to rav...@googlegroups.com
@Paul

You're right, that's the problem. With Raven I can't see how you can do calculation at query time, which is what's needed here. You can only do calculation at indexing time, which is different.

Even TransformResults won't help, because it's run after the query has executed.

Ryan Heath

unread,
Jul 24, 2012, 5:31:12 PM7/24/12
to rav...@googlegroups.com
I think the trick is that the docs get a fixed 'age' value.
Newer docs always have a higher age.

Look at this example using hours since submission
1 2 3 4 5 (because new docs are always lower, but change in time)
With votes of 100 for each and a simplied formula: votes/hours we
would get these values
100 50 33 25 20

But now use the age since a fixed starting point
5 4 3 2 1 (because new docs are always higher and stay fixed in time)
With votes of 100 for each and a simplied formula: votes/hours we
would get these values
20 25 33 50 100 to get the same order as above reverse the sorting order.

I hope it makes sense ...

// Ryan

On Tue, Jul 24, 2012 at 11:23 PM, Paul Hinett

Matt Warren

unread,
Jul 24, 2012, 5:33:15 PM7/24/12
to rav...@googlegroups.com
What about doing this?

Everytime you update the upvote value, you also update the time value, i.e. time since the doc was created. AND update the results of the formula and store that in the doc (then you can sort on it in the query)


p = votes (points) from users.
t = time since submission in hours.

Then you'll be able to do the formula using t = "time between submission and most recent vote". Which is not exactly the same, but probably a close approximation?

Paul Hinett

unread,
Jul 24, 2012, 5:38:16 PM7/24/12
to rav...@googlegroups.com
Initially that sounded a great idea, but again what would happen if a
document got a very high score from lots of votes in the first day, but
then wasn't voted on for another 2 weeks, it would still keep that high
score for 2 weeks.

Ryan Heath

unread,
Jul 24, 2012, 5:40:18 PM7/24/12
to rav...@googlegroups.com
Not if an other doc is posted, because its fixed age would be always
higher than all existing docs.
I think ... I am not sure about it too, but I have feeling it could work ...

// Ryan

On Tue, Jul 24, 2012 at 11:38 PM, Paul Hinett

Paul Hinett

unread,
Jul 24, 2012, 5:41:40 PM7/24/12
to rav...@googlegroups.com
I'll do some tests and see how it looks.

Thanks!

Ryan Heath

unread,
Jul 24, 2012, 5:41:54 PM7/24/12
to rav...@googlegroups.com
Look it the other way: when no one would vote any doc, the order of
docs would not change, because all docs get older.

// Ryan

Paul Hinett

unread,
Jul 24, 2012, 5:54:14 PM7/24/12
to rav...@googlegroups.com
But wouldn't the index contain the same results, even though they are
getting older, the index doesn't actually know that.

Or am i missing the point?

Ryan Heath

unread,
Jul 24, 2012, 5:56:24 PM7/24/12
to rav...@googlegroups.com
What I meant is that is shouldnt matter, since in time the order of
all docs would not change (if no one voted)

// Ryan

On Tue, Jul 24, 2012 at 11:54 PM, Paul Hinett

Matt Warren

unread,
Jul 24, 2012, 6:02:11 PM7/24/12
to rav...@googlegroups.com
No but it would change if some docs were voted on and not others (which is also a problem with my suggestion).
And because the time is the divisor, something with a small time period (because it hadn't been updated) would end up with a large value.

Either way I'm not sure you can totally replicate the Hacker News method in Raven, so an alternative method like your is the best approach.

Ryan Heath

unread,
Jul 24, 2012, 6:55:21 PM7/24/12
to rav...@googlegroups.com
Maybe we should change the formula too ...

Something like (p - 1) * (t + 2)^0.25
p = votes
t = submitted - startingpoint.

I did some excel foo and it looks ok ...
I'm sure whether the -1 and +2 still make sense ...

// Ryan

Paul Hinett

unread,
Jul 24, 2012, 7:10:39 PM7/24/12
to rav...@googlegroups.com
At first glance and a couple of tests on the calculator this formula
seems to be good.

Do you have the graph to email over?

Thanks!

On 24 July 2012 22:41:54, Ryan Heath wrote:

Ryan Heath

unread,
Jul 25, 2012, 5:00:43 AM7/25/12
to rav...@googlegroups.com
I'm NOT whether the -1 and +2 still make sense ...

// Ryan

Ryan Heath

unread,
Jul 25, 2012, 5:02:15 AM7/25/12
to rav...@googlegroups.com
Paul,

Here is an excel file. Play with the values, the two graphs should
keep the same relative form.
Then the order would be te same too.

// Ryan
forumulafoo.xlsx

Matt Warren

unread,
Jul 25, 2012, 5:13:35 AM7/25/12
to rav...@googlegroups.com
Isn't the ranking different when you use the 2 formulas?

In the spreadsheet you sent, A has the highest value using the orig formula, but C has the highest value using the new adapted formula?

Ryan Heath

unread,
Jul 25, 2012, 5:19:42 AM7/25/12
to rav...@googlegroups.com
Yes, that's true. It isnt 100% equal, maybe the adapted formula should
be different.
Be I believe overall you should get the same results more or less.

// Ryan

Ryan Heath

unread,
Jul 25, 2012, 5:32:27 AM7/25/12
to rav...@googlegroups.com
Change the adapted forumula into something like

t * p^0.25 (first cell =B5*POWER(B2,0.25))

And it looks beter already.

Motivation: the sort order normaly depends on t (submission - startingpoint)
and it is altered slightly by p (votes)

Whether this is good enough is something Paul should decide with his data.

// Ryan

Ryan Heath

unread,
Jul 25, 2012, 5:54:03 AM7/25/12
to rav...@googlegroups.com
You can also see a nice effect when you alter cell B3 to simulate time progress.
The graphs stay relatively the same.

// Ryan

On Wed, Jul 25, 2012 at 11:13 AM, Matt Warren <matt...@gmail.com> wrote:

Kijana Woodard

unread,
Jul 25, 2012, 10:48:43 PM7/25/12
to rav...@googlegroups.com
@Ryan I happened to be looking into solr date boosting today and ran across this:

I immediately thought of this thread because it looks very similar to what you've suggested.

Speaking of boosting, couldn't we leave out the formula, calculate the fixed age in the index, and then use Boost in the Index and/or on the client to get the results the way we want?

Chris Marisic

unread,
Jul 26, 2012, 11:26:05 AM7/26/12
to rav...@googlegroups.com
Wouldn't this be something you'd want to do as a reduce index?

Since you can then pass at query time the time filter (most like you're within 2 weeks or whatnot).

For storing who does what, I'd just store whenever a user downloads or votes or whatever inside their user/account document, or create a specific container document like VoteBag, DownloadBag etc so that a user would have 1 of them, or perhaps they'd get 1 document for every 4000 or some finite number of actions so apply sane partitioning incase one of your users really goes hog wild and downloads tens of thousands of items that you aren't loading a doc with that much json.

You could then have the map part of the index flatting all of those documents into data like { userId, productId/songId etc, vote value etc, time of action } and then reduce into the results you want.

Paul Hinett

unread,
Jul 27, 2012, 6:22:16 AM7/27/12
to rav...@googlegroups.com

Thanks for the help on this. I got a little help from SO for an
algorithm which gives a score for each audio document. Newer items are
given a higher score and downloads/plays etc improve the score too,
after around 12-14 days popular items should drop off for newer audios
to get a chance at trending.

Here is the algorithm if anyone is interested:
https://gist.github.com/3187299

And you can see it in action here:
http://www.house-mixes.com/mixes/trending/

Paul

Ryan Heath

unread,
Jul 30, 2012, 3:38:15 AM7/30/12
to rav...@googlegroups.com
So you are not using nightly console app runs anymore? Is RDB capabele to reindex your votes, downloads etc in near realtime?
I'd like to know whether this approach gained you the goal(s) you were looking for. 

// Ryan

Sent from my iPhone

Paul Hinett

unread,
Jul 30, 2012, 4:57:34 AM7/30/12
to rav...@googlegroups.com
At the moment I am running this side-by-side with my console app...i
have a trending audios section as well as a popular audio section
(calculated from my console app), eventually once I am satisfied with
the algorithm i will fade out the popular section.

It seems to be handling ok and updates an audios score every time it is
played/downloaded/favorited etc. I always seem to have around 3 indexes
stale, but these are different indexes (multimap/reduce indexes for
other documents).

The score seems to be working ok at the moment, with a mixture of
fairly new mixes as well as mixes from a couple of weeks ago with very
high stats near the top of the results, i may have to adjust the
algorithm slightly over time to get the perfect results...but so far i
am happy.

Paul

On 30 July 2012 08:38:15, Ryan Heath wrote:
> So you are not using nightly console app runs anymore? Is RDB capabele
> to reindex your votes, downloads etc in near realtime?
> I'd like to know whether this approach gained you the goal(s) you were
> looking for.
>
> // Ryan
>
> Sent from my iPhone
>
> On Friday, July 27, 2012, Paul Hinett wrote:
>
>
> Thanks for the help on this. I got a little help from SO for an
> algorithm which gives a score for each audio document. Newer items
> are given a higher score and downloads/plays etc improve the score
> too, after around 12-14 days popular items should drop off for
> newer audios to get a chance at trending.
>
> Here is the algorithm if anyone is interested:
> https://gist.github.com/__3187299 <https://gist.github.com/3187299>
>
> And you can see it in action here:
> http://www.house-mixes.com/__mixes/trending/
Reply all
Reply to author
Forward
0 new messages