highlighting interesting items

Aaron Swartz

unread,

Jan 18, 2008, 11:41:06 AM1/18/08

to view.theinfo

In many applications, you have a long list of things of which you want
to highlight some interesting ones to the user. As a concrete example,
let's imagine you have a series of photos and you want to show the
user the most interesting photos in any section.

One traditional way of doing this is to let users vote on photos, then
sort photos by the number of votes. This is the solution employed by,
for example, bash.org. Unfortunately, it has a serious flaw: the
average user looks at the things with the most votes, likes most of
them, and gives them more votes. The result is a classic Matthew
effect: whichever items happen to end up on top stay there for just
about ever. (Furthermore, if a popular site links to one particular
item on your site, everyone votes that item up, giving it an absurd
number of votes.)

One solution to this is to rank not just on votes but on recency of
those votes. This is roughly what Reddit does. This has the advantage
that there are basically new things every day, but the disadvantage
that intra-day comparisons are basically worthless. Because traffic to
the site is growing over time, new stories always tend to have more
votes than old ones; thus the all-time hits on Reddit are totally
uninteresting.

So here's my new idea and I want to hear your thoughts:

What you really want to rank things by is the probability someone will
like it, which, if we leave personalization out for the moment, is
just the percentage of people who like it. Now obviously it's
impossible to know the percentage of people who will like an item in
advance, but each vote gives you more information about that. So
instead of sorting by the number of votes, you sort by the estimated
percentage of people who like something (which you calculate from the
votes so far using Bayes' theorem).

Now by default, this would just mean that new stories would tend to
settle in the middle of the pack, where they're unlikely to get voted
on. So when you're drawing top pages, randomly increase the expected
probability of items you're unsure about (i.e. have few votes). That
way, when users view the top stories, they'll also get a few
could-be-top-stories mixed in; they'll vote on those, you'll know
whether they're good or not, and that'll improve the rankings for next
time.

Thoughts on the concept? Help with the math?

Dave Pawson

unread,

Jan 18, 2008, 11:53:45 AM1/18/08

to view-t...@googlegroups.com

On 18/01/2008, Aaron Swartz <m...@aaronsw.com> wrote:
>
> In many applications, you have a long list of things of which you want
> to highlight some interesting ones to the user. As a concrete example,
> let's imagine you have a series of photos and you want to show the
> user the most interesting photos in any section.

Or abstract it to 'I've selected n items given your search criteria'.

>
> One traditional way of doing this is to let users vote on photos, then
> sort photos by the number of votes.

it has a serious flaw: the

> average user looks at the things with the most votes, likes most of
> them, and gives them more votes. The result is a classic Matthew
> effect: whichever items happen to end up on top stay there for just
> about ever.

> So here's my new idea and I want to hear your thoughts:

>
> What you really want to rank things by is the probability someone will
> like it,

So

> instead of sorting by the number of votes, you sort by the estimated
> percentage of people who like something (which you calculate from the
> votes so far using Bayes' theorem).

Comparing the above... it seems to me that you're going to finish up
with the same items (excepting your random insertions)?

Votes vs 'votes so far using Bayes' ?

Is there a biggish difference I'm missing Aaron?

regards

--
Dave Pawson
XSLT XSL-FO FAQ.
http://www.dpawson.co.uk

Aaron Swartz

unread,

Jan 20, 2008, 10:42:48 AM1/20/08

to view-t...@googlegroups.com

> > instead of sorting by the number of votes, you sort by the estimated
> > percentage of people who like something (which you calculate from the
> > votes so far using Bayes' theorem).
>
> Comparing the above... it seems to me that you're going to finish up
> with the same items (excepting your random insertions)?
>
> Votes vs 'votes so far using Bayes' ?
>
> Is there a biggish difference I'm missing Aaron?

There are two big differences:

1. The randomness
2. The fact that it's a percentage and not a flat number

Votes leads to the runaway Matthew effect I describe, whereas
percentages cannot go above 100. And depending on whether you make it
percent-of-votes-that-are-positive or
percent-of-views-that-led-to-a-positive-vote, I think you'll see some
very different results.

Reply all

Reply to author

Forward