Feedback on the search project

1 view
Skip to first unread message

Hailin

unread,
Jul 27, 2008, 9:38:33 PM7/27/08
to WordPress Summer of Code 2008
The current implementation indexes blog content (post_title, comments,
post_content, etc) by unique words, and stores words and documents
into additional database tables for search querying. That meets the
requirement of the original proposal.

The limitations I see for this search project, or other blog search
approach, such as Flaptor’s approach on wordpress.com, is the lack of
accuracy.

Search in the early days of Internet was basically using similar
approach: index by keywords, and rank by word frequency. So it was
very inaccurate.
And spammers would put many tiny texts such as “car car car car car”
around sex website. So searching for “car” gives bunch of spam
websites.

The Google guys did something very simple to improve the accuracy.
For example, use more contextual information to improve the relevancy
such as giving more weight to page titles, taking font size into
consideration, etc.
These simple techniques can be incorporated in the current search
project.

Ex: I wrote a post title “Internet in China” which describes some
Internet stats in China.

And suppose another post contains the following sentence:

Internet connections was not very good when I was traveling; yet the
food was just delicious in China”

So searching for “internet in china” should give more weight to my
first post
Because the words are in the title, the post is linked from other
posts, and “Internet” and “China” are adjacent to each other.

Among the techniques, page rank algorithm was most effective. It’s an
old concept in academics – when one looks for research papers on one
subject, one starts on the reference lists at the appendix. If many
authors, especially well-known authors, reference a particular paper,
that one is the best result.
When the sampling space is enlarged to the whole Internet, it gives
amazingly accurate results.

I think blog search is limited precisely by the lack of effective
ranking algorithms without big-enough sampling space. We need to
address or solve this problem in order to design an effective search
algorithm for blogs.

Kodie

unread,
Jul 28, 2008, 1:55:32 AM7/28/08
to wordpress...@googlegroups.com
Hello

Hailin wrote:
> The current implementation indexes blog content (post_title, comments,
> post_content, etc) by unique words, and stores words and documents
> into additional database tables for search querying. That meets the
> requirement of the original proposal.
>
> The limitations I see for this search project, or other blog search
> approach, such as Flaptor’s approach on wordpress.com, is the lack of
> accuracy.
>
> Search in the early days of Internet was basically using similar
> approach: index by keywords, and rank by word frequency. So it was
> very inaccurate.
> And spammers would put many tiny texts such as “car car car car car”
> around sex website. So searching for “car” gives bunch of spam
> websites.

That's very good point.

>
> The Google guys did something very simple to improve the accuracy.
> For example, use more contextual information to improve the relevancy
> such as giving more weight to page titles, taking font size into
> consideration, etc.
> These simple techniques can be incorporated in the current search
> project.

Yeah, I think we could do it simply. Taking words positions in text is
far more difficult. At first, size of index scares me a bit when we
would store also words position. Probably using database tables wouldn't
be a good idea in that case.

>
> Ex: I wrote a post title “Internet in China” which describes some
> Internet stats in China.
>
> And suppose another post contains the following sentence:
>
> Internet connections was not very good when I was traveling; yet the
> food was just delicious in China”
>
> So searching for “internet in china” should give more weight to my
> first post
> Because the words are in the title, the post is linked from other
> posts, and “Internet” and “China” are adjacent to each other.
>

That's very true.

> Among the techniques, page rank algorithm was most effective. It’s an
> old concept in academics – when one looks for research papers on one
> subject, one starts on the reference lists at the appendix. If many
> authors, especially well-known authors, reference a particular paper,
> that one is the best result.
> When the sampling space is enlarged to the whole Internet, it gives
> amazingly accurate results.

Do you think we could use pingback for it? However, Pingback isn't used
by all web applications, so it would give only a partial info about
links. We could use referrer but this may not be a good idea because of
some spamming or cheating references.

>
> I think blog search is limited precisely by the lack of effective
> ranking algorithms without big-enough sampling space. We need to
> address or solve this problem in order to design an effective search
> algorithm for blogs.
>

Improving algorithm seem to be top priority now. Do anyone has another
ideas?

Best wishes,
Kodie

Reply all
Reply to author
Forward
0 new messages