Re: Search Project

0 views
Skip to first unread message

Hailin

unread,
Aug 12, 2008, 2:58:12 PM8/12/08
to WordPress Summer of Code 2008
Here is some discussions on the Search project:
spikeflaptor
===========

“precision” is an incredible imprecise word. Moreover, “relevance”
means different things for different persons. If you’re serious about
improving relevance, the only way to go is to assemble a test team
large enough to be representative of most different search interests
of potential users. In our experience, this in no trivial task, as
you’ll soon realize that all your friends, coworkers, family,
acquaintances, etc, lie in a very particular ecological niche
regarding search habits.

The current search implementation at wordpress does weight different
parts of the page, as title and body. I don’t recall if we’re indexing
comments… our fear was leaving the system very vulnerable to spamming.

Spam control is a neverending story. It a war that forces continuous
adaptations on both bands, while at the same time, it’s not an end on
itself and we’re trying not to waste resources on it. This is why I
think that advanced spam control techniques are better left alone
until needed: show off all your arsenal right from the beginning, and
you’ll be attacked only once.

You’re absolutely right about phrase queries, I think we should enable
them by default.



Hailin Wu wrote
==========

Thanks for your feedback.

Let me give two simple examples to illustrate what I mean by
relevance:

http://en.search.wordpress.com/?q=hailin+wu

Gives no results, although I have a few WP blogs,
and google search gives hailin.wordpress.com, my main blog.

http://en.search.wordpress.com/?q=matt
gives random results with matt on the post title.
Google search gives ma.tt

I guess it’s inherently difficult to solve this issue
when the search subjects set is small, such as when it’s within
wordpress.com blogs. Page rank, which is the secret to google’s
precision, can not be adequately applied when the space is small.

However, it seems Google has a way to deal with this. Google sells
site search appliance http://www.google.com/enterprise/
I am just wondering what algorithms they used to produce more relevant
results.


Kodie

unread,
Aug 14, 2008, 3:31:48 AM8/14/08
to wordpress...@googlegroups.com
Hey,

thanks for this discussion it shows that most important thing in search
engines is relevance. However, I find it hard to test alone. Improving
search results needs real blogs, real installations and many users. For
now our search engine suports different weighting words from title,
body, tags, etc... Moreover, it is important how many times word is used
in document and in how many documents. So more common words are less
important.

Regarding indexing comments, I think that very often comments are very
valueable and can contain information someone is looking for. I think
that antispam solutions should be used on regular basis. Maybe
possibility to turn off comment indexing would nice feature to add.

I am thinking about improvments. Do you hava any idea where we could
test our search engine?

Best wishes,
Kodie
<http://hailin.wordpress.com>, my main blog.

http://en.search.wordpress.com/?q=matt
gives random results with matt on the post title.
Google search gives ma.tt <http://ma.tt>

I guess it's inherently difficult to solve this issue
when the search subjects set is small, such as when it's within
wordpress.com <http://wordpress.com> blogs. Page rank, which is the
Reply all
Reply to author
Forward
0 new messages