Hailin wrote:
> The current implementation indexes blog content (post_title, comments,
> post_content, etc) by unique words, and stores words and documents
> into additional database tables for search querying. That meets the
> requirement of the original proposal.
>
> The limitations I see for this search project, or other blog search
> approach, such as Flaptor’s approach on wordpress.com, is the lack of
> accuracy.
>
> Search in the early days of Internet was basically using similar
> approach: index by keywords, and rank by word frequency. So it was
> very inaccurate.
> And spammers would put many tiny texts such as “car car car car car”
> around sex website. So searching for “car” gives bunch of spam
> websites.
That's very good point.
>
> The Google guys did something very simple to improve the accuracy.
> For example, use more contextual information to improve the relevancy
> such as giving more weight to page titles, taking font size into
> consideration, etc.
> These simple techniques can be incorporated in the current search
> project.
Yeah, I think we could do it simply. Taking words positions in text is
far more difficult. At first, size of index scares me a bit when we
would store also words position. Probably using database tables wouldn't
be a good idea in that case.
>
> Ex: I wrote a post title “Internet in China” which describes some
> Internet stats in China.
>
> And suppose another post contains the following sentence:
>
> Internet connections was not very good when I was traveling; yet the
> food was just delicious in China”
>
> So searching for “internet in china” should give more weight to my
> first post
> Because the words are in the title, the post is linked from other
> posts, and “Internet” and “China” are adjacent to each other.
>
That's very true.
> Among the techniques, page rank algorithm was most effective. It’s an
> old concept in academics – when one looks for research papers on one
> subject, one starts on the reference lists at the appendix. If many
> authors, especially well-known authors, reference a particular paper,
> that one is the best result.
> When the sampling space is enlarged to the whole Internet, it gives
> amazingly accurate results.
Do you think we could use pingback for it? However, Pingback isn't used
by all web applications, so it would give only a partial info about
links. We could use referrer but this may not be a good idea because of
some spamming or cheating references.
>
> I think blog search is limited precisely by the lack of effective
> ranking algorithms without big-enough sampling space. We need to
> address or solve this problem in order to design an effective search
> algorithm for blogs.
>
Improving algorithm seem to be top priority now. Do anyone has another
ideas?
Best wishes,
Kodie