Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Faster Better Cheaper Search Engines

1 view
Skip to first unread message

John

unread,
Oct 25, 2009, 8:44:50 PM10/25/09
to
Searching for documents and other items on the Web or computers is
often tedious and time consuming. Time is money. Highly paid
professionals spend hours, days, and even longer searching for
information on the Web or computers. Most search today is done using
key word and phrase matching, often combined with various ranking
schemes for the search results. Occasionally more advanced methods
such as logical queries, e.g. search for “rocket scientist” and NOT
“space”, and regular expressions are used. All of these methods have
significant limitations and often require lengthy human review and
further manual searching of the search results.

The dream search engine would search by topic, by the detailed content
of the items searched, ideally finding the desired information
immediately. Actual understanding of text remains a unfulfilled
promise of artificial intelligence. Statistical language processing
can achieve a degree of searching by topic. This article introduces
the basic concepts and mathematics of statistical language processing
and its applications to search. It gives a brief introduction and
overview of more advanced techniques in statistical language
processing as applied to search. It also includes sample Ruby code
illustrating some simple statistical language processing methods.

http://math-blog.com/2009/10/25/faster-better-cheaper-search-engines/

Ian Parker

unread,
Oct 26, 2009, 7:29:09 AM10/26/09
to
It is easy to say these thing. In fact the modern search engine is
extremely sophisticated in what it is trying to do.
Most search engines these days use LSI (Latent Semantic Indexing).
This presents each web page as being a vector. Some remarkable
associations between websites that have got similar vectors.

http://chris.ikit.org/ksv2.pdf

is very impressive. It should be pointed out that LSI scanning is CPU
intensive. OK once a page has been done it has been done.
One thing I would like Google to do is to use vectors when searching
from within a document you are writing. It does not appear to do this.

- Ian Parker

Ted Dunning

unread,
Nov 4, 2009, 2:57:16 PM11/4/09
to
I hate to be negative, but ...

On Oct 26, 3:29 am, Ian Parker <ianpark...@gmail.com> wrote:
> It is easy to say these thing. In fact the modern search engine is
> extremely sophisticated in what it is trying to do.
> Most search engines these days use LSI (Latent Semantic Indexing).

This is just plain silly. In fact, very few search engines use LSI
outside of research. Even fewer search engines in production use LSI
directly. A very few engines use some form of random indexing (which
is similar). Off-hand, I can only think of non-search production
applications that use this form of comparison (essay scoring, a (very)
little bit of fraud modeling, some recommendation engines, perhaps one
or two other applications).

> This presents each web page as being a vector. Some remarkable
> associations between websites that have got similar vectors.

Moderately interesting is what I would say rather than "remarkable".

> .... It should be pointed out that LSI scanning is CPU


> intensive. OK once a page has been done it has been done.

The first is true, the second is not.

Searching with LSI is exactly proportional to corpus size and is
usually bottle-necked by memory bandwidth and secondarily by.
Searching using conventional techniques is sub-linear in corpus size
when you start getting really large corpora. The cost of LSI is
prohibitive for most large search engines on several axes.

In addition, LSI is typically best in recall while modern search
applications are (mostly) dominated by considerations of first page
precision. This makes LSI a very bad match to (most) modern needs.

Ian Parker

unread,
Nov 5, 2009, 9:47:52 AM11/5/09
to

I have a Google alert "Latent Semantic Analysis", and loads of
articles come up describing how to optinise your search for Google's
new techniques. It would seem that they are all under an illusion.

It is hard to see how Web 3.0 is ever going to work without some form
of LSA being used to produce precise word meanings.


- Ian Parker

Ted Dunning

unread,
Nov 6, 2009, 12:32:47 PM11/6/09
to
On Nov 5, 6:47 am, Ian Parker <ianpark...@gmail.com> wrote:
> > In addition, LSI is typically best in recall while modern search
> > applications are (mostly) dominated by considerations of first page
> > precision.  This makes LSI a very bad match to (most) modern needs.
>
> I have a Google alert "Latent Semantic Analysis", and loads of
> articles come up describing how to optinise your search for Google's
> new techniques. It would seem that they are all under an illusion.

Well, that would not be the first time that the SEO community have
gone all a-twitter about rumors that have nothing to do with the
reality of how search engines work.

My own working approximation is that SEO "expert" knowledge of search
technology is zero. I have only very rarely seen any counter-
evidence.

LSI and LSA are very particular technical terms that refer to spectral
decompositions of occurrence patterns. Google has done work in
probabilistic LSI (I believe that Hoffman works there now), but I am
pretty sure (without direct knowledge of the code) that the techniques
actually in production uses term expansion instead of dimensionality
reduction, and that the term expansion is done primarily at index
time.

>
> It is hard to see how Web 3.0 is ever going to work without some form
> of LSA being used to produce precise word meanings.

The point of LSA is to get just the opposite of "precise word
meanings".

Ian Parker

unread,
Nov 6, 2009, 3:58:06 PM11/6/09
to
On 6 Nov, 17:32, Ted Dunning <ted.dunn...@gmail.com> wrote:

> The point of LSA is to get just the opposite of "precise word
> meanings".

I disagree, although we may be talking at cross purposes. There is a
set of precise words that is a (possibly quite small) subset of total
words, but which refer to a unique precise concept. Sometimes, as is
the case with Blue Giant, we have other meanings. As I said "Blue
Giant" is a music group. We need a way of distinguishing between music
groups and massive main sequence stars. LSA is the "kneejerk" way of
doing this.


- Ian Parker

0 new messages