LSA: Identifying most representative terms and documents of a dimension

69 views
Skip to first unread message

sleepyoliver

unread,
Jan 17, 2012, 11:48:25 AM1/17/12
to Semantic Vectors
Hi there,

I am interested in identifying the most representative terms and
documents of each dimension generated by LSA. How can I obtain this
information? Import svd_docvectors.txt and svd_termvectors.txt to
Excel and sort by columns? How do I interpret the values of the
vectors? High value means high relevance? What about negative values?

Best regards,
Oliver

Dominic

unread,
Jan 20, 2012, 2:05:58 AM1/20/12
to Semantic Vectors
Hi Oliver,

On Jan 17, 8:48 am, sleepyoliver <oli.muel...@gmail.com> wrote:
> Hi there,
>
> I am interested in identifying the most representative terms and
> documents of each dimension generated by LSA. How can I obtain this
> information? Import svd_docvectors.txt and svd_termvectors.txt to
> Excel and sort by columns?

That's a great idea, I hadn't thought of that. There's a variety of
other scripting and hacking solutions I can think of, but none as
quick as this.

> How do I interpret the values of the
> vectors? High value means high relevance? What about negative values?

Yes, a high value in a particular column would indicate high relevance
to a particular latent dimension. Whether you interpret these as
topics or not depends on preference, really. In my experience some are
good and some look pretty random.

Negative values are an interpretation problem for SVD in my opinion.
The basic mathematics descends from (I think) Euler's discovery that
the stable axes of rotation for a solid body are orthogonal and of
course intersect in the center of gravity. You can see exactly why
negative values are important if your measurements are spatial
distances and your origin is the centre of gravity.

If what we're measuring is a number of observations (in this case,
number of times each term occurs in each context), all the counts are
initially positive. So while it's easy to see that an axial
decomposition would give rise to some negative values, it's hard to
interpret these as relating at all to the number of times something
happens.

Note that we get even more negative entries with random projection (as
many negatives as positives), but since we don't generally try to
interpret the individual coordinates at all it doesn't really matter.

Best wishes,
Dominic

>
> Best regards,
> Oliver

sleepyoliver

unread,
Feb 23, 2012, 3:29:15 AM2/23/12
to Semantic Vectors
Thanks Dominic!

sleepyoliver

unread,
Mar 23, 2012, 3:45:35 AM3/23/12
to Semantic Vectors
Hello again,

I have one more question regarding identifying high-loading terms and
documents of an dimension. As far as I know, in standard LSA, the
values of the SVD matrix range between -1 and +1. However, the term
and document matrices I get from SV conatin some very large (e.g.,
+6000) and some very small (e.g., -6000) values. How does that happen?

Best regards
Oliver


On 23 Feb., 09:29, sleepyoliver <oli.muel...@gmail.com> wrote:
> Thanks Dominic!
>
> On 20 Jan., 08:05, Dominic <dwidd...@gmail.com> wrote:
>
>
>
>
>
>
>
> > Hi Oliver,
>

Trevor Cohen

unread,
Mar 23, 2012, 10:02:23 AM3/23/12
to semanti...@googlegroups.com
Hi Oliver,
Did you use any of the term weighting options when constructing the matrix? I'd recommend using -termweight logentropy, as this is standard practice with LSA (e.g http://lsa.colorado.edu/papers/plato/plato.annote.html ) and tends to improve results. It's also common practice to use a stopword list to eliminate uninformative terms with high global frequency, although this can be approximated to some degree by using a maximum frequency threshold.
Regards,
Trevor  

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To post to this group, send email to semanti...@googlegroups.com.
To unsubscribe from this group, send email to semanticvecto...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/semanticvectors?hl=en.


Dominic Widdows

unread,
Mar 23, 2012, 12:49:48 PM3/23/12
to semanti...@googlegroups.com
Hi Oliver,

Another question - are you seeing large values in your term vectors,
your document vectors, or both?
In the code at
http://code.google.com/p/semanticvectors/source/browse/trunk/src/pitt/search/semanticvectors/LSA.java,
we explicitly normalize term vectors but take doc vectors straight
from the U matrix of the SVD decomposition.

The docvectors part wouldn't surprise me, because large values can
occur in SVD depending on your choice of representation.

In the decomposition A = U * S * V, U is left singular vectors, V is
right singular vectors, S is singular values. By moving multiplicative
factors around from one matrix to another, you can have quite a lot of
leeway. You can choose parameters so that at least one of U and V are
unitary matrices, possibly both, I'd have to check the maths /
literature to make sure.

I'm not an SVD expert and I haven't actually checked what convention
the library uses. So I can't answer the question "What exactly is
going on?" off the top of my head, but at least I can say that it's
not altogether surprising to find large values in the document
vectors. If you're getting large values in your term vectors, that's
another matter and is definitely weird.

Best wishes,
Dominic

Reply all
Reply to author
Forward
0 new messages