Some Initial Results

26 views
Skip to first unread message

Darren Govoni

unread,
Oct 9, 2008, 9:48:59 PM10/9/08
to semanti...@googlegroups.com
Hi again,
I indexed all the short abstracts on wikipedia and was able to build
the semantic vectors, etc.

- For 1.1 million documents, Lucene took 3 hours to index.
- SemanticVectors took 4 minutes to generate vectors over that index.
IMPRESSIVE!!

Running some tests I get some varied and at times unexpected results.
Consider below. [1] produces some expected terms such as pitcher,major
league, but also 'elephants'. Hehe. For [2] I don't really see any terms
I would expect to see, especially 'snoopy'. Given that the data set is
ALL of wikipedia abstracts I would expect more semantically relevant
vectors because of the breadth of documents.

My question is: Is there a data set and examples that demonstrate
clustering, LSA, and those sorts of things more readily?

[1] 'baseball'
------------------------------
java pitt.search.semanticvectors.Search baseball
Opening query vector store from file: termvectors.bin
Dimensions = 200
Lowercasing term: baseball
Searching term vectors, searchtype SUM ... Search output follows ...
1.0:baseball
0.39926556:pitcher
0.35765815:major
0.34752852:league
0.31036237:handed
0.29435244:batted
0.28889602:rotation
0.28867063:oven
0.28463548:elephants
0.2721926:soe
0.27092353:bankhead
0.2704968:columba
0.26990595:areas
0.269557:hairstyles
0.26269984:superbas
0.2594802:faria
0.25887036:mariinsky
0.25632423:thq
0.2561813:bedourie
0.25433058:bazar


[2] 'politics'

java pitt.search.semanticvectors.Search politics
------------------------------
Opening query vector store from file: termvectors.bin
Dimensions = 200
Lowercasing term: politics
Searching term vectors, searchtype SUM ... Search output follows ...
1.0000002:politics
0.3098981:rolf
0.27607232:syndicalism
0.27446535:caroliniana
0.27189368:benatar
0.2679594:transnational
0.26459378:snoopy
0.2573341:separations
0.2569847:majored
0.2514971:brampton
0.25055647:fulltime
0.25007513:sacrificing
0.24993348:dramatic
0.24831256:limiter
0.24809343:liberated
0.24738097:investigative
0.24737078:tübingen
0.24671082:erle
0.24581943:isopropanol
0.24579354:kargil


Dominic

unread,
Oct 10, 2008, 10:31:13 AM10/10/08
to Semantic Vectors
Hi Darren,

Hi again,
I indexed all the short abstracts on wikipedia and was able to
build
the semantic vectors, etc.


Interesting. Could you post a link to where you get these abstracts
from? I know a lot of people are using Wikipedia as a corpus nowadays,
so it's a very good thing to be involved in.



- For 1.1 million documents, Lucene took 3 hours to index.
- SemanticVectors took 4 minutes to generate vectors over that
index.
IMPRESSIVE!!


That's great!

Running some tests I get some varied and at times unexpected
results.
Consider below. [1] produces some expected terms such as
pitcher,major
league, but also 'elephants'. Hehe. For [2] I don't really see any
terms
I would expect to see, especially 'snoopy'. Given that the data
set is
ALL of wikipedia abstracts I would expect more semantically
relevant
vectors because of the breadth of documents.


Yes, I'm surprised at the politics result.
Things you could try include:
i. using more dimensions (-d option).
ii. using more training cycles (-tc option).
iii. building a positional index (see http://code.google.com/p/semanticvectors/wiki/PositionalIndexes)
Of these, I would expect iii. to make the most difference, though
you'll probably need to build a whole new Lucene index.

My question is: Is there a data set and examples that demonstrate
clustering, LSA, and those sorts of things more readily?


I tend to use the bible, because it's small and easy - and contrary to
what is often said in the literature, it seems that you can build a
perfectly good model using less than 2 million words in your corpus.
I have scraped data from the web in the past, with some good results
and some not so good. (It's pretty easy to implement if you're willing
to pick a list of sites and then run wget ovet this batch - but if you
start to do this for more than a few small sites, check that your
institution / infrastructure is OK with this, use settings that
respect robots.txt, etc. - I know this can all be done but I wouldn't
advise anyone to do it without properly considering the implications.)
Also I've used the European Parliament data and the Ohsumed corpus for
medical stuff - they're easy to find on the web as well.

But really I'd be surprised if any of these is better than Wikipedia
as a source for this sort of work - hopefully there are ways of
getting better results here.

Another thing to consider is method rather than data - on my vague
TODO list for a while has been the idea of adding support for SVD as
well as random projection into semantic vectors. It really shouldn't
be too hard - though I'd be amazed if there weren't serious
performance problems.

Thanks for keeping in touch - please keep asking and posting results,
this is great for our small community.
Best wishes,
Dominic

Lance Norskog

unread,
Oct 10, 2008, 1:30:01 PM10/10/08
to semanti...@googlegroups.com
I got better results with domain-specific indexes. So, if you only
index one part of Wikipedia, say a top-level category, you might get
more sensible results.


--
"The Playboy reader invites a female acquaintance in for a quiet
discussion of Picasso, Nietzsche, jazz, sex." - Hugh Hefner

Darren Govoni

unread,
Oct 10, 2008, 4:02:10 PM10/10/08
to semanti...@googlegroups.com
Yeah, I noticed a degradation in "expected" results the more documents I
indexed, but here's the catch. Semantic vectoring like this becomes more
attractive in theory as the size of data and result sets grow, yet the
entropy of the perceived results _seems_ to diminish (only in my 1
experiment so far).

I'm still learning about this promising project, so bear with my initial
observations. :)

Even still, wouldn't one expect the strongest vectored neighbors to
condense despite (or because of) the number of documents? In my results,
the neighbors are more or less completely random from a semantic
perspective.

Darren

PS. Haven't got Dominic's response in my inbox yet. Will response after.

Lance Norskog

unread,
Oct 10, 2008, 8:58:36 PM10/10/08
to semanti...@googlegroups.com
My test was hundreds of thousands of web items referring to famous movies. "Snow" turned up "White" and "Dwarves" and "Dwarf" as close relations.  Also, words are not the atomic data to index: phrases are. These are much isolate than words.

If you build an index with Solr it is possible to do some nice things to prep your index for better cross-correlations. For example, storing the words as phonemes instead of spelled, or stemming the words so that 'get' and 'getting' are the same indexed term. (I don't think it will handle 'dwarf' and 'dwarves'.)

Christian Prokopp

unread,
Oct 12, 2008, 10:42:42 PM10/12/08
to semanti...@googlegroups.com
When you check your results make sure you cross check them with the corpus, e.g. wikipedia. I had experiments with INFOMAP and wikipedia a while ago which seemed to make no (semantic) sense until I checked the corpus. There are surprises sometimes. In my example I was looking for religious terms like 'god' and 'lord' and got Scandinavian names back. When I checked it turned out that there was a comprehensive collection of articles about old Scandinavian religions and their gods. This fits semantically just my understanding of the topic was too one-dimensional to realise it.

Cheers,
Christian

2008/10/11 Darren Govoni <dar...@ontrenet.com>

curioussemant...@gmail.com

unread,
Dec 2, 2008, 2:28:51 PM12/2/08
to Semantic Vectors
I am interested in working with solr with SemanticVectors. When I
attempt to run the pitt.search.semanticvectors.BuildIndex on the solr
index, i get an error. Does the solr index need to be modified in
some way to work with SV? Exploring with LUKE shows the index is as
expected.

Dominic

unread,
Dec 2, 2008, 4:40:13 PM12/2/08
to Semantic Vectors
Hi Lance,

I'm hoping this is the same issue raised concurrently in this thread:
http://groups.google.com/group/semanticvectors/browse_thread/thread/c29c7414b762eb94

If you reply could you reply to that thread, please? (Not sure if
there's a better way to formally merge two threads!)

Best wishes,
Dominic

On Dec 2, 2:28 pm, "curioussemanticvectorsu...@gmail.com"

Lance Norskog

unread,
Dec 2, 2008, 6:27:57 PM12/2/08
to semanti...@googlegroups.com
The SemanticVectors distributed source (last summer) expected a particular pair of field names in the terms. I changed the app just to match my solr-generated index. You have to know your Solr index schema and how it becomes fields&terms to integrate with the SV programs.

I did something even bigger: I wanted to use a huge index and reading all of the terms was not possible. I added a Lucene search and drove the SV code from a HitCollector.

This would be a really cool thing to add to Solr. I would do it the same way they did spell-correction. It scans the main index and makes its own small index with a different schema. An SV plugin would be driven from a Solr search and build its own vector database. This would create a different kind of "More Like This" than the Lucene implementation.

Lance

Dominic

unread,
Dec 2, 2008, 7:11:43 PM12/2/08
to Semantic Vectors
Hi Lance,

On 12/2/08, Lance Norskog <nors...@gmail.com> wrote:
> The SemanticVectors distributed source (last summer) expected a particular
> pair of field names in the terms. I changed the app just to match my
> solr-generated index. You have to know your Solr index schema and how it
> becomes fields&terms to integrate with the SV programs.
>
> I did something even bigger: I wanted to use a huge index and reading all of
> the terms was not possible. I added a Lucene search and drove the SV code
> from a HitCollector.

Did you find that the latency of search requests was acceptable with
this approach? It sounds like a lot to ask while a user is watching,
but I could be wrong. Or is this partly what motivates your other
question about index compression?

> This would be a really cool thing to add to Solr. I would do it the same way
> they did spell-correction. It scans the main index and makes its own small
> index with a different schema. An SV plugin would be driven from a Solr
> search and build its own vector database. This would create a different kind
> of "More Like This" than the Lucene implementation.

Please keep the list in touch with our progress if you get something
working in this realm, this is great.

If you'd want to open source the code, it would be fine to host it
with the SV project. I don't know if this would make more or less
sense than making it part of Solr.

When this came up with LeActiveMath, we did some work to make sure
that SemanticVectors behaves reasonably well as a component -
particularly with throwing sensible exceptions instead of exiting the
JVM when something goes wrong. Hopefully we're in reasonably good
shape on this score now, but glad to hear further suggestions if you
have them.

Best wishes,
Dominic
Reply all
Reply to author
Forward
0 new messages