Help with LSA

37 views
Skip to first unread message

David Webb

unread,
Jul 16, 2011, 3:53:54 PM7/16/11
to S-Space Package Users
I have successfully created an .sspace file from the lsa.jar using
several thousand resumes. My end goal is to do the same for about 2mm
resumes.

I am a 16 year java veteran and a 2 day LSA/S-SPACE newbie if that
helps in your answer :).

When I run the SSE utility and get-neighbors, I notice there is a lot
of punctuation in my results.

1) What is the best way to remove the punctuation from files I am
indexing? Should I do it beforehand, or can I write a class to strip
it out and pass it into the call to the LSA.jar? I am open to either
one.

I also notice that case is an issue. for example "gn java" returns
Java as the top neighbor

> gn java
XML 0.7651379533548325
system 0.7652091463258547
Web 0.7653763835607287
Server 0.7658603226937245
JSP, 0.7668629964288048
application 0.7669354240916089
Apache 0.768246371517607
design 0.7688026265633942
using 0.7767917054057629
Java 0.7792353126764983

2) How can I tell the LSA program to ignore case when determining what
to index/tokenize.

3) I tried to use the org.tartarus.snowball.ext.englishStemmer class
as my stemmer. I compiled it from the latest Java lib download from
the snowball site, and I get a CCE as follows...any suggestions? I
fully understand the CCE, so I must be using the wrong package. :)

java.lang.ClassCastException: org.tartarus.snowball.ext.englishStemmer
cannot be cast to edu.ucla.sspace.text.Stemmer
at
edu.ucla.sspace.text.IteratorFactory.setProperties(IteratorFactory.java:
260)
at edu.ucla.sspace.mains.GenericMain.run(GenericMain.java:422)
at edu.ucla.sspace.mains.LSAMain.main(LSAMain.java:147)

Thank you very much in advance!

DW




David Jurgens

unread,
Jul 16, 2011, 8:13:11 PM7/16/11
to s-spac...@googlegroups.com
Hi David,
 
When I run the SSE utility and get-neighbors, I notice there is a lot
of punctuation in my results.

1) What is the best way to remove the punctuation from files I am
indexing?  Should I do it beforehand, or can I write a class to strip
it out and pass it into the call to the LSA.jar?  I am open to either
one.

It's probably best to remove the punctuation before hand.  The second option is much cleaner, but we don't have support for it at the moment.  I've filed Issue 98 to keep track on being able to pass in a custom tokenizer / preprocessor, as it would definitely make life easier.  :)  I don't think we added this in the first place, because we generally test multiple algorithms on the corpus, so it's faster to preprocess once, rather that on-the-fly each time.
 
I also notice that case is an issue.  for example "gn java" returns
Java as the top neighbor

> gn java
XML     0.7651379533548325
system  0.7652091463258547
Web     0.7653763835607287
Server  0.7658603226937245
JSP,    0.7668629964288048
application     0.7669354240916089
Apache  0.768246371517607
design  0.7688026265633942
using   0.7767917054057629
Java    0.7792353126764983

2) How can I tell the LSA program to ignore case when determining what
to index/tokenize.

This is another side effect of the current tokenizing.  You'll want to lower-case all your text when do you the above preprocessing step.

We have a few tools to help, so you might try something like the following to do the cleaning:

    public void clean(File inputDirectory, File outputFile) throws IOException {        
        PrintWriter outputWriter = new PrintWriter(outputFile);       
        for (File resume : new DirectoryWalker(inputDirectory)) {
            StringBuilder doc = new StringBuilder((int)resume.length());
            for (String line : new LineReader(resume)) {
                line = line.replaceAll("(\\p{Punct}+)", " $1 ");
                line = line.toLowerCase();
                doc.append(line).append(' ');
            }
            outputWriter.println(doc);
        }
        outputWriter.close();
    }

That should lower case everything and separate the punctuation from any words, which makes "word," and "word" now tokenize to the same instance.  If you want to remove the punctuation all together, just swap the " $1 " with " ".  You can use the resulting output file with the -d option for lsa.jar .
 

3) I tried to use the org.tartarus.snowball.ext.englishStemmer class
as my stemmer.  I compiled it from the latest Java lib download from
the snowball site, and I get a CCE as follows...any suggestions?  I
fully understand the CCE, so I must be using the wrong package. :)

java.lang.ClassCastException: org.tartarus.snowball.ext.englishStemmer
cannot be cast to edu.ucla.sspace.text.Stemmer
       at edu.ucla.sspace.text.IteratorFactory.setProperties(IteratorFactory.java:260)
       at edu.ucla.sspace.mains.GenericMain.run(GenericMain.java:422)
       at edu.ucla.sspace.mains.LSAMain.main(LSAMain.java:147)

We have a custom Stemmer interface which lets us wrap all of the existing Snowball stemmers.  I think you can just use edu.ucla.sspace.text.EnglishStemmer with the -Z option, which should correctly wrap the EnglishStemmer.  You might with and without stemming.  I'd imagine that resumes have a lot of technical language, and the stemmers may not correctly handle such words.

If you run into issues, let us know so we can help. 

  Thanks,
  David
 

David Webb

unread,
Jul 17, 2011, 11:00:39 AM7/17/11
to S-Space Package Users
Thank you for all the help yesterday. your recommendations worked
great and my gn commands are returning awesome results!
Reply all
Reply to author
Forward
0 new messages