It's probably best to remove the punctuation before hand. The second option is much cleaner, but we don't have support for it at the moment. I've filed Issue 98 to keep track on being able to pass in a custom tokenizer / preprocessor, as it would definitely make life easier. :) I don't think we added this in the first place, because we generally test multiple algorithms on the corpus, so it's faster to preprocess once, rather that on-the-fly each time.
This is another side effect of the current tokenizing. You'll want to lower-case all your text when do you the above preprocessing step.
We have a few tools to help, so you might try something like the following to do the cleaning:
public void clean(File inputDirectory, File outputFile) throws IOException {
PrintWriter outputWriter = new PrintWriter(outputFile);
for (File resume : new DirectoryWalker(inputDirectory)) {
StringBuilder doc = new StringBuilder((int)resume.length());
for (String line : new LineReader(resume)) {
line = line.replaceAll("(\\p{Punct}+)", " $1 ");
line = line.toLowerCase();
doc.append(line).append(' ');
}
outputWriter.println(doc);
}
outputWriter.close();
}
That should lower case everything and separate the punctuation from any words, which makes "word," and "word" now tokenize to the same instance. If you want to remove the punctuation all together, just swap the " $1 " with " ". You can use the resulting output file with the -d option for lsa.jar .
We have a custom Stemmer interface which lets us wrap all of the existing Snowball stemmers. I think you can just use edu.ucla.sspace.text.EnglishStemmer with the -Z option, which should correctly wrap the EnglishStemmer. You might with and without stemming. I'd imagine that resumes have a lot of technical language, and the stemmers may not correctly handle such words.
If you run into issues, let us know so we can help.
Thanks,
David