Finding strongest terms in text

17 views
Skip to first unread message

David Webb

unread,
Jul 21, 2011, 2:08:00 PM7/21/11
to S-Space Package Users
I have a use case where a need to take in a paragraph of text, then
determine the "key" terms in that single document.

An example would be that the input is a resume of a software
developer, and the most occurring terms in the resume are "Java",
"HTML", and "XML".

Does your awesome sspace-lib have anything that can give me those type
of results?

Thank you.

David Jurgens

unread,
Jul 22, 2011, 8:37:30 PM7/22/11
to s-spac...@googlegroups.com
Hi David,

 You might try using the TF-IDF  scores for the document's tokens.  I'm not sure if there's an easy way to expose those at the moment though.  You'd need to essentially run the VectorSpaceModel with some hacking inside of the processSpace method to figure out which terms have the highest weights.   We have both implemented.

  One question is how you want to determine what is "key."  Your earlier post made it sound like the most important aspect is frequency.  However, terms like "the" will show up fairly frequent.  Does it matter how often the terms show up in other resumes as well?  (e.g., will rarer terms count more towards being "key"?).  There's no one answer for how to do this, so it will probably depend on what kind of terms you want to find.

  Thanks,
  David
Reply all
Reply to author
Forward
0 new messages