Compound term tokens

David Webb

unread,

Jul 17, 2011, 11:10:45 AM7/17/11

to S-Space Package Users

I have a question about compound terms and words. I understand the
concept behind the following option when building the S-Space.

-C, --compoundWords=FILE

However, it is impossible for me know and document every possible
combination of compound words within the space of resumes and skill
sets, especially since they are added to so frequently.

Is there a way to automatically create compound tokens based on a high
occurrence of specific words being found adjacent to one another in
the corpus?

For example:

gn java would return, as an example, perhaps one or more of the
following very common compound terms.
"java architect"
"j2ee architect"
"struts 1.1"
"weblogic 8.1"
"jboss 5.1.0"

gn windows might return
"windows nt"
"windows 2000"
"windows xp"

What is the proper mechanism to determine what I am looking for?

Thank you for the assistance.

David Webb

unread,

Jul 19, 2011, 4:30:30 PM7/19/11

to s-spac...@googlegroups.com

Can anyone point me in the right direction here?

What is the terminology or algorithm for analyzing a corpus and then generating list of the most commonly paired words?

To create my compound tokens, is it a certain manual process, or is there automation and mathematical hope? :)

I appreciate the help.

David Jurgens

unread,

Jul 19, 2011, 5:49:27 PM7/19/11

to s-spac...@googlegroups.com

Hi David,

There is quite a bit of mathematical hope :) What you want is something to extract collocations, which are two-word phrases where the two tokens appear together more frequently expected than by chance. Extracting collocations can be done in many ways (see the linked wikipedia page), but luckily we do have a tool for finding them. The edu.ucla.sspace.tools.BigramExtractor class (though incorrectly named) will identify collocations according to the selected significance test, which are also built in. It's been a while since we used that class, so it may have a few rough edges, but it's got help documentation if you run it without any arguments, which can hopefully get your started extracting collocations. If you run into issues with it, let us know so we can make it more useful.

If you feel comfortable in Perl, there's also the Ngram Statistics Package, which will let you find compound words with more than two tokens.

Thanks,

David

David Webb

unread,

Jul 19, 2011, 7:47:19 PM7/19/11

to S-Space Package Users

David,

Thank you for the education. I was able to create a list of
collocated terms using your utility.

I chose CHI_SQUARED as the SignificanceTest since it was implemented
and mentioned on the wiki page.

My biggest suggestion would be to add the -f option to pass a list of
the files to read (Similar to the lsa.jar)

-f, --fileList=FILE[,FILE...] a list of document files

With the -M option for min occurrences, is that the min occurrences in
each document, or in the entire corpus that is analyzed?

As for the output, what does the number in the first column represent?

I am thinking that if I can pass this several thousand documents and
set the -M to a high number like 100 or 1000, then the results should
be very common terms that are collocated. Does that sound like the
right approach to you?

Thank you very much!

David Jurgens

unread,

Jul 19, 2011, 8:56:09 PM7/19/11

to s-spac...@googlegroups.com

I chose CHI_SQUARED as the SignificanceTest since it was implemented
and mentioned on the wiki page.

Great to hear you got it working. Chi Squared is a pretty common test for collocations, so I think it should serve you well. I forgot that I hadn't implemented a few listed, but we rarely use this tool (You might be the only person we know of who has tried it out :) ).

My biggest suggestion would be to add the -f option to pass a list of
the files to read (Similar to the lsa.jar)
-f, --fileList=FILE[,FILE...] a list of document files

Good idea. I'll put that on the agenda, as it shouldn't be too hard to add.

With the -M option for min occurrences, is that the min occurrences in
each document, or in the entire corpus that is analyzed?

That's in the entire corpus. Some statistical tests are sensitive to low word frequencies, so the minFrequency option gives the user a chance to filter out potential collections that barely appeared and don't have reliable data. (e.g., consider a word pair that only shows up once, right next to each other)

As for the output, what does the number in the first column represent?

Those are the statistical test scores (in your case, for the Chi Squared statistic). Roughly speaking, the higher the value, the more likely the word pair is a collocation.

I am thinking that if I can pass this several thousand documents and
set the -M to a high number like 100 or 1000, then the results should
be very common terms that are collocated. Does that sound like the
right approach to you?

Definitely. Alternately, you can still keep M low and then filter based on the output by keep only those bigrams with a test score above a certain threshold. If you're just looking for frequent collocations, then the -M option should also help filter to meet your needs.

Reply all

Reply to author

Forward