I chose CHI_SQUARED as the SignificanceTest since it was implemented
and mentioned on the wiki page.
Great to hear you got it working. Chi Squared is a pretty common test for collocations, so I think it should serve you well. I forgot that I hadn't implemented a few listed, but we rarely use this tool (You might be the only person we know of who has tried it out :) ).
My biggest suggestion would be to add the -f option to pass a list of
the files to read (Similar to the lsa.jar)
-f, --fileList=FILE[,FILE...] a list of document files
Good idea. I'll put that on the agenda, as it shouldn't be too hard to add.
With the -M option for min occurrences, is that the min occurrences in
each document, or in the entire corpus that is analyzed?
That's in the entire corpus. Some statistical tests are sensitive to low word frequencies, so the minFrequency option gives the user a chance to filter out potential collections that barely appeared and don't have reliable data. (e.g., consider a word pair that only shows up once, right next to each other)
As for the output, what does the number in the first column represent?
Those are the statistical test scores (in your case, for the Chi Squared statistic). Roughly speaking, the higher the value, the more likely the word pair is a collocation.
I am thinking that if I can pass this several thousand documents and
set the -M to a high number like 100 or 1000, then the results should
be very common terms that are collocated. Does that sound like the
right approach to you?
Definitely. Alternately, you can still keep M low and then filter based on the output by keep only those bigrams with a test score above a certain threshold. If you're just looking for frequent collocations, then the -M option should also help filter to meet your needs.