I have two approaches to the solution.
SOLUTION 1:
One is 5-step based on cooccurence and frequencies.
This finds words that are describing somewhat the same topic.
Examples from result (some are good, some are bad):
london pm 03.15 12.00 mackenzie 03/08/2001 07/08/2001 06/08/2001 stoy
hayward bdo
research breast associate cancer graduate student sims kath jobling
flint pam
hospital kilmallock loughfarm limerick nhs kilteely university
macmillan nurse cancer trust
advice citizens bureau travel centres albans bately office telecom
1782 chickenley
windows desktop pocket amigaos portable qnx 12.30 linux cardiff xp
03.00
risk assessment assesment assessments margin madness coverage cancer
r.a mine cardiac
oil 0.6-1.7 glue 0.3-2.7 sulfur wool refineries 14/17 5/14 0.5-2.2
soya
FULL:
http://seelf.com/result-50-3000.txt
Took about 60minutes to finish on single core.
SOLUTION 2:
The another one is very simple 1-step algo that doesn't consider word
frequencies.
This finds words that are talk about different topics, but which are
simliar.
Examples from result:
london uk british 2 year manchester city scottish american english 1
research work project information development studies group support
study education programme
hospital road school area college hospitals house centre university
community street
advice information support services work service issues website team
system policy
windows window use microsoft office software system win computer 2
linux
risk project information data waste knowledge performance change
quality time stress
oil water gas food fuel energy house oils 2 1 chemical
FULL:
http://seelf.com/result-interc.txt
Took about 40minutes to finish on single core.