It seems to me that just using co-occurrence frequencies should work
for this problem. I've posted a very simple Python implementation on
github:
https://github.com/tgflynn/NLP-Challenge and a brief
discussion on my blog here :
http://cogniception.com/wp/.
Running this on the posted dataset took about 20 minutes on my
machine. The results look reasonable. For example here is the output
line for the word hospital:
hospital trust nhs royal general university london hospital community
eye medical
One comment I have on the data is that it contains many non-ascii
characters, and even many words which consist solely of non-ascii
characters or digits. I haven't filtered these out but I suspect
doing so would substantially reduce the size of the set and make the
output easier to read.