(...) "The reason I use “content mining” and not “Text and Data Mining” is that science consists of more than text – images, audio video, code and much more. Text is the best known and the most immediately tractable and many scientific disciplines have developed Natural Language Processing (NLP). In our group Lezan Hawizy, Peter Corbett, David Jessop, Daniel Lowe and others have developed ChemicalTagger, OSCAR, Patent Analysis, and OPSIN. (http://www-pmr.ch.cam.ac.uk/wiki/Main_Page ). So the contentmine.org is exactly that – an org that mines content.
But words are often a poor way of representing science and images are common. A general approach to processing all images is very hard and 2 years ago I though it was effectively impossible. However with hard work some subsets can be tractable. Here we show you some of the possibilities in phylogenetic trees (evolutionary trees). What is described below is simple to follow and simple to carry out, but it took me some months of exploration to find the best strategy. And I owe a great debt to Noureddin Sadawi who introduced me to thinning – I haven’t used his code but his experience was invaluable". (...)