The main challenge with Entity Extraction of German content is surprisingly not the language. Our Natural Language Processing algorithms can be adapted to other languages with relatively little effort. Rather, the main challenge with German is the common use of the Gothic or Fraktur font when printing. This special script-like font is particularly difficult to recognize using the best OCR tools available today. Words and especially names can be misread as many of the characters in this special font are extremely similar.
To improve precision, we have come up with a learning QA system that allows our internal reviewers to import a list of all surnames and given names extracted via entity extraction into a QA program that checks each instance against a German name authority. Those words that are not found in the authority are sent to an interface where the reviewer can mark them as “New” (words to add to the authority), “Delete” (words to delete from the database) and “Map” (words that the OCR engine misread, but that we can correct).