No, no preprocessing is normally required. Whatever text you give it is simply used to determine the probabilities. Note that line breaks matter, since the model models the start of the line. Furthermore, contexts longer than 3-4 may cause the model to become too sparse (there is no back-off right now). The trickiest part in getting the language model to work is in finding the right weights for characters, language models, and whitespace (specified with command line parameters to ocropus-ngraphs during matching). They are a tradeoff between how well your documents match your corpus, document quality, and recognizer quality.
Tom