I have a corpus in specific format on HDFS, which i want to process via UIMA pipeline. A document in the corpus can span multiple lines. From previous issues in the mail list i recognized that i have to override DocumentTextExtractor, but that still doesn't give me the freedom to process multiple lines. After browsing the source code, i got the idea the feelings that the only way to adjust my files is to implement my own RecordReader. Is there an easier way to solve this problem. Another question is whether there is a way to read XMI files from HDFS.
Greetings from Bauhaus
Thanks