Hi All,
I'm working with a data set in the conll format for which I don't immediately have the original text. After eyeballing a few examples, I concluded that the tokenization was close enough to factorie's deterministic tokenizer for now. So I imitated it in that I plugged in new tokens in each section of each document like the DT does using the owpl tokenization from my data. But then when I pull up the results in the repl, I can't verify that the document "hasAnnotation" Token, i.e., I tried mydoc.hasAnnotation(classOf[cc.factorie.app.nlp.Token]), and got false.
My intuition is that in the pipeline code I adapted from your codebase, it will determine that the Token prerequisite hasn't been satisfied and will re-tokenize the sentences. So I'm wondering if anyone has some insights on how to "fake" the tokenization so that it thinks it has been done - or has a suggestion for a better way to do this.
Thanks!
Josh