Faking tokenization

19 views
Skip to first unread message

Joshua Cason

unread,
Mar 2, 2015, 2:01:54 PM3/2/15
to dis...@factorie.cs.umass.edu
Hi All,

I'm working with a data set in the conll format for which I don't immediately have the original text. After eyeballing a few examples, I concluded that the tokenization was close enough to factorie's deterministic tokenizer for now. So I imitated it in that I plugged in new tokens in each section of each document like the DT does using the owpl tokenization from my data. But then when I pull up the results in the repl, I can't verify that the document "hasAnnotation" Token, i.e., I tried mydoc.hasAnnotation(classOf[cc.factorie.app.nlp.Token]), and got false. 

My intuition is that in the pipeline code I adapted from your codebase, it will determine that the Token prerequisite hasn't been satisfied and will re-tokenize the sentences. So I'm wondering if anyone has some insights on how to "fake" the tokenization so that it thinks it has been done - or has a suggestion for a better way to do this.

Thanks!

Josh

Emma Strubell

unread,
Mar 2, 2015, 2:09:58 PM3/2/15
to dis...@factorie.cs.umass.edu
Hi Josh,

I suspect you're missing the following line (where "doc" is the Document into which you're loading Tokens):
doc.annotators(classOf[Token]) = UnknownDocumentAnnotator.getClass

We do this tokenization "faking" in most if not all of our loaders (package app.nlp.load), I recommend checking out the OWPL loader for a simple example (LoadOWPL.scala).

Hope this helps!

Emma



--
--
Factorie Discuss group.
To post, email: dis...@factorie.cs.umass.edu
To unsubscribe, email: discuss+u...@factorie.cs.umass.edu

Joshua Cason

unread,
Mar 2, 2015, 2:13:09 PM3/2/15
to dis...@factorie.cs.umass.edu
Hi Emma,

Yes, that sounds like exactly what I was looking for. Thanks!

Josh

Pallika Kanani

unread,
Mar 2, 2015, 2:14:28 PM3/2/15
to dis...@factorie.cs.umass.edu
Check out the LoadConll2003 for some example code. Here's what I usually do:

class LoadWHATEVERNER () extends Load {


def fromSource(source: io.Source): Seq[Document] = {
def newDocument(name: String): Document = {
val document = new Document("").setName(name)
document.annotators(classOf[Token]) = UnknownDocumentAnnotator.getClass // register that we have token boundaries
document.annotators(classOf[Sentence]) = UnknownDocumentAnnotator.getClass // register that we have sentence boundaries
document
}

val documents = new ArrayBuffer[Document]
var document = newDocument("doc-" + documents.length)
documents += document
var sentence = new Sentence(document)
for (line <- source.getLines()) {
if (line.length < 2) {
// Sentence boundary
document.appendString("\n")
documents += document
document = newDocument("doc-" + documents.length)
sentence = new Sentence(document)
} else {
val fields = line.split("\t")
assert(fields.length == 2)
val word = fields(0)
var nerTag = fields(1)

if (sentence.length > 0) document.appendString(" ")
val token = new Token(sentence, word)
token.attr += new LabeledWHATEVERNerTag(token, ner)
}
}
documents
}

Best,
Pallika


--
--
Factorie Discuss group.
To post, email: dis...@factorie.cs.umass.edu
To unsubscribe, email: discuss+u...@factorie.cs.umass.edu

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@factorie.cs.umass.edu.

Joshua Cason

unread,
Mar 2, 2015, 2:24:08 PM3/2/15
to dis...@factorie.cs.umass.edu
Hi Pallika,

That's pretty cool. Thanks for the sample.

Josh
Reply all
Reply to author
Forward
0 new messages