Faking tokenization

Joshua Cason

unread,

Mar 2, 2015, 2:01:54 PM3/2/15

to dis...@factorie.cs.umass.edu

Hi All,

I'm working with a data set in the conll format for which I don't immediately have the original text. After eyeballing a few examples, I concluded that the tokenization was close enough to factorie's deterministic tokenizer for now. So I imitated it in that I plugged in new tokens in each section of each document like the DT does using the owpl tokenization from my data. But then when I pull up the results in the repl, I can't verify that the document "hasAnnotation" Token, i.e., I tried mydoc.hasAnnotation(classOf[cc.factorie.app.nlp.Token]), and got false.

My intuition is that in the pipeline code I adapted from your codebase, it will determine that the Token prerequisite hasn't been satisfied and will re-tokenize the sentences. So I'm wondering if anyone has some insights on how to "fake" the tokenization so that it thinks it has been done - or has a suggestion for a better way to do this.

Thanks!

Josh

Emma Strubell

unread,

Mar 2, 2015, 2:09:58 PM3/2/15

to dis...@factorie.cs.umass.edu

Hi Josh,

I suspect you're missing the following line (where "doc" is the Document into which you're loading Tokens):

doc.annotators(classOf[Token]) = UnknownDocumentAnnotator.getClass

We do this tokenization "faking" in most if not all of our loaders (package app.nlp.load), I recommend checking out the OWPL loader for a simple example (LoadOWPL.scala).

Hope this helps!

Emma

--
--
Factorie Discuss group.
To post, email: dis...@factorie.cs.umass.edu
To unsubscribe, email: discuss+u...@factorie.cs.umass.edu

Joshua Cason

unread,

Mar 2, 2015, 2:13:09 PM3/2/15

to dis...@factorie.cs.umass.edu

Hi Emma,

Yes, that sounds like exactly what I was looking for. Thanks!

Josh

Pallika Kanani

unread,

Mar 2, 2015, 2:14:28 PM3/2/15

to dis...@factorie.cs.umass.edu

Check out the LoadConll2003 for some example code. Here's what I usually do:

class LoadWHATEVERNER () extends Load {


def fromSource(source: io.Source): Seq[Document] = {

def newDocument(name: String): Document = {
  val document = new Document("").setName(name)
  document.annotators(classOf[Token]) = UnknownDocumentAnnotator.getClass // register that we have token boundaries
  document.annotators(classOf[Sentence]) = UnknownDocumentAnnotator.getClass // register that we have sentence boundaries
  document
}

val documents = new ArrayBuffer[Document]
var document = newDocument("doc-" + documents.length)
documents += document
var sentence = new Sentence(document)
for (line <- source.getLines()) {
  if (line.length < 2) {
    // Sentence boundary
    document.appendString("\n")
    documents += document
    document = newDocument("doc-" + documents.length)
    sentence = new Sentence(document)
  } else {
    val fields = line.split("\t")
    assert(fields.length == 2)
    val word = fields(0)
    var nerTag = fields(1)

    if (sentence.length > 0) document.appendString(" ")
    val token = new Token(sentence, word)
    token.attr += new LabeledWHATEVERNerTag(token, ner) 
  }
}

documents

Best,

Pallika

--

--
Factorie Discuss group.
To post, email: dis...@factorie.cs.umass.edu
To unsubscribe, email: discuss+u...@factorie.cs.umass.edu

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@factorie.cs.umass.edu.

Joshua Cason

unread,

Mar 2, 2015, 2:24:08 PM3/2/15

to dis...@factorie.cs.umass.edu

Hi Pallika,

That's pretty cool. Thanks for the sample.

Josh

Reply all

Reply to author

Forward