German POS tagger and corpora

153 views
Skip to first unread message

Earl Brown

unread,
Sep 1, 2015, 2:59:32 PM9/1/15
to CorpLing with R
First, two questions and then the motivation for the questions:

(1) Is TreeTagger a good POS tagger for German? Others?
(2) Is the Database of Spoken German a good corpus for spontaneously spoken German?

Motivation for first question:
I'm helping an MA student with some POS tagging in German. I've successfully used the koRpus package


which uses the TreeTagger POS tagger


to tag a sample sentence in German (that Google Translate helped me make). I'm wondering what other POS taggers for German corpus-linguistics-R-ists on this list have experience with.

Motivation for second question:
Also, another professor suggested that the student use the Database of Spoken German


as the source for spontaneous spoken German for his study. As I don't study German (I'm merely providing the corpus linguistics support), I'm wondering if another (publicly available) corpus might be better, etc.

Thanks in advance for any words of wisdom.

Christophe Bechet

unread,
Sep 10, 2015, 10:10:53 AM9/10/15
to CorpLing with R
I can't answer those questions since I'm  a beginner and I would also ask a question about the use of TreeTagger through R. Actually, I can't manage to use it to tag a Dutch corpus. Many examples are given for English and some for French, but when it comes to Dutch, it's getting very hard. Here is an example of how to use treetag with a French text. When trying with Dutch (lang="ndl", preset="dutch-utf8"), it fails. Where's the prob?

treetag(fichier_exemple, 
            treetagger="manual",
            lang="fr", 
            TT.options= list (path="C:/soft/TreeTagger", preset="fr-utf8"))

Earl Brown

unread,
Sep 12, 2015, 9:17:36 PM9/12/15
to CorpLing with R, meik.m...@hhu.de
It appears that while TreeTagger supports Dutch, koRpus does not. However, I'm sure the creator of koRpus: 


would enjoy receiving help to enable koRpus to support Dutch. It's not too difficult; I helped koRpus learn Spanish, and others have helped with other languages.

If you want to use TreeTagger to tag Dutch without koRpus, you'll have to do so from the command-line in a terminal window. I like using koRpus so that I can use R for pre- and post-processing before and after using TreeTagger/koRpus.


Christophe Bechet

unread,
Sep 14, 2015, 1:08:01 PM9/14/15
to corplin...@googlegroups.com
OK, thank you for the information. I've managed to use TreeTagger from the terminal window, but the output of the POS-tagging doesn't suit me. It outputs a three-column format, thus, a one word per line document. Is it possible, though, to output the tagged text as such, i.e. a tagged text?

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To post to this group, send email to corplin...@googlegroups.com.
Visit this group at http://groups.google.com/group/corpling-with-r.
For more options, visit https://groups.google.com/d/optout.

Earl Brown

unread,
Sep 14, 2015, 10:43:18 PM9/14/15
to CorpLing with R, meik.m...@uni-duesseldorf.de
While I'm not aware of an argument in TreeTagger to change the output, you can do some good old-fashioned text manipulation on the tabular data format returned by TreeTagger. Here's a toy example:

# resulting tabular data from TreeTagger
tagged <- data.frame(
  word = c("The", "TreeTagger", "is", "easy", "to", "use", "."),
  pos = c("DT", "NP", "VBZ", "JJ", "TO", "VB", "SENT"),
  lemma = c("the", "TreeTagger", "be", "easy", "to", "use", ".")
)

# good old-fashioned text manipulation
with(tagged, paste0(word, "_", pos, " ", collapse = ""))

# or with angled brackets
with(tagged, paste0(word, "<", pos, "> ", collapse = ""))

Reply all
Reply to author
Forward
0 new messages