Edit NLTK CESS_ESP corpus?

355 views
Skip to first unread message

Le Cras

unread,
May 14, 2016, 4:57:19 AM5/14/16
to nltk-users

Finally I trained a tagger for spanish, based on the NLTK CESS_ESP corpus. Many spanish words are missing in this corpus. I know where the corpus is saved and I opened the multiple files of it. ¿ There's a guide or a way to add more words to the existing data trees of this corpus?

nltk_data/corpora/cess_esp



example of a tree:

(sn-SUJ
      (espec.mp
        (da0mp0 los el))
      (grup.nom.mp
        (ncmp000 abogados abogado)
        (sp
          (prep
            (sps00 de de))
          (sn
            (espec.fs
              (da0fs0 la el))
            (grup.nom.fs
              (ncfs000 empresa empresa))))))

Alexis

unread,
May 15, 2016, 7:08:00 PM5/15/16
to nltk-...@googlegroups.com
There are several things you can do to make a better Spanish tagger.

1. *Any* tagger, no matter how much data you train it on, will have to deal with unknown words. Read the nltk book to learn how to set up a series of "backoff taggers" to fall back on. In the worst case you'll fall back to a "default tagger" that tags all unknown words as nouns.

2. To add more training data, there's no reason to tweak the `cess_esp` corpus itself. You can train a tagger with as much data as you can assemble from multiple sources. And don't bother with trees: You're training a tagger, so any POS-tagged corpus is ok. 

3. The combined training data must use the same set of POS tags, of course. While you'll find different tagsets in use, many nltk corpora can be shown in an alternative "universal" tagset, so if you don't need very precise tagging you can map all your training data to this. See the help for each corpus reader.

4. There are a number of tagged Spanish corpora in the NLTK (though of course you can also use non-NLTK corpora). Check out the spanish parts of conll2002 (CONLL corpus) and universal_treebanks_v20. There must be more among the corpora the NLTK distributes, but you'd probably be better off looking elsewhere for a large tagged corpus that you can download.

Alexis


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages