Raphael, If you have a moment, this question could only be answered by you. It looks like the tagging is done within the AtD code. I was wondering why you chose to do this, rather than use an available 3rd party PoS tagger? Was it for performance reasons? (I know your design document mentioned that performance was considered critical for the service). For my needs, accuracy is going to trump performance and I'm considering swapping out the tagger with a more accurate one. As I develop rules, I'm finding that I need to get very creative in order to work around improperly tagged sentences. I end up having to considerably narrow the scope of my rules, or abandon the rule completely due to so many false positives. If you have a minute, I'd be interested in hearing your thoughts on the PoS tagging and the idea of swapping that portion out. Any pitfalls? Suggestions? Warnings?
Also, if anyone else has modified, or attempted to modify, the AtD PoS tagging, I would love to hear your experiences.
Hi Jay, Next to myself, you're the most involved in AtD rule development of anyone I have interacted with to this point. There may be someone else doing it in secret, but I don't know who they are. :)
AtD's tagger is trigram based and it uses a few rules to correct some of the trigram tagger output. I don't write rules assuming an accurate tagging of a sentence. I write rules based on how the tagger interprets an incorrect sentence. The trigram tagger is not as accurate as a HMM tagger, but in a situation where a lot of the text may be wrong, it doesn't make sense to have a super-accurate tagger either. An error in the sentence (what we're checking for!) may throw the tagger off. Language Tool uses an even simpler dictionary-based tagger for similar reasons, see: http://languagetool.wikidot.com/developing-a-tagger-dictionary
If you change out AtD's tagger, beware that it may break a lot of rules as they rely on the output of the existing tagger. When I developed my tagger training and evaluation sets, I used the following two taggers:
Fun trivia, these taggers arrive at the same result except for a few cases. The speed difference between the two is incredible though. The .jp tagger went through and helped me build my data sets in minutes. The Stanford tagger had to run for a weekend to do the same thing.
On Fri, Jun 17, 2011 at 4:55 PM, jf <jfis...@gmail.com> wrote: > Raphael, > If you have a moment, this question could only be answered by you. It looks > like the tagging is done within the AtD code. I was wondering why you chose > to do this, rather than use an available 3rd party PoS tagger? Was it for > performance reasons? (I know your design document mentioned that > performance was considered critical for the service). For my needs, > accuracy is going to trump performance and I'm considering swapping out the > tagger with a more accurate one. As I develop rules, I'm finding that I > need to get very creative in order to work around improperly tagged > sentences. I end up having to considerably narrow the scope of my rules, or > abandon the rule completely due to so many false positives. If you have a > minute, I'd be interested in hearing your thoughts on the PoS tagging and > the idea of swapping that portion out. Any pitfalls? Suggestions? > Warnings?
> Also, if anyone else has modified, or attempted to modify, the AtD PoS > tagging, I would love to hear your experiences.
> Thanks, > Jay
> -- > You received this message because you are subscribed to the Google Groups > "AtD Developers" group. > To post to this group, send email to atd-developers@googlegroups.com > To unsubscribe from this group, send email to > atd-developers+unsubscribe@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/atd-developers?hl=en
Raphael, Thanks for the background on the tagging. I was considering the Stanford parser, which you mentioned. Sounds like it would not be worth the extra effort.
On Sat, Jun 18, 2011 at 12:43 AM, Raphael Mudge <rsmu...@gmail.com> wrote: > Hi Jay, > Next to myself, you're the most involved in AtD rule development of > anyone I have interacted with to this point. There may be someone else > doing it in secret, but I don't know who they are. :)
> AtD's tagger is trigram based and it uses a few rules to correct some > of the trigram tagger output. I don't write rules assuming an accurate > tagging of a sentence. I write rules based on how the tagger > interprets an incorrect sentence. The trigram tagger is not as > accurate as a HMM tagger, but in a situation where a lot of the text > may be wrong, it doesn't make sense to have a super-accurate tagger > either. An error in the sentence (what we're checking for!) may throw > the tagger off. Language Tool uses an even simpler dictionary-based > tagger for similar reasons, see: > http://languagetool.wikidot.com/developing-a-tagger-dictionary
> If you change out AtD's tagger, beware that it may break a lot of > rules as they rely on the output of the existing tagger. When I > developed my tagger training and evaluation sets, I used the following > two taggers:
> Fun trivia, these taggers arrive at the same result except for a few > cases. The speed difference between the two is incredible though. The > .jp tagger went through and helped me build my data sets in minutes. > The Stanford tagger had to run for a weekend to do the same thing.
> -- Raphael
> On Fri, Jun 17, 2011 at 4:55 PM, jf <jfis...@gmail.com> wrote: > > Raphael, > > If you have a moment, this question could only be answered by you. It > looks > > like the tagging is done within the AtD code. I was wondering why you > chose > > to do this, rather than use an available 3rd party PoS tagger? Was it > for > > performance reasons? (I know your design document mentioned that > > performance was considered critical for the service). For my needs, > > accuracy is going to trump performance and I'm considering swapping out > the > > tagger with a more accurate one. As I develop rules, I'm finding that I > > need to get very creative in order to work around improperly tagged > > sentences. I end up having to considerably narrow the scope of my rules, > or > > abandon the rule completely due to so many false positives. If you have > a > > minute, I'd be interested in hearing your thoughts on the PoS tagging and > > the idea of swapping that portion out. Any pitfalls? Suggestions? > > Warnings?
> > Also, if anyone else has modified, or attempted to modify, the AtD PoS > > tagging, I would love to hear your experiences.
> > Thanks, > > Jay
> > -- > > You received this message because you are subscribed to the Google Groups > > "AtD Developers" group. > > To post to this group, send email to atd-developers@googlegroups.com > > To unsubscribe from this group, send email to > > atd-developers+unsubscribe@googlegroups.com > > For more options, visit this group at > > http://groups.google.com/group/atd-developers?hl=en
> -- > You received this message because you are subscribed to the Google Groups > "AtD Developers" group. > To post to this group, send email to atd-developers@googlegroups.com > To unsubscribe from this group, send email to > atd-developers+unsubscribe@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/atd-developers?hl=en