PoS tagging

1,009 views
Skip to first unread message

alex boulton

unread,
Jan 11, 2013, 6:29:19 AM1/11/13
to ant...@googlegroups.com
Dear Laurence and all
  1. AntClaws requires prior installation of CLAWS from UCREL, is that right? But this is a paying service, as far as I know. Is there any way to PoS tag a corpus (about 800K words) free for easy use with AntConc? (Which is what makes AntConc so wonderful for teachers and students as well as corpus linguists -- free and easy as well as high quality!)
  2. Related to this, I've used Yasumasa Someya's lemma list successfully before with Antconc, but how would it work in conjunction with a tagged corpus?
I couldn't find a previous post on this, so I may be missing something very obvious... So thanks for any help, or tact in telling me if I'm asking a stupid question ;o)

May your 2013 continue as unmayanly as 2012
alex

Laurence Anthony

unread,
Jan 13, 2013, 8:16:05 PM1/13/13
to ant...@googlegroups.com
Hi Alex,

> AntClaws requires prior installation of CLAWS from UCREL, is that right? But
> this is a paying service, as far as I know. Is there any way to PoS tag a
> corpus (about 800K words) free for easy use with AntConc? (Which is what
> makes AntConc so wonderful for teachers and students as well as corpus
> linguists -- free and easy as well as high quality!)

You are correct that AntClaws requires the paid CLAWS engine from
UCREL. If you do not have this, you might try using GoTagger, which is
a free and very simple tagging tool that I think is based on the Brill
tagger:

Unfortunately, the site link seems to be broken at the moment. I can
send you the program if you want.

> Related to this, I've used Yasumasa Someya's lemma list successfully before
> with Antconc, but how would it work in conjunction with a tagged corpus?

Someya's lemma list has some odd cases that need to be processed
carefully to be used successfully in AntConc. In particular, it
includes words with apostrophes and hyphens that in the default
setting of AntConc will be split in an inappropriate way. On my site
is an edited version of his lemma list with the problematic items
removed.

Using Someya's lemma list with a tagged corpus would be problematic
(unless, of course, you simple ignore all the tags and treat the
corpus as a plain text corpus - via the AntConc global settings). What
you would need to do is tag the words in the lemma list with tags that
match those in the corpus. Then, AntConc would work fine.

What I would recommend is that you create a completely new lemma list
with the tag information incorporated from the beginning. Actually,
I'm going to be making this as part of a different project very
shortly utilizing the lemma information in the BNC. I'll upload it to
my site when it's finished. (Still, the tags will have to match your
corpus tags.)


> May your 2013 continue as unmayanly as 2012
> alex

Happy new year to you, too!

(I'll see you at AACL 2013 - I'm presenting immediately after you!)

Laurence.

alex boulton

unread,
Jan 14, 2013, 3:37:51 AM1/14/13
to ant...@googlegroups.com
Thanks Laurence.
I used your version of Someya's lemma list from the AntConc page with no problem.
I'm sure I could get my corpus tagged using CLAWS, but I'm keen only to use free things available to students. The it's a question of juggling how to combine PoS-tagging and lemmatisation (the latter without the former can lead to a number of problems).
Snow forecast here, hope I'll make it to the AACL on time! Looking forward to seeing you there, and no doubt hassling you about AntConc too ;o)
Cheers
alex

Laurence Anthony

unread,
Jan 14, 2013, 3:50:17 AM1/14/13
to ant...@googlegroups.com
On Mon, Jan 14, 2013 at 5:37 PM, alex boulton
<alex.b...@univ-lorraine.fr> wrote:
> Thanks Laurence.
> I used your version of Someya's lemma list from the AntConc page with no
> problem.
> I'm sure I could get my corpus tagged using CLAWS, but I'm keen only to use
> free things available to students. The it's a question of juggling how to
> combine PoS-tagging and lemmatisation (the latter without the former can
> lead to a number of problems).
> Snow forecast here, hope I'll make it to the AACL on time! Looking forward
> to seeing you there, and no doubt hassling you about AntConc too ;o)
> Cheers
> alex
>

Can you just confirm what you want the Word List results to look like
(if POS and lemmas are combined)?

For example, say you have the following three-sentence corpus:
He_PRP reads_VBZ a_DT book_NN ._.
He_PRP is_VBZ reading_VBG the_DT book_NN ._.
He_PRP was_VBD reading_VBG books_NNS ._.

What results list (Word List) do you want AntConc to generate?

Laurence.

p.s. It's snowing here in Tokyo, too!

Robert Fuchs

unread,
Jan 14, 2013, 4:01:21 AM1/14/13
to ant...@googlegroups.com
Hi Alex,

I have successfully used the Stanford tagger in the past (which is free), and using regex I searched the output with AntConc: http://nlp.stanford.edu/software/tagger.shtml
The also have a lemmatiser, which is part of the following (haven't tried this yet) http://nlp.stanford.edu/software/corenlp.shtml
I guess you can only really make use of these tools via the command line. Making sure everything works correctly with Java in your PATH and all this may be too complicated for students, depending on their background. Once everything is in place I think they should manage.

Best wishes,
Robert
--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To view this discussion on the web visit https://groups.google.com/d/msg/antconc/-/zw6d80UMj3EJ.
To post to this group, send email to ant...@googlegroups.com.
To unsubscribe from this group, send email to antconc+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/antconc?hl=en.

alex boulton

unread,
Jan 14, 2013, 4:08:24 AM1/14/13
to ant...@googlegroups.com
My understanding is that AntConc doesn't "understand" tags, and just treats a tagged item as a longer stretch of text. In which case the tagged corpus would only really be useful at the word level, and an untagged corpus much more appropriate for longer strings.
Not just for the word list, but to be able to search for e.g. use*_NN* (i.e. all forms of use as a noun, inevitably including usefulness etc.), to distinguish from use*_VV* (forms of use as a verb). I experimented with a segment of the corpus following CLAWS tagging: it worked fine for this type of thing, but I'll need to experiment further.
Presumably keyword analysis would only work with a reference corpus with exactly the same tagset?
I have to say these are basic things I haven't done before, being generally a parasite of other people's corpus linguistics work rather than a corpus linguist myself. ;o)
alex

alex boulton

unread,
Jan 14, 2013, 4:11:54 AM1/14/13
to ant...@googlegroups.com
Thanks Robert. I thought initially of CLAWS as Laurence mentions it on the AntConc page, but will certainly give Stanford a go too.
Cheers
alex

Laurence Anthony

unread,
Jan 14, 2013, 4:29:54 AM1/14/13
to ant...@googlegroups.com
On Mon, Jan 14, 2013 at 6:08 PM, alex boulton
<alex.b...@univ-lorraine.fr> wrote:
> My understanding is that AntConc doesn't "understand" tags, and just treats
> a tagged item as a longer stretch of text. In which case the tagged corpus
> would only really be useful at the word level, and an untagged corpus much
> more appropriate for longer strings.
> Not just for the word list, but to be able to search for e.g. use*_NN* (i.e.
> all forms of use as a noun, inevitably including usefulness etc.), to
> distinguish from use*_VV* (forms of use as a verb). I experimented with a
> segment of the corpus following CLAWS tagging: it worked fine for this type
> of thing, but I'll need to experiment further.
> Presumably keyword analysis would only work with a reference corpus with
> exactly the same tagset?
> I have to say these are basic things I haven't done before, being generally
> a parasite of other people's corpus linguistics work rather than a corpus
> linguist myself. ;o)
> alex

Is this a response to my question below?

>> Can you just confirm what you want the Word List results to look like
>> (if POS and lemmas are combined)?
>>
>> For example, say you have the following three-sentence corpus:
>> He_PRP reads_VBZ a_DT book_NN ._.
>> He_PRP is_VBZ reading_VBG the_DT book_NN ._.
>> He_PRP was_VBD reading_VBG books_NNS ._.
>>
>> What results list (Word List) do you want AntConc to generate?

I'm a bit confused how your response answers the questions.

Regarding keywords lists (in combination with lemma lists?), again,
I'm not exactly sure what kind of results you're looking for. On my
YouTube site, I give a demonstration of how to generate lemma
keywords, but I think you might be wanting something else.

Laurence.

Laurence Anthony

unread,
Jan 14, 2013, 4:34:17 AM1/14/13
to ant...@googlegroups.com
On Mon, Jan 14, 2013 at 6:01 PM, Robert Fuchs <robert....@gmail.com> wrote:
> Hi Alex,
>
> I have successfully used the Stanford tagger in the past (which is free),
> and using regex I searched the output with AntConc:
> http://nlp.stanford.edu/software/tagger.shtml
> The also have a lemmatiser, which is part of the following (haven't tried
> this yet) http://nlp.stanford.edu/software/corenlp.shtml
> I guess you can only really make use of these tools via the command line.
> Making sure everything works correctly with Java in your PATH and all this
> may be too complicated for students, depending on their background. Once
> everything is in place I think they should manage.
>
> Best wishes,
> Robert
>

Hi Robert,

Before answering Alex's original question, I had a look at how easy it
would be to use the Stanford Tagger. It comes with a simple Java based
graphical interface for tagging short extracts of text, but offers not
batch function for tagging a whole corpus. It says in the
documentation that for doing anything slightly complex, you need to go
to the command line.

Still, it's a very good tagger.

Laurence.

Brian Schanding

unread,
Mar 18, 2013, 5:31:51 PM3/18/13
to ant...@googlegroups.com
I'm glad I found this discussion. I hope you don't mind my taking it on just a slight detour for a second...
 
I've purchased the CLAWS tagger and can't even get it to open up to be used. The readme file is not much help. Any kind soul out there feel like explaining how to get it to operate? It's supposed to have a Java version that is more user friendly, but it won't open for me. And apparently I'm supposed to be able to use Windows command prompt to run also, but I can't get that working either.
 
Sorry to nose in with this question, but I am having no luck so far finding information that helps.

Laurence Anthony

unread,
Mar 18, 2013, 8:10:54 PM3/18/13
to ant...@googlegroups.com
Dear Brian,

You question is certainly beyond the scope of the AntConc discussion
list. However, I have another software tool (AntCLAWS-Gui) that might
be of help. If you download this tool from my website, put it in the
same folder as your Claws.exe executable, and launch it. This will
provide a bridge to your CLAWS engine allowing you to type in
sentences directly and have them tagged, and also load in multiple
files for tagging. They will all be tagged as horizontal texts with a
Brown corpus _TAG format.

I would also recommend that you contact the developers of CLAWS who I
am sure will be happy to help you get the Java tools and command line
tool working.

I hope that helps!
Laurence.

Brian Schanding

unread,
Mar 21, 2013, 9:16:54 PM3/21/13
to ant...@googlegroups.com
SO HELPFUL! I can't thank you enough, really! Works well, and works quickly! 
Brian

Laurence Anthony

unread,
Apr 11, 2013, 11:50:31 AM4/11/13
to ant...@googlegroups.com
Dear Brian and All,

I'd like to announce a new version of AntCLAWS-GUI that fixes a bug causing apostrophe s to be mis-tagged. It is available for download from here:
http://www.antlab.sci.waseda.ac.jp/software.html


I made a few other little changes to the program so that it now reports to the user if the CLAWS engine is not in the same folder as the main program. I hope it's useful.

Laurence.






###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################


--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.

To post to this group, send email to ant...@googlegroups.com.

Brian Schanding

unread,
Apr 12, 2013, 11:45:21 AM4/12/13
to ant...@googlegroups.com
Thanks for this update and for letting us know!
Reply all
Reply to author
Forward
0 new messages