apostrophe

686 views
Skip to first unread message

Haidee Thomson

unread,
Feb 17, 2015, 9:42:47 PM2/17/15
to ant...@googlegroups.com
Hello all,

I am new to Antconc and I have a couple of questions that you might be able to help me with:

1) I am interested in n-grams, currently antconc is showing the word  "don't" as a 3-gram (don x t). How can I get the apostrophe counted as part of one of the words so "don"t" is counted as a 2-gram?

2) I have multiple texts within one file, do I need to separate these into individual files in order for antconc to accurately identify the range?

Thank you in advance!
Haidee

Laurence Anthony

unread,
Feb 18, 2015, 4:43:13 PM2/18/15
to ant...@googlegroups.com
Hi Haidee,

Your question 1) is a common question. Just append the apostrophe to the token definition in the AntConc global settings. Note that many corpus linguistics think this a bad idea, though, and that don  and t should be treated separately.

For 2), yes, you will need to split your files because range is a measure of dispersion across files. With a small amount of scripting, you should be able to do the splitting automatically.

I hope that helps.

Laurence.

Haidee Thomson

unread,
Feb 18, 2015, 10:07:26 PM2/18/15
to ant...@googlegroups.com
Hi Laurence,

Thank you for your quick response. I had seen this problem about the apostrophe on the google group discussion threads, however I think I have not understood something about "appending the apostrophe". Instructions as I understand are: Global settings - token definition - check the box "append following definition" and the field below already has the apostrophe in there, so I simply check the box and hit "apply". Then re-try the n-gram analysis, but nothing changes...am I missing something?

I agree that don and t are two separate items, but my problem is with the apostrophe becoming another separate item (making what should be a 2-gram into a 3-gram) e.g. don x t rather than don t

The scripting method sounds like a good short cut, do you have any idea what the script would be?

Many thanks,
Haidee


--
You received this message because you are subscribed to a topic in the Google Groups "AntConc-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antconc/JSHeCbDd0T0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at http://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/d/optout.



--

Haidee Thomson 
URL: http://haideethomson.com/

TEL: 080-4308-2853 

Laurence Anthony

unread,
Feb 18, 2015, 10:19:22 PM2/18/15
to ant...@googlegroups.com
Haidee,

Let me clarify. N-grams count a series of token strings, so a 2-gram is a series to two token strings. What is your definition of a token? Based on that definition, how do you want to divide <don't> up. There are only really two options: <don> + <t> or <don't> depending on if you count <'> as part of the definition of a token or not.

The script would be something that you have to write yourself.

Laurence.



###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.

Haidee Thomson

unread,
Feb 19, 2015, 3:12:49 PM2/19/15
to ant...@googlegroups.com
Hi Laurence,

Thank you for your reply. I consider each word to be a token, contracted words are counted are counted as the sum of the separate words which they contain. So I would count <don't> as 2 tokens <don> + <t>
Attached is a screen shot showing my problem: don't is showing up as a 3-gram.

Thank you so much for your explanations,
Haidee
ngram apostrophe.jpg

Laurence Anthony

unread,
Feb 19, 2015, 5:29:11 PM2/19/15
to ant...@googlegroups.com
Haidee,

You are suffering a classic problem of not setting the character encoding in the AntConc global settings to match the encoding of the files you are using. The 'curly' quotes in your files are not being rendered correctly as a result. If you look at the file view tool, you will see that all quotes are being rendered incorrectly. I repeat - the problem is nothing to do with <don't>. It's simply a character encoding problelm.

AntConc defaults to UTF-8. I suspect that you are using files encoded in Latin 1 or perhaps some variant. Try changing the encoding in AntConc to one of these. If you are unsure, use my new EncodeAnt tool that will detect and convert your file encodings to the UTF-8 international standard.

I hope that helps.

Regards,

Laurence.

Haidee Thomson

unread,
Feb 19, 2015, 8:27:45 PM2/19/15
to ant...@googlegroups.com
Hi Laurence,

Thank you! I am splitting the files manually, so I can change the coding to UTF-8 at the same time. I ran a couple of the new files through antconc and the apostrophe is no longer a problem. 
Thank you for your patient explanation!

Haidee

--
You received this message because you are subscribed to a topic in the Google Groups "AntConc-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antconc/JSHeCbDd0T0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antconc+u...@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at http://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/d/optout.

Laurence Anthony

unread,
Feb 19, 2015, 8:43:51 PM2/19/15
to ant...@googlegroups.com
Great. EncodeAnt can do a complete batch character encoding change for all your files. So, if the conversion is taking time at the save step, just use EncodeAnt later. It will do everything in seconds.

Regards,

Laurence.



###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntConc-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.

Haidee Thomson

unread,
Feb 19, 2015, 8:56:56 PM2/19/15
to ant...@googlegroups.com
Thank you! I'm sure that will save some time :-)
Reply all
Reply to author
Forward
0 new messages