Apostrophy in Token Definition

504 views
Skip to first unread message

nat...@gmail.com

unread,
Mar 5, 2017, 8:58:52 PM3/5/17
to AntConc-Discussion
Hi.
I have gone into Global Settings and checked the box for Append Following Definition and have an apostrophe in the box. I clicked "apply."
I want it to be able to count words like "it's" and "you'll" as single words, as I am including them in a lemma list.

I created a lemma list for pronouns like  you-> you'll,you're,you'd,you've  and it-> it's,it'd,it'll
I loaded my lemma list and clicked "apply."

When I run a word list, I get the words "t" and "s" which are single letters. When I then check the concordance for these "words," some contractions are popping up.

Example:
In the word list, I have the word "s" occurring 705 times. When I check the concordance, several instances of "s" is really the word "it's", which should've shown up as an "it" word.
Also, if I do a search for *'s, it occurs 301 times.

It seems like even though I am asking it to use the apostrophe as part of a contraction, it is using it to separate the words into two: don plus t and it plus s. It is finding some of these contractions, if they are in my lemma list, but not all of them.

Any advice? Is there something else I am suppose to click on?

--Natalie

nat...@gmail.com

unread,
Mar 5, 2017, 9:22:03 PM3/5/17
to AntConc-Discussion
I should also mention that all my files are in UTF-8. Even though they already were, I "converted" them again using EncodeAnt just in case I was wrong.

Thanks.

--Natalie

Laurence Anthony

unread,
Mar 6, 2017, 5:24:38 AM3/6/17
to ant...@googlegroups.com
Hi Natalie,

There is a chance that your apostrophes are of different types. For example, Microsoft has "curly apostrophes" which are different from standard ASCII ones. All the variations will need to be added. 

I suggest you start with a small file of just a few words that seem to show this problem (e.g. cutting and pasting the problem sentences from the original larger corpus), and testing them. This will make it much easier to identify where the problem is.

I hope that helps.

Laurence.




###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+unsubscribe@googlegroups.com.
To post to this group, send email to ant...@googlegroups.com.
Visit this group at https://groups.google.com/group/antconc.
For more options, visit https://groups.google.com/d/optout.

nat...@gmail.com

unread,
Mar 6, 2017, 10:03:01 PM3/6/17
to AntConc-Discussion
Thanks, Lawrence. You helped one thing! I discovered my lemma list that I loaded was not in UTF-8 format, but rather ASCII format.  That was good to know.

However, when I used EncodeAnt to convert it, it did not convert it to UTF-8, despite my clicking on the "convert" button.  I resaved it in Word in the UTF-8 format and checked the encoding in EncodeAnt and got it to the UTF-format.

Then, I reloaded my lemma list and tried again. My lemma list is for the pronouns "you, we, I, your, it".   I got the same results.

I did discover this, for example:
The frequency for "you'll"  in my word list (showing as lemma for "you") is 26. When I do my word list and have the word "ll", it finds 30 instances of "ll" that are in fact the word "you'll". This tells me that not all of my contractions are being counted for in my lemma list. My next step is to go through those 30 and see if there is anything noticeable about those that would indicate why 4 of them are not showing up in my lemmas for "you."

It can be truly frustrating. Thanks, though, for making this easier than going through and counting every single thing manually all the time!

If you have any other ideas based on this info, I'm happy to hear it.

--Natalie



On Sunday, March 5, 2017 at 8:58:52 PM UTC-5, nat...@gmail.com wrote:

nat...@gmail.com

unread,
Mar 8, 2017, 5:22:52 AM3/8/17
to AntConc-Discussion
I have figured out my problem! You were right, Lawrence!

All of my text files are UTF-8 format. However, even though they are UTF-8, some of the apostrophes are the curly kind and some are the straight kind.  I would have thought that when I converted to UTF-8 it would have changed them all to the straight kind, but it didn't. 

I don't know anything about encoding or how the conversion works.


Does anyone know if there is a way I can "convert" all the curly apostrophes to straight apostrophes so I don't have to go into 800+ files and change them manually?

Any help is much appreciated!

--Natalie



On Sunday, March 5, 2017 at 8:58:52 PM UTC-5, nat...@gmail.com wrote:

JFlorian

unread,
Mar 8, 2017, 1:48:51 PM3/8/17
to ant...@googlegroups.com
MS Word has a bunch of coding behind what you can see.  It is a *bear* to strip out manually.  You could locate the code in one file and do Search-Find-Replace.  Here is a description of % codes to use.  <  http://stackoverflow.com/questions/2826191/converting-ms-word-quotes-and-apostrophes >

[Alternately, you could take MS Word files and first make sure all ARE set as MS Word codes for single quote and double quote and apostrophe.  (lots of work for each contraction).   Then put into notepad and convert to plain txt, then Antconc conversion.  The long way around.]

OR, just pop each file into text Windows Notepad and save as plain txt file.  :-D   The plain text *should* strip out all codes.   Then, convert it in Antconc.

Judy

Laurence Anthony

unread,
Mar 8, 2017, 4:23:59 PM3/8/17
to ant...@googlegroups.com
Hi Natalie,

If you just have mixed curly and straight quotes and apostrophes, the easiest thing to do is just open *all* your text files into a good text editor like Notepad++ and then search and replace the curly quotes and apostrophes with straight quotes and apostrophes in a single batch step. There is only one "curly" character of each type, so it should be too hard.

Load them back into AntConc (while still open in Notepad++) and if you find that there are still some other problematic characters, just replace those in Notepad++ and these changes will be immediately reflected in AntConc.

Working like this, you should be able to find and correct all your problems in just 2-3 min.

I hope that helps.

Laurence.



###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
Reply all
Reply to author
Forward
0 new messages