Lemmatized types and tokens for specific tags?

16 views
Skip to first unread message

LaTreese

unread,
Apr 30, 2021, 12:12:17 PM4/30/21
to chibolts
Hello there!

This may be an easy issue to solve, but I cannot figure it out. I have relatively little experience with CLAN so please be gentle.

I have several different tags in my transcripts (e.g., @z:shp to denote shape words). So, in the transcript, "square" and "squares" would be coded as square@z:shp and squares@z:shp , respectively. However when I try to analyze them for types and tokens, different forms of the same stem are being counted as two different types (e.g., square and squares counted as 2 types). 

I am currently using this command to get types and tokens of all of my different categories:
freq +s*@z:* 

I read in the manual that I should create a MOR line to be able to run types and tokens on lemmas so that "square" and "squares" are counted as one type. I did this, but now my tags are not available on the MOR line.

Is there a way to get the lemmatized type and token counts for specific tags?

Thank you so much!

LaTreese Hall
Florida International University

Leonid Spektor

unread,
Apr 30, 2021, 3:08:55 PM4/30/21
to chib...@googlegroups.com
Hi,

The plain command freq +s*@z:* will not work, because words square and squares have different spelling. If you want both of those words counted as one type, then you need to run lemmas, i.e. stems only, search on %mor tier. To create %mor tier in your data files you need to get appropriate language grammar from the web first. It looks like you are working with English language data files, so to get the English grammar you need to start CLAN and to select menu "File->Get MOR Grammar->English - eng". This will download the grammar to your computer. After that you need to run MOR command on your data files. In "Commands" window type command mor *.cha. This assumes your data filenames end with .cha file extension. When MOR command is finished and doesn't find any words that it can not identify, then you can use the following command to find what you want:

freq +d7 +s*@z:* +sm;*,o%


Leonid.

--
You received this message because you are subscribed to the Google Groups "chibolts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chibolts+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/628a167c-ada8-4df7-8c91-19b0933f0743n%40googlegroups.com.

LaTreese Hall

unread,
May 3, 2021, 10:05:05 AM5/3/21
to chib...@googlegroups.com
Hello Leonid,

Thank you for responding! I had done all of this but it was still counting singular and plural form of the same word as two different types. 

This is what the output looks like when I run freq +d7 +s*@z:* +sm;*,o% 

   1 circle    
      1 circle@z:shp
  1 circles   
      1 circles@z:shp
 1 square    
      1 square@z:shp
  1 squares   
      1 squares@z:shp
------------------------------
   4  Total number of different item types used
   4  Total number of items (tokens)
1.000  Type/Token ratio

Any idea of how to get Type and Token info for the lemmas? So that the above output indicates 2 types and 2 tokens?

Thank you!
LaTreese


You received this message because you are subscribed to a topic in the Google Groups "chibolts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/chibolts/2lAUOxAbURA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to chibolts+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chibolts/958FB752-CD6D-4128-B9F9-2BB9C35E43C3%40andrew.cmu.edu.

Reply all
Reply to author
Forward
0 new messages