Question about collostructional analysis

Matías Guzmán Naranjo

unread,

Apr 25, 2013, 1:04:24 PM4/25/13

to corplin...@googlegroups.com

Hi everyone,

I'm having some hard time doing collostructional analysis with a corpus. If I follow the methods outlined here: http://www-user.uni-bremen.de/~anatol/crs/caw/sclx.html The test indicates that some how almost all lemas are used significantly more often with the anallysed construction than expected. However, if I instead of using number of other lemas of the same type in the construction and related constructions, I use the number of sentences for the construction and related constructions, I get sensible results.

Why is this procedure with counts of sentences wrong? What might it be that I'm doing wrong?

Thanks a lot,

Matías

Stefan Th. Gries

unread,

Apr 25, 2013, 1:13:11 PM4/25/13

to corplin...@googlegroups.com

> I follow the methods outlined here: http://www-user.uni-bremen.de/~anatol/crs/caw/sclx.html

That page is a little outdated (last updated 2005 ...) in terms of the
comments re the scripts etc. (cf. <http://tinyurl.com/collostructions>
for a more recent page, which e.g. doesn't say my script only works
with Windoze) but of course the underlying logic is still accurately
described.

> The test indicates that some how almost all lemas are used significantly more often with the anallysed construction than expected.

Well, in some sense, that's obvious: language isn't used randomly so a
test that is based on comparing observed co-occurrence to
expected/random co-occurrence says that the observed co-occurrence is
not random. What's wrong with that ;-) This is why the important point
about the results is the ranking, not that >=1.303 threshold (with
p_FYE).

> However, if I instead of using number of other lemas of the same type in the construction and related constructions, I use the number of sentences for the construction and related constructions, I get sensible results.

"sensible" only in terms of the arbitrary threshold of 0.05. The
results ARE sensible along the above lines and Gries (2012h) shows
that, usually, the rankings resulting from different a+b+c+d=N are
very similar.

STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

Matías Guzmán Naranjo

unread,

Apr 25, 2013, 1:50:08 PM4/25/13

to corplin...@googlegroups.com

Thanks for your answer.

Maybe I'm not understanding something here. I'm studying a kind of psych verbs (X) in Spanish that present to possibilities A and B. I want to know which words are attracted and repelled by each form compared to other psych verbs that don't present this alternation (~X). What I'm doing is comparing how often each word appears in the construction and in the total set of psych verbs ~X, and comparing this to what would be expected in terms of the proportion to the number of sentences in X and ~X (this is equivalent to the number of psych verbs). This way it works showing repelled and attracted words.

How it doesn't work at all, is if I do the comparison with number of words of the same type as the word I'm testing (for example Preposition, or Common Noun, etc.). This way EVERY word is classified as having a higher frequency than expected, and no words are repelled. I'm guessing my code has some error, but is this really the way I should be doing it?

Since my corpus is too big to use R (about 50 million words or so), I'm programming the analysis in python, so I can't really use your scripts, it would take too long and my computer freezes.

Thanks again

2013/4/25 Stefan Th. Gries <stg...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To post to this group, send email to corplin...@googlegroups.com.
Visit this group at http://groups.google.com/group/corpling-with-r?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Stefan Th. Gries

unread,

Apr 25, 2013, 1:58:35 PM4/25/13

to corplin...@googlegroups.com

I am not sure I understand what you're doing; for example, the
sentence "I'm studying a kind of psych > verbs (X) in Spanish that
present to possibilities A and B." is not clear to me. It seems to me
you may have a 3-dimensional table there but I don't get your
description; if you do, see Stefanowitsch and Gries (2005) on ways to
handle 3-dimensional tables. In any case, the general point is: the
overall N in the 2x2 tables you use should be the N of the units that
are on the same level of granularity than what you're studying (or a
reasonably close approximation), which is why for argument structure
Cx Anatol and I often used the number of verbs in the corpus and why
people who use the script for purely lexical collocations use the
number of words.

Incidentally, 50m is not too big for R - I routinely use R on the BNC
without any problems.

Christophe Bechet

unread,

Jul 30, 2015, 7:04:08 AM7/30/15

to CorpLing with R, morte...@gmail.com

Dear all,

I'm also having hard time doing collostructional analysis. I'm using a 573,987 word corpus of historical Dutch which is not annotated. I would like to use colexeme analysis in order to determine which verbs in a given slot are most strongly attracted to a grammatical construction, nl. "instead of" (its Dutch counterpart). Here follows the configuration of my construction:

[Subject VERB X instead of Y] (e.g. "I'm wearing the red shirt instead of the blue one").

For coll. analysis, I need the following:

- the name of the construction I wish to study ("instead of"); ==> OK
- the frequency of all elements that could potentially occur in the construction, nl. all verbs (tokens) that occur in the corpus; ==> PROBLEMATIC!
- the combined frequency of all construction tokens (in my case: 24); ==> OK
- A table with the three columns ‘VERB’, ‘corpus_frequency ’, and ‘construction_frequency’. ==> OK

The second element that is needed can hardly be obtained since the corpus is not tagged for POS. Does it mean that coll. analysis can only be used with annotated corpora, or is there another alternative to coll. analysis?

Best regards,

C. Bechet

Hardie, Andrew

unread,

Jul 30, 2015, 9:07:58 AM7/30/15

to corplin...@googlegroups.com

The bigger question, it seems to me, is: How much will the collostruction analysis really add if all you have to work with is 24 examples?

best

Andrew.

--

You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To post to this group, send email to corplin...@googlegroups.com.

Visit this group at http://groups.google.com/group/corpling-with-r.
For more options, visit https://groups.google.com/d/optout.

Christophe Bechet

unread,

Jul 30, 2015, 4:02:02 PM7/30/15

to corplin...@googlegroups.com

Yes, that is the biggest issue. Complex prepositions are hard to study in a quantitative approach, as far as diachrony is concerned. Sometimes examples are scarce in historical corpora. In a corpora of present day Dutch, for instance, there are more than 40,000 instances of the construction "in plaats van", against only 27 for "in stede van". How could I compare collocational preferences for each construction in order to find significant semantic differences between them? Distinctive collexem analysis seemed the best option at first glance...

Christophe Bechet

unread,

Jul 30, 2015, 4:03:26 PM7/30/15

to corplin...@googlegroups.com

Yes, that is the biggest issue. Complex prepositions are hard to study in a quantitative approach, as far as diachrony is concerned. Sometimes examples are scarce in historical corpora. In a corpus of present day Dutch, for instance, there are more than 40,000 instances of the construction "in plaats van", against only 27 for "in stede van". How could I compare collocational preferences for each construction in order to find significant semantic differences between them? Distinctive collexeme analysis seemed the best option at first sight...

2015-07-30 15:07 GMT+02:00 Hardie, Andrew <a.ha...@lancaster.ac.uk>:

Reply all

Reply to author

Forward