large contingency tables with many zeros

45 views
Skip to first unread message

Nina Julich

unread,
Jul 22, 2019, 7:00:29 AM7/22/19
to StatForLing with R

Hi all,


in a study, we are interested in colour-feeling associations. For the study, participants were given 15 feelings (angry, furious, passionate etc.) and they had to select a colour from a set of 27 colour shades for each feeling.


This is the result, the 27 colour shades were collapsed into 14 broader colour terms.



blackbluebrowncolour.not.listedgreengreyno.associationorangepinkpurpleredtilewhiteyellow
angry701011120071000
bored13340019151020090
calm0403092200201862
cheerful0311802610101150
confident51611601248218209
depressed282711201210011010
excited07257113117360121
fearful19471251921316113
furious700311210067020
happy0601100545412046
jealous1020472912116012
joyful01011008312302143
passionate0302004441155001
sad354720920060010
shy211111210140792942


For the results, we are interested in which associations are particularly strong. 


My initial idea was to perform a chisquare test on the contingency table, and to plot the residuals to show which associations are significant and which ones are repelled.


However, there are many zeros in the contingency table which means that a chisquare test should not be performed. 


If I perform a fisher test, I get the following error message:


“Error in fisher.test(english) : FEXACT error 5.

The hash table key cannot be computed because the largest key

is larger than the largest representable int.

The algorithm cannot proceed.

Reduce the workspace size or use another algorithm.”


Irrespective of the fisher test, in order to get at the residuals to compute strength of association, I have to perform the chisquare test. But again, I am not sure whether I can do this, given all the zeros in my contingency table.


Shall I refrain from analysing the residuals and look at percentages instead?


Thanks for your help!


Best,

Nina


Stefan Th. Gries

unread,
Jul 22, 2019, 7:06:48 AM7/22/19
to StatForLing with R
My first thought would be to go with association rules (packages arules and arulesViz, I think), which can return values that tell you very similar things to residuals etc. Of course, arules are also somewhat sensitive to the extremely small n you have (for that many cells), like any approach would be, but measures such as support, confidence, and lift might be useful to explore the kinds of data you have.

HTH,
STG

Nina Julich

unread,
Jul 22, 2019, 7:13:13 AM7/22/19
to StatForLing with R
Ok. Thank you! I will have a look at association rules.
Just to make sure, you would definitely advise against using residuals retrieved from a chisquare test?

Best,
Nina

Stefan Th. Gries

unread,
Jul 22, 2019, 8:03:26 AM7/22/19
to StatForLing with R
> Just to make sure, you would definitely advise against using residuals retrieved from a chisquare test?
I wouldn't go that far (and some of the association rules statistics
are in fact similar to the residuals) but of course they can become
quote inflated given the low expected frequencies and so a combination
of association rules statistics might help with filtering out
interesting things. You'll have to be creative ...

Nina Julich

unread,
Jul 22, 2019, 11:18:45 AM7/22/19
to StatForLing with R
:-) Ok. We will be. Thanks a lot! Also for your immediate response.

ekbrown77

unread,
Jul 22, 2019, 1:30:48 PM7/22/19
to StatForLing with R
Another nice visualization for categorical data is the association plot (as Stefan presents in his book). Base R's assocplot() function is alright, but I like vcd::assoc(), as it color-codes Pearson residuals in the plot. There's a good example here, as well as in Levshina (2015)

Stefan Th. Gries

unread,
Jul 22, 2019, 1:33:36 PM7/22/19
to StatForLing with R
plus, let's not forget the kinds of plots that a correspondence
analysis could offer for this, which are potentially also good for you
since that wouldn't require a p-value somewhere down the road.

On Mon, Jul 22, 2019 at 6:30 PM ekbrown77 <ekbr...@gmail.com> wrote:
>
> Another nice visualization for categorical data is the association plot (as Stefan presents in his book). Base R's assocplot() function is alright, but I like vcd::assoc(), as it color-codes Pearson residuals in the plot. There's a good example here, as well as in Levshina (2015).
>
> --
> You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/statforling-with-r/5ae9339e-95df-4fc1-a521-c9b9574ee101%40googlegroups.com.



--
Best,
STG
--
Stefan Th. Gries
------------------------------
UC Santa Barbara & JLU Giessen
http://www.stgries.info
------------------------------

Martin Schweinberger

unread,
Sep 18, 2019, 8:57:36 PM9/18/19
to statforli...@googlegroups.com

Hi all,


I posting this (with Stefan's approval, of course) because The School of Languages and Cultures at The University of Queensland (UQ) is advertising a research assistant position in computational linguistics and/or corpus linguistics with a focus on quantitative analysis in R.

The job advertisement can be found here: https://linguistlist.org/issues/30/30-3484.html

Seeking outstanding Research Assistant with skills in research technology and digital data analysis methods relating to the use of technology in linguistic or cultural research. The School of Languages and Cultures is one of the largest Schools of language instruction in Australia with over 52 ...

I think that the position is a great opportunity for early career researchers to further develop their skills in computational approaches to analyzing language data, to gain experience about what it is like to work in an English-speaking country at a research-intensive university in a truly beautiful city (this only applies to those that do not already do so obviously).

UQ ranks in the top 50 as measured by the Performance Ranking of Scientific Papers for World Universities. The University also ranks 48 in the QS World University Rankings, 45 in the US News Best Global Universities Rankings, 65 in the Times Higher Education World University Rankings and 55 in the Academic Ranking of World Universities.


Kind regards

Martin

=====================================
Dr. Martin Schweinberger
5/221 Sir Fred Schonell Drive
St Lucia, QLD, 4067

Fon.: +61 (0)404 228 226
Home: http://www.martinschweinberger.de/
Reply all
Reply to author
Forward
0 new messages