Mixed-effects linear model for placement of nuclear stress in 10-word turns

43 views
Skip to first unread message

Christoph Ruehlemann

unread,
Jan 10, 2019, 11:23:48 AM1/10/19
to corplin...@googlegroups.com
Hi all,

I'm trying to model the placement of nuclear stress in 10-word turns from the BNC in a linear mixed model but am very new to mixed modeling. The model includes these variables:

  • STRSS, the binary response variable; the 10-word turns have been selected in such a way that only 1 word carries the nuclear stress
  • INFMX, a binary explanatory variable denoting whether a word carries the maximum informativity (i.e., 'surprisal' given the preceding word)
  • CLASS, an explanatory variable with three levels: function word, interjection, or content word
  • POST, an explanatory variable denoting whether the nuclear stress occurs early in the turn (words 1-3), in mid-turn position (words 4-6), or late in the turn (words 7-10)
  • STRCT, an explanatory variable denoting whether the nuclear stress falls on a word inside what is called the turn constructional unit (TCU) or not
  • SPKR, a random factor referring to speaker IDs, and
  • SEQU, another random factor referring each word to its place in the sequence of exactly 10 words, considered random because only 10-word turns are examined here, not turns of other lengths

Here's some reproducible data:

df <- data.frame(
  SPKR = c(rep("A", 10), rep("B", 10), rep("C", 10)),
  SEQU = rep(1:10, 3),
  STRSS = rep(c(rep("notS", 8), "S", "notS"), 3),
  INFMX = rep(c(rep("notMax", 8), "priorMax", "Max"), 3),
  CLASS = rep(c(rep("fnc", 3), rep("itj", 1), rep("cnt", 6)), 3),
  POST = rep(c(rep("earl", 3), rep("mid", 3), rep("lte", 4)), 3),
  STRCT = rep(c(rep("notTCU", 2), rep("TCU", 6), rep("notTCU", 2)), 3)
)
df
   SPKR SEQU STRSS    INFMX CLASS POST  STRCT
1     A    1  notS   notMax   fnc earl notTCU
2     A    2  notS   notMax   fnc earl notTCU
3     A    3  notS   notMax   fnc earl    TCU
4     A    4  notS   notMax   itj  mid    TCU
5     A    5  notS   notMax   cnt  mid    TCU
6     A    6  notS   notMax   cnt  mid    TCU
7     A    7  notS   notMax   cnt  lte    TCU
8     A    8  notS   notMax   cnt  lte    TCU
9     A    9     S priorMax   cnt  lte notTCU
10    A   10  notS      Max   cnt  lte notTCU
11    B    1  notS   notMax   fnc earl notTCU
12    B    2  notS   notMax   fnc earl notTCU
13    B    3  notS   notMax   fnc earl    TCU
14    B    4  notS   notMax   itj  mid    TCU
15    B    5  notS   notMax   cnt  mid    TCU
16    B    6  notS   notMax   cnt  mid    TCU
17    B    7  notS   notMax   cnt  lte    TCU
18    B    8  notS   notMax   cnt  lte    TCU
19    B    9     S priorMax   cnt  lte notTCU
20    B   10  notS      Max   cnt  lte notTCU
21    C    1  notS   notMax   fnc earl notTCU
22    C    2  notS   notMax   fnc earl notTCU
23    C    3  notS   notMax   fnc earl    TCU
24    C    4  notS   notMax   itj  mid    TCU
25    C    5  notS   notMax   cnt  mid    TCU
26    C    6  notS   notMax   cnt  mid    TCU
27    C    7  notS   notMax   cnt  lte    TCU
28    C    8  notS   notMax   cnt  lte    TCU
29    C    9     S priorMax   cnt  lte notTCU
30    C   10  notS      Max   cnt  lte notTCU

My hypothesis is that a word will carry nuclear stress (i.e., df$STRSS=="S") if

  • df$INFMX=="priorMAX", i.e., the word with the greatest informativity immediately follows the word with the nuclear stress
  • df$CLASS=="cnt", i.e., the word is a content word
  • df$STRCT=="notTCU", i.e., the word lies inside the TCU
  • df$POST=="lte", i.e., the word occurs late in the turn

Given that the response variable is binary, I've tried a generalized mixed model so far, using library("mlmRev"):

model1 <- glmer(STRSS ~ (INFMX + CLASS + POST + STRCT)^2 + 
           (1 | SPKR) + (1 | SEQU), data = df, family = binomial(link = "logit"), nAGQ = 1)

The problems I'd appreciate help with are the following:

  • Is this the right approach? I.e., is this, at least in principle, the right model?
  • The model call produces some unpleasant information--what to make of it?

    fixed-effect model matrix is rank deficient so dropping 19 columns /coefficients
    Warning messages:
    1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
    unable to evaluate scaled gradient
    2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
    Hessian is numerically singular: parameters are not uniquely determined
    
  • And finally, how to read the output of the model summary?

    summary(model1)
    Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
    Family: binomial  ( logit )
    Formula: STRSS ~ (INFMX + CLASS + POST + STRCT)^2 + (1 | SPKR) + (1 |      SEQU)
      Data: df
    
     AIC      BIC   logLik deviance df.resid 
     18.0     30.6      0.0      0.0       21 
    
    Scaled residuals: 
     Min        1Q    Median        3Q       Max 
    -1.49e-08  1.49e-08  1.49e-08  1.49e-08  1.49e-08 
    
    Random effects:
    Groups Name        Variance Std.Dev.
    SEQU   (Intercept) 0.83102  0.9116  
    SPKR   (Intercept) 0.05073  0.2252  
    Number of obs: 30, groups:  SEQU, 10; SPKR, 3
    
    Fixed effects:
            Estimate Std. Error z value Pr(>|z|)
    (Intercept)    3.972e+01  7.249e+07       0        1
    INFMXnotMax   -4.107e-01  6.711e+07       0        1
    INFMXpriorMax -7.929e+01  5.479e+07       0        1
    CLASSfnc       3.565e-05  4.745e+07       0        1
    CLASSitj       1.581e-06  4.745e+07       0        1
    POSTlte        1.847e-05  3.875e+07       0        1
    STRCTnotTCU   -1.472e-05  4.745e+07       0        1
    
    Correlation of Fixed Effects:
        (Intr) INFMXnM INFMXpM CLASSf CLASSt POSTlt
    INFMXnotMax -0.926                                     
    INFMXprirMx -0.378  0.408                              
    CLASSfnc     0.218 -0.471   0.000                      
    CLASSitj    -0.218  0.000   0.000   0.333              
    POSTlte     -0.535  0.289   0.000   0.408  0.408       
    STRCTnotTCU -0.655  0.707   0.000  -0.667  0.000  0.000
    fit warnings:
    fixed-effect model matrix is rank deficient so dropping 19 columns / coefficients
    convergence code: 0
     unable to evaluate scaled gradient
     Hessian is numerically singular: parameters are not uniquely determined
    
     Warning messages:
     1: In vcov.merMod(object, use.hessian = use.hessian) :
     variance-covariance matrix computed from finite-difference Hessian is
     not positive definite or contains NA values: falling back to var-cov estimated from RX
      2: In vcov.merMod(object, correlation = correlation, sigm = sig) :
      variance-covariance matrix computed from finite-difference Hessian is
      not positive definite or contains NA values: falling back to var-cov estimated from RX
    

I'm quite aware that this post is demanding a lot. Helpful pointers are appreciated all the more!

Chris


Bob Green

unread,
Feb 23, 2019, 6:23:01 PM2/23/19
to corplin...@googlegroups.com
Hello,

I was hoping for some advice regarding Reading
pdf articles with 3 columns into a text file.

An example of an article is:
https://academic.oup.com/schizophreniabulletin/article/19/1/165/1895752


If you click on the pdf link it takes you to the
article. The URL is very long so I ha vent posted
the specific article link, but cam.

Each article consists of 3 columns, the far left
column on the first page is often blank. There is
a page header and line separators, and articles can span several pages.

I employed the following code :

library(pdftools)
library(tm)
library(reshape2)
library(ggplot2)


setwd("E:/Firstperson/MyCorpus")

files <- list.files(pattern = "pdf$")
#Rpdf <- readPDF(control = list(doc_id = 1:10, text = "-layout"))

opinions2 <- lapply(files, pdf_text)
corp <- Corpus(VectorSource(opinions2))

Below is the first 15 lines of the Coprus file,
which is a scrambled version of the first page of the pdf.

Any advice regarding a better way to read the text in is appreciated,

Regards

Bob


c("VOL. 19, NO. 1, 1993 First Person Account: 165
The Onset of Paranoia Downloaded from
https://academic.oup.com/schizophreniabulletin/article-abstract/19/1/165/1895752
by Queensland Health District user on 18 February
2019 by William D. Bowden Abstract lieve that I
am not schizophrenic; I believe that I am a
psychic, that I The article that follows is part
\"broadcast\" my thoughts to any- of the
Schizophrenia Bulletin's one who is—what? In my
immedi- ongoing First Person Accounts se- ate
vicinity? Mentally focused on ries. We hope that
mental health me? Maybe even anywhere on
professionals—the Bulletin's pri- Earth? I don't
know. I have be- mary audience—will take this
lieved all of these possibilities and opportunity
to learn about the more, but presently I believe
that issues and difficulties confronted people
can \"read my mind\" only by consumers of mental
health if they are in my immediate care. In
addition, we hope that vicinity. these accounts
will give patients I am going to try, in this
short and families a better sense of essay, to
explain how I came to not being alone in
confronting believe in this psychic phe- the
problems that can be antici- nomenon. My belief
has withstood pated by persons with serious
attack from anyone I've shared it emotional
difficulties. We wel- with. It is also something
that I come other contributions from truly wish
would stop. I also patients, ex-patients, or
family wish, even if this phenomenon is members.
Our major editorial re- true, that I did not
believe it, be- quirement is that such contribu-
cause I can find no other person tions be clearly
written and who will admit that I am psychic.
organized, and that a novel or Most people I talk
with claim that unique aspect of schizophrenia it
is an entirely erroneous belief. be described,
with special I now know that I started to be-
emphasis on points that will be come
schizophrenic before I real- important for
professionals. ized anything out of the ordinary
Clinicians who see articulate pa- was taking
place. I was a boiler tients, with experiences
they be- technician in the Navy when I lieve
should be shared, might first started to become
schizo- encourage these patients to sub- phrenic.
I was 19 when I started a mit their articles to
First Person pattern of thinking that would
Accounts, Division of Clinical lead to full-blown
schizophrenia, and Treatment Research, NIMH,
paranoid type. I began taking an 5600 Fishers
Lane, Rm. 18C-06, interest in psychic phenomena
and Rockville, MD 20857.—The also in religion. I
did not have a Editors. clearly defined faith
then; my re- ligious beliefs were a composite of
scraps picked up from intermittent My name is
William D. Bowden. I attendance at church,
psychic am 35 years old. I'm a paranoid claims
read in supermarket tab- schizophrenic. Many
psychologists loids, and the then popular view
and psychiatrists have told me this among
myth-believers that extra- over the years, but I
am just com- terrestrials were visiting Earth and
ing to the point where I actually psychically
altering Earthlings' believe it. This reluctance
to be- lieve that 1 am schizophrenic is due to
the intensity of my belief in my \"delusional
system.\" My Reprint requests should be sent to
Mr. W.D. Bowden, 19 Cayuga Rd., delusional system
leads me to be- Sea Ranch Lakes, FL 33308. ",
"166 SCHIZOPHRENIA BULLETIN Downloaded from
https://academic.oup.com/schizophreniabulletin/art
icle-abstract/19/1/165/1895752

Earl Brown

unread,
Mar 2, 2019, 1:31:35 PM3/2/19
to CorpLing with R
When I download the most recent version of the xpdf command-line tools (not the "XpdfReader") from here, and use the command-line tool "pdftotext" within R (or from a system-level command-line terminal) to extract the text from the PDF file you linked to (after downloading that file):

system("/Users/ekb5/xpdf-tools-mac-4.00/bin64/pdftotext /Users/ekb5/Downloads/19-1-165.pdf")

a new TXT file is created in the same directory with your PDF file, with the same name, but with an extension of .txt, that is, "19-1-165.txt". (I'll attach the resulting TXT file.) The syntax of the command-line call is: 
  1. the pathway to the pdftotext executable on your computer
  2. a space (no commas anywhere)
  3. the pathway to the PDF file you want to extract text from
The pdftotext tool is awesome, as it automatically senses columns and keeps text together from the same column, and it also removing hyphens present in the original PDF.

You could do some post-processing clean-up (probably with gsub() or stringr::str_replace()) to remove page numbers and header info that appears within the resulting text. Perhaps an even better way would be first to crop the input PDF pages before calling pdftotext, but I've only ever done that in Python with PyPDF2, so I can't help with cropping PDF pages in R. 
19-1-165.txt

Bob Green

unread,
Mar 3, 2019, 4:51:29 PM3/3/19
to CorpLing with R

Many thanks Earl. I'll do some experimentation.

It may also be the case I can just read them as
pdf, without needing to format them as text files.

Regards

Bob


At 04:31 AM 3/03/2019, Earl Brown wrote:
>When I download the most recent version of the
>xpdf command-line tools (not the "XpdfReader")
>from
><http://www.xpdfreader.com/download.html>here,
>and use the command-line tool "pdftotext" within
>R (or from a system-level command-line terminal)
>to extract the text from the PDF file you linked
>to (after downloading that file):
>
>system("/Users/ekb5/xpdf-tools-mac-4.00/bin64/pdftotext
>/Users/ekb5/Downloads/19-1-165.pdf")
>
>a new TXT file is created in the same directory
>with your PDF file, with the same name, but with
>an extension of .txt, that is, "19-1-165.txt".
>(I'll attach the resulting TXT file.) The syntax of the command-line call is:Â
> * the pathway to the pdftotext executable on your computer
> * a space (no commas anywhere)
> * the pathway to the PDF file you want to extract text from
>The pdftotext tool is awesome, as it
>automatically senses columns and keeps text
>together from the same column, and it also
>removing hyphens present in the original PDF.
>
>You could do some post-processing clean-up
>(probably with
><https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/grep>gsub()
>or
>stringr::<https://stringr.tidyverse.org/reference/str_replace.html>str_replace())
>to remove page numbers and header info that
>appears within the resulting text. Perhaps an
>even better way would be first to crop the input
>PDF pages before calling pdftotext, but I've
>only ever done that in Python with
><https://pypi.org/project/PyPDF2/>PyPDF2, so I
>can't help with cropping PDF pages in R.Â
>
>--
>You received this message because you are
>subscribed to the Google Groups "CorpLing with R" group.
>To unsubscribe from this group and stop
>receiving emails from it, send an email to
><mailto:corpling-with...@googlegroups.com>corpling-with...@googlegroups.com.
>To post to this group, send email to
><mailto:corplin...@googlegroups.com>corplin...@googlegroups.com.
>Visit this group at
><https://groups.google.com/group/corpling-with-r>https://groups.google.com/group/corpling-with-r.
>For more options, visit
><https://groups.google.com/d/optout>https://groups.google.com/d/optout.
>Content-Type: text/plain; charset=US-ASCII; name=19-1-165.txt
>Content-Disposition: attachment; filename=19-1-165.txt
>X-Attachment-Id: 1e1d2ba0-7c71-4edf-b4fe-0cb29f9fa387
>Content-ID: <1e1d2ba0-7c71-4edf-b4fe-0cb29f9fa387>

Reply all
Reply to author
Forward
0 new messages