Modelling individual's data over time

729 views
Skip to first unread message

Gregor Gorjanc

unread,
Oct 20, 2017, 8:20:59 PM10/20/17
to R-inla discussion group
Hi,

I am modelling individual measurements observed over time. The data is unbalanced, i.e., not all individuals are measured at every time point. I would like to produce "smoothed" individual profile for each individual over the whole time period. I am struggling to come up with an appropriate INLA call, partially because I am not sure about the model.

I would like to:
1) model variation between individuals - this is easy with f(idx, model="iid)
2) take into account that measurement of an individual at a particular point correlates with his/her measurements before/after the point - I am thinking about the AR1 or perhaps even RW1 or RW2, e.g., f(time, model="ar1/rw1/rw2), but I wonder if I should fit this irrespective of the idx factor or  should I use the replicate feature to get evaluation of all individuals at all time-points (all-ind-at-all-times)
3) obtain individual profiles - should I do "properly designed" linear combinations of f(idx, model="iid) and f(time, model="...") or should I pull all-ind-at-all-times evaluations from the previous line

I believe Gianluca was asking the same thing and phrased it as "centered AR" http://www.r-inla.org/comments-1?place=msg%2Fr-inla-discussion-group%2FopN7IjEb3hs%2FuAoOkU9UBwAJ . I would like that the individual's time-trend would "oscillate" around individual's mean, but that is achieved by f(idx, model="iid) + f(time, model="ar1), no? Am I just confusing data modelling and processing of the model fit (to obtain individual profiles)?

Thanks, Gregor

# Simulate data
nInd = 10
nTime = 8
IndEff = rnorm(n=nInd)
TimeEff = rnorm(n=nTime)
y = matrix(nrow=nInd, ncol=nTime)
for (j in 1:nTime) {
  for (i in 1:nInd) {
    if (runif(n=1) > 0.25) {
      y[i, j] = IndEff[i] + TimeEff[j] # simple additive effect here, but the true data might not be that simple
    }
  }
}
n = sum(!is.na(y))
Data = data.frame(y=rep(NA, times=n),
                  Ind=rep(NA, times=n),
                  Time=rep(NA, times=n))
k = 0
for (j in 1:nTime) {
  for (i in 1:nInd) {
    if (!is.na(y[i, j])) {
      k = k + 1
      Data[k, ] = c(y[i, j], Ind=i, Time=j)
    }
  }
}

# Fit simple model
# y ~ Normal(mu + ind + time, sigma2e)
# ind ~ Normal(0, simga2i)
# time[i] ~ Normal(\rho time[i-1], simga2t)
inla(formula=y ~ f(Ind, model="iid") + f(Time, model="ar1"), data=Data)

# Fit with replicates?

Haakon Bakka

unread,
Oct 21, 2017, 8:26:29 AM10/21/17
to Gregor Gorjanc, R-inla discussion group
Hi,

My suggestion would be to do an AR1 that is replicated over the individuals. But, to make the process consistent when they are measured at different time points and different missing values, I would consider the continuous version of the ar1, the ou process (inla.doc("ou")).

An alternative to the AR1 would be the RW2, where the continuous is called CRW2.

In the continuous models, you could put a values=... argument to make sure that the random effect are computed for the values you need; using the same values across all individuals. This would make it easier to plot the results.

## AR1 alternative
An alternative to "values" would be to use the AR1, but to make a fine grid over your time points and aggregating all data to that grid (and putting fake NA data where there is no data to force the computation of the posterior).

## Oscillating around the mean
Yes, a f(idx, model="iid) + f(time, model="ar1", replicate=...) would make the ar1 models oscillate around the mean. Same with a rw2 model.

However, the mean of the ar1 would not be exactly zero, but a little bit off. This is because part of the mean structure is actually a random contribution from the time series. If you want you could just compute this time series mean and ad it back on to the iid effect.

There is a way to explicitly remove the mean by using constraints, but I think this will be much slower (since you are replicating the time series) than just fixing any issues after the model is fitted.

## General point
Please be careful about overfitting! Look at the posterior value of the Gaussian precision and see if the likelihood standard deviation is too small.

Kind regards,
Haakon



--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion-group+unsub...@googlegroups.com.
To post to this group, send email to r-inla-discussion-group@googlegroups.com.
Visit this group at https://groups.google.com/group/r-inla-discussion-group.
For more options, visit https://groups.google.com/d/optout.

Haakon Bakka

unread,
Oct 21, 2017, 8:31:02 AM10/21/17
to Gregor Gorjanc, R-inla discussion group
I think the correct code for you example would be
r = inla(formula=y ~ f(Ind, model="iid") + f(Time, model="ar1", replicate = Ind), data=Data)

But I recommend PC priors for all the parameters.

Haakon


To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion-group+unsubscr...@googlegroups.com.

Helpdesk

unread,
Oct 21, 2017, 12:31:52 PM10/21/17
to Haakon Bakka, Gregor Gorjanc, R-inla discussion group
On Sat, 2017-10-21 at 15:31 +0300, Haakon Bakka wrote:
> I think the correct code for you example would be
> r = inla(formula=y ~ f(Ind, model="iid") + f(Time, model="ar1",
> replicate = Ind), data=Data)

as Haakon says, be aware that you might need to define the time points where the process is defined,

compare these

> r = inla(y ~ -1 + f(idx, model="ar1"), data = data.frame(y=0:1, idx=c(10,100)))
> r = inla(y ~ -1 + f(idx, model="ar1", values=1:100), data = data.frame(y=0:1, idx=c(10,100)))


in the first, the timepoints are 10 and 100, or eqv 1 and 2!

in the second, you have all the timepoints between 1 and 100, even though only timepoint 10 and 100 are used.



--
Håvard Rue
Helpdesk
he...@r-inla.org

Gregor Gorjanc

unread,
Oct 23, 2017, 5:20:45 AM10/23/17
to he...@r-inla.org, Haakon Bakka, R-inla discussion group
Dear Haakon and Havard,

Thanks for your answers. I have been able to obtain individual-time profiles following your advice. I have essentially used:

y ~ "some covariates" +
    f(IndI, model="iid") +
    f(TimeI, model="ar1", value=TimeIValues, replicate=IndI)

Below I summarise my experience and ask some questions.

Regards, Gregor

----

1) I used AR1 model, because my time variable is discrete. I am worried about overfitting, so I payed close attention to precision estimates. I did not notice major changes in precision for the error term after adding time component into the model. However, precision for the time component tends to be very large. Rho for AR1 is very close to 1. When I checked use of RW1 or RW2 I obtained crazy high precision estimates for the time component. Do these observations suggest overfitting? 

f(IndI, model="iid")
                                           mean        sd     0.025q     0.5q     0.975q    mode
Precision for the Gaussian observations   40.51     1.571     37.479    40.49      43.67   40.46
Precision for IndI                        46.25     8.810     31.089    45.54      65.62   44.21

f(IndI, model="iid") + f(TimeI, model="iid")
                                           mean        sd     0.025q     0.5q     0.975q    mode
Precision for the Gaussian observations   42.48     1.739     39.142    42.46      45.99   42.42
Precision for IndI                        47.45     9.032     31.908    46.72      67.31   45.35
Precision for DateYearMonthI             925.87   495.161    369.498   796.28    2221.26  613.05

f(IndI, model="iid") + f(TimeI, model="ar1")
                                           mean        sd     0.025q     0.5q    0.975q    mode
Precision for the Gaussian observations   41.21    1.6061    38.1346   41.190    44.457   41.14
Precision for IndI                        46.60    8.8544    31.3679   45.891    66.078   44.55
Precision for TimeI                     2538.43 2325.1339   388.2139 1874.091  8696.255 1000.28
Rho for TimeI                              0.97    0.0308     0.8879    0.980     0.997    0.99

f(IndI, model="iid") + f(TimeI, model="rw1")
                                            mean        sd     0.025q     0.5q     0.975q     mode
Precision for the Gaussian observations    41.25     1.610      38.21    41.20      44.52    41.10
Precision for IndI                         46.38     8.864     31.387    45.56      66.13    43.96
Precision for DateYearMonthI            37771.18 21158.603   10537.16 33394.37   90959.30 24934.59

f(IndI, model="iid") + f(TimeI, model="rw2")
                                             mean        sd     0.025q     0.5q      0.975q     mode
Precision for the Gaussian observations     41.22     1.606     38.115     41.20      44.44    41.18
Precision for IndI                          46.51     8.826     31.037     45.93      65.60    44.90
Precision for DateYearMonthI            109054.58 42294.622  45722.114 102877.11  209546.46 90408.61

f(IndI, model="iid") + f(TimeI, model="ar1", replicate=IndI)
                                            mean        sd     0.025q      0.5q     0.975q      mode
Precision for the Gaussian observations    42.81    1.7652    39.4432    42.776    46.3877   42.7084
Precision for IndI                         54.75   13.5578    33.4403    52.927    86.3869   49.3999
Precision for TimeI                       299.02  195.4974    65.5443   253.903   802.9862  168.2894
Rho for TimeI                               0.98    0.0113     0.9557     0.988     0.9976    0.9933

f(IndI, model="iid") + f(TimeI, model="rw1", replicate=IndI)
                                            mean       sd     0.025q     0.5q     0.975q     mode
Precision for the Gaussian observations    42.96    1.749     39.613    42.92      46.49    42.86
Precision for IndI                         50.88   10.404     33.391    49.88      74.18    47.95
Precision for TimeI                     13800.06 5888.249   6042.377 12564.92   28610.86 10532.02

f(IndI, model="iid") + f(TimeI, model="rw2", replicate=IndI)
                                             mean        sd     0.025q      0.5q     0.975q      mode
Precision for the Gaussian observations     43.33     1.748  4.001e+01     43.30      46.88     43.22
Precision for IndI                          91.79    31.919  4.393e+01     86.96     167.97     77.95
Precision for TimeI                     462998.47 78008.863  3.252e+05 458146.23  630159.17 449701.81

----

2) Is my understanding correct that with f(x, replicate=y) the order of results in summary.random$x is

---
x1 y1
x2 y1
...
xn y1
---
x1 y2
x2 y2
...
xn y2
---
...
x1 ym
x2 ym
...
xn ym
---

Would it make sense to add replicate info (as an additional column) in summary.random$x?

----

3) I noticed that when I had Ind and Time variables formatted as factors (I defined Time as either Year-Month, Year-WeekOfTheYear, or Year-DayOfTheYear, with appropriate set of levels, i.e., levels covered all possible Year-SubYear combinations, I provide function to construct such factors below) I got this error:

y ~ "some covariates" +
    f(IndF, model="iid") +
    f(TimeF, model="ar1", value=levels(TimeF), replicate=IndF)
Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,  : 
  ‘max’ not meaningful for factors

I solved it by using integer indexes, which I defined as:

IndI = as.numeric(IndF)
TimeI = as.numeric(TimeF)
TimeIValues = 1:nlevels(TimeF)

This is the function I used to construct time factors

BuildDateFactor = function(x, Month=FALSE, Week=FALSE, Day=FALSE) {
  # Build date factor with levels that contain also non observed dates.
  # This can then be used to model distance between dates properly.
  # x - Date
  # Month, Week, Day - logical, should resulting factor have year-{month,week,day} resolution
  if (Month + Week + Day > 1) {
    stop("Only one of Month, Week, or Day must be true")
  }
  RangeX = range(x)
  Year = format(x, format="%Y")
  if (Month) {
    By = "month"
    Format = "%Y-%m"
    SubYear = format(x, format="%m")
  }
  if (Week) {
    By = "week"
    Format = "%Y-%V"
    SubYear = format(x, format="%V")
  }
  if (Day) {
    By = "day"
    Format = "%Y-%j"
    SubYear = format(x, format="%j")
  }
  LevelsX = sort(unique(format(seq(from=RangeX[1], to=RangeX[2], by=By), format=Format)))
  factor(paste(Year, SubYear, sep="-"), levels=LevelsX)
}

----

Finn Lindgren

unread,
Oct 23, 2017, 5:32:49 AM10/23/17
to Gregor Gorjanc, he...@r-inla.org, Haakon Bakka, R-inla discussion group
Hi Gregor,

even though time is discrete in your case, you have unequal spacing.
This means that the "ou" model is likely a more appropriate choice.
When the spacing is equal, it's equivalent to an AR(1) process.

Having time as a factor doesn't play well with a time-ordered model. A
factor should be seen as a glorified collection of indicator
variables, not an ordered sequence.
Indexing is the way to do it.

Finn
> --
> You received this message because you are subscribed to the Google Groups
> "R-inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to r-inla-discussion...@googlegroups.com.
> To post to this group, send email to
> r-inla-disc...@googlegroups.com.
--
Finn Lindgren
email: finn.l...@gmail.com

Gregor Gorjanc

unread,
Oct 23, 2017, 6:28:27 AM10/23/17
to Finn Lindgren, he...@r-inla.org, Haakon Bakka, R-inla discussion group
Hi Finn,

Thanks for this input. Actually, I defined time as Year-Month (or Year-Week or Year-Day) so I do have equal spacing between the time points (give or take few days difference between months for the Year-month formulation). I tried now the continuous time models ("ou" and "crw2"). Indeed I see that "ar1" and "ou" give me more or less the same results ("ar1" rho ~ 0.97, while "ou" phi ~ 0.03 --> rho ~ exp(-0.03) = 0.97), which I believe suggest that spacing is equidistant.

Should I be worried that rho is so high or that RW models have so large precisions?

Regards, Gregor

f(IndI, model="iid") + f(TimeI, model="ou", values=...)
                                             mean        sd 0.025quant 0.5quant 0.975quant     mode
Precision for the Gaussian observations   41.2774    1.6302    38.0301   41.305    44.4368   41.442
Precision for IndI                        46.2184    9.1834    31.4376   45.072    67.2728   42.720
Precision for TimeI                     2666.4974 2378.4611   424.1470 1992.906  8968.3901 1087.889
Phi for TimeI                              0.0324    0.0351     0.0033    0.022     0.1249    0.009

f(IndI, model="iid") + f(TimeI, model="crw2", values=...)
                                             mean        sd 0.025quant  0.5quant 0.975quant     mode
Precision for the Gaussian observations     41.21     1.606      38.13     41.18      44.46    41.12
Precision for IndI                          46.19     8.859      31.29     45.34      65.97    43.68
Precision for TimeI                     108426.28 42377.699   45855.08 101904.19  210040.94 89194.81

f(IndI, model="iid") + f(TimeI, model="ou", values=..., replicate=IndI)
                                            mean       sd 0.025quant 0.5quant 0.975quant     mode
Precision for the Gaussian observations  42.8940   1.7702    39.5121  42.8598    46.4779  42.7957
Precision for IndI                       54.5974  13.4951    33.4635  52.7430    86.1993  49.1619
Precision for TimeI                     295.3813 184.1402    71.8762 253.7867   765.9362 176.7437
Phi for TimeI                             0.0166   0.0127     0.0027   0.0133     0.0498   0.0075

f(IndI, model="iid") + f(TimeI, model="crw2", values=..., replicate=IndI)
                                             mean        sd 0.025quant  0.5quant 0.975quant      mode
Precision for the Gaussian observations     43.34     1.747  3.998e+01     43.32      46.86     43.28
Precision for IndI                          91.47    31.776  4.380e+01     86.67     167.27     77.72
Precision for TimeI                     462269.60 77985.826  3.251e+05 457032.31  630495.86 447605.97


> To post to this group, send email to



--
Finn Lindgren
email: finn.l...@gmail.com

--
You received this message because you are subscribed to a topic in the Google Groups "R-inla discussion group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/r-inla-discussion-group/kOZfV29zxc8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to r-inla-discussion-group+unsub...@googlegroups.com.
To post to this group, send an email to r-inla-discussion-group@googlegroups.com.



--
-- 
Regards, Gregor

Finn Lindgren

unread,
Oct 23, 2017, 6:40:17 AM10/23/17
to Gregor Gorjanc, he...@r-inla.org, Haakon Bakka, R-inla discussion group
It’s hard to interpret the precision parameters by themselves. Easier to look at the estimated process values themselves. High precision may indicate “ no effect”, but that depends on the scale of the problem.

Finn

Haakon Bakka

unread,
Oct 25, 2017, 3:19:39 AM10/25/17
to Finn Lindgren, Gregor Gorjanc, he...@r-inla.org, R-inla discussion group
You might want to do the following:
1. PC priors for all precision parameters
- inla.doc("pc.prec")
2. Transform the posterior precision to posterior of sigma/standard deviation
3. Plot all of these posteriors together
4. Understand the size of the effects compared to each-other and compared to zero (no effect)

A word of warning is that the standard deviation parameters must actually represent the standard deviations. I believe this should be ok for the ou and ar1 and iid models. This is usually not true for intrinsic models like crw2!!
(This you could try to check by looking at the posterior summaries for the random effects, e.g. sd(...$mean) or mean(...$sd) or similar summaries of the summaries)

PS: You could try to fix the problem for crw2 by scale.model=T.

Kind regards,
Haakon Bakka





Finn

> To post to this group, send email to
> r-inla-discussion-group@googlegroups.com.
>
> Visit this group at https://groups.google.com/group/r-inla-discussion-group.
> For more options, visit https://groups.google.com/d/optout.



--
Finn Lindgren
email: finn.l...@gmail.com

--
You received this message because you are subscribed to a topic in the Google Groups "R-inla discussion group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/r-inla-discussion-group/kOZfV29zxc8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to r-inla-discussion-group+unsubscr...@googlegroups.com.

To post to this group, send an email to r-inla-discussion-group@googlegroups.com.
Visit this group at https://groups.google.com/group/r-inla-discussion-group.
For more options, visit https://groups.google.com/d/optout.



--
-- 
Regards, Gregor

Haakon Bakka

unread,
Oct 25, 2017, 3:47:02 AM10/25/17
to Finn Lindgren, Gregor Gorjanc, he...@r-inla.org, R-inla discussion group
Let me just clarify a few things with my previous answers.
1) Using rw2 (and inla.group) is usually better than using crw2, in the sense that it is more numerically stable. However, the precision/sigma parameter is not easy to interpret, since the model is intrinsic.
2) RW2 is not "stationary": The standard deviation is not the same across the different covariate values, see e.g. the plots in
- Here you can also see the scaling option
3) The AR1 model is stationary (because of good choices), and so, the sigma (or precision) is much easier to interpret.

Kind regards,
Haakon


Reply all
Reply to author
Forward
0 new messages