Is it possible to add an Autoregressive term for panel time series data?

95 views
Skip to first unread message

JJ Hubbard

unread,
Aug 12, 2022, 11:23:46 AM8/12/22
to R-inla discussion group
I have time series observations from multiple participants. I have added an AR1 term to my mixed effects model like so:

f(idx, model='ar',
        order = 1,
        constr=FALSE
    )

But since there are multiple participants, the lag may not necessarily be from the same source. Is there a way to group the index 'idx' by participant so that I am getting the lags of that participant's time series?

Thanks for your time.

Helpdesk

unread,
Aug 12, 2022, 11:27:16 AM8/12/22
to JJ Hubbard, R-inla discussion group
you mean each particiant has its own AR1 but they all share the same
hyperparameters ? then its

f(time, model="ar1", replicate=r, locations=1:n)

where r=1,2,3.... to #particiants

and its a good idea to define using 'locations' the time points for the
union of AR1's, then each participants can just use a part of its own
time series
> --
> You received this message because you are subscribed to the Google
> Groups "R-inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to r-inla-discussion...@googlegroups.com.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/r-inla-discussion-group/581af2cb-c452-473c-acf8-a8a4e051228cn%40googlegroups.com
> .

--
Håvard Rue
he...@r-inla.org

Finn Lindgren

unread,
Aug 12, 2022, 11:33:01 AM8/12/22
to Helpdesk, JJ Hubbard, R-inla discussion group
I think you meant values=1:n, not locations=1:n?
Finn
> To view this discussion on the web, visit https://groups.google.com/d/msgid/r-inla-discussion-group/9bc175fd8b9c2c378d6c8e3e804441f463127ec7.camel%40r-inla.org.



--
Finn Lindgren
email: finn.l...@gmail.com
Message has been deleted

JJ Hubbard

unread,
Aug 12, 2022, 11:55:04 AM8/12/22
to R-inla discussion group
I think so, I haven't really considered whether the hypers should be shared or independent.

When I instead try 

f(idx, model='ar',
        order = 1,
        replicate=1:r,
        values=1:n,
        constr=FALSE
    )

I get this error:

Error in inla.core(formula = formula, family = family, contrasts = contrasts, : length(replicate) != length(xx) 9432 != 10000

This would seem to suggest that the # of participants must equal the # of observations

INLA help

unread,
Aug 12, 2022, 11:58:32 AM8/12/22
to JJ Hubbard, R-inla discussion group
No….

From: r-inla-disc...@googlegroups.com <r-inla-disc...@googlegroups.com> on behalf of JJ Hubbard <jordanj...@gmail.com>
Sent: Friday, August 12, 2022 5:55:04 PM
To: R-inla discussion group <r-inla-disc...@googlegroups.com>
Subject: Re: [r-inla] Is it possible to add an Autoregressive term for panel time series data?
 

INLA help

unread,
Aug 12, 2022, 12:00:20 PM8/12/22
to JJ Hubbard, R-inla discussion group
idx[i]  is the time point for participant r[i]

From: INLA help <he...@r-inla.org>
Sent: Friday, August 12, 2022 5:58:28 PM
To: JJ Hubbard <jordanj...@gmail.com>; R-inla discussion group <r-inla-disc...@googlegroups.com>

JJ Hubbard

unread,
Aug 12, 2022, 12:03:37 PM8/12/22
to R-inla discussion group
Ah, so idx should be of dimension n x r then? A column of observations for each participant? If so, how would I handle having a different # of observations for participants? Would it be best to leave those as 0 or NA?

Thanks a lot for your help.

Finn Lindgren

unread,
Aug 12, 2022, 12:13:44 PM8/12/22
to JJ Hubbard, R-inla discussion group
No, you need to supply a replicate vector of the same length as you data.
Normally this would be a column of your data.frame, indicating from which individual each observation belongs.

Finn

On 12 Aug 2022, at 16:55, JJ Hubbard <jordanj...@gmail.com> wrote:

I think so, I haven't really considered whether the hypers should be shared or independent.

INLA help

unread,
Aug 12, 2022, 12:26:21 PM8/12/22
to Finn Lindgren, JJ Hubbard, R-inla discussion group
You just define the idx and r for those observations you have or those you want to predict 

From: r-inla-disc...@googlegroups.com <r-inla-disc...@googlegroups.com> on behalf of Finn Lindgren <finn.l...@gmail.com>
Sent: Friday, August 12, 2022 6:13:39 PM
To: JJ Hubbard <jordanj...@gmail.com>
Cc: R-inla discussion group <r-inla-disc...@googlegroups.com>

Subject: Re: [r-inla] Is it possible to add an Autoregressive term for panel time series data?

JJ Hubbard

unread,
Aug 12, 2022, 12:59:21 PM8/12/22
to R-inla discussion group
Thanks a lot Finn, that cleared up what replicate was for me. I got it running now.

New question, my model is running extremely slowly now, I imagine I must have incorrectly specified something which is slowing it down. Currently my model looks like this:

df$idx <- 1L:nrow(df)
N=5
m1 <- inla(Y ~ 1 +
   
    # Fixed Effects
    + fixed_effect_1
   
    # Autoregressive
    + f(idx, model='ar1',
        replicate=participant_id,
        values=1:N,
        constr=FALSE
    )
           

    , data=df[1:N,], family="gaussian"
    , control.inla=list(strategy="adaptive",  int.strategy="auto")
    , control.compute=list(return.marginals.predictor=TRUE, cpo=TRUE, waic=TRUE, dic=TRUE, config=TRUE)
    , verbose=TRUE
    )

With only the fixed effect, I can fit 1000 observations in 1 second. With the AR1 added it takes 1m30s to fit only 5 observations. I must be doing something terribly wrong here...



Helpdesk

unread,
Aug 12, 2022, 1:05:31 PM8/12/22
to Finn Lindgren, JJ Hubbard, R-inla discussion group
with person.1 being observed at time 1,2,4,4, (yes, twice at time 4),
and person.2 being observed at time 2,3,5, then

idx=c(1,2,4,4,2,3,5)
r=c(1,1,1,1,2,2,2)
values=1:5

(oops, it is *values* as Finn noted)




On Fri, 2022-08-12 at 16:26 +0000, INLA help wrote:
> You just define the idx and r for those observations you have or those
> you want to predict 
>
> Get Outlook for iOS
> > > > https://groups.google.com/d/msgid/r-inla-discussion-group/9bc175fd8b9c2c378d6c8e3e804441f463127ec7.camel%40r-inla.org
> > > > .
> > >
> > >
> > >
> > > --
> > > Finn Lindgren
> > > email: finn.l...@gmail.com
> > --
> > You received this message because you are subscribed to the Google
> > Groups "R-inla discussion group" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an email
> > tor-inla-discussio...@googlegroups.com.
> > To view this discussion on the web, visit
> > https://groups.google.com/d/msgid/r-inla-discussion-group/50e9dfce-9e08-4fa7-a958-09cb7926d33cn%40googlegroups.com

Helpdesk

unread,
Aug 12, 2022, 1:07:31 PM8/12/22
to JJ Hubbard, R-inla discussion group
particiant_id must be integer from 1...n, where n is the number of
participants
Håvard Rue
he...@r-inla.org

JJ Hubbard

unread,
Aug 12, 2022, 2:19:50 PM8/12/22
to R-inla discussion group
Thanks a ton for your help so far. Here is what I have now:

# make integer mappings for timestamp and ID, both are strings by default
df$timestamp_f <- as.integer(as.factor(df$timestamp))
df$participant_id_f <- as.integer(as.factor(df$participant_id))
n = max(df$timestamp_f)
m1 <- inla(Y ~ 1
    + f(timestamp_f, model="ar1",
        replicate=participant_id_f,
        values=1:n,
        constr=FALSE
    ),
    data=df,
    family="gaussian"
    )

Unfortunately, it still seems painfully slow. Is there anything else I am doing wrong?

Finn Lindgren

unread,
Aug 12, 2022, 2:24:35 PM8/12/22
to JJ Hubbard, R-inla discussion group
How large is the model? I.e. what is  n  and what is max(df$participant_id_f) ?

Finn

On 12 Aug 2022, at 19:19, JJ Hubbard <jordanj...@gmail.com> wrote:

Thanks a ton for your help so far. Here is what I have now:

JJ Hubbard

unread,
Aug 12, 2022, 3:29:32 PM8/12/22
to R-inla discussion group
I have been testing with small samples of about 5-20 rows from my dataframe, which is about 300,000 rows total. 5-10 rows usually takes about 1m-1m30s with the single fixed effect and AR1 included. With just the singled fixed effect I can run about 2000 rows in 1s.

When the sample size is small, max(df$participant_id_f) is typically equal to n, because participants are rarely observed more frequently than once every 3-4 weeks. So when I run n = 20, max(df$participant_id_f) = 20 too.

I ran a test to see if the issue was the fact that these subsets all started at the beginning and thus had no second occurrences of a given participant. Subsetting my df like df[df$participant_id_f %in% 1:100)] produced a new dataframe with n = 608, which ran in 15s. I did not change anything else.

Not sure what causes this.

Finn Lindgren

unread,
Aug 12, 2022, 3:41:46 PM8/12/22
to JJ Hubbard, R-inla discussion group
So the inla() call takes longer when you have too little data to reliably estimate the model? That might seem surprising, but it may be that the likelihood is very flat, making reliable optimisation difficult.
For testing the estimation method, I would instead suggest subsetting the data in a more stratified way, e.g. by including all data from a subset of the individuals, to ensure adequate temporal coverage when testing.

Finn

On 12 Aug 2022, at 20:29, JJ Hubbard <jordanj...@gmail.com> wrote:

I have been testing with small samples of about 5-20 rows from my dataframe, which is about 300,000 rows total. 5-10 rows usually takes about 1m-1m30s with the single fixed effect and AR1 included. With just the singled fixed effect I can run about 2000 rows in 1s.

JJ Hubbard

unread,
Aug 12, 2022, 3:52:51 PM8/12/22
to R-inla discussion group
Your stratified subset is something I tried, with positive results. Much faster when subsetting only 100 participants, but all of their observations. In order to run the model on my full data when I am ready, I imagine I would need to reorder the data so that I do not have a large bulk of participants who the model has not encountered yet, which is what seems to result in the slow down.

But I feel like reordering the data will also introduce temporal leakage, since some observations will have a prior that has been informed by an observation which the model fit to previously, but actually had not occurred yet at that point in time.

Finn Lindgren

unread,
Aug 12, 2022, 4:00:37 PM8/12/22
to JJ Hubbard, R-inla discussion group
When subsetting data by individuals, you need to make sure the replicate= information is for the subset, I.e. unless you only use the individuals with the lowest I’d, you need a new I’d vector.
At least in inlabru this can be done a bit more automatically by using a mapping from unique factor/character IDs to an index vector, but for plain inla you need to be very precise with what data you specify.

I don’t understand your second paragraph. The estimation is done jointly for the model as a whole, and not sequentially.
It may help to consider that the model can be defined regardless of whether it’s being observed or not.

Finn

On 12 Aug 2022, at 20:52, JJ Hubbard <jordanj...@gmail.com> wrote:

Your stratified subset is something I tried, with positive results. Much faster when subsetting only 100 participants, but all of their observations. In order to run the model on my full data when I am ready, I imagine I would need to reorder the data so that I do not have a large bulk of participants who the model has not encountered yet, which is what seems to result in the slow down.

JJ Hubbard

unread,
Aug 12, 2022, 4:09:49 PM8/12/22
to R-inla discussion group
I am getting my replicate vector by calling

unique(participant_id_f)

Does that seem acceptable? I figured this was the best way to get "all values assumed by the covariate" as per the documentation for the values argument. I subsetted using the first 100 IDs anyways, via df2 <- df[df$participant_id_f %in% 1:100,], so hopefully it was all kosher anyways?

It seems my understanding of INLA is incorrect, I assumed it fit observations sequentially, updating the prior each time so the next estimation is more informed. But this is actually done in parallel?

Finn Lindgren

unread,
Aug 12, 2022, 4:33:12 PM8/12/22
to JJ Hubbard, R-inla discussion group
The replicate model will include all indices from 1 to max(replicate), regardless of which ones have any observations, so using unique() is not sufficient.

Finn

On 12 Aug 2022, at 21:09, JJ Hubbard <jordanj...@gmail.com> wrote:

I am getting my replicate vector by calling

Finn Lindgren

unread,
Aug 19, 2022, 12:35:48 PM8/19/22
to JJ Hubbard, R-inla discussion group
Ah, yes, that’s why I needed separate t,j, and i indexes when describing the model!
t_i and j_i as stored in the df need to be the time and individual ID for observation index nr i (row i of the df).

Finn

On 19 Aug 2022, at 17:31, JJ Hubbard <jordanj...@gmail.com> wrote:

I think so, I haven't really considered whether I would want them to share hypers or have their own. I don't think I did the best job explaining my setup, let me clarify.

Currently, my idx argument is just 1:nrow(df), which I modeled from reading this documentation:  ar.pdf (r-inla-download.org)
But since my data comes from multiple participants, for any index i, index i-1 is likely not from the same participant, so my AR right now I think is nonsensical, as most of the time the lag is not actually the current individual's Y_t-1, but actually some other individual's Y_t.

Thanks for your prompt response.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages