Reshape data to match statsmodels and exog contains inf or nans

D. K

unread,

Jan 15, 2023, 8:02:01 PM1/15/23

to pystatsmodels

Hello

I'm attempting to set up a panel TVPVAR using statsmodels, running first a local model.

I follow step by step the tutorial on their website

Coming from Stata, I am confused on how to proper reshape my data to match what statsmodels expects.

The data are saved in long format file(Stata) in the way the attached screenshot image here shows.

There are an identifier (id ), year, country and then a set of thirty (30) variables, say variable1 to variable30 for each country and year. A typical long panel data format.

I am getting an error

An unsupported index was provided and will be ignored when e.g. forecasting. self._init_dates(dates, freq)

So, my first question is how to properly reshape my data in order to be compatible with statsmodels for a local and a panel tvpvar model ?

Also, the second error I get is when I run the tvpvar model is:

exog contains inf or nans

I do have gaps in the data, of course. I run two types of Var models. In the first one all my variables are endogenous. In the second one I consider some exogenous variables, mostly dummies. How is that solved? Could just setting exog=None be a solution? Since A part from the attached screenshot , a small sample of my data are in the following link

https://drive.google.com/file/d/1YmKseNKEGZTQk_II4fOwgUVfZLAgVqJT/view?usp=share_link

For the first question I set up the panel framework as follows:

%matplotlib inline

from importlib import reload
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

from scipy.stats import invwishart, invgamma

#1

import pyreadstat

dtafile = 'panel.dta'
dta, meta = pyreadstat.read_dta(dtafile)
dta.tail()

labels=list(meta.column_labels)
column=list(meta.column_names)

# Panel data settings
year = dta.year
year = pd.Categorical(dta.year)
dta = dta.set_index([ "country", "year"])
dta["year"] = year

dta.head()

Thank you for your help in advanced

Screen Shot 2023-01-15 at 22.29.32.png

Chad Fulton

unread,

Jan 15, 2023, 8:13:33 PM1/15/23

to pystat...@googlegroups.com

Hello,

Statsmodels does not include a built-in panel TVP-VAR model (or even a built-in TVP-VAR model), so there is no answer to your question.

If you are creating such a model yourself using the state space framework, then the base classes ultimately expect that the data passed to them will have the shape (n, k) where n is the number of observations in your dataset and k is the number of observed variables in the state space model.

However, the way you map your dataset (with id, year, country, etc.) to these variables would depend on how you were setting up the state space form of the panel TVP-VAR, and that is up to you, so there is no answer that we can provide to that question either.

Best,

Chad

--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/befd912b-a24b-43af-ad78-cefc68fe2029n%40googlegroups.com.

D. K

unread,

Jan 16, 2023, 2:33:24 PM1/16/23

to pystatsmodels

Dear Chad,

Thank you so much for your reply. I have realized that my initial question was unclear.

My model is a TVP-VAR for a panel in a normal linear state space model composed of the State Equation and the Measurement Equation, where I have managed to write it as in eq. 33 in Canova and Cicarelli (2013)
The key model equation, where X t = Xt and ut = Xt′+ut with UtN = 0 (I + 2 Xt′ Xt), is attached.

I use exactly this class of models from your site : TVP-VAR, MCMC, and sparse simulation smoothing.

https://www.statsmodels.org/devel/examples/notebooks/generated/statespace_tvpvar_mcmc_cfa.html

When I run the local model, I get the attached local graph, for the Simulations based on KFS approach, MLE parameters' and Simulations based on CFA approach, MLE parameters' where some countries and years appear in an unexpected format. I suspect it has to do with the data shape I am using. You can see my actual data shape in the attached local screenshot.
When I run the Simulations with alternative parameterization yielding a smoother trend smong the errors I get is "'value' must be an instance of str or bytes, not a tuple."
In addition to an earlier "An unsupported index was provided and will be ignored when, e.g. forecasting. self._init_dates(dates, freq) "

I suspect that has to do with my data shape and index. Because I created my dataset in Stata, it is in a long format.
My question is a bit naive. How do I reshape my data in order to be compatible with statsmodels? How do I rewrite my code in order to bring my data into an acceptable shape to run the TVP-VAR, MCMC, and sparse simulation smoothing?

Hope it is clear what I am looking. The code I am now using is:

%matplotlib inline

from importlib import reload
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

from scipy.stats import invwishart, invgamma

#1

import pyreadstat

dtafile = 'panel.dta'
dta, meta = pyreadstat.read_dta(dtafile)
dta.tail()

labels=list(meta.column_labels)
column=list(meta.column_names)

# Panel data settings
year = dta.year
year = pd.Categorical(dta.year)
dta = dta.set_index([ "country", "year"])
dta["year"] = year

dta.head()

Thank you for your help

Best
David Kaufman

On Monday 01/ 13/ 2023 at 3:13:33 a.m. UTC user chadf...@gmail.com wrote

model.png

Data shape.png

local.png

Chad Fulton

unread,

Jan 16, 2023, 7:44:36 PM1/16/23

to pystat...@googlegroups.com

Ultimately the data passed to the state space base class must be a 2-dimensional array / DataFrame, where the rows are the time dimension. In your case, I guess the time dimension is years, so your DataFrame index should only be years.

In your code

# Panel data settings
year = dta.year
year = pd.Categorical(dta.year)
dta = dta.set_index([ "country", "year"])
dta["year"] = year

it looks like you are setting the index to be a MultiIndex of country and year, which will not work with the state space framework. All of the non-date dimensions in your dataset (country, id, group, etc.) must be part of the columns.

Hope that helps,

Chad

To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/66000269-2f9d-4909-8d8e-13db52cf772cn%40googlegroups.com.

D. K

unread,

Jan 20, 2023, 5:40:53 PM1/20/23

to pystatsmodels

Thank you very much, Chand for your contribution so far. I apologize if I am going to bore you, I feel bad, especially from those who intend to assist me. I would like your understanding, However, there are two or three things I have not understood as "pre data construction. The first has to do with the construction of the data, the others are more theoretical.

I changed the data to index the time variable years and all made all variables and strings column as per your suggestion, and I put the index in year the following way.

Adding the prefix of the ISO 3 character country code variable name before each variable to change the name of each variable. That is, for e.g. the variable burden in the case of Australia and Canada AUS_burden, CAN_burden, and so on. I have compiled them by country. Screenshot1 shows how I have done it. I am new into python so, the change was made in Stata, but that doesn't matter. A second grouping is aggregation by variable, as screenshot2 shows. My question is which one is correct, screenshot 1 or screenshot ?
Mainly, I am wondering how do I select the variable I am now interested in, given that there are now a total of 80 countries in panels and over 1800 columns of results? These are the variables with the country prefix. That is, I end up having more than one variable for the same common variable, e.g. burden instead of AUS_burden CAN_burden etc. How can I have only one? I found this construction in a paper by Korobilis on PVARS . I read somewhere that "perhaps there should be a list or dictionary." I don't know how this is done or if it can be done? In general, I understand the correct layout in the case of a single country, as you explained, and I tested it on my data, but in the panel, how is it done? My original long data format was for about thirty variables or more for 80 countries

My code now is:

import pyreadstat

dtafile = 'panel.dta'
dta, meta = pyreadstat.read_dta(dtafile)
dta.tail()

labels=list(meta.column_labels)
column=list(meta.column_names)

# Panel data settings

year=pd.date_range('1945', freq='A', periods=76)dta = dta.set_index([ "year"])
dta.head()

Second, in the local model with an AR(1) process , series should not be stationary, am I understanding this wrongly?

Morover, in point 7 of the tvp var code from your site, I have a MissingDataError: exog contains inf or nans . I also tried it in the case of a single country. Of course, I have gaps in the data But how do I solve this without deleting rows or taking the average

Finally, in the same model, how can I add the Cholesky time varying impulse response function?

Thank you so much. I really apologize if I asked too much

David

screenshot2.png

ScreenShot1 .png

Chad Fulton

unread,

Jan 21, 2023, 10:24:33 AM1/21/23

to pystat...@googlegroups.com

I'm sorry that I can't be of more help, but as I mentioned earlier, we do not have a panel TVP-VAR model in Statsmodels, so I do not know how your model is set up and therefore I do not know how your variables should be defined.

My advice would be to start by understanding your model and its state space form: what are the observed variables, what are the unobserved states, and how do those variables map to those states. I am not familiar with your model and unfortunately I don't have the bandwidth to determine / construct your model for you.

Once you understand your model, you should try to create a very small example with only a couple of observed variables to see if it works before you try to scale it up to your entire dataset (aside: a VAR model with 80 countries each with 30 variables sounds to me like it will just be too big, but I don't know what you're planning).

Best,

Chad

To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/82a53fad-b453-44e0-9e07-86c3a35f3680n%40googlegroups.com.

Reply all

Reply to author

Forward