Spatial regression when the y variablke is point data and some of the x values are polygons

36 views
Skip to first unread message

Stephen Clark

unread,
Oct 26, 2021, 1:08:35 PM10/26/21
to Openspace List
Hello,

I am looking for some pointers on how best to estimate relationships in my data. The y data that I have relates to businesses that have closed (there are about 1000 businesses), I want to see how the probability of closure is related to a number of factors, x, some of which are intrinsic to the business and others which are more related to the neighbourhood (there are about 3000 neighbourhoods). Without a spatial complication I would just run a generalised linear model with a logit/probit link. However this is not the best approach to my mind.

The main complication is that the y variables (and the business specific factors) are point data whilst the x variables mainly relate to neighbourhoods which are polygons. I can work out in which neighbourhood the business is in, and ascribe that neighbourhood's characteristics to the business. But sometimes a business is at the edge of a neighbourhood and the attributes of surrounding neighbourhoods becomes important. (By the way I do have access to population weighted centroids for the neighbourhood polygons.)

I have looked at SAR and SEM models, but they require the data to be in the same geography. This I do not have,  so what would the weights be? A distance based matrix based on the locations of the businesses or an adjacency based on the neighbourhoods.  I am reluctant to aggregate the business into counts of closed businesses in each neighbourhood because this would lose the business factors that I hold.

I am comfortable with GeoDa ad R if anyone has any suggestions for me to explore. Thanks.

Danlin Yu

unread,
Oct 26, 2021, 1:47:58 PM10/26/21
to openspa...@googlegroups.com

Stephen:

In your described scenario, I would suggest attribute the neighborhood characteristics to the business point that the point falls within (a point in polygon spatial join does that nicely). It is certainly true that some of the businesses that fall near the borders of the neighborhood might be influenced by neighboring neighborhood, but we also cannot rule out the possibility that even businesses in the dead center of the neighborhood will not have any influence from other neighborhoods. The argument here is not perfect, but practical - if a business falls within a neighborhood, we take the chance of assigning the neighborhood's characteristics to that business and attribute the other neighborhoods' influence to the stochastic process and our residuals will faithfully capture that effect. The spatial autoregressive models (SAR/SEM) are then applied to see if such design makes sense. After all, models are tools to facilitate the interpretation of the underlying research question, it is up to the scholar (us) to interpret the modeled result.

Other than SAR/SEM, attempting SpatialFiltering in R/spdep might be another alternative to look at things.

Hope this helps.

Best,

Danlin

--
You received this message because you are subscribed to the Google Groups "Openspace List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openspace-lis...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openspace-list/95f4c32e-9962-43df-a623-51e31bd40b85n%40googlegroups.com.
-- 
___________________________________________
Danlin Yu, Ph.D.
Professor of GIS and Urban Geography
Department of Earth & Environmental Studies
Montclair State University
Montclair, NJ, 07043
Tel: 973-655-4313
Fax: 973-655-4072
Office: CELS 314
Email: y...@mail.montclair.edu
webpage: csam.montclair.edu/~yu

MIGUEL GOMEZ DE ANTONIO

unread,
Oct 26, 2021, 2:03:24 PM10/26/21
to openspa...@googlegroups.com
Dear Stephen,

Have you explore the possibility of estimating models for point processes. You can avoid the use of polygons and estimate the models for a continuos space. These models are suitable for determining which are the factors that might explain the probability of a business go bankrupt. 

The R package "spatstat" allows the estimation models for inhomogeneous process, Gibbs models and other point processes.

Miguel Gómez de Antonio
Profesor Titular de Universidad
Dpto. Economía Aplicada, Pública y Política
Universidad Complutense de Madrid


--

Stephen Clark

unread,
Oct 26, 2021, 3:25:00 PM10/26/21
to Openspace List
Thanks Danlin and Miguel for these suggestions.

I have been reading about Spatial Lagged X models, Spatial Durbin Models and Spatial Durbin Error Models. All these include a WXtheta term that 'borrows strength' from the neighbouring X variables, eg population density, income, ...

Spatial Lagged X : y = Xbeta + WXtheta + epsilon
Spatial Durbin Model : y = phiWy + Xbeta + WXtheta + epsilon
Spatial Durbin Error Model : y = Xbeta + WXtheta + u, u=lambdaWu + epsilon

LR tests are available to help in the choice of models. The only issue is that in R I can only find functions to fit lm's not glm's versions of these models.

Marshall Feldman

unread,
Oct 28, 2021, 12:21:35 PM10/28/21
to openspa...@googlegroups.com

Dear Stephen,

I don't know who your audience is or your purposes for making these estimates.

With these caveats aside, I'd advise starting with the simplest estimates you can. I suspect in your case this would be assigning polygon values to each business and then running a logit/probit estimate. Even then you'd have to account for spatial autocorrelation, but I think even the importance of this would depend on the sizes of your polygons and the spatial pattern of your businesses.

You can also get carried away with resampling methods and the like, largely because there may be no analytical solutions for your estimating statistics. But ask yourself what difference this would make.

Even commonly used OLS estimators are "BLUE" (Best Linear Unbiased Estimators), and the preference for linearity is due to a combination of analytical tractability and limitations on 20th-century computing power. In other words, mathematically arbitrary, pragmatic concessions.

In my experience, such considerations as seeking the best possible estimator is often a case of the juice not being worth the squeeze: i.e., the substantive interpretation of the more complicated results does not differ from the simple, but mathematically flawed, results. This is especially true if your intended audience is unfamiliar with the math and you have no understandable way to explain your methods. With logit models one can display logistic curves, but other methods may have nothing similar.

Additionally, with logit models, I most often have found little difference between them and simpler methods. I always run OLS or GLS models first; typically the substantive interpretation of the OLS model results are identical to the GLS and MLE-logit models' results. If not, then this is another, methodological finding regarding under what circumstances the simpler-but-wrong method leads to substantively wrong results.

I'm not saying don't attempt something more complicated: just start with baby steps. If you go further, I think Miguel's suggestions about point processes is a good one.

But if you go to such trouble, be cautious with R. Years ago I estimated a GLM in S but was suspicious of the results: they were too close to the OLS estimates. So I took a textbook GLM model and estimated it. The results differed from the textbook's. Thankfully, S (and R) allows access to the source code, and I by inspecting it I found a bug. After fixing it, both the textbook model and my original were correct. I'm currently working to modify an existing R package to analyze timeseries data. You may have to do something similar.

I'd add that, IIRC, David Birch found the half-life of small businesses to be about five years. There are so many considerations like this (e.g., spatio-temporal status of the macro economy, firm niches in industrial sectors, etc.) that you may run out of degrees of freedom by the time you take all relevant considerations into account. Starting with simpler models would reveal this before you waste time with more complex modeling strategies.

Good luck!

Marshall Feldman
Emeritus Professor of Urban Studies and Labor Research
The University of Rhode Island

Reply all
Reply to author
Forward
0 new messages