comparing raster area in R

Rebecca Stubbs

unread,

Dec 4, 2017, 6:20:12 AM12/4/17

to GIS In Ecology Forum

I have rasters for multiple species distributions under present climate conditions and also under projected future climate change scenarios. I was quantifying changes between present and future suitable habitat using the nicheOverlap function and Schoener's D statistic. Quite a few of the organisms in my study are just moving farther up mountains so there is a lot of overlap as the future distribution is inside the present distribution (just at higher elevations). But by looking at the raster files in QGIS I can see that there is less suitable habitat in the future, so I wanted to quantify this. I have scoured the internet for a good way to calculate area for rasters and never found anything that perfectly suited my fancy. I therefore wrote up something that is an amalgamation of bits and pieces of various scripts. It is pasted below. I would like to stay in R because I need to run this for many species but I am open to other suggestions.

Two questions:

1) do you all agree this is doing what I think it is doing (calculating area in square kilometers)?

2) is there a way to simplify this? Specifically you'll see I go from a raster to a dataframe back to raster? Maybe I could stay in rasters?

Thanks for any input!

Rebecca

####

library(raster)

#load rasters

m <- raster("SpeciesA_avg.asc")

mf <- raster("SpeciesA_future_layers_avg.asc")

#change to dataframe

m.df <- as.data.frame(m, xy=TRUE)

#get rid of NAs

m.df1 <- na.omit(m.df)

#keep only cells that that have a suitability score above 0.5 (scores range from 0 to 1)

m.df2 <- m.df1[m.df1$SpeciesA_avg> 0.5,]

#re-rasterize just the suitable area

m.raster <- rasterFromXYZ(m.df2)

##same as above but for future projection

mf.df <- as.data.frame(mf, xy=TRUE)

mf.df1 <- na.omit(mf.df)

mf.df2 <- mf.df1[mf.df1$SpeciesA_future_layers_avg>0.5,]

mf.raster <-rasterFromXYZ(mf.df2)

#get sizes of all cells in current distribution raster

#note my original layers were 30 seconds or 1 km2.

#I thought about just multiplying by the res (0.008333333) I could also get area but as I am working far from the equator I know every square is not going to be exactly 1 km2. So, I'm using the area function instead

cell_size<-area(m.raster, na.rm=TRUE, weights=FALSE)

#delete NAs from all raster cells. It looks like these come back when switching from dataframe to raster

cell_size1<-cell_size[!is.na(cell_size)]

#compute area [km2] of all cells in raster

raster_area_present<-length(cell_size1)*median(cell_size1)

raster_area_present

#get sizes of all cells in future raster [km2]

cell_size<-area(mf.raster, na.rm=TRUE, weights=FALSE)

#delete NAs from vector of all raster cells

cell_size1<-cell_size[!is.na(cell_size)]

#compute area [km2] of all cells in geo_raster

raster_area_future<-length(cell_size1)*median(cell_size1)

raster_area_future

##calculate change in area

dif_area <- raster_area_present - raster_area_future

dif_area

Jeffrey Evans

unread,

Dec 5, 2017, 5:55:07 AM12/5/17

to GIS In Ecology Forum

You have some unnecessary steps and your code can be simplified quite a bit. It would be prudent for you to conduct your analysis in a projected coordinate system and not geographic. If this is not possible, the area function is returning square kilometer(s) per cell so, the sum of the raster represents the total square kilometers. As to streamlining your analysis, here is some code that may help:

First, lets add required packages and create some example data.

    library(raster)
    r.current <- raster(ncols=100, nrows=100)
      r.current [] <- runif(ncell(r.current ), 0, 1)
      r.current <- focal(r.current , w=focalWeight(r.current , 6, "Gauss"))
    r.future <- raster(ncols=100, nrows=100)
      r.future [] <- runif(ncell(r.future ), 0, 1)
      r.future <- focal(r.future , w=focalWeight(r.future , 6, "Gauss"))

You really do not need to coerce your raster data to a data.frame to truncate values. Using a vectorization approach will turn all values under your threshold to NA's.

    p=0.5
    r.current[r.current < p] <- NA
    r.future[r.future < p] <- NA

No data (NA) begets no data so, individual cell areas stay NA's based on the above rasters. As such, there is no reason the drop the NA's before calling the area function. NA cells are not included in the cell area values.

    a.current <- area(r.current, na.rm=TRUE, weights=FALSE)
    a.future <- area(r.future, na.rm=TRUE, weights=FALSE)

We can plot the data to check that NA's were created and retained correctly

    par(mfrow=c(2,2))
      plot(r.current, main="current distribution")
      plot(a.current, main="current cell areas")
      plot(r.future, main="future distribution ")
      plot(a.future, main="future cell areas")

Now, we can take the sum of the cell areas corresponding with estimated niche as well as take the difference.

    sum( getValues(a.current), na.rm=TRUE)
    sum( getValues(a.future), na.rm=TRUE)
    sum( getValues(a.current), na.rm=TRUE) - sum( getValues(a.future), na.rm=TRUE)

If you would like to create a raster of niche agreement you can pass a custom function to overlay. Since you are working with probabilities, the function needs to account for values occurring in x (current) and y (future) and not an exact value match. Since the backgroud values were set to NA, then a simple and statement will work within an ifelse.

    d <- function(x,y) { ifelse( (!is.na(x) & !is.na(y)), 1, NA) }
   r.agree <- overlay(stack(r.current,r.future), fun=d)
        plot(r.agree)

Best,
Jeff

Jeffrey S. Evans, Ph.D., | Senior Landscape Ecologist

The Nature Conservancy | Global Lands Science Team

Affiliate Assistant Professor | Zoology & Physiology | University of Wyoming

Laramie, WY | jeffre...@tnc.org | (970) 672-6766

GIS in Ecology

unread,

Dec 5, 2017, 6:37:35 AM12/5/17

to GIS In Ecology Forum

Hi Rebecca,

Jeff has suggested some streamlining of your r code (thanks Jeff), but I'd like to suggest an alternative approach that I think would streamline the whole process even more while giving you more flexibility in case you'd like to do things like change your threshold values. It would also make it easier to apply to non-projected raster data layers (although you'd be much better working with an appropriate projection when crating your raster data layers in the first place).

For this approach, you can stay within the QGIS environment (although you could do the same in R). First you create a polygon grid data of the same size and extent as your raster data layer (so it directy overlays it). This is done with the CREATE VECTOR GRID tool. Next, for this grid you use the FIELD CALCULATOR tool to calculate the area of each individual polygon in this grid (if you're using a projected coordinate system, which is highly adviseable, these should all be the same). After that, you use the POINT SAMPLING TOOL to extract the values for each raster data layer (current and future predicted distribution) to the attribute table of your polygon grid. This can either be the raw predicted likelihood of occurrence or the predicted presence-absence based on your threshold. Personally, I'd suggest the former.

Once you have the information in this table, then you have a number of options. Firstly, you can calculate the total area of grid cells for each model (current and/or future distirbution) by simply summing the area values for the polygon grid cells with values over your selected threshold. Secondly, you can look not just at the total area, but at which specific cells go from being occupied to being unoccupied, from unoccupied to occupied, stay occupied or remain unocccurlied over time based on your predictions.

This can either be done in QGIS (and you can find more information about how to do this here: https://gis.stackexchange.com/questions/43037/getting-tabular-statistics-from-table-using-qgis, or you can move the attribute table of the polygon grid data layer into R and do it there. This may well make it much easuer to sum the total area values only for those cells with a value over a specific threshold (e.g. 0.5).

Now, when it comes to working with multiple species, there are a number of key points. Firstly, when you use the POINT SAMPLING TOOL, you can extract values from pretty much as many rasters as you'd like in that one step, you just need to remember to use appropriate names so that you know which fields belong to which species and which time frames. This means you only need to do this step once to create one very large table with the information from all the species in it (as well as which cells have changed over time).

Second, once you have this table, you can either calculate the total area of cells above your chosen threshold just as before, but for multiple species. To automate in QGIS, you might have to use something like Modelbuilder, or Batch Processing, but these are relatively easy to use. However, since you know R, then I would suggest doing it in R and just running the code repeatedly by sampling the table based on each modelled value field and your given threshold before calculating the total area.

The main benefits of this are:

1. It removes any problems you may create by the incorrect use of preojections in R, which can be hard to detect since it is hard to visualise your data in R.

2. If someone suggests you should use a different threshold value, or wants you to explore the impact of your chosen threshold value, you can work with the same attribute table and just adjust the threshold you use in the final part.

3. You can use the polygon grid data layer to assess not just the change in area occupied, but look at which cells occupancy is predicted to change in, and this can be just as informative as looking at the changes in total area. In particularly, if there are barriers to movements between current suitable areas and future predicted suitable areas, then no matter how suitable they are, a population may not be able to access them and so will decline in distribution rather than shift to new areas.

This isn't necessarily a recommendation that you use a polygon grid data layer approach rather than a raster based one, just that it's an alternative worth thinking about, especially if you're working with raster data layers that are in the geographic projection rather than a true projected coordinate system (but it would be better not to be doing this in the first place).

All the best,

Colin

Jeffrey Evans

unread,

Dec 6, 2017, 7:36:40 AM12/6/17

to gis-in-eco...@googlegroups.com

As to Colins recommendation, it is sound, but not necessarily memory safe. For large rasters this method may be untenable as QGIS is not always good with large data problems. The R raster package will keep the problem somewhat memory sage. If you want to go with Colin's analysis in R, which is a good suggestion, I would aim you towards a different object class.

The sp SpatialPixelsDataFrame object class is, in effect, exactly what Colin describes in QGIS but it is much simpler to get to a polygon topology with all of the associated raster values. In this sp class, each pixel is a polygon representing its place in the array. Just like QGIS, the cell size will vary if the projection distorts the geographic space. However, once again I urge you to use a projected coordinate system. The raster::area function is an approximation and you can loose considerable precision that, over large areas, could be expressed as a notable bias. Here I provide a worked example using a UTM (meters) projection. For large areas Albers (meters) is often a good choice but there are many other options based on where your analysis extent resides, it just takes a little bit of research.

Here I provide a example using a sp SpatialPixelsDataFrame class object. I would however, recommend using the raster::stack function to read in the data. In this way to can read in many rasters into a single raster stack class object and then coerce into an sp object. In this example I create dummy data, put the rasters into a raster stack object and then coerce into a SpatialPixelsDataFrame.

Again, first add packages and create some example data but, this time in a projected coordinate system with units in meters.

library(raster)
library(sp)
library(rgeos)

    r.current <- raster(nrows=180, ncols=360, xmn=571823.6, xmx=616763.6, ymn=4423540,
                        ymx=4453690, resolution=250, crs = CRS("+proj=utm +zone=12 +datum=NAD83
                        +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"))

r.current[] <- runif(ncell(r.current ), 0, 1)

      r.current <- focal(r.current , w=focalWeight(r.current, d=1000, "Gauss"))
    r.future <- raster(nrows=180, ncols=360, xmn=571823.6, xmx=616763.6, ymn=4423540,
                        ymx=4453690, resolution=250, crs = CRS("+proj=utm +zone=12 +datum=NAD83
                        +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"))

r.future [] <- runif(ncell(r.future ), 0, 1)

r.future <- focal(r.future , w=focalWeight(r.future , d=1000, "Gauss"))

Now, we can coerce the raster stack into a SpatialPixelsDataFrame object where, the cells are functionally polygons and the individual raster values are represented as columns in a data.frame object located in the @data slot. We can look at the class of the niche object and the @data slot located in niche as well as displaying the first 10 rows of data.

    niche <- as(stack(r.current, r.future), "SpatialPixelsDataFrame")
      class(niche)
      names(niche) <- c("current", "future")
      head(niche@data)
      class(niche@data)

In much the same way you manipulate the raster class object, here we using indexing to replace values in the matrix (data.frame).

niche@data[niche@data < 0.5] <- NA
niche@data[1:100,]

Since we have a standard area in our raster cells we can look at the cell count to derive area.

    ( n.current <- length( niche@data[!is.na(niche@data$current),]$current ) )
    ( n.future <- length( niche@data[!is.na(niche@data$future),]$future ) )
    abs( n.current - n.future )

Alternately, we can use gArea in the rgeos package to return an area value for each cell (in square meters since our projection units are in meters). Then, using a query on the data.frame can sum the associated areas.

    niche@data$area <- gArea( as(niche, "SpatialPolygonsDataFrame"), byid=TRUE )

    ( km.current <- sum(niche@data[!is.na(niche@data$current),]$area ) )
    ( km.future <- sum(niche@data[!is.na(niche@data$future),]$area ) )

Now, here is where it gets interesting for future analysis. In my previous example, we applied a function using overlay. We can do much the same here using apply. However, apply has the additional advantage of applying a function over the rows (margin=1) or columns (margin=2) of the array. If you have several species to analyze, this provides the flexibility of fairly complex analysis across the estimated niche spaces. Note that in the new "d" function we are only defining x (not x,y as in the previous example). In this case x represents each row and the index x[i] represents the location of the value in the vector (string of values from the row). When we call apply you can specify an index in the data.frame. Here, since we are only interested in the first two columns then we index the data.frame using niche@data[,1:2] which limits the vector of rows to the first two columns thus, returning a vector of two values.

    d <- function(x) { ifelse( (!is.na(x[1]) & !is.na(x[2])), 1, NA) }
    niche@data$agree <- apply(niche@data[,1:2], MARGIN=1, FUN=d)
      plot( raster(niche,4),main="Agreement between current and future niche")

If you are going to read your data into memory anyway, then using this data structure will open you up your ability to perform fairly complex analysis down the road. For example, this data structure is already set up to perform a local autocorrelation analysis (eg., LISA, Gi*) using the spdep package.

Please note that I do not disparage the use of GIS software such as QGIS but, if one is going to perform any statistical analysis on spatial data, software such as R and S+ are good choices to provide this extended capacity. Because these are scripting languages. there are also instances where data manipulation and query is much more streamlined than what GIS software provides. If you are willing to step out into Python in QGIS or ArcGIS, much of the same functionality is available sans the additional specialized package support that R provides.

Best,

Jeff

--
-- ======================================================================
You received this message because you are subscribed to the "GIS In Ecology Forum" discussion group (http://www.gisinecology.com/GIS_in_Ecology_forum.htm).

To control how often you get emails from this group, go to http://groups.google.com/group/gis-in-ecology-forum/subscribe (you will need to log on to get to this page).

To post to this group, either log onto the group's home page or send an email to
gis-in-ecology-forum@googlegroups.com.

The rules for posting to this group can be found here: http://groups.google.com/group/gis-in-ecology-forum/browse_thread/thread/df31a0822742203f#.

To unsubscribe from this group, email:
gis-in-ecology-forum+unsub...@googlegroups.com

All information on this forum is provided on an 'as is' basis. GIS In Ecology is not responsible for checking the accuracy or suitability of any posting or response.
======================================================================
---
You received this message because you are subscribed to a topic in the Google Groups "GIS In Ecology Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gis-in-ecology-forum/VBDer6QqFbs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gis-in-ecology-forum+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

GIS in Ecology

unread,

Dec 6, 2017, 7:51:03 AM12/6/17

to GIS In Ecology Forum

Hi Jeff,

That's a very good point! I should have made it clear that my suggested approach in QGIS will run into problems if you try apply it to very large raster data layers. Thanks for the information on how to apply it in R instead, I'm sure a lot of people will find that very useful.

I think this is a good example of the need to explore different software packages to find the best solution to a particular problem: sometimes it will be to use QGIS, sometimes it will be to use R, and sometimes it might be to use something different altogether. The key is to not get stuck in the mindset that everything needs to be done in a single software package and structure your work to make it as easy to move between different packages, as and when required, to make the most of the benefits that each one can offer you. In the long run this makes life easier and more productive.