script optimisation

23 views
Skip to first unread message

Lauren Hodgson

unread,
Oct 2, 2012, 12:23:44 AM10/2/12
to tropi...@googlegroups.com
Hi all,

Does anyone have any ideas for optimising the following script? It
uses two inputs: a dataframe, and a list,

I have a data frame that is 1.4 million rows long. It has an ID for
stream segment, and the runoff from the land that goes directly into
that stream. called runoff

HydroID | runoff
-------------------------------
1 | 29
2 | 3468
3 | 542
...



I also have a list with each stream, and all streams that feed into
it. called flowdata.

eg.

3, 2, 1
2, 1
1
...


I want to calculate total runoff for the entire catchment above a
stream. ie. add up the runoff values for each value in each row of
the list.

I've tried two methods for this. One is an sql query, which takes 8 hours.
The second uses R, and iterates through the HydroIDs

for (id in accflow$HydroID) { cat(id, '\n')
i=i+1
##sum runoff from runoff table for a vector in flowdata
v=strsplit(flowdata[[i]],',') # isolate the list associated with HydroID
ids=as.vector(v[[1]]) #turn the list into a vector
tt=sum(runoff$runoff[which(runoff$node_id %in% ids)]) # find the
vector in node_id, and sum the runoff values for all


accflow$acc_runoff[which(accflow$HydroID==id)]=tt #print accumulated
runoff to another table

}

Phillips, Ben

unread,
Oct 2, 2012, 2:05:20 AM10/2/12
to <tropical-r@googlegroups.com>
lapply or do.call will help here. May even be a parallel version of lapply nowadays...

Excuse the brevity: sent from my phone.
> --
> An R group for questions, tips and tricks relevant to spatial ecology and climate change.
> All R questions welcome.
> ---
> You received this message because you are subscribed to the Google Groups "Tropical R" group.
> To post to this group, send an email to tropi...@googlegroups.com.
> To unsubscribe from this group, send email to tropical-r+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

Lauren Hodgson

unread,
Oct 2, 2012, 10:13:33 PM10/2/12
to tropi...@googlegroups.com
Thanks Ben - it took me a while to get it working, but I'm using sapply:

accflow[,2]=sapply(flowdata, function(ids) {
return(sum(runoff$runoff[which(runoff$node_id %in% ids)])) })

It is at least 10 times faster.

Eventually I'll stop asking stupid questions where the answer is
'don't use a loop, use apply'

Lauren Hodgson

unread,
Oct 3, 2012, 2:14:54 AM10/3/12
to tropi...@googlegroups.com
and, yep, because even sapply was likely to take 40+ hours, I'm now
using parSapply over 10 cores.

It takes 1 min 30 to do what it took the loop 3 hours to do. Now the
total task should take 3 hours!

Thanks again, Ben!

alice Hughes

unread,
Feb 22, 2013, 10:28:12 PM2/22/13
to tropi...@googlegroups.com
Hi all


I have a group of maxent species "distribution" projections in the ascii format, all of which are thresholded (10% training).
I also have a table denoting which order each species falls into. Could someone help me write a couple of lines of code to: for each order classify each SR to binary, then summate them to get species richness for each order?
Thanks and all the best

Alice

> Date: Wed, 3 Oct 2012 16:14:54 +1000
> Subject: Re: [Trop-R] script optimisation
> From: lhodg...@gmail.com
> To: tropi...@googlegroups.com

Jeremy VanDerWal

unread,
Feb 24, 2013, 4:51:35 AM2/24/13
to tropi...@googlegroups.com
Alice, hope the following helps. Cheers, Jeremy

library(SDMTools)
working.dir = "some/working/directory"; setwd(working.dir) #define and set the working directory
ascii.dir = "directory.where.species.asciis.are" #define the directory of the species ascii files

order.info = read.csv("some.file") #read in the table of order and species columns
#I am assuming the order.info table has 2 columns... order and species

for (ord in unique(order.info$order)) { cat(ord,'\n')#cycle through each order
out = NULL #define the output
species = order.info$species[which(order.info$order==ord)] #get the species in this order
for (spp in species) { cat('\t',spp,'\n')#cycle through each species in that order
tasc = read.asc("the.species.distribution.asc") #read in teh species ascii
###you need to apply a threshold if it has not been applied... I am assuming that the species ascii is binary (0,1)
if (is.null(out)) { #if out is null, create the basic output asc as the first species distribution
out = tasc #copy the species distribution
} else { # if out has some data, add the next species to it
out = out + tasc #add on the species data
}
}
write.asc(out,paste(ord,"_richness.asc",sep='')) #write out the richness for the order
}

alice Hughes

unread,
Feb 25, 2013, 10:27:41 PM2/25/13
to tropi...@googlegroups.com
Hi Jeremy

Thanks so much for that script, I really appreciate it.
Sorry for not replying yesterday-I realised that the data needed alot of cleaning up before I could try the script.

At the moment the sdms are not binary, but have a continuous probability of occurrence above the 10% training threshold, so I added a couple of lines to classify anything above 0 as 1, and keep 0 as 0.
 Because of the quantity of data when I was cleaning it I ended up listing the Genus, rather than species on the table (sample below), as there are well over 2000 genera being modelled (about 6000 species). So R needs to recognise the first half of the sdm output (i.e the ascii might be called "Alisma_orientale_thresholded.asc", then use the table to identify it as a Alismatale.
I tried to get the script running earlier, and deal with the errors-but in the end it just ran through without processing anything....
Thanks so much

Alice


Genus Order
Alisma Alismatales
Sagittaria Alismatales
Spirodela Alismatales
Hydrocharis Alismatales
Vallisneria Alismatales
Triglochin Alismatales
Zannichellia Alismatales
Ruppia Alismatales
Acronema Apiales
Aegopodium Apiales
Angelica Apiales
Arcuatopterus Apiales
Aulacospermum Apiales
Carum Apiales
Cortiella Apiales
Cyclorhiza Apiales
Glehnia Apiales
Haplosphaera Apiales
Hyalolaena Apiales
Hymenidium Apiales
Keraymonia Apiales
Ligustcum Apiales
Lomatium Apiales
Nothosmyrnium Apiales
Osmorhiza Apiales
Peucedanum Apiales
Prangos Apiales
Ptilimnium Apiales
Semenovia Apiales
Sinodielsia Apiales
Sium Apiales
Trachyspermum Apiales
Vicatia Apiales
Yushuia Apiales
Acanthopanax Apiales
Aralia Apiales
Brassaiopsis Apiales
Dendropanax Apiales
Eleutherococcus Apiales
Evodiopanax Apiales
Gamblea Apiales
Hedera Apiales
Kalopanax Apiales
Macropanax Apiales
Merrilliopanax Apiales
Metapanax Apiales
Nothopanax Apiales
Panax Apiales
Pentapanax Apiales
Polyscias Apiales
Pseudopanax Apiales
Schefflera Apiales
Trevesia Apiales
Bupleurm Apiales
Cicuta Apiales
Daucus Apiales
Foeniculum Apiales
Hansenia Apiales
Oreomyrrhis Apiales
Selinum Apiales
Alocasia Arales
Amorphophallus Arales
Arisaema Arales
Arum Arales
Colocasia Arales
Homalomena Arales
Pinellia Arales
Pothos Arales
Remusatia Arales
Rhaphidophora Arales
Sauromatum Arales
Typhonium Arales
Unk.araceae Arales
Zantedeschia Arales
Hypopterygium Archidiales
Dixonia Archidiales
Powellia Archidiales
Racopilum Archidiales
Archontophoenix Arecales
Areca Arecales
Caryota Arecales
Guihaia Arecales
Aristolochia Aristolochiales
Asarum Aristolochiales
Southbya Aristolochiales
Allium Asparagales
Milula Asparagales
Ophiopogon Asparagales
Polygonatum Asparagales
Smilacina Asparagales
Tupistra Asparagales
Burmannia Asparagales
Molineria Asparagales
Acampe Asparagales
Aerides Asparagales
Agrostophyllum Asparagales
Amitostigma Asparagales
Androcorys Asparagales
Anochilus Asparagales

From: jjvan...@gmail.com
Date: Sun, 24 Feb 2013 19:51:35 +1000
Subject: Re: [Trop-R] Species richness by order
To: tropi...@googlegroups.com
--
An R group for questions, tips and tricks relevant to spatial ecology and climate change.
All R questions welcome.
---
You received this message because you are subscribed to the Google Groups "Tropical R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tropical-r+...@googlegroups.com.

To post to this group, send an email to tropi...@googlegroups.com.

Stewart Macdonald

unread,
Feb 25, 2013, 11:24:57 PM2/25/13
to tropi...@googlegroups.com
See if this works. The gist is to:
-read in the list of genera and orders
--loop through each order
---loop over each genus in this order
----get a list of all .asc files for this genus
----loop over these .asc files, combining them for overall richness
--write out a richness .asc file for each order

You'll obviously need to change the variables up the top, and add in your thresholding code if need be. I don't have any .asc files to test this with, so I haven't changed the last section of the code and I haven't actually tested this to see if it works.


Stewart


############################

library(SDMTools)
working.dir = ~/"mydata/";
ascii.dir = "sdms/" #define the directory of the species ascii files
genusOrder.file <- "genus-order.txt" # this should be a tab-delimited file with two columns and a heading of "Genus\tOrder"

setwd(working.dir) #define and set the working directory

order.info = read.table(genusOrder.file, header=T, sep="\t") #read in the table of order and genus

for (ord in unique(order.info$Order)) {
cat('Doing order:', ord,'\n')#cycle through each order
#out = NULL #define the output
theseGenera = order.info$Genus[which(order.info$Order==ord)] #get the species in this order
for (thisGenus in theseGenera) {
cat('\tDoing genus', thisGenus, '\n')#cycle through each species in that order

theseSpecies <- list.files(ascii.dir, pattern=paste(thisGenus, "*", sep=''))

for (theseSpecies.file in theseSpecies) {

cat('\t\tDoing', theseSpecies.file, '\n')

tasc = read.asc( paste(ascii.dir, theseSpecies.file, sep='') ) #read in the species ascii

###you need to apply a threshold if it has not been applied... I am assuming that the species ascii is binary (0,1)

if (is.null(out)) { #if out is null, create the basic output asc as the first species distribution
out = tasc #copy the species distribution
} else { # if out has some data, add the next species to it
out = out + tasc #add on the species data
}
}
}
write.asc(out, paste(ord, "_richness.asc", sep='')) #write out the richness for the order

alice Hughes

unread,
Feb 26, 2013, 1:23:27 AM2/26/13
to tropi...@googlegroups.com
Hi Stewart

Thanks so much-it thought about working, but on the first species of the first order and genus it gave the message that it could not find the files (?)

Here is my version, a few very minor changes, I think with just a tiny bit of tweeking it will work!
Thanks so much!

Alice

library(SDMTools)
work.dir = "E:/XTBG/model/plants/1/";setwd(work.dir)
ascii.dir = "E:/XTBG/model/plants/1/" #define the directory of the species ascii files
genusOrder.file = "E:/XTBG/model/plants/1/order-genus.txt"

order.info = read.table(genusOrder.file, header=T, sep="\t") #read in the table of order and genus

for (ord in unique(order.info$Order)) {
cat('Doing order:', ord,'\n')#cycle through each order
#out = NULL #define the output
theseGenera = order.info$Genus[which(order.info$Order==ord)] #get the species in this order
for (thisGenus in theseGenera) {
cat('\tDoing genus', thisGenus, '\n')#cycle through each species in that order

theseSpecies <- list.files(ascii.dir, pattern=paste(thisGenus, "*", sep=''))

for (theseSpecies.file in theseSpecies) {

cat('\t\tDoing', theseSpecies.file, '\n')

tasc = read.asc( paste(ascii.dir, theseSpecies.file, sep='') ) #read in the species ascii

###you need to apply a threshold if it has not been applied... I am assuming that the species ascii is binary (0,1)

richness=NULL
for (tfile in files) {cat(tfile, '\n')
thresholded=read.asc(tfile)
thresholded[which(thresholded==0)]=0  #set values less than 3 to zero
thresholded[which(thresholded<0)]=1  #set values equal to 3 (where current and future overlap) to 1
if (is.null(richness)) {richness = thresholded;  } #define the output & set everything = 0
richness = richness + thresholded
}
dev.off()
if (is.null(out)) { #if out is null, create the basic output asc as the first species distribution
out = tasc #copy the species distribution
} else { # if out has some data, add the next species to it
out = out + tasc #add on the species data
}
}
}
write.asc(out, paste(ord, "_richness.asc", sep='')) #write out the richness for the order
}

> Subject: Re: [Trop-R] Species richness by order
> From: stewartm...@gmail.com
> Date: Tue, 26 Feb 2013 14:24:57 +1000
> To: tropi...@googlegroups.com

Stewart Macdonald

unread,
Feb 26, 2013, 1:36:57 AM2/26/13
to tropi...@googlegroups.com
> work.dir = "E:/XTBG/model/plants/1/";setwd(work.dir)
> ascii.dir = "E:/XTBG/model/plants/1/" #define the directory of the species ascii files
> genusOrder.file = "E:/XTBG/model/plants/1/order-genus.txt"

These are all absolute paths, rather than the relative paths I was using. Having only a cursory glance (and no knowledge of how R handles file paths on Windows), I suggest you change it to this:

> work.dir = "E:/XTBG/model/plants/"
> setwd(work.dir)
> ascii.dir = "1/" #define the directory of the species ascii files
> genusOrder.file = "order-genus.txt"

This sets the working directory using an absolute path, then references the other dir/files by using relative paths (paths relative to the working directory).

If that doesn't work, the first step of debugging will be to see what file R can't find. Change the 'tasc = read.asc...' line to this:

> thisFile <- paste(working.dir, ascii.dir, theseSpecies.file, sep='')

> cat("Looking for file:", thisFile)

> tasc = read.asc(thisFile) #read in the species ascii



Stewart

alice Hughes

unread,
Feb 26, 2013, 2:54:47 AM2/26/13
to tropi...@googlegroups.com
Thanks so much Stewart

Still no joy I'm afraid, here is the attempted run:

> library(SDMTools)
> work.dir = "E:/XTBG/model/plants/";setwd(work.dir)
> ascii.dir = "1/" #define the directory of the species ascii files
> genusOrder.file = "order-genus.txt"
> order.info = read.table(genusOrder.file, header=T, sep="\t") #read in the table of order and genus
> for (ord in unique(order.info$Order)) {
+ cat('Doing order:', ord,'\n')#cycle through each order
+ #out = NULL #define the output
+ theseGenera = order.info$Genus[which(order.info$Order==ord)] #get the species in this order
+ for (thisGenus in theseGenera) {
+ cat('\tDoing genus', thisGenus, '\n')#cycle through each species in that order
+ theseSpecies <- list.files(ascii.dir, pattern=paste(thisGenus, "*", sep=''))
+ for (theseSpecies.file in theseSpecies) {
+ cat('\t\tDoing', theseSpecies.file, '\n')
+ thisFile <-paste (work.dir, ascii.dir, theseSpecies.file, sep='')
+ cat("Looking for file:",thisFile)
+ tasc=read.asc(thisFile)
+ ###you need to apply a threshold if it has not been applied... I am assuming that the species ascii is binary (0,1)
+ richness=NULL
+ for (tfile in files) {cat(tfile, '\n')
+ thresholded=read.asc(tfile)
+ thresholded[which(thresholded==0)]=0  #set values less than 3 to zero
+ thresholded[which(thresholded<0)]=1  #set values equal to 3 (where current and future overlap) to 1
+ if (is.null(richness)) {richness = thresholded;  } #define the output & set everything = 0
+ richness = richness + thresholded
+ }
+ dev.off()
+ if (is.null(out)) { #if out is null, create the basic output asc as the first species distribution
+ out = tasc #copy the species distribution
+ } else { # if out has some data, add the next species to it
+ out = out + tasc #add on the species data
+ }
+ }
+ }
+ write.asc(out, paste(ord, "_richness.asc", sep='')) #write out the richness for the order
+ }
Doing order: Acorales 
        Doing genus Acorus 
                Doing Acorus_calamus_thresholded.asc 
Looking for file: E:/XTBG/model/plants/1/Acorus_calamus_thresholded.ascError: object 'files' not found
>






The file is there-so I'm not sure of the error source
Thanks so much for all your help!

Alice












> Subject: Re: [Trop-R] Species richness by order
> From: stewartm...@gmail.com
> Date: Tue, 26 Feb 2013 16:36:57 +1000
> To: tropi...@googlegroups.com

Stewart Macdonald

unread,
Feb 26, 2013, 2:57:40 AM2/26/13
to tropi...@googlegroups.com
> + for (tfile in files) {cat(tfile, '\n')

This is your problem. Have you copied-and-pasted that code without also pasting in whatever code creates the 'files' object?


Stewart

alice Hughes

unread,
Feb 26, 2013, 5:20:38 PM2/26/13
to tropi...@googlegroups.com
Hi Stewart 

Thanks again, you are right-there was a short line of code hiding before a comment.
However, when that line is added (as below) then it provides the warning
Looking for file: E:/XTBG/model/plants/1/Acorus_calamus_thresholded.ascError: object 'out' not found

So I changed the "out" in the code to file, it thought about it a little longer then came up with 
Looking for file: E:/XTBG/model/plants/1/Acorus_calamus_thresholded.ascError in file+ tasc :non-numeric argument to binary argument

I've tried changing the combinations of "file" and "out", but the error is either that it cannot fine an object, or that the argument in not numeric

Thanks so much (current code version below) I really appreciate it!

Alice


> Subject: Re: [Trop-R] Species richness by order
> From: stewartm...@gmail.com
> Date: Tue, 26 Feb 2013 17:57:40 +1000
> To: tropi...@googlegroups.com
> > + files=list.files (pattern='thresholded.asc')

> > + richness=NULL
> > + for (tfile in files) {cat(tfile, '\n')
> > + thresholded=read.asc(tfile)
> > + thresholded[which(thresholded==0)]=0 #set values less than 3 to zero
> > + thresholded[which(thresholded<0)]=1 #set values equal to 3 (where current and future overlap) to 1
> > +
> > + if (is.null(richness)) {richness = thresholded; } #define the output & set everything = 0
> > + richness = richness + thresholded
> > + }
> > +
> > + if (is.null(file)) { #if file is null, create the basic output asc as the first species distribution
> > + file = tasc #copy the species distribution

> > + } else { # if out has some data, add the next species to it
> > + file = file + tasc #add on the species data
> > + }
> > + }
> > + }
> > + write.asc(file, paste(ord, "_richness.asc", sep='')) #write out the richness for the order

> > + }
> > Doing order: Acorales
> > Doing genus Acorus
> > Doing Acorus_calamus_thresholded.asc
> > Looking for file: E:/XTBG/model/plants/1/Acorus_calamus_thresholded.ascError: object 'out' not found

Stewart Macdonald

unread,
Feb 26, 2013, 7:11:43 PM2/26/13
to tropi...@googlegroups.com
Jeremy's original code created an object called 'out' to store the order's overall richness. It looks like you've renamed this to 'richness' at the start of the script, but not at the end.


Stewart

alice Hughes

unread,
Mar 6, 2013, 1:03:26 AM3/6/13
to tropi...@googlegroups.com
Hi Stewart

No joy yet-I think it may be a RAM issue (non-convenient data), I'll try to get the data on the supercomputer and give it another go

Thnaks and all the best

Alice

> Subject: Re: [Trop-R] Species richness by order

> Date: Tue, 26 Feb 2013 14:24:57 +1000
> To: tropi...@googlegroups.com
>
Reply all
Reply to author
Forward
0 new messages