script optimisation

Lauren Hodgson

unread,

Oct 2, 2012, 12:23:44 AM10/2/12

to tropi...@googlegroups.com

Hi all,

Does anyone have any ideas for optimising the following script? It
uses two inputs: a dataframe, and a list,

I have a data frame that is 1.4 million rows long. It has an ID for
stream segment, and the runoff from the land that goes directly into
that stream. called runoff

HydroID | runoff
-------------------------------
1 | 29
2 | 3468
3 | 542
...

I also have a list with each stream, and all streams that feed into
it. called flowdata.

eg.

3, 2, 1
2, 1
1
...

I want to calculate total runoff for the entire catchment above a
stream. ie. add up the runoff values for each value in each row of
the list.

I've tried two methods for this. One is an sql query, which takes 8 hours.
The second uses R, and iterates through the HydroIDs

for (id in accflow$HydroID) { cat(id, '\n')
i=i+1
##sum runoff from runoff table for a vector in flowdata
v=strsplit(flowdata[[i]],',') # isolate the list associated with HydroID
ids=as.vector(v[[1]]) #turn the list into a vector
tt=sum(runoff$runoff[which(runoff$node_id %in% ids)]) # find the
vector in node_id, and sum the runoff values for all

accflow$acc_runoff[which(accflow$HydroID==id)]=tt #print accumulated
runoff to another table

}

Phillips, Ben

unread,

Oct 2, 2012, 2:05:20 AM10/2/12

to <tropical-r@googlegroups.com>

lapply or do.call will help here. May even be a parallel version of lapply nowadays...

Excuse the brevity: sent from my phone.

> --
> An R group for questions, tips and tricks relevant to spatial ecology and climate change.
> All R questions welcome.
> ---
> You received this message because you are subscribed to the Google Groups "Tropical R" group.
> To post to this group, send an email to tropi...@googlegroups.com.
> To unsubscribe from this group, send email to tropical-r+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

Lauren Hodgson

unread,

Oct 2, 2012, 10:13:33 PM10/2/12

to tropi...@googlegroups.com

Thanks Ben - it took me a while to get it working, but I'm using sapply:

accflow[,2]=sapply(flowdata, function(ids) {
return(sum(runoff$runoff[which(runoff$node_id %in% ids)])) })

It is at least 10 times faster.

Eventually I'll stop asking stupid questions where the answer is
'don't use a loop, use apply'

Lauren Hodgson

unread,

Oct 3, 2012, 2:14:54 AM10/3/12

to tropi...@googlegroups.com

and, yep, because even sapply was likely to take 40+ hours, I'm now
using parSapply over 10 cores.

It takes 1 min 30 to do what it took the loop 3 hours to do. Now the
total task should take 3 hours!

Thanks again, Ben!

alice Hughes

unread,

Feb 22, 2013, 10:28:12 PM2/22/13

to tropi...@googlegroups.com

Hi all

I have a group of maxent species "distribution" projections in the ascii format, all of which are thresholded (10% training).

I also have a table denoting which order each species falls into. Could someone help me write a couple of lines of code to: for each order classify each SR to binary, then summate them to get species richness for each order?

Thanks and all the best

Alice

> Date: Wed, 3 Oct 2012 16:14:54 +1000
> Subject: Re: [Trop-R] script optimisation
> From: lhodg...@gmail.com
> To: tropi...@googlegroups.com

Jeremy VanDerWal

unread,

Feb 24, 2013, 4:51:35 AM2/24/13

to tropi...@googlegroups.com

Alice, hope the following helps. Cheers, Jeremy

library(SDMTools)

working.dir = "some/working/directory"; setwd(working.dir) #define and set the working directory

ascii.dir = "directory.where.species.asciis.are" #define the directory of the species ascii files

order.info = read.csv("some.file") #read in the table of order and species columns

#I am assuming the order.info table has 2 columns... order and species

for (ord in unique(order.info$order)) { cat(ord,'\n')#cycle through each order

out = NULL #define the output

species = order.info$species[which(order.info$order==ord)] #get the species in this order

for (spp in species) { cat('\t',spp,'\n')#cycle through each species in that order

tasc = read.asc("the.species.distribution.asc") #read in teh species ascii

###you need to apply a threshold if it has not been applied... I am assuming that the species ascii is binary (0,1)

if (is.null(out)) { #if out is null, create the basic output asc as the first species distribution

out = tasc #copy the species distribution

} else { # if out has some data, add the next species to it

out = out + tasc #add on the species data

}

write.asc(out,paste(ord,"_richness.asc",sep='')) #write out the richness for the order

}

alice Hughes

unread,

Feb 25, 2013, 10:27:41 PM2/25/13

to tropi...@googlegroups.com

Hi Jeremy

Thanks so much for that script, I really appreciate it.

Sorry for not replying yesterday-I realised that the data needed alot of cleaning up before I could try the script.

At the moment the sdms are not binary, but have a continuous probability of occurrence above the 10% training threshold, so I added a couple of lines to classify anything above 0 as 1, and keep 0 as 0.

Because of the quantity of data when I was cleaning it I ended up listing the Genus, rather than species on the table (sample below), as there are well over 2000 genera being modelled (about 6000 species). So R needs to recognise the first half of the sdm output (i.e the ascii might be called "Alisma_orientale_thresholded.asc", then use the table to identify it as a Alismatale.

I tried to get the script running earlier, and deal with the errors-but in the end it just ran through without processing anything....

Thanks so much

Alice

Genus	Order
Alisma	Alismatales
Sagittaria	Alismatales
Spirodela	Alismatales
Hydrocharis	Alismatales
Vallisneria	Alismatales
Triglochin	Alismatales
Zannichellia	Alismatales
Ruppia	Alismatales
Acronema	Apiales
Aegopodium	Apiales
Angelica	Apiales
Arcuatopterus	Apiales
Aulacospermum	Apiales
Carum	Apiales
Cortiella	Apiales
Cyclorhiza	Apiales
Glehnia	Apiales
Haplosphaera	Apiales
Hyalolaena	Apiales
Hymenidium	Apiales
Keraymonia	Apiales
Ligustcum	Apiales
Lomatium	Apiales
Nothosmyrnium	Apiales
Osmorhiza	Apiales
Peucedanum	Apiales
Prangos	Apiales
Ptilimnium	Apiales
Semenovia	Apiales
Sinodielsia	Apiales
Sium	Apiales
Trachyspermum	Apiales
Vicatia	Apiales
Yushuia	Apiales
Acanthopanax	Apiales
Aralia	Apiales
Brassaiopsis	Apiales
Dendropanax	Apiales
Eleutherococcus	Apiales
Evodiopanax	Apiales
Gamblea	Apiales
Hedera	Apiales
Kalopanax	Apiales
Macropanax	Apiales
Merrilliopanax	Apiales
Metapanax	Apiales
Nothopanax	Apiales
Panax	Apiales
Pentapanax	Apiales
Polyscias	Apiales
Pseudopanax	Apiales
Schefflera	Apiales
Trevesia	Apiales
Bupleurm	Apiales
Cicuta	Apiales
Daucus	Apiales
Foeniculum	Apiales
Hansenia	Apiales
Oreomyrrhis	Apiales
Selinum	Apiales
Alocasia	Arales
Amorphophallus	Arales
Arisaema	Arales
Arum	Arales
Colocasia	Arales
Homalomena	Arales
Pinellia	Arales
Pothos	Arales
Remusatia	Arales
Rhaphidophora	Arales
Sauromatum	Arales
Typhonium	Arales
Unk.araceae	Arales
Zantedeschia	Arales
Hypopterygium	Archidiales
Dixonia	Archidiales
Powellia	Archidiales
Racopilum	Archidiales
Archontophoenix	Arecales
Areca	Arecales
Caryota	Arecales
Guihaia	Arecales
Aristolochia	Aristolochiales
Asarum	Aristolochiales
Southbya	Aristolochiales
Allium	Asparagales
Milula	Asparagales
Ophiopogon	Asparagales
Polygonatum	Asparagales
Smilacina	Asparagales
Tupistra	Asparagales
Burmannia	Asparagales
Molineria	Asparagales
Acampe	Asparagales
Aerides	Asparagales
Agrostophyllum	Asparagales
Amitostigma	Asparagales
Androcorys	Asparagales
Anochilus	Asparagales

From: jjvan...@gmail.com
Date: Sun, 24 Feb 2013 19:51:35 +1000
Subject: Re: [Trop-R] Species richness by order
To: tropi...@googlegroups.com

--
An R group for questions, tips and tricks relevant to spatial ecology and climate change.
All R questions welcome.
---
You received this message because you are subscribed to the Google Groups "Tropical R" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tropical-r+...@googlegroups.com.

To post to this group, send an email to tropi...@googlegroups.com.

Stewart Macdonald

unread,

Feb 25, 2013, 11:24:57 PM2/25/13

to tropi...@googlegroups.com

See if this works. The gist is to:
-read in the list of genera and orders
--loop through each order
---loop over each genus in this order
----get a list of all .asc files for this genus
----loop over these .asc files, combining them for overall richness
--write out a richness .asc file for each order

You'll obviously need to change the variables up the top, and add in your thresholding code if need be. I don't have any .asc files to test this with, so I haven't changed the last section of the code and I haven't actually tested this to see if it works.

Stewart

############################

library(SDMTools)
working.dir = ~/"mydata/";
ascii.dir = "sdms/" #define the directory of the species ascii files
genusOrder.file <- "genus-order.txt" # this should be a tab-delimited file with two columns and a heading of "Genus\tOrder"

setwd(working.dir) #define and set the working directory

order.info = read.table(genusOrder.file, header=T, sep="\t") #read in the table of order and genus

for (ord in unique(order.info$Order)) {
cat('Doing order:', ord,'\n')#cycle through each order
#out = NULL #define the output
theseGenera = order.info$Genus[which(order.info$Order==ord)] #get the species in this order
for (thisGenus in theseGenera) {
cat('\tDoing genus', thisGenus, '\n')#cycle through each species in that order

theseSpecies <- list.files(ascii.dir, pattern=paste(thisGenus, "*", sep=''))

for (theseSpecies.file in theseSpecies) {

cat('\t\tDoing', theseSpecies.file, '\n')

tasc = read.asc( paste(ascii.dir, theseSpecies.file, sep='') ) #read in the species ascii

###you need to apply a threshold if it has not been applied... I am assuming that the species ascii is binary (0,1)

if (is.null(out)) { #if out is null, create the basic output asc as the first species distribution
out = tasc #copy the species distribution
} else { # if out has some data, add the next species to it
out = out + tasc #add on the species data
}
}
}

write.asc(out, paste(ord, "_richness.asc", sep='')) #write out the richness for the order

alice Hughes

unread,

Feb 26, 2013, 1:23:27 AM2/26/13

to tropi...@googlegroups.com

Hi Stewart

Thanks so much-it thought about working, but on the first species of the first order and genus it gave the message that it could not find the files (?)

Here is my version, a few very minor changes, I think with just a tiny bit of tweeking it will work!

Thanks so much!

Alice

library(SDMTools)

work.dir = "E:/XTBG/model/plants/1/";setwd(work.dir)

ascii.dir = "E:/XTBG/model/plants/1/" #define the directory of the species ascii files

genusOrder.file = "E:/XTBG/model/plants/1/order-genus.txt"

order.info = read.table(genusOrder.file, header=T, sep="\t") #read in the table of order and genus

for (ord in unique(order.info$Order)) {

cat('Doing order:', ord,'\n')#cycle through each order

#out = NULL #define the output

theseGenera = order.info$Genus[which(order.info$Order==ord)] #get the species in this order

for (thisGenus in theseGenera) {

cat('\tDoing genus', thisGenus, '\n')#cycle through each species in that order

theseSpecies <- list.files(ascii.dir, pattern=paste(thisGenus, "*", sep=''))

for (theseSpecies.file in theseSpecies) {

cat('\t\tDoing', theseSpecies.file, '\n')

tasc = read.asc( paste(ascii.dir, theseSpecies.file, sep='') ) #read in the species ascii

###you need to apply a threshold if it has not been applied... I am assuming that the species ascii is binary (0,1)

richness=NULL

for (tfile in files) {cat(tfile, '\n')

thresholded=read.asc(tfile)

thresholded[which(thresholded==0)]=0 #set values less than 3 to zero

thresholded[which(thresholded<0)]=1 #set values equal to 3 (where current and future overlap) to 1

if (is.null(richness)) {richness = thresholded; } #define the output & set everything = 0

richness = richness + thresholded

}

dev.off()

if (is.null(out)) { #if out is null, create the basic output asc as the first species distribution

out = tasc #copy the species distribution

} else { # if out has some data, add the next species to it

out = out + tasc #add on the species data

}

write.asc(out, paste(ord, "_richness.asc", sep='')) #write out the richness for the order

}

> Subject: Re: [Trop-R] Species richness by order

> From: stewartm...@gmail.com
> Date: Tue, 26 Feb 2013 14:24:57 +1000
> To: tropi...@googlegroups.com

Stewart Macdonald

unread,

Feb 26, 2013, 1:36:57 AM2/26/13

to tropi...@googlegroups.com

> work.dir = "E:/XTBG/model/plants/1/";setwd(work.dir)
> ascii.dir = "E:/XTBG/model/plants/1/" #define the directory of the species ascii files
> genusOrder.file = "E:/XTBG/model/plants/1/order-genus.txt"

These are all absolute paths, rather than the relative paths I was using. Having only a cursory glance (and no knowledge of how R handles file paths on Windows), I suggest you change it to this:

> work.dir = "E:/XTBG/model/plants/"
> setwd(work.dir)
> ascii.dir = "1/" #define the directory of the species ascii files
> genusOrder.file = "order-genus.txt"

This sets the working directory using an absolute path, then references the other dir/files by using relative paths (paths relative to the working directory).

If that doesn't work, the first step of debugging will be to see what file R can't find. Change the 'tasc = read.asc...' line to this:

> thisFile <- paste(working.dir, ascii.dir, theseSpecies.file, sep='')

> cat("Looking for file:", thisFile)

> tasc = read.asc(thisFile) #read in the species ascii

Stewart

alice Hughes

unread,

Feb 26, 2013, 2:54:47 AM2/26/13

to tropi...@googlegroups.com

Thanks so much Stewart

Still no joy I'm afraid, here is the attempted run:

> library(SDMTools)

> work.dir = "E:/XTBG/model/plants/";setwd(work.dir)

> ascii.dir = "1/" #define the directory of the species ascii files

> genusOrder.file = "order-genus.txt"

>

> order.info = read.table(genusOrder.file, header=T, sep="\t") #read in the table of order and genus

>

> for (ord in unique(order.info$Order)) {

+ cat('Doing order:', ord,'\n')#cycle through each order

+ #out = NULL #define the output

+ theseGenera = order.info$Genus[which(order.info$Order==ord)] #get the species in this order

+ for (thisGenus in theseGenera) {

+ cat('\tDoing genus', thisGenus, '\n')#cycle through each species in that order

+

+ theseSpecies <- list.files(ascii.dir, pattern=paste(thisGenus, "*", sep=''))

+

+ for (theseSpecies.file in theseSpecies) {

+

+ cat('\t\tDoing', theseSpecies.file, '\n')

+

+ thisFile <-paste (work.dir, ascii.dir, theseSpecies.file, sep='')

+ cat("Looking for file:",thisFile)

+ tasc=read.asc(thisFile)

+

+ ###you need to apply a threshold if it has not been applied... I am assuming that the species ascii is binary (0,1)

+

+ richness=NULL

+ for (tfile in files) {cat(tfile, '\n')

+ thresholded=read.asc(tfile)

+ thresholded[which(thresholded==0)]=0 #set values less than 3 to zero

+ thresholded[which(thresholded<0)]=1 #set values equal to 3 (where current and future overlap) to 1

+

+ if (is.null(richness)) {richness = thresholded; } #define the output & set everything = 0

+ richness = richness + thresholded

+ }

+ dev.off()

+ if (is.null(out)) { #if out is null, create the basic output asc as the first species distribution

+ out = tasc #copy the species distribution

+ } else { # if out has some data, add the next species to it

+ out = out + tasc #add on the species data

+ }

+ write.asc(out, paste(ord, "_richness.asc", sep='')) #write out the richness for the order

+ }

Doing order: Acorales

Doing genus Acorus

Doing Acorus_calamus_thresholded.asc

Looking for file: E:/XTBG/model/plants/1/Acorus_calamus_thresholded.ascError: object 'files' not found

>

The file is there-so I'm not sure of the error source

Thanks so much for all your help!

Alice

> Subject: Re: [Trop-R] Species richness by order
> From: stewartm...@gmail.com

> Date: Tue, 26 Feb 2013 16:36:57 +1000
> To: tropi...@googlegroups.com

Stewart Macdonald

unread,

Feb 26, 2013, 2:57:40 AM2/26/13

to tropi...@googlegroups.com

> + for (tfile in files) {cat(tfile, '\n')

This is your problem. Have you copied-and-pasted that code without also pasting in whatever code creates the 'files' object?

Stewart

alice Hughes

unread,

Feb 26, 2013, 5:20:38 PM2/26/13

to tropi...@googlegroups.com

Hi Stewart

Thanks again, you are right-there was a short line of code hiding before a comment.

However, when that line is added (as below) then it provides the warning

Looking for file: E:/XTBG/model/plants/1/Acorus_calamus_thresholded.ascError: object 'out' not found

So I changed the "out" in the code to file, it thought about it a little longer then came up with

Looking for file: E:/XTBG/model/plants/1/Acorus_calamus_thresholded.ascError in file+ tasc :non-numeric argument to binary argument

I've tried changing the combinations of "file" and "out", but the error is either that it cannot fine an object, or that the argument in not numeric

Thanks so much (current code version below) I really appreciate it!

Alice

> Subject: Re: [Trop-R] Species richness by order
> From: stewartm...@gmail.com

> Date: Tue, 26 Feb 2013 17:57:40 +1000
> To: tropi...@googlegroups.com

> > + files=list.files (pattern='thresholded.asc')

> > + richness=NULL
> > + for (tfile in files) {cat(tfile, '\n')
> > + thresholded=read.asc(tfile)
> > + thresholded[which(thresholded==0)]=0 #set values less than 3 to zero
> > + thresholded[which(thresholded<0)]=1 #set values equal to 3 (where current and future overlap) to 1
> > +
> > + if (is.null(richness)) {richness = thresholded; } #define the output & set everything = 0
> > + richness = richness + thresholded
> > + }
> > +

> > + if (is.null(file)) { #if file is null, create the basic output asc as the first species distribution
> > + file = tasc #copy the species distribution

> > + } else { # if out has some data, add the next species to it

> > + file = file + tasc #add on the species data
> > + }
> > + }
> > + }
> > + write.asc(file, paste(ord, "_richness.asc", sep='')) #write out the richness for the order

> > + }
> > Doing order: Acorales
> > Doing genus Acorus
> > Doing Acorus_calamus_thresholded.asc

> > Looking for file: E:/XTBG/model/plants/1/Acorus_calamus_thresholded.ascError: object 'out' not found

Stewart Macdonald

unread,

Feb 26, 2013, 7:11:43 PM2/26/13

to tropi...@googlegroups.com

Jeremy's original code created an object called 'out' to store the order's overall richness. It looks like you've renamed this to 'richness' at the start of the script, but not at the end.

Stewart

alice Hughes

unread,

Mar 6, 2013, 1:03:26 AM3/6/13

to tropi...@googlegroups.com

Hi Stewart

No joy yet-I think it may be a RAM issue (non-convenient data), I'll try to get the data on the supercomputer and give it another go

Thnaks and all the best

Alice

> Subject: Re: [Trop-R] Species richness by order

> From: stewartm...@gmail.com

> Date: Tue, 26 Feb 2013 14:24:57 +1000
> To: tropi...@googlegroups.com
>

Reply all

Reply to author

Forward