CRAN Task View for Archaeology

Ben Marwick

unread,

Jan 20, 2015, 2:54:38 AM1/20/15

to anti...@googlegroups.com

Hello,

If you an R user, could I please ask you to have a look at this unofficial CRAN Task View for Archaeology that I have drafted?

The document is here: https://github.com/benmarwick/ctv-archaeology

I'd be most grateful if you could have a look at let me know of functions and packages that you often use but I haven't yet included on this Task View.

I believe that every archaeologist could benefit from using R for their research, and I hope a short annotated list like this might help them get started.

thanks and best wishes,

Ben

--
Ben Marwick, Assistant Professor, Department of Anthropology
Denny Hall M32, Box 353100, University of Washington
Seattle, WA 98195-3100 USA

t. (+1) 206.552.9450 e. bmar...@uw.edu
f. (+1) 206.543.3285 w. http://faculty.washington.edu/bmarwick/

Allar Haav

unread,

Jan 20, 2015, 3:15:17 AM1/20/15

to anti...@googlegroups.com

Hi,

I don't know how many archaeologists nowadays use GRASS GIS (I do!), but I have found quite a lot of use for package spgrass6 (http://cran.r-project.org/web/packages/spgrass6/index.html) that allows to use GRASS mapsets and functions via R.

Allar

--
You received this message because you are subscribed to the Google Groups "Antiquist" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antiquist+...@googlegroups.com.
To post to this group, send email to anti...@googlegroups.com.
Visit this group at http://groups.google.com/group/antiquist.
For more options, visit https://groups.google.com/d/optout.

Philip Riris

unread,

Jan 20, 2015, 4:25:48 AM1/20/15

to Ben Marwick, anti...@googlegroups.com

Hi Ben,

I would say my principal uses for R are data representation and spatial analysis. The packages I most use include ggplot2, sp, spatstat, MASS, maptools and rgdal, but I lean heavily on spatstat for most things I do.

Will take a look at the task view ASAP.

Phil

Sent from my Windows Phone. Please excuse my brevity.

From: Ben Marwick
Sent: ‎20/‎01/‎2015 07:54
To: anti...@googlegroups.com
Subject: [Antiquist] CRAN Task View for Archaeology

Tom

unread,

Jan 20, 2015, 4:46:33 AM1/20/15

to Ben Marwick, anti...@googlegroups.com

Dear Ben,

Thanks for compiling this resource, it is very useful and will definitely be of interest to many archaeologists.

If you consider adding a section on network science, I can recommend the following packages. I have used most of these myself and I took the descriptions given here from Wikipedia since they are concise, but feel free to replace them:

igraph: a generic network analysis package

sna: performs sociometric analysis of networks

network: manipulates and displays network objects

tnet: performs analysis of weighted networks, two-mode networks, and longitudinal networks (did not test this myself)

ergm: a set of tools to analyze and simulate networks based on exponential random graph models exponential random graph models

Bergm: provides tools for Bayesian analysis for exponential random graph models, hergm implements hierarchical exponential random graph models (did not test this myself)

RSiena: allows the analyses of the evolution of social networks using dynamic actor-oriented models

latentnet: has functions for network latent position and cluster models (did not test this myself)

degreenet: provides tools for statistical modeling of network degree distributions (did not test this myself)

networksis: provides tools for simulating bipartite networks with fixed marginals (did not test this myself)

Best,

Tom

Postdoctoral researcher

Department of Computer and Information Science

University of Konstanz

HERA CARIB archaeological project

http://archaeologicalnetworks.wordpress.com/

http://connectedpast.soton.ac.uk/

Tom Brughmans

unread,

Jan 20, 2015, 4:47:16 AM1/20/15

to anti...@googlegroups.com, Ben Marwick

Dear Ben,

Thanks for compiling this resource, it is very useful and will definitely be of interest to many archaeologists.

If you consider adding a section on network science, I can recommend the following packages. I have used most of these myself and I took the descriptions given here from Wikipedia since they are concise, but feel free to replace them:

igraph: a generic network analysis package

sna: performs sociometric analysis of networks

network: manipulates and displays network objects

tnet: performs analysis of weighted networks, two-mode networks, and longitudinal networks (did not test this myself)

ergm: a set of tools to analyze and simulate networks based on exponential random graph models exponential random graph models

Bergm: provides tools for Bayesian analysis for exponential random graph models, hergm implements hierarchical exponential random graph models (did not test this myself)

RSiena: allows the analyses of the evolution of social networks using dynamic actor-oriented models

latentnet: has functions for network latent position and cluster models (did not test this myself)

degreenet: provides tools for statistical modeling of network degree distributions (did not test this myself)

networksis: provides tools for simulating bipartite networks with fixed marginals (did not test this myself)

Best,

Tom

Postdoctoral researcher

Department of Computer and Information Science

University of Konstanz

HERA CARIB archaeological project

http://archaeologicalnetworks.wordpress.com/

http://connectedpast.soton.ac.uk/

Lee Drake

unread,

Jan 20, 2015, 1:04:36 PM1/20/15

to anti...@googlegroups.com

Hi all,

I second the thanks to Ben. My package recommendations are as follows:

Graphing

gridExtra: for plot layouts

wq: for plot layouts

rgl: for 3D plots

scatterplot3D: for 3D plots

ggplot2: for plotting

Chronology

Bchron: for radiocarbon date calibrations and building age models

Bayesian

bcp: for Bayesian change-point analysis in a time series

MCMCpack: for general Bayesian analysis

Mixing Models

SiAR: for creating proportional models of contribution to a diet/assemblage

---

B. Lee Drake

Department of Anthropology
University of New Mexico
(505) 510.1518
b.lee...@gmail.com

Stefano Costa

unread,

Jan 20, 2015, 2:02:16 PM1/20/15

to anti...@googlegroups.com

Il 20/01/2015 08:54, Ben Marwick ha scritto:
> Hello,
>
> If you an R user, could I please ask you to have a look at this
> unofficial CRAN Task View for Archaeology that I have drafted?
>
> The document is here: https://github.com/benmarwick/ctv-archaeology
>
> I'd be most grateful if you could have a look at let me know of
> functions and packages that you often use but I haven't yet included on
> this Task View.
>
> I believe that every archaeologist could benefit from using R for their
> research, and I hope a short annotated list like this might help them
> get started.

Ben,
that's a very useful resource (and thanks for mentioning the
Quantitative archaeology wiki ;)

I especially recommend environments like RStudio (good for beginners)
and Emacs+ESS (that I use) for making small steps of reproducible
procedures. ggplot2 is my first choice, it seems to make exploratory
data analysis more intuitive.

While not strictly R-specific, I also suggest the Cross Validated Stack
Exchange http://stats.stackexchange.com/ for help with statistical
methods rather than R coding.

Ciao,
steko

PS I submitted a pull request for two broken links I found

--
Stefano Costa
http://steko.iosa.it/
Editor, Journal of Open Archaeology Data
http://openarchaeologydata.metajnl.com/

Ben Marwick

unread,

Jan 20, 2015, 3:46:48 PM1/20/15

to anti...@googlegroups.com

Hi Allar, Phil, Tom, Lee and Stefano,

Thank you all very much for your quick and helpful replies, I'm delighted to know of other R-using archaeologists and what are the key packages in your research. I've updated the task view with all of your suggestions.

Phil also mentioned that a task view like this might be a good place to list scholarly publications that use R, which is a great idea. If you have a publication that is available online, specifically cites R in the text, and has accompanying R code to generate the figures and tables in the publication, I'd be glad to list it. I'm very interested in this kind of article-level reproducibility, a wonderful example by a biologist just came out today over here: http://rspb.royalsocietypublishing.org/content/282/1801/20141631). Are there any similar examples in archaeology?

If you have any other suggestions, please do let me know.

thanks again,

Ben

Lee Drake

unread,

Jan 20, 2015, 11:24:46 PM1/20/15

to anti...@googlegroups.com

Excellent idea Ben! I have three using R that are open access, the other three are locked behind paywalls. I will put links at the end of this message, listed by method used in R.

My colleagues and I tried to reproduce findings from strontium isotope sourcing studies in Chaco Canyon in northwestern New Mexico using Bayesian Mixing Models in R (SiAR package) with other more classical techniques. Unfortunately in this case we did not find strong support in the data for the previous conclusions, though at least one source suggested by the original work was supported by the analysis. That paper (and code) can be found here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0095580

As you noted - including R script as part of the publishing process (generally as supplemental material) can make reproducibility much easier despite more complex analysis. If you put your data on a server, you can call to its address, as an example: read.csv(file="http://www.anarchaeologist.com/data/lithics.csv"). That way R code can run as easily on someone else computer as it did on yours without having to navigate a file path.

Also, critically, a bit of translation code can help plotting work on both mac and windows computers:

#Compatibility

if(.Platform$OS.type=="windows") {

quartz<-function() windows()

}

Articles using R:

Open Access:

Strontium isotope Bayesian mixing models: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0095580

Modeling plant uptake of 14C during photosynthesis: https://journals.uair.arizona.edu/index.php/radiocarbon/article/view/16155/pdf

d13C as climate indicator large data sets: https://journals.uair.arizona.edu/index.php/radiocarbon/article/view/16153

Behind paywall (contact me for manuscript and code if you do not have access):

d13C from Radiocarbon dates as climate indicator: http://www.sciencedirect.com/science/article/pii/S0305440312001471

Multi proxy analysis of climate during Late Bronze Age: http://www.sciencedirect.com/science/article/pii/S0305440312000416

Bayesian change point analysis of pollen: http://hol.sagepub.com/content/22/12/1353.short

---

B. Lee Drake

Department of Anthropology
University of New Mexico
(505) 510.1518
b.lee...@gmail.com

--

Philip Riris

unread,

Jan 21, 2015, 5:33:40 AM1/21/15

to anti...@googlegroups.com

Dear all,

Ben has also suggested that the Task View also be home to vignettes produced by the archaeological R community (that's us!).

For those who don't already know, R vignettes (link) document a specific problem or task that needs to be solved with a walkthrough and a more or less detailed description of the code. Usually they pertain to a particular package, and use multiple functions from within it to guide the user through their implementation and show off what the package is capable of.

Because R packages are generally speaking not made with archaeologists in mind, the compilation of archaeologically themed vignettes would permit users (I'm thinking students in particular here) to see "how it's done". Vignettes written by and for archaeologists are potentially very useful in a didactic sense, as well as for the transparency, reproducibility, and reuseability of our work.

Anyone can make vignettes in markdown with the package 'rmarkdown', which is automatically included in RStudio. Thus, anyone using R for analysis or display already has the technical capability to produce a vignette as part of normal workflows. The link above contains more technical details. Critically, however, a push for this relies on users (that's us!) to see its value for the discipline and for driving new methods.

Comments and suggestions are welcome.

Best,

Phil

Postdoctoral research assistant
Cotúa Island project
Institute of Archaeology
University College London

Enrico Crema

unread,

Jan 21, 2015, 9:12:21 AM1/21/15

to anti...@googlegroups.com

Dear Ben & everybody,

This is a brilliant idea! I've seen on Github that you've already most of the packages I used. Here are some others with the link to the paper where I mention them (the most recent ones have the actual code as online supplement)

Package: abc package (for Approximate Bayesian Computation)

Paper: An Approximate Bayesian Computation approach for inferring patterns of cultural evolutionary change (http://www.sciencedirect.com/science/article/pii/S0305440314002593)

Package: RandomForest (for Random forest

Paper: Culture, space, and metapopulation: a simulation-based study for evaluating signals of blending and branching

(http://www.sciencedirect.com/science/article/pii/S0305440314000053)

Package: DistatisR (for Distatis analysis), pegas (for AMOVA, not ANOVA)

Paper: Isolation-by-distance, homophily, and “core” vs. “package” cultural evolution models in Neolithic Europe (http://www.sciencedirect.com/science/article/pii/S1090513814001251)

I also used spatstat on a variety of places (the following exclude book chapter contributions and conference proceedings):

A house with a view? Multi-model inference, visibility fields, and point process analysis of a Bronze Age settlement on Leskernick Hill (Cornwall, UK)

(http://www.sciencedirect.com/science/article/pii/S0305440314000028)

A probabilistic framework for assessing spatio-temporal point patterns in the archaeological record

(http://www.sciencedirect.com/science/article/pii/S0305440309004622)

I think other packages that I haven't use include spdep and gstat. I'm curious if anyone used the seriation package. The vignette mentions Petrie and archaeology, it would be sad if no one every used it...

Best,

Enrico

Dave Potts

unread,

Jan 21, 2015, 9:16:02 AM1/21/15

to anti...@googlegroups.com

On 21/01/15 14:12, Enrico Crema wrote:
Hi everybody

Sounds a good idea, but before you do this, can you put together a list of external dependencies first, its rather annoying, if your downloading a complex list of R packages, if the loader bails out half way through because its missing an external bit of software.

Dave.

--

Ben Marwick

unread,

Jan 22, 2015, 4:05:42 AM1/22/15

to anti...@googlegroups.com, b.lee...@gmail.com

Hi Lee,

Thanks very much for these, that's all very impressive. How can we get everyone to do this, include R code with their publications?

I've added all except the two Radiocarbon papers since I can't find any supplementary material with code, can you point me in the right direction?

I ran all the code I could find from your papers and have a few observations and questions:

* The code is very well commented, it's easy to see the code that generates each figure in the paper, and it's really nice to have little notes on how long things will take, etc.

* Is any of this code under version control? I'd recommend git for version control in general, and github.com or bitbucket.org for public hosting of code, then archive the commit used for the author-accepted-manuscript at figshare.com or zenodo.org and get a DOI for the specific commit that produced the results in the paper, and cite that DOI in the paper. Then you can continue to make corrections to the code after publication, and it's easier for others to reuse and verify your code, and contribute to it. Code as a supplementary material is obviously a vast improvement on the status quo, and the next step is having it in a version controlled repository and a DOI for the commit used in the publication, which is the current best practice (cf. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745).

* In the comments at the top of your script file you should add the title of the article that the code relates to, that makes it easier to use. Currently you have the same boilerplate about copying and pasting, and with four of your scripts open at once on my computer it's difficult to know which belongs to which paper... Most people wont have that problem, but if the script become separated from the article PDF the poor reader is going to struggle to work out what that script file is associated with unless you add some metadata at the top

* It's great that you include the specific version number of the contributed packages, but it should probably be done for all the contributed packages that you use (maybe not in the text of the article but in the code or a readme.txt). I just paste in the results of sessionInfo() for this. For example, the current version of ggplot2 has opts giving an error because element is the current function. So I had to edit some of your plotting code to make it work. If your code was on github I could make a pull request and you'd have a quick fix that would be immediately public. If you'd given the ggplot2 version that you used I could have got that one from the CRAN archive and saved myself a bit of typing. This is a problem that packrat, checkpoint and Docker are trying to solve.

* Some of the data files for the Lower Alentejo paper are not available. The code points to files in a directory at http://www.bleedrake.com/la/ but there's nothing there. Best practice here I think would be to bundle the data files with the code, ideally in a version controlled repository or similar DOI-issuing service such as figshare, zenodo, or other persistent URL-issuer such as opencontext or tDAR. Then the code can have a relative path to the data, and you don't have to worry about maintaining public availability of the data files on your website.

* Code for the PLOS paper seems to violate the 'don't repeat yourself' principle in a few places, especially the numerous lines of mean() and sd(). Seems like lapply() or similar might have saved a lot of typing here (I'm sure that code I wrote 3-5 years ago would have many worse problems than this!)

* Have you thought of using a literate programming approach, for example with markdown? Then you can have the text of your paper and the code together in the same file, and execute the document (or knit, which is the current verb in R) to produce the manuscript with text, figures and tables all together? That will reduce some of the platform dependencies, especially with graphics that you've encountered. Carl Boettiger has some nice examples of this, most recently http://rspb.royalsocietypublishing.org/content/282/1801/20141631 and https://github.com/cboettig/nonparametric-bayes), which I've learned a lot from. And others have demonstrated how to make an R package from a research repository where the manuscript, code and data are all bundled together and the manuscript is the package vignette (cf. https://github.com/jhollist/manuscriptPackage), as Phil mentioned earlier.

Thanks again for sharing your outstanding examples of reproducible research, they're really wonderful to see.

best,

Ben

Ben Marwick

unread,

Jan 22, 2015, 5:35:12 AM1/22/15

to anti...@googlegroups.com

Hi Enrico,

Thank very much for your suggestions and listing your papers for the task view. These are really wonderful examples of innovative science that is completely communicated with code and data, rather than being just the brief advertisement or tip-of-iceberg that most papers are. The code from the abc paper it looks very well organised and useful for others to learn from. Is it in a version controlled repository? I ran the code in your 'Isolation-by-distance' paper and that all worked great. I couldn't find any code associated with the 'culture space and metapopulation' paper (though I read it with great interest!). I also looked at the seriation package and thought it would be useful, but I think I'll need to look more carefully, I couldn't see any obvious archaeological applications...

A few comments...

* I learned about a lot of new (to me) packages, I've added them to the task view along with references to the abc paper and the isolation paper.

* I'm not keen on rm(list=ls()) at the top of a script, it seems a bit unkind to the user. I think literate programming offers a better approach since it creates a new environment when the code is executed (ie. sweave or knitr), so it's free of contaminating data objects, but doesn't require the user to remove everything.

* Supplementary materials are not a very convenient way to share code and data... the journal renames many of your files so the internal references in the code need to be edited, your R script files are supplied as zip files which adds a few extra steps to get to the code, and your CSV files (as they are referred to in the code) are Excel xls/x files in the supplementary materials. I see you have a comment in the code that says 'the excel worksheets should be individually exported as .csv files', but this is a bit tedious for the poor user and an unnecessary obstacle to using the data. Better would be to have the CSV files as the actual files in the supplementary materials, since Excel files are a poor choice for data longevity and portability. Those are all pretty small details, but they add up to an obstacle to sharing the really important parts of the paper, and burden the reader with a lot of tedium to access the code and data. There are some nice guidelines on sharing code and data here: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003542

* My suggestion to lower the burden in working with code and data that accompanies a paper is to put the code and data in an online repository that gives a persistent URL, and cite that URL in the paper. As I noted to Lee, many researchers are putting code and data on github, then archiving a specific commit of their github repository in a repository like figshare.com or zenodo.org and getting a DOI from there to cite in the paper. Those repositories are not going to mess around with file names or formats in the same way that the journal does, and it's easy for you to make corrections after publication, but still have a reference to the version current at the time of publication. You could have the usual supplementary material also, but the online repositories I'm referring to are also free to access, so people can get to them even if they don't subscribe to the journal (though I see your JAS papers are OA, so that's not a problem for those specific papers).

* You don't have any kind of license on your code, which makes it difficult for others to know how they can reuse it. I see that Lee used GPL, but I prefer MIT (http://opensource.org/licenses/MIT) because it's not viral like GPL. Most people think of licenses as something that's only relevant where there are commercial applications, but I think they're useful ways to formally communicate to your users about your intentions for reuse (do you want attribution? are you ok with commercial reuse or do you want to limit to non-commercial use?). Licenses are also handy to absolve you of any responsibility for how others use your code (eg. they misuse your code, publish, then it turns out they made a mistake and drama ensues, you can point to the license and say 'these are the conditions that you accept when you use my code, no warranty, no liability, etc.'). So I think all publicly available code should have a license, ideally one that is widely used (rather than one you make up yourself!).

* For your simulation code (which I haven't studied carefully, so I might be on the wrong track here), include a random seed value. If I understand this correctly, given the same initial seed, all random numbers used in an analysis will be equal, thus giving identical results every time it is run.

I've now finished running the code from the abc paper and got some errors at step 5. First some warnings, eg

> UBpost1<-abc(target=observed,sumstat=simresUB,param=simparamUB,tol=0.01,method="rejection")

Warning message:

In abc(target = observed, sumstat = simresUB, param = simparamUB, :

No summary statistics names are given, using S1, S2, ...

then an error:

> hist(UBpost1$unadj.values[,1])

Error in UBpost1$unadj.values[, 1] : incorrect number of dimensions

Any thoughts about that?

Thanks again for sharing these excellent examples. I'm glad Phil made the suggestion to list papers because I've learned a lot from reading them, and I'm sure others will too. I will add my own to the task view and hope that you all might offer some feedback also.

best,

Ben

Ben Marwick

unread,

Jan 22, 2015, 5:53:04 AM1/22/15

to anti...@googlegroups.com

Hi Dave,

Yes, good point, managing these external dependences is a big challenge when working in a programming environment, and can be really frustrating. No doubt more than a few novices have been turned away from R (or Python, where the situation is much worse) for that reason. I've tried to list all the external dependencies next to each package where they appear in the task view, so that the reader is aware of them. This seems to be how the other task views handle the problem, so I'll stick with that. But that doesn't help much for a batch installation of all the packages in the task view.

Virtual environments are the current solution to this problem, for example a virtual machine image that has all the packages and all the dependencies together. I use that for teaching (since students' computers can be quite exotic), but it's a heavyweight solution since a small VM file is about 3 Gb. More promising are Docker containers, which provide a very similar virtual environment, but with a tiny memory and disk footprint. I do most of my work at the moment using RStudio and rocker (https://github.com/rocker-org/rocker it's quite well documented on the wiki of that repository), so I'm using RStudio server in web browser in a linux environment (on a windows laptop). This a gives very high degree of isolation and reproducibility. The dockerfile that describes the linux environment specifies the dependencies, so you or anyone can recreate my computational environment exactly (well, not the hardware) with minimal effort (one line of code, in my case: docker run -d -p 8787:8787 rocker/hadleyverse). This is a lot nicer that using a virtual machine because docker uses a lot less system resources, and is much less distracting (I don't have to work with a whole different desktop like I do in a typical VM).

There's a good paper on using docker and R here: 'An introduction to Docker for reproducible research, with examples from the R environment' http://arxiv.org/abs/1410.0846

If you have other thoughts on how to make all of this easier, please do share them.

best,

Ben

Enrico Crema

unread,

Jan 23, 2015, 7:02:50 AM1/23/15

to anti...@googlegroups.com

Hi Ben,

Thanks for the extensive comments! The code were uploaded just for reference and partial (in the case of ABC, you need a cluster) reproducibility. So it is not currently on github or any version controlled repository (I should do this eventually). As for your comments:

* I'm not keen on rm(list=ls()) at the top of a script, it seems a bit unkind to the user. I think literate programming offers a better approach since it creates a new environment when the code is executed (ie. sweave or knitr), so it's free of contaminating data objects, but doesn't require the user to remove everything.

Ops...this was left-over from my own script...

* Supplementary materials are not a very convenient way to share code and data... the journal renames many of your files so the internal references in the code need to be edited, your R script files are supplied as zip files which adds a few extra steps to get to the code, and your CSV files (as they are referred to in the code) are Excel xls/x files in the supplementary materials. I see you have a comment in the code that says 'the excel worksheets should be individually exported as .csv files', but this is a bit tedious for the poor user and an unnecessary obstacle to using the data. Better would be to have the CSV files as the actual files in the supplementary materials, since Excel files are a poor choice for data longevity and portability. Those are all pretty small details, but they add up to an obstacle to sharing the really important parts of the paper, and burden the reader with a lot of tedium to access the code and data. There are some nice guidelines on sharing code and data here: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003542

Unfortunately I didn't have any control over how these data were deposited in the ESM. Indeed these renaming and groupings make thing tedious... I don't recall exactly why (so I might be wrong) but I think the reason some files were in excel was because there was a limit in the number of supplementary material that I could upload on E&HB. Again this makes everything complicated...

* My suggestion to lower the burden in working with code and data that accompanies a paper is to put the code and data in an online repository that gives a persistent URL, and cite that URL in the paper. As I noted to Lee, many researchers are putting code and data on github, then archiving a specific commit of their github repository in a repository like figshare.com or zenodo.org and getting a DOI from there to cite in the paper. Those repositories are not going to mess around with file names or formats in the same way that the journal does, and it's easy for you to make corrections after publication, but still have a reference to the version current at the time of publication. You could have the usual supplementary material also, but the online repositories I'm referring to are also free to access, so people can get to them even if they don't subscribe to the journal (though I see your JAS papers are OA, so that's not a problem for those specific papers).

That will be brilliant. Especially the possibility to maintain both the snapshot of the paper, and add corrections (see below).

* You don't have any kind of license on your code, which makes it difficult for others to know how they can reuse it. I see that Lee used GPL, but I prefer MIT (http://opensource.org/licenses/MIT) because it's not viral like GPL. Most people think of licenses as something that's only relevant where there are commercial applications, but I think they're useful ways to formally communicate to your users about your intentions for reuse (do you want attribution? are you ok with commercial reuse or do you want to limit to non-commercial use?). Licenses are also handy to absolve you of any responsibility for how others use your code (eg. they misuse your code, publish, then it turns out they made a mistake and drama ensues, you can point to the license and say 'these are the conditions that you accept when you use my code, no warranty, no liability, etc.'). So I think all publicly available code should have a license, ideally one that is widely used (rather than one you make up yourself!).

Agee.

* For your simulation code (which I haven't studied carefully, so I might be on the wrong track here), include a random seed value. If I understand this correctly, given the same initial seed, all random numbers used in an analysis will be equal, thus giving identical results every time it is run.

Well this specific code was not designed for reproducibility but more for an illustration of how things are executed. The actual execution was carried on a cluster, and random seed was defined there. I could have supplied the entire workflow (including the submission file to the cluster, which includes the random seeds), but the intention in this case was slightly different. I should have clarified this better in the code and in the paper.

I've now finished running the code from the abc paper and got some errors at step 5. First some warnings, eg

> UBpost1<-abc(target=observed,sumstat=simresUB,param=simparamUB,tol=0.01,method="rejection")
Warning message:
In abc(target = observed, sumstat = simresUB, param = simparamUB, :
No summary statistics names are given, using S1, S2, ...

then an error:

> hist(UBpost1$unadj.values[,1])
Error in UBpost1$unadj.values[, 1] : incorrect number of dimensions

Any thoughts about that?

Yes, Mark Madsen pointed this to me already. It's about the settings in the number of simulations. Somehow the number of simulations (nsim) was set to 5 rather than 100 on line 96, and later set to 100 at line 183. These should be both 100. Given that that the tolerance level of abc was set to 0.01 (the parameter tol in abc and abc2), there is a good chance that data.frame of UBpost1 had only a single row. You can check by simply printing UBpost1$unadj.values. This should have tol*nsim rows. In fact even with 100 runs you will get just 1 row, so the tolerance should really be something like 0.2. Changing nsim to 100 and all tol parameters to 0.2 or something should solve your problem

All Best,

Enrico

Mark Madsen

unread,

Jan 25, 2015, 12:19:10 PM1/25/15

to anti...@googlegroups.com

Ben -

This is a great resource, thank you! There are a lot of terrific suggestions thus far, so I'll just add one for the moment.

I think of caret not as a multivariate modeling package in its own right, but a scaffold for doing many kinds of models, and rigorously tuning and testing them, in a standardized way. I've been using it across several projects intensively this year, and I can't say enough good things about caret or its author, Max Kuhn.

I'd suggest breaking caret out of the multivariate models section and creating a small section for Model Testing and Validation. In this section, I would personally highlight caret as being the "swiss army knife" for model training, testing, and validation, and note that it has a really superb companion book called "Applied Predictive Modeling" by Max Kuhn, published by Springer (and available as an ebook).

I'd also put bootstrap here (although in most cases I'd let a higher level package actually *do* the bootstrap sampling in the context of model training, it's useful to do bootstrap samples yourself in writing functions), and mention Hmisc again in this context.

Other packages that do cross-validation are CVtools and DAAG. I'm sure there are others lurking in CRAN, too.

In general, anything we can do to highlight methods for properly tuning and evaluating models is a good thing, since you almost never see an indication that archaeologists have evaluated the optimality of tuning parameters in their statistical models -- too often, sample sizes or number of predictors etc are chosen using conventions, rather than methods like cross-validation that tell you what the best fit really is.

Ben Marwick

unread,

Feb 2, 2015, 5:15:31 AM2/2/15

to anti...@googlegroups.com

Hi Enrico, Mark, et al.

Everyone - I've added a list of contributors to the task view document. If you suggested a package then I've added you as a contributor (if you'd prefer not to be listed, just let me know). Thanks to you all for your suggestions!

Enrico - thanks for your followup and explanation, I see what you mean about your code being more of a demonstration of your model. I think reproducibility exists on a long spectrum from not at all reproducible (most articles in most journals) to push-button total reproducibility (specifically for the computational and statistical components, very few papers, but now we're got a few listed on the task view!), and any progress towards the fully reproducible end of the spectrum is worth recognising, such as your demonstration code. We're always going to have the challenge of analyses that too intensive to run on a personal computer, data that are too big to share conveniently, and data and results we cannot ethically share due to cultural sensitivities, etc.. So demanding that all papers be fully reproducible is unreasonable, instead we can strive to expose more of the workflow (as you have done) which is still a huge step forward for archaeology. By the way, have you considered documenting your computational environment in a dockerfile, so that someone could deploy your environment onto a server by running a docker container? I've been exploring this recently, cf https://github.com/benmarwick/Steele_et_al_VR003_MSA_Pigments The nice thing about the dockerfile is that it easily specifies all the software dependencies, and the resulting container is a linux system almost totally isolated from the rest of the host system, so you have good assurance that if the docker container works on one cluster then it will work on another. For example a docker image on my windows laptop will be easily portable to an Amazon EC2 cluster or similar cloud compute service. I have a few rough notes on using docker here: https://gist.github.com/benmarwick/86aaa458df70ff202c27

Your comment on the journal's limitations on supplementary materials was quite interesting. I've noticed in some other fields that people are treating code and data more like first class research products, and reserving the supp. for further discussion of material that was cut from the paper. For example, data files go to a repository like figshare.com, zenodo.org, dataverse.org, etc. that issues a DOI and can be cited in the paper as a regular research product. Code is treated similarly, with people using git for version control during development, then hosting the code online at github, bitbucket, gitlab, etc. and finally archiving a specific commit of the code in one of the data repositories, so the code gets its own DOI for citing. Then the original author can continue to develop and correct the code in public (and others can contribute), but there is still an easily accessible snapshot of the code as it was when the paper was published. DOIs have no special magical properties beyond the promise of persistence, but they are convenient as a widely recognised pointer to a scholarly product. So I think we should skip journal supplementary materials altogether and put code and data in open repositories, get DOIs, and cite them in the paper. In my case the data are usually small enough to be combined with the code, so my github repository has the code and data combined. This concept of a 'research repository' is based on the work on Gentlemen, Donohoe, Stodden, etc (http://biostatistics.oxfordjournals.org/content/11/3/385.long, http://www.math.usu.edu/~corcoran/classes/14spring6550/handouts/reproducible_research.pdf, http://openresearchsoftware.metajnl.com/article/view/jors.ay/63, more: http://ropensci.github.io/reproducibility-guide/sections/references/)

R packages are especially well suited to creating research repositories (I mentioned this above, I've made a few personal packages for research and teaching), because they bundle together code, data, documentation, and tests in a systematic way that promotes good practices (eg. documentation, licensing, etc.), and make it easy to share code with others. I find that the testing performed during the build process is especially useful for code development. Making packages recently became a lot easier with the appearance of Hadley Wickham's book (http://r-pkgs.had.co.nz/) with has excellent instructions for making packages with RStudio. If you write more than 2-3 functions for a research project, you'll probably benefit from making a package.

Mark - thanks for those suggestions, I've made the model testing section as you suggested, please go ahead with a pull request if you want to finesse it a bit more. Some of the sub-headings probably need a bit more thought about how best to organise.

best,

Ben

Lee Drake

unread,

Feb 5, 2015, 2:05:37 PM2/5/15

to anti...@googlegroups.com

Hi all,

Thank you for the suggestions Ben - this is a rewarding discussion to be part of. I like the idea of bundling to avoid changes in software versions. To answer your earlier question - Radiocarbon stores its supplemental files with the PDF page. If you click the view PDF link, a link to the supplemental material will then open to the lower right. I've also restored the /la/ data folder to the website, not sure what happened to it originally. Finally, I strive for my coding to be as redundant as possible. If you see it repeating, it means you understand what is happening and can make guesses about its overall organization. It also means that someone can focus in on one part of the code without being as dependent on an earlier section. I know there is a more elegant, minimalist approach to coding, but to encourage replication it is helpful for everything to be spelled out as simply as possible.

The bigger problem I have run into is the change in packages over time. As Ben mentioned, to change plot elements in ggplot2 in 2012, you had to use opts(). The writer of the package decided to change the function to theme(), but also change the syntax as well. The result is that the code no longer plots properly. Ben noted potential solutions to this, but as it is today only a small minority of folks publishing in archaeology will go through all of these. The number of archaeologists I know of who actively publish who would also store data or methods in gitHub is very small. This is in part a cultural issue, and in part a technical issue. The technological capability to store and share data has vastly outpaced the mores of the field. As a result, those who post data online along with code are going to be the minority for the foreseeable future.

So the question is, how to draw attention to the issue? At this point - apologies in advance - it may be worth considering broadening the discussion past R. How do we foster a culture in archaeology where data is publicly available, digitally accessible, and replicable? And how do we communicate this to the field?

The first solution is to have examples to point to in the literature. I would encourage a publication to a methodological journal - either Journal of Archaeological Method and Theory or Journal of Archaeological Science, that defines best practices. On the surface an edited volume would be the best fit - but that is difficult to access online. Perhaps there is another route as well.

A broader discussion about the issue would help as well. For example, a roundtable at the SAA or EAA (or both) that discusses replicability in archaeology, a digital archaeological commons. This could be focused on a) drawing attention to the issue in a public venue and b) getting broader input on how to accomplish this.

Other fields have addressed this problem in innovative ways. My favorite of these is paleoclimatology. The National Oceanic and Atmospheric Association has a data repository commonly used to store data. There are several gigabytes of data - you can download almost the entirety of the ice core records that document glacial advances for the past 800,000 years as one example. This makes it very easy to download and analyze past work in the context of your own. As a result, in paleoclimatology there are many articles that can analyze a phenomenon on multiple scales using locally generated and globally available datasets. I am attaching one article by Eelco Rohling that uses such data sets to evaluate climatic conditions around the time of the Late Bronze Age collapse in the Eastern Mediterranean region. They can compare local cores to non-sea-salt deposited in Greenland to identify one large air mass (Siberian High) that could have put evaporative pressure on terrestrial water resources. That kind of work would not be possible if the GISP2 data were not trivially easy to access.

In any case, as a question - how would such a system in archaeology work? There are good regional examples, like the Chaco Research Archive for the US Southwest, but not many others. Given the quantity of archaeology that is done on public land with public funding, it is surprising we have yet to build a centralized, publicly available data archive. As so much archaeological data is qualitative in nature (burial descriptions,etc.) it would be good to have a place for site reports and other so-called grey literature as well.

---

B. Lee Drake

Department of Anthropology
University of New Mexico
(505) 510.1518
b.lee...@gmail.com

Rohling et al 2009.pdf

Ben Marwick

unread,

Feb 10, 2015, 12:26:36 AM2/10/15

to anti...@googlegroups.com, b.lee...@gmail.com

Hi Lee et al.,

Thanks for the further explanation about the location of the supps for Radiocarbon, I found them now after your more detailed instructions.

Your comments on the cultural issue are an excellent analysis that I totally agree with. A few of us on this list will be part of an 'Open Methods' session at the upcoming SAA meeting where will we be talking about scripted analyses, literate programming and other ways we're using to improve the reproducibility of our research.

I fully support your proposal to move the question beyond the R programming environment to open practices in general. A manifesto-like-publication is one option we might consider (and I'd be happy to lead that if there's interest), but I also I think there are a bunch of simple things we can do at smaller scales to sustainably change the default that will have much greater impact, because they directly engage with other people's work:

-- To prepare the next generation by making scripted analyses and use of open data repositories normal:

* teach archaeology students to conduct analysis using an open source programming language. They don't need to become computer scientists, but they should be able to have basic proficiency in data cleaning, exploratory data analysis, visualisation and hypothesis testing.

* teach archaeology students using datasets that are freely available in online repositories. There are a few repos that are aimed at archaeologists: tdar.org, opencontext.org, archaeologydataservice.ac.uk but we don't have a strong culture of remixing other people's data (my preference is for figshare.org and zenodo.org because they're a lot easier to use, with nice github hooks)

* have students attempt to reproduce each other's analyses to identify weak points as learning moments and cultivate good habits for coding reproducible research

-- To chip away at our peers and normalise open practices among established practitioners:

* Publish our work examples of reproducible research in our field, put data and code in open data repositories and cite them in our papers. This will help to raise data and code to be first class research products, not throwaway pieces of workflow, as they currently are for most people. Citing code and data also supports microattribution in big multi-author projects where one person has done most of the work on archiving, and with code and data citation, they can now get recognition for that. I challenge those of you currently writing long R scripts to have a go at writing complete R packages as the research repository for your research project, and then cite the package in your publication. Packages forces you to write functions (rather than line-by-line code), automates some testing, and allows you to bundle data in with the code in a convenient way (to a point, probably gigabytes wouldn't work well). Functions are a huge feature of R and greatly improve the readability of code, minimize bugs and confusion by limiting the scope of variables (for me this is where the biggest payoff is), and minimize repetition. Plus you can specify dependencies exactly, so you can say your package depends on a specific version of another package. I don't mean a package on CRAN, but just on github or even just a tarball in a data repository. Hadley Wickham's recent book (http://r-pkgs.had.co.nz/ that I mentioned above) is a huge contribution that has substantially lowered the barrier to package creation.

* Request code & data when reviewing, and recommend to editors that papers only be published when code and data are made openly available in an appropriate repository (not in supps, since they're only available to subscribers for non-OA journals, and they have weird limitations that Enrico noted above). Of course most people wont have a clue what this is about, but it creates opportunities for discussion and education, and if editors get enough of this request they may just make it part of the normal business of a journal article to include URLs to code and data.

* Submit to & review for journals that support reproducible research. One excellent way forward in this direction is a badging system for journal articles to signify when an article has accompanying open materials (cf. https://osf.io/tvyxz/wiki/home/). We can write to editors to encourage them to adopt a system like this (I've written to a few already, including Judith on this list, and if they get interest coming from a few directions there's a better chance they might make it happen)

* Critically review & audit data management plans in grant proposals. I mean, not just note that they're present, but check to see if the applicant really followed through with the data management plan of their last grant.

* Consider reproducibility wherever possible in hiring, promotion & reference letters. When writing these letters, we can favourably note open practices to show others that we value them, and that they should too. And when interviewing prospective hires, students, etc., we can ask questions about open practices to indicate that these are important to us.

* host or lead training workshops to equip peers and students with the skills needed to work reproducibly. Software Carpentry (http://software-carpentry.org/) have an excellent introductory two-day curriculum for this that has been wildly successful in the natural and physical sciences. We run them every quarter at UW, the class fills in a less than a week (I require my grad students to take it, then have them help out at subsequent workshops). Here are the details of our next one: http://efran.github.io/2015-04-09-UW/ They have a spin-off project called Data Carpentry (http://datacarpentry.org/) that is intended to be more domain specific, we might consider contributing to this project a few lessons especially suited to archaeologists.

I'm sure there are other things we can do at a day-to-day level to normalise open practices and set the default to reproducible, I'm keen to know what others are doing. Over the last few months I've been drafting a proposal for an 'Open Science Interest Group' as part of the Society of American Archaeology (cf. http://bit.ly/saaopensci please let me know your suggesions!), which would be another way to raise awareness of this practices.

Ben

Reply all

Reply to author

Forward