BiG CZ SSI proposal

Skip to first unread message

Anthony Aufdenkampe

Feb 14, 2013, 11:15:19 PM2/14/13
to Emilio Mayorga, Kerstin Lehnert, Zaslavsky, Ilya, Emma, Gary Berg-Cross,, Valentine, David,, Whitenack, Thomas,,,, Aaron Packman, Critical Zone EarthCube Domain Group
Hi All,
    Here are some ideas for the title and goals of our proposed project.  Let me know what you think.

Proposal Title: 
An integrated biological-geological data discovery, access and publication system for critical zone sciences (BiG CZ)

  1. Develop the BiG CZ Access web app for map-based discovery of data on critical zone structure and function. 
    • This web app would be a client to the BiG CZ central system (goal 2, below) and would allow a user to zoom above or below the Earth’s surface similar to today’s ability to easily explore historical imagery of the earth’s surface using Google Earth.
    • Views would be filterable by a time period to display: 
      • point locations with sensor-based or sample-based time series observations, and direct access to that data
      • profiles from soil pits and boreholes with sample-based data, and direct access to that data
      • 2D & 3D images of structure obtained for the subsurface via geophysical approaches or for the surface obtained via LiDAR and other geospatial imaging approaches.
  2. Develop the BiG CZ Central software stack that bridges data systems developed for multiple critical zone domains.  BiG CZ Central would:
    • Register and catalog data-series level metadata from numerous domain-specific data services for discrete, feature based earth observations derived from sensors and samples using Observations Data Model 2.0;
    • Provide a web-services interoperability interface to provide single stream/query access to data from multiple, different data systems
    • Develop CZML, for web services in/out of ODM 2.0?
    • Develop a unifying controlled vocabulary ontology that matches metadata requirements for ODM 2.0 and translates/maps CVs from other domains/systems.
  3. Develop the BiG CZ Publication system that stores, organizes and shares user-uploaded data in a central data repository [on the cloud?]
    • [if IEDA fully adopts ODM 2, can we just use their systems?]
    • If we do develop local server systems to facilitate data publication, they should be build with open source, multi-platform software (i.e. Python and PostgresQL)

If this looks familiar, it's because it follows the structure of CUAHSI HIS!  I realized this after nearly completing the outline.  Its more or less a redo of HIS that is wrapped around the much more adaptable ODM 2.0, but with these important distinctions:
  • The BiG CZ Access client would be a sleek web app rather than HydroDesktop, which is MS Window's specific and rather clunky.  The web app should be built on an open source stack, to avoid licensing fees and to allow others to easily adapt it.
  • We would develop a significant number of metadata services so that BiG CZ Central could catalog a wide variety of data sets that HIS Central does not (i.e. soil datasets, etc.).  A single metadata service to HIS Central would capture all the data systems that have already been cataloged by HIS Central.
I suggest that each of the above goals has a team leader.  These might be:
  1. BiG CZ Access web app: Anthony
  2. BiG CZ Central: Ilya
  3. BiG CZ Publication portal and system(s): Kerstin
I'm taking a long 4-day weekend, so I'll barely have email until tuesday.


On Wed, Feb 13, 2013 at 2:03 PM, Aaron Packman <> wrote:
Hi all,

Sorry that I couldn't participate much in the conversation over the last two days.  I like the progress on the discussion, yet I see also that there is still uncertainty in the major challenges that can be addressed through this proposal.  I suggest focusing on biogeochemical processes at the CZO scale as a major theme, because that will require addressing both an important general challenge for analysis of CZ system dynamics and particular issues with both microbial data and experimental manipulations (which are in part related).

The central general challenge is that CZ systems are highly heterogeneous, requiring very different sampling approaches, and further a wide variety of analytical methods are used, with wide discrepancies in cost and ease of use, so different types of data are obtained at wildly different spatial and temporal resolutions, and there is no good approach for synthesizing them in a way that captures meaningful information on differences in data resolution or quality.  In particular, there is no general approach to recording data quality or uncertainties in averaged quantities (e.g., how should point measurement uncertainties be extrapolated to the scale implied by the sampling resolution?)  Estimates of a variety of CZ biogeochemical processes require integration of multiple types of information, but it is not currently possible to evaluate uncertainties in those estimates based on the combined uncertainties and resolutions of all of the component measurements.  Further, there aren't standard metadata formats or data archives for many of the relevant kinds of data.  This is a general Earthcube problem, but CZOs provide a particular focus on a set of scales, processes, and measurements.  Also it is useful and important that CZOs include both surface and subsurface processes, as that magnifies differences in sample acquisition and data resolution relative to more homogeneous systems like an open water body or the atmosphere or even a groundwater aquifer.  So I would keep the proposal pretty tightly focused to the CZO effort, but make the argument that tools useful for the CZO effort are generally useful for problems of upscaling disparate information in earth systems.

The particular issue with microbial ecology is that we need to understand linkages between "structure" and "function" of the communities, and how both relate to prevailing environmental conditions.  Structure here refers to the basic organisms present, while function refers to the role of those organisms in effecting some important metabolic process or other transformation.  Most microbial ecology data is being obtained as gene sequences, and then the software and database systems that Emma mentioned are used to interpret those sequences.  It is possible to interrogate the resulting databases in terms of either structure or function, for example to ask what putative organisms are present, or to ask what sort of functional capability is present (e.g., capability for a particular type of biochemical process).  Further, there is an important distinction between genetic capability (organisms have genes for a particular process), gene expression (organisms that actually have some particular capability), resulting biochemical capability (number of particular enzymes, proteins, etc. that are produced), and resulting ecosystem function (total metabolic rate or chemical transformation rate in an ecosystem under specified environmental conditions).  There might be a great diversity of genetic capability present, but only a subset of this is expressed at any time, and the functional outcome varies almost continuously as environmental conditions change. 

Right now there is an enormous gulf between the interpretation of the basic community genomic data and the actual environmental functions, and this problem is greatly exacerbated by the strong disconnect in resolution between microbial community sampling, measurements of microbial community function, and large-scale biogeochemical patterns.  Functional assessments always involve some experimental manipulation or simulation because it is not possible to isolate the effects of individual organisms or populations of similar organisms in the natural environment.  So typically either some subsystem is isolated for experimentation, or a chemical is added to the environment and the transformation of that chemical interpreted in terms of microbial metabolism using a model.  The use cases that Emma identified represent a mixture of basic environmental observations plus additional measurements that can only be interpreted based on the result of an specific experimental perturbation of the system.   The latter measurements aren't interesting individually, and some of them simply are not meaningful on their own because they are conditional on the design of the experiment.  Further, the interpretation of the desired quantity (e.g., nutrient uptake, metabolic rate) depends very strongly on the assumptions about the system structure and the model used to interpret the integrated system-scale behavior.  In this case, the resulting parameter estimates should not be treated as data, but rather the collection of experimental observations need to be considered in toto, and the metadata should comprise both the experiment design and the environmental conditions at the time of observations (or, perhaps better would be for the environmental conditions to be archived separately as their own data so they could also be used for other purposes).

For a CZO example, consider the Christina River CZO data repository ( ) The data listed under climate/meterology, hydrology, temperature, chemistry, and geomorphology are all basic environmental observations -- interpreted to some extent because of the analytical methods involved, but with well understood interpretations so most people would accept these values as fact, subject to some analytical uncertainty.  However, the Biology/Ecology data are functional measures based on experiments -- the basic measurements here are time series of oxygen, nitrogen, and various inert tracers.  As in Emma's use cases, the individual concentration measurements of added tracers or nutrients are not meaningful, and instead it is the space/time-series of observations following the tracer addition that is important.  Further, in many cases the resulting interpreted parameters are strongly scale dependent, so it is not readily possible to compare these results in a simple database format. 

I think it would be useful to try to address the challenge of relating typical microbial community data to important functional outcomes at CZO scales (e.g., in soils and sediments, in aquifers and regoliths, in wetlands and rivers, and in entire watersheds and landscapes).  There are quite a few efforts to interpret community genomic data, but I don't know of any that have tried to integrate this with functional outcomes in natural systems.  This will require addressing questions of community structure vs. functional capability in the genomic data, observation vs. experimentation in environmental data, and sampling resolution and upscaling. 

To make this more tractable, it would be best to focus on a specific functional outcome. There are really just two big ones relevant to CZOs:  carbon and nutrient dynamics.  I like Emma's second proposed theme of comparing nitrogen transformations in terrestrial vs. aquatic systems, as that is an extremely important issue and encompasses a diversity of sub-environments within the CZ.  This case should receive wide support within NSF because human perturbations of the nitrogen cycle are a major concern now.  Beyond CZO, this effort should be of interest to NSF Ecosystems program, NEON and the related Macrosystems program in BIO, and the crosscutting SEES sustainability program.


On 2/13/2013 5:56 AM, Emilio Mayorga wrote:
Hi all,

At the call yesterday I volunteered to help with the 1-page draft summary. Steve has already moved this forward and created this draft:
(Steve, is it accessible to others?)

Emma has sent out what I think are two compelling descriptions of use cases. I really like the stories. She also had an excellent and concise set of key ideas from the call yesterday (I'm copying them here):
- This tool should be broader than CZO scientists and data, but certainly inclusive of it.
- The tool should focus on microbial data and the intersection between microbial data and other geoscience data types.
- The proposal needs to emphasize software creation over data curation, even though we all agree there will end up being a large amount of data creation necessary.

I'm not sure where the ideas about experimental data, from yesterday's call, fit into a cohesive picture (my connection to the call was dropped during much of that discussion). Roelof's project ideas document focuses on experimental data, though his citations also touch on microbial topics.

I've looked at all these materials closely and still find myself unable to identify our core selling point in the context of the SSI opportunity. So instead of editing Steve's draft project summary, I'd like to raise several questions and comments:

- Was there agreement at the call to set aside a focus on experimental data?
- If not, is it compelling to have *both* experimental data and microbial genomics data at the core of this proposal? I think both have very good arguments going for them, but I'm not sure that together they make for a cohesive presentation that doesn't sound too forced. They bring different set of challenges into focus: a need for integration with other types of data (geoscience, observational, etc) can be weaved around both, but experimental data seems to have a greater and broader need for general information modeling, management and curation. It sounds like microbial genomics per se is already being addressed (see several project links Emma and others have pointed us to over the last 10 days), but the integration with other environmental (and experimental) data is the critical gap we would address, as Emma said.

- So, if it's not already clear, we need to decide ASAP on one topic, or both.

- As a clarification, Steve's project summary sometimes talks about all biological data rather than microbial ecology and genomics proper. I think we should be careful not to define our focus as biological/ecological data as a whole; one could argue that while many challenges remain, there is lots of good work being done in that field by NCEAS and the Ecoinformatics community (eg,, the LTER informatics group, DataONE, and many others. Heck, there's even a journal called "Ecological Informatics" ( that's been around for a few years. The gap I *think* I see is narrower and focuses on integrating microbial genomic data with both microbial function and a wide array of other data, to facilitate the kind of user cases Emma describes.

- In defining our scope and target, I think we also need to be careful not to give the impression that we're trying to attack the very broad problem of integration across all environmental domains related to the critical zone. Many of our draft statements (in the summary, the use cases, and previous docs) convey that impression. My concern is that such expansiveness starts sounding like we're trying to do all of EarthCube in one project, as well as duplicate a lot of what the new CZO Data project is setting out to do (except open ended and not tied down to the CZO network and its specific sites). While integration will be an important element of any proposal to come out of this group, I would think that a clearer and narrower focus is helpful.

Personally, I'd like to hear what Ilya, Kerstin and Ruth think regarding an expansive vs narrower focus; they have a much better understanding than me of what is an appropriate scope for this funding opportunity. As it is, I'm finding it very hard to conceive of well defined software that addresses such a broad scope; instead, this currently looks more like an integration effort (a really cool one!) with lots of system development and software component integration (glue, refinements, convention development, data browsing and querying, etc), to be sure, but not well defined pieces of software.

I'll end with a tiny note about the acronym "EML" that Roelof proposed on his project ideas document: it's already taken by the Ecological Metadata Language,
I hope these thoughts are of some use.

On Tue, Feb 12, 2013 at 7:00 PM, Kerstin Lehnert <> wrote:
Can we play with BiG (Bio-Geo)?


Dr. Kerstin Lehnert
Director, Integrated Earth Data Applications Research Group
Director, Geoinformatics for Geochemistry

Lamont-Doherty Earth Observatory
Columbia University
Palisades, NY, 10964

From: "Zaslavsky, Ilya" <>

Date: Tuesday, February 12, 2013 21:17 PM
To: Emma <>, Anthony Aufdenkampe <>
Cc: Gary Berg-Cross <>, Kerstin Lehnert <>, "" <>, "" <>, "" <>, David Valentine <>, Stephen M Richard <>, "" <>, "" <>, "" <>, "" <>

Subject: RE: [czo-earthcube] Input necessary by FRIDAY Feb 8 Re: NSF CZO proposal summary (

Thanks Emma, this is excellent.

I’ll be able to work on it tomorrow morning and on Thursday.


I just talked with Barbara Ransom, and got some additional advice on how to move forward. To summarize:

1)      Expand it beyond CZO (basically what we concluded on the call, as Emma mentioned)

2)      It would be good to have both GEO and BIO know about the idea. On the BIO side, we’d need to speak with Peter McCartney after we have put the initial thoughts about BIO integration on paper. We may also need canonical BIO-funded people on board. Anthony, Aaron, or others – do you know Peter? I can also talk to him.

3)      It is important to demonstrate wide community involvement, and have strong metrics showing that.


What would be a catchy abbreviation the infrastructure pieces we develop?


-          Ilya



From: Emma []
Sent: Tuesday, February 12, 2013 1:37 PM
To: Anthony Aufdenkampe
Cc: Gary Berg-Cross; Kerstin Lehnert; Zaslavsky, Ilya;;;; Valentine, David;; Whitenack, Thomas;;;
Subject: Re: [czo-earthcube] Input necessary by FRIDAY Feb 8 Re: NSF CZO proposal summary (


Hello All,

I am thrilled to see that Anthony is interested in what we have been working on. One of the things we discussed yesterday is that we think you could be a prime candidate to be the lead PI (or one of two lead PIs), if you were comfortable.

A couple of important ideas from the call yesterday:

  • This tool should be broader than CZO scientists and data, but certainly inclusive of it.
  • The tool should focus on microbial data and the intersection between microbial data and other geoscience data types.
  • The proposal needs to emphasize software creation over data curation, even though we all agree there will end up being a large amount of data creation necessary.

A few people stepped forward to work on a 1-page description of the idea we are converging on for this proposal, but wanted to see some use-case scenarios first. I have attached a file that includes two very specific use-cases that I wrote, and one general one on the second page from Roelof. My use-cases are focused on site selection, collaboration and publication of data coincident with associated articles. These are the roles that I see this tool filling, however, I am very happy to adapt these use-cases if the group feels they are not quite right.

I had mentioned that Aaron Packman might be able to work on these with me, but he is a bit under the weather the last couple of days. I hope to develop these use-case scenarios further with him over the rest of the week, as I get feedback from others on what changes would be helpful.

I have also attached a document called project ideas, which is from Roelof Versteeg per the conversation yesterday.

Finally, I have made a doodle to choose a time for a call at the end of this week to check in and see where we are. Please respond by Wednesday afternoon (evening in EST) so that I can send out the agreed-upon time by Wednesday Feb 13 at night.


On Tue, Feb 12, 2013 at 6:05 AM, Anthony Aufdenkampe <> wrote:

Hi All,

    I'm sorry that I was not able to make the Skype call yesterday, but I'm very interested in where the discussion went.  In short, now that I have our CZO proposal submitted and have had time to reflect, I'm very interested in participating in this.  I've had time to read over the "NSF CZO proposal summary" Google Doc, and have thoughts/ideas.  Ilya and I also discussed many possibilities last Thursday on a phone call.

    I'ld love an update on the group's thinking.




On Mon, Feb 11, 2013 at 6:04 PM, Gary Berg-Cross <> wrote:

I should have my Skype up in a few minutes.. garyb.cross




On Mon, Feb 11, 2013 at 6:00 PM, Kerstin Lehnert <> wrote:

How do I connect with you?



Dr. Kerstin Lehnert

Director, Integrated Earth Data Applications Research Group

Director, Geoinformatics for Geochemistry


Lamont-Doherty Earth Observatory

Columbia University

Palisades, NY, 10964




From: "Zaslavsky, Ilya" <>
Date: Monday, February 11, 2013 11:13 AM
To: Emma <>
Cc: "" <>, Anthony Aufdenkampe <>, "" <>, Kerstin Lehnert <>, "" <>, "" <>, "" <>, David Valentine <>, Stephen M Richard <>, "" <>, "" <>, "" <>, "" <>
Subject: RE: [czo-earthcube] Input necessary by FRIDAY Feb 8 Re: NSF CZO proposal summary (


Hi All,


A lot of good thoughts are shared in the Google doc. I hope we can find time to discuss and coordinate sometime today, and decide on moving forward. Time is very short, not to mention that folks potentially on the proposal team are deciding which proposal they are writing (there is a limit of one proposal per senior participant). Would connecting around 3pm PST today, over skype, work?


-          Ilya [] On Behalf Of Emma
Sent: Wednesday, February 06, 2013 10:35 AM
Cc:;;;;;;; Valentine, David;; Whitenack, Thomas;;; Zaslavsky, Ilya;
Subject: [czo-earthcube] Input necessary by FRIDAY Feb 8 Re: NSF CZO proposal summary (


Dear SSI Proposal Group,


The link goes to a google doc Roelof Versteeg has written based on some of what was discussed in the phone call last Friday. For those who missed the call, there were some different views on what should be included in this proposal, so there are a few different choices listed and described. I think the third is the most original, with the smallest degree of overlap with existing projects. However, it would also present the most difficulty. I want to stress that this proposal is an opportunity for us to propose a problem that we want to address, and hopefully fix. We need to describe the process we will use to address it, but we do not need to know how we will fix the problem at the outset. If it were easy, they wouldn't want to shell out $1M for it!


So, if you would like to be involved moving forward, please input your information in a box at the top of the document and then add any input you have. Feel free to edit what is there, not just add. 


I hope you can add your input by Friday so that we can move forward and contact the program officers with whatever we have decided. We are on a short timetable and I would like to move quickly to decide what the proposal will be on, so that we can begin writing ASAP. 


If you would not like to be involved any longer, please simply do not input your information. We will use the self-provided contact info moving forward. 






On Wed, Feb 6, 2013 at 9:47 AM, Emma Aronson (Google Drive) <> wrote:

I've shared an item with you.

Google Drive: create, share, and keep all your stuff in one place.

Logo for Google Drive


Emma L Aronson, PhD

You received this message because you are subscribed to the Google Groups "Critical Zone EarthCube Domain Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To post to this group, send email to
Visit this group at
For more options, visit



Gary Berg-Cross, Ph.D.  


SOCoP Executive Secretary

Knowledge Strategies    

Potomac, MD


Anthony K. Aufdenkampe, Ph.D.
Associate Research Scientist - Isotope & Organic Geochemistry
Stroud Water Research Center
970 Spencer Road, Avondale, PA 19311
Tel. 610-268-2153 ext. 263; Fax 610-268-0490

Emma L Aronson, PhD

Dr. Aaron I. Packman
Professor, Department of Civil and Environmental Engineering, Northwestern University
address: 2145 Sheridan Road, Tech A314, Evanston IL 60208-3109
phone:  847-491-9902, fax:  847-491-4011

Anthony K. Aufdenkampe, Ph.D.
Associate Research Scientist - Isotope & Organic Geochemistry
Stroud Water Research Center
970 Spencer Road, Avondale, PA 19311
Tel. 610-268-2153 ext. 263; Fax 610-268-0490
Reply all
Reply to author
0 new messages