Cloud Update from EarthScope - April 2024

22 views

Skip to first unread message

Scott Johnson

unread,

Apr 11, 2024, 1:19:02 PM4/11/24

to EarthScope General

Cloud Update

With computational notebooks, code can tell a story

As we build out cloud data systems, we want to help novices leverage them along with pros. One of EarthScope’s initial interfaces to help investigators explore cloud computing with NSF SAGE/GAGE data will be an interactive notebook computing platform. So if you aren’t yet familiar with notebooks, here’s a primer on what they are—and why many in science find them useful.

What’s a notebook?

Notebooks are an alternative way to write and run code that looks more like embedding code in an interactive document. The concept can be implemented in different forms (see our Javascript data access notebooks, for example), but we’ll focus here on Jupyter notebooks.

These notebooks can run code in a number of languages, like Python, R, or Julia, for some examples. The kernel process that executes code can run on your local machine, but it can also run on a server, allowing users to operate the notebook with nothing but a web browser.

Sections (or “cells”) of a notebook can contain code or they can contain formatted text and other media content. This allows a notebook to contain rich directions, context, explanation, documentation, or anything else useful that you can think of. Notebooks can be much more user friendly and flexible than a script because of this ability to add “narrative” to your software tools.

The code cells in a notebook can be executed independently within a shared environment. Although out-of-order execution may sometimes require some care, this allows you to break your code into documented steps.

The output of your code—like visualizations or processed results—can also be displayed inside the notebook. This can make a notebook a full-service tool for interacting with and visualizing data for students, for example. Or it can make it an exploratory tool for quickly iterating to understand the shape of a new dataset or technique.

screenshot of Jupyter notebook showing input, code cells, and output calculating the radius around an earthquake where GPS stations should record offset

screenshot of a jupyter notebook with input fields for creating a map plotting GPS station locations

Above: examples from an EarthScope Jupyter notebook used to create maps and plots for GPS stations near earthquakes.

Why use notebooks?

The cofounders of Jupyter have written that “even though Jupyter helps users perform complex, technical work, Jupyter itself solves problems that are fundamentally human in nature. Namely, Jupyter helps humans to think and tell stories with code and data.”

There are a number of use cases that have made notebooks popular in science, specifically. One of these is the ease with which you can share them with others. When hosted on a server, multiple users can engage with them immediately—no setup required—in the same environment. Colleagues can more easily collaborate on a notebook, and students can jump right into that workflow and focus on scientific content and analysis rather than management and configuration of the technical environment.

The NISAR Science Team, for example, has shared data processing algorithms with its community as notebooks prior to launch. Likewise, the open science Pangeo project has been using notebooks to share tools that “allow researchers to access, process, and analyze NASA data in the commercial cloud without having to download the data”.

This simplified sharing also makes notebooks useful for reproducibility and transparency efforts, as you can document your analysis methods in a ready-to-run package. When the Laser Interferometer Gravitational-Wave Observatory (LIGO) team announced the detection of gravitational waves in 2016, there was obvious interest in analyzing and replicating the data analysis. The team published detailed notebooks that served as a tutorial and documentation for their results.

It’s also worth noting that notebooks are more flexible and extensible than you might think at first glance. Notebooks can be turned into static web pages, run by a browser even without a server, or embedded in web pages to power web apps. They can also interact with systems that aren’t notebooks, like web services, software installed in the same environment, or even a distributed computing architecture.

How to get started

You can create Jupyter Notebooks to run on your local computer by installing JupyterLab, or through Microsoft’s Visual Studio Code. Alternatively, you can work with Python Jupyter notebooks hosted by remote servers such as Google Colab or Binder.

There are many beginner-friendly tutorials for these tools (or for coding languages like Python) that will help you quickly get your feet wet, in addition to geophysical notebook collections you could explore once you’re comfortable.

Just as Google Colab runs on Google’s servers, it’s possible to set up JupyterLab on a server or cloud space you control for multiple users to share–a platform known as JupyterHub. Partnering with 2i2c, we’ll soon be launching a JupyterHub to run notebooks in our cloud data system that you’ll be able to access using your EarthScope data login. In addition to being an easy entry point for building notebooks, our hub will offer some benefits specific to our community.

Future advantages

As we optimize the data archive for cloud processing, there will be functional advantages to performing data analysis from our notebook hub. Because the hub and the data will exist in the same cloud ecosystem, you’ll be able to leverage server-side processing to run data-intensive workflows extremely efficiently and you won’t have to spend long periods of time waiting for data to download—making it more intuitive and faster to run your analysis.

infographic titled ''Cloud On-Ramp Construction'', top progress bar represents data systems and notes that progress goes from lifting existing data systems into the cloud to power up data architecture for full cloud capabilities, with benefits ranging from existing services and workflows preserved to new data query capabilities added to scalable computing adjacent to data supported; second progress bar labeled ''GeoLab JupyterHub'' noting progress from open access to a shared jupyterhub to geolab leverages cloud-ntaive data analysis methods, with a benefit of a platform for reproducibility and collaboration; third progress bar titled ''Training and Documentation'' with progress from introductory information to initial training opportunities to full how-to resources, and an identified benefit of the community being equipped for cloud computing

Here’s where we are currently, and what we’re heading towards.

Initially, you’ll be able to take advantage of notebooks’ strengths for exploratory cloud data analysis, visualization, and data storytelling. But beyond that, we intend to map pathways to more data-intensive cloud computing, so users can migrate their workflows to more script-based development for improved versioning, testing, and scalability as needed.

This notebook hub will also facilitate community collaboration. Technical short courses will be able to simplify and centralize access to tutorial content. (So short courses may double as your introduction to the hub.) It will provide a common location where teams can work together on data analysis methods. And as time goes on, a growing set of curated community tools will be available to everyone. It is our hope that this platform will prove to be convenient and extremely useful—particularly for optimizing your work for cloud computing.

We’ll have much more information to share about the notebook hub in a future update!

The EarthScope-operated data systems of the NSF GAGE and SAGE Facilities are migrating to cloud services.

To learn more about this effort and find resources, visit: earthscope.org/data/cloud

webpage screenshot with title ''Cloud Data Systems''

video thumbnail with title ''Cloud Computing for Science''

Check out our two-minute video for an introduction to cloud computing for science