SEAMO Best Of SEAMO Zip

0 views
Skip to first unread message

Lalo Scalf

unread,
Aug 18, 2024, 4:31:53 PM8/18/24
to downjactoubor

The rest of the year, I am a short distance away in the Biology Department at the university, but immersed worlds away in the field of conservation biology as a PhD student. My research goal is to identify poaching hotspots of the most trafficked mammal in the world, pangolins, to inform conservation management and wildlife enforcement decisions. This translates into a lot of time looking for pangolin scats in forests with detection dogs, extracting DNA from pangolin specimens in museums, and conducting genetic analyses in labs.

At DSSG, there was an emphasis on the analysis and inclusion of stakeholders from the very beginning. For seamo, we were fortunate enough to have the main stakeholders of the project, the Seattle Department of Transportation (SDOT), as the project leads, with whom we can brainstorm, discuss, and ask questions at each point of crucial decisions. In addition, we had a meeting with an extended panel of stakeholders from SDOT and the Office of the Mayor. We developed a system to exchange information and provide feedback resulting in continued interactions that were invaluable in gaining insight into the specific interests and the potential uses of the Seattle Mobility Index.

SEAMO Best Of SEAMO Zip


Download https://lpoms.com/2A2yLh



Having heard the views of Seattle Mobility Index stakeholders, the seamo team has adjusted its course to take account of the feedback. The main addition to the project was the development of personas that represent key groups of people living in Seattle. The purpose of the personas is to add a filter to the Seattle Mobility Index to show what mobility looks like for different types of people. The mobility index of a single professional who likes to bike everywhere will look different from a family with two children that uses a car as the primary mode of travel. Along with this major addition of personas, we made a number of smaller adjustments to the project that we did not foresee at the beginning. The continued feedback from our stakeholders has kept us flexible and each mobility index has gone through a number of iterations to provide the most informative score for the stakeholders.

Eventually, though, I read the manual. I searched for recipes online. I visited bookstores to peruse their pressure cooker cookbooks. After just a few weeks, I felt like an alimentary genius; I now aver to passersby that the pressure cooker has changed my life for the better. For far too long, I had dismissed this magical gadget that allows me to make hot, flavorful, fragrant cuisine in a fraction of the time and effort of using a stove or oven. If only I had been aware of its sorcery sooner, I could have avoided years of heartbreak in the kitchen.

For our project, a major task has been importing and preprocessing geospatial data to train an object detection algorithm that will help the team to identify damaged buildings from post-hurricane satellite imagery. At times, the going has been tough. Anyone who has worked with large amounts of geospatial data understands that GIS processing is not always the most efficient. Load times of large data in standard formats can be infinite. The proper formatting of geospatial data often requires numerous intermediate steps that can stall out if the CPU memory is insufficient for the task at hand. Moreover, the analyst needs a shopping list of data wrangling steps to avoid unnecessary mistakes.

For the past nine weeks, I have been part of a data science project whose aim is to identify damaged or flooded buildings after a hurricane event from satellite imagery. Aside from the research experience and technical skills gained through the project, I am thrilled to learn that the eScience Institute at University of Washington emphasizes a great deal of data management and reproducibility: this is the sort of culture not often seen in other academic environments. As I am also working on another ongoing project which surveys Earth System Science researchers about their experiences in data reuse and reproducible research, I feel an impetus to share my thoughts on the definitions of reproducibility, current awareness of reproducibility in the data science community, and some good practices for enhancing reproducibility.

To make datasets reproducible, it is critical to provide access to the source materials and any code or tools used to process the source data. In our project, all the data and tools used are open-source. We use a Github repository to manage all our data processing scripts and document the data pipeline. By following the instructions on this repository, others should be able to reproduce any intermediate data used by our team. There are some cases when it is not possible to exactly reproduce datasets. For example, machine learning projects often use human annotators to create training data. These training data are subject to human biases and random mistakes. It would be very helpful to establish a standard procedure and quality check for the human annotation phase and document everything in reasonable detail. It is also difficult to recreate observational data and field experiment data. Again, sufficient documentation of how data is generated would be helpful to others.

Data science can be computationally heavy. Thanks to the thriving open-source community today, many software programs and libraries are freely available to anyone, making exact replication much easier than in the old days. Apart from providing access to code and tools used in the research, documentation of method is also vital in ensuring method reproducibility. With enough details, another researcher would, in theory, be able to reimplement the methods even when the code and tools used by the original study are not available. There are many good practices to document research methods besides papers and their supplemental materials, including executable notebooks such as iPython notebook, readable code comments, and Github Wiki pages. Our team uses all of the above mentioned tools to keep things organized.

Exact replication can identify mistakes in data and method. However, with the same or similar research questions, reproducing a study using different data or methods can judge the validity and test the robustness of a claim. For example, this can include conducting research in a different geographic area from but with the same method as the original study. To ensure result reproducibility, one good practice is to repeat the study multiple times with different subsamples of data before arriving at any final result. In our project, we applied an object detection method to imagery of Hurricane Harvey. One of our future goals is to incorporate data from other events, test the generalizability of our model and then try to make it more generalizable.

To summarize, research reproducibility has become a concern for the data science community. Reproducibility can be examined through data reproducibility, method reproducibility, result reproducibility, and inferential reproducibility. Although in some cases, exact replication is not possible, providing open access to raw materials and documenting details of the research process are of utmost importance. There are many changes in culture, reward structures, and funding policies that need to happen in the long run, but as researchers, adopting good practices can make a difference in a short period of time.

b37509886e
Reply all
Reply to author
Forward
0 new messages