Legacy data (n) - Information stored in an old or obsolete format or computer system that is, therefore, difficult to access or process. (Business Dictionary, 2016)
For over 135 years, the U.S. Geological Survey has collected diverse information about the natural world and how it interacts with society. Much of this legacy information is one-of-a-kind and in danger of being lost forever through decay of materials, obsolete technology, or staff changes. Several laws and orders require federal agencies to preserve and provide the public access to federally collected scientific information. The information is to be archived in a manner that allows others to examine the materials for new information or interpretations. Data-at-Risk is a systematic way for the USGS to continue efforts to meet the challenge of preserving and making accessible enormous amount of information locked away in inaccessible formats. Data-at-Risk efforts inventory and prioritize inaccessible information and assist with the preservation and release of the information into the public domain. Much of the information the USGS collects has permanent or long-term value to the Nation and the world through its contributions to furthering scientific discovery, public policies, or decisions. These information collections represent observations and events that will never be repeated and warrant preservation for future generations to learn and benefit from them.
Goal: Expand the USGS contribution to scientific discovery and knowledge by demonstrating a long-term approach to inventorying, prioritizing and releasing to the public the wealth of USGS legacy scientific data.
Implement a systematic workflow to create a USGS Legacy Data Inventory that catalogs and describes known USGS legacy data sets.
Develop a methodology to evaluate and prioritize USGS legacy data sets based on USGS mission and program objectives and potential of successful release within USGS records management and open data policies.
Preserve and release select, priority legacy data sets through the USGS IPDS data release workflow
Analyze the time and resources required to preserve/release legacy data and develop estimates to inform future legacy data inventory efforts.
As one of the largest and oldest earth science organizations in the world, the scientific legacy of the USGS is its data, to include, but not limited to images, video, audio files, physical samples, etc., and the scientific knowledge derived from them, gathered over 130 years of research. However, it is widely understood that high-quality data collected and analyzed as part of now completed projects are hidden away in case files, file cabinets and hard drives housed in USGS facilities. Therefore, despite their potential significance to current USGS mission and program research objectives, these “legacy data” are unavailable. In addition, legacy data are by definition at risk of permanent loss or damage because they pre-date current, open-data policies, standards and formats. Risks to legacy data can be technical, such as obsolescence of the data’s storage media and format, or they can be organizational, such as a lack of funding or facility storage. Conveniently, addressing legacy data risks such as these generally results in the science data becoming useable by modern data tools, as well as accessible to the broader scientific community.
Building on past USGS legacy data inventory and preservation projects
USGS has long history of proactively researching and developing solutions to data management needs, including legacy data inventory and preservation. For example, in 1994 USGS was instrumental in establishing the FGDC-CSDGM metadata standard for geospatial scientific data that is still part of the foundation of USGS data management. Today, USGS is a lead agency in establishing meaningful and actionable policies that facilitate data release to the greater, public scientific community. In recent years, CDI has invested in several legacy data inventory and preservation projects, including the “Legacy Data Inventory” project (aka, “Data Mine” 2013-present), which examined the time, resources and workflows needed for science centers to inventory legacy data. Another CDI project, the “North American Bat Data Recovery and Integration” project (2014-present), is preserving previously unavailable bat banding data (1932-1972) and white-nose syndrome disease data and making them available via APIs. Both of these CDI projects were forward-thinking legacy data initiatives, several years ahead of Federal open data policies and mandates.
However, one of the most comprehensive, Bureau-level legacy data preservation efforts was the USGS Data Rescue project, which provided funding, tools, and support to USGS scientists to preserve legacy data sets at imminent risk of permanent loss or damage. A small sample of USGS science data rescued over those eight fiscal years included:
Inventoried, catalogued, indexed, and preserved Famine Early Warning one-of-a-kind, hardcopy maps.
Landsat orphan scenes, totaling over 146,000 were retrieved and processed, allowing the land research community to access previously unavailable satellite records.
Through a partnership with the Alaska State Division of Geological and Geophysical Surveys, the Alaska Water Science Center scanned, added metadata to, and included in a database volcano imagery dating from the 1950s to 2004.
20,000 original, historical stream flow measurements from Kentucky dating from the early 1900s to the late 1980s were scanned and entered into NWIS.
Central Mineral and Environmental Resources Science Center geochemical data conversion totaling approximately 250,000 primary documents from paper to electronic format were completed.
California Water Science Center migrated paper well schedules and other groundwater records dating back more than 100 years old. The records define historical climate variability, geologic conditions where natural hazards occur, and the extents of freshwater resources.
Over 100 projects were supported in the 8 years the Data Rescue project was in operation (2006-2013), while an additional 300 projects went unfunded, providing a glimpse of the potential trove of USGS legacy data at risk of damage or loss. The urgency of and strategies for preserving USGS legacy data have been discussed at length at the 2014 CSAS&L Data Management Workshop and the 2015 CDI Workshop, further emphasizing a Bureau-wide recognition of the importance of legacy data preservation and release. During the 2015 CDI Workshop, legacy data preservation was rated a top-rated FY16 priority by the Data Management Working Group, laying the groundwork for this proposal, which intends to apply the legacy data inventory and evaluation methods developed through the CDI Legacy Data Inventory project to formalize and extend the inventory successfully started through the Data Rescue Program. By creating a formal method to submit, document and evaluate legacy data known to be in need of preservation, USGS would have a tool that USGS scientists, science centers, and mission areas can use to identify significant historical legacy data that can inform, new, data-intensive scientific efforts.
Challenges and improvements for USGS legacy data preservation and release
Based on our experiences managing and preserving USGS legacy data, we have seen two challenges that often undermine legacy data preservation and release:
The most scientifically significant legacy data aren’t always the most recoverable: Legacy data by definition are “dated” because there is some length of time that has passed since the data were collected, the project completed and recovery efforts begin. The longer the time that’s passed, the more likely project staff aren’t available and supporting project and data documents are lost. Lacking this knowledge and/or documentation, metadata may not be completed, resulting in preserved data that aren’t useable - a critical element of the USGS data release peer review and approval process. If data is not useable, it is more difficult to release. Critically evaluating legacy data for their “release potential,” not just their scientific significance, increases the likelihood of selecting legacy data that will be successfully released.
Research scientists may not have data science skills/expertise/resources: Traditionally, legacy data efforts provide funding directly to the data owner, who is generally a principal investigator and knows the data intimately, but may lack the data science experience, time and tools to preserve and release data in an open format with complete, compliant metadata. In our experience, this can lead to delays in preserving and releasing legacy data. Data scientists can/should not replace data owners, but they can provide a significant level of assistance to data owners, by applying their data and metadata development experience and tools.
We believe that each of challenges have good solutions that can improve the efficiency and predictability of preservation and release efforts:
Make “potential for successful release” a primary evaluation factor in prioritizing and selecting legacy data for preservation and release. By developing a method of estimating the feasibility and cost of preserving and releasing data and incorporating it into the evaluation and priority criteria, we can better select and prioritize data sets.
Provide funding to a USGS data scientist to collaborate with data owners and ensure preservation and releases are consistently produced and of the highest quality.
Each objective of this proposal will be addressed in a sequence of 3 phases:
Legacy Data Inventory Submission Period
Evaluation and prioritization of the Legacy Data Inventory; selection of data sets for preservation and release.
Preservation and release of selected datasets.
Phase I: Identification and inventory of USGS data at risk
Data owners will document their legacy data sets electronically, providing the primary project and data set metadata elements needed to score, evaluate and prioritize the legacy data inventory. The core of these metadata elements will be derived from the established “USGS Metadata 20 Questions” form, which has proven effective at gathering metadata from research scientists with little/no data science experience. Narrative fields will be used for evaluating need. Categorical fields will be used to calculate feasibility scores used to determine level of effort required to successfully rescue the proposed data.
Phase II: Evaluation and prioritization of the USGS data at risk requests
The CDI Data Management Working Group’s Data at Risk sub-group will facilitate the evaluation and prioritization of the legacy data inventory. Mission Areas will be engaged to verify inventory submissions are supported programmatically and meet mission objectives. The USGS Records Management Program, Enterprise Publishing Program, and Sciencebase will be consulted to verify submitted legacy data inventory submissions can be released within Bureau records management and data release policies. Once these checkpoints have been verified the Data at Risk sub-group and data scientist will score and prioritize the legacy data inventory based on the following criteria:
Scientific value/significance to USGS mission area and program objectives.
Potential of successfully preserving and releasing the data by the data scientist.
Severity/Imminence of loss or damage to data based on identified risk factors.
Phase III: Preservation and Release of Select, Priority Legacy Data
Working in order of priority as set in Phase II, the data scientist(s) will collaborate with the data owner and work with them to complete the process of preserving and releasing their legacy data. Through this data owner/scientist collaboration, the data scientist will create and validate the FGDC-CSDGM metadata and develop the data set in an open-format as documented in the metadata. By process, the data scientist will act as an agent of the data owner, coordinating and completing all steps in each workflow until the the IPDS record approved and disseminated by the Bureau and the Sciencebase data release item(s) are approved, locked and made public by the Sciencebase team. However, while the data scientist is responsible for ensuring all preservation and release tasks are completed consistently and within policies and best practices, the data owner retains all approval of final metadata attribution (e.g., title, authorship), as well as disposition of their legacy data (e.g., pre/post processing methods; derivative data architectures).
At the completion of Phase III, each legacy data release will have the following created by the data scientist:
complete, compliant FGDC-CSDGM metadata
legacy data set(s) in an open-format, publicly discoverable and available from Sciencebase.
a USGS highlight submitted through the SW Region to Reston.
a CDI update describing the data set(s) released and a summary of time and resources required to complete the release.
The Community for Data Integration (CDI) represents a dynamic community of practice focused on advancing science data and information management and integration capabilities across the U.S. Geological Survey and the CDI community. This annual report describes the various presentations, activities, and outcomes of the CDI monthly forums, working groups, virtual training series, and other CDI-sponsored events in fiscal year 2016. The report also describes the objectives and accomplishments of the 13 CDI-funded projects in fiscal year 2016.
First estimates of the probability of survival in a small-bodied, high-elevation frog (Boreal Chorus Frog, Pseudacris maculata), or how historical data can be useful
Muths, E.L., R.D. Scherer, S.M. Amburgey, T. Matthews, A.W. Spencer, and P.S. Corn
In an era of shrinking budgets yet increasing demands for conservation, the value of existing (i.e., historical) data are elevated. Lengthy time series on common, or previously common, species are particularly valuable and may be available only through the use of historical information. We provide first estimates of the probability of survival and longevity (0.67–0.79 and 5–7 years, respectively) for a subalpine population of a small-bodied, ostensibly common amphibian, the Boreal Chorus Frog (Pseudacris maculata (Agassiz, 1850)), using historical data and contemporary, hypothesis-driven information–theoretic analyses. We also test a priori hypotheses about the effects of color morph (as suggested by early reports) and of drought (as suggested by recent climate predictions) on survival. Using robust mark–recapture models, we find some support for early hypotheses regarding the effect of color on survival, but we find no effect of drought. The congruence between early findings and our analyses highlights the usefulness of historical information in providing raw data for contemporary analyses and context for conservation and management decisions.
Landscape intactness has been defined as a quantifiable estimate of naturalness measured on a gradient of anthropogenic influence. We developed a multiscale index of landscape intactness for the Bureau of Land Management’s (BLM) landscape approach, which requires multiple scales of information to quantify the cumulative effects of land use. The multiscale index of landscape intactness represents a gradient of anthropogenic influence as represented by development levels at two analysis scales.
To create the index, we first mapped the surface disturbance footprint of development, for the western U.S., by compiling and combining spatial data for urban development, agriculture, energy and minerals, and transportation for 17 states. All linear features and points were buffered to create a surface disturbance footprint. Buffered footprints and polygonal data were rasterized at 15-meter (m), aggregated to 30-m, and then combined with the existing 30-meter inputs for urban development and cultivated croplands. The footprint area was represented as a proportion of the cell and was summed using a raster calculator. To reduce processing time, the 30-m disturbance footprint was aggregated to 90-m. The 90-m resolution surface disturbance footprint is retained as a separate raster data sets in this data release (Surface Disturbance Footprint from Development for the Western United States). We used a circular moving window to create a terrestrial development index for two scales of analysis, 2.5-kilometer (km) and 20-km, by calculating the percent of the surface disturbance footprint at each scale. The terrestrial development index at both the 2.5-km (Terrestrial Development Index for the Western United States: 2.5-km moving window) and 20-km (Terrestrial Development Index for the Western United States: 20-km moving window) were retained as separate raster data sets in this data release. The terrestrial development indexes at two analysis scales were ranked and combined to create the multiscale index of landscape intactness (retained as Landscape Intactness Index for the Western United States) in this data release. To identify intact areas, we focused on terrestrial development index scores less than or equal to 3 percent, which represented relatively low levels of development on multiple-use lands managed by the BLM and other land management agencies.
The multiscale index of landscape intactness was designed to be flexible, transparent, defensible, and applicable across multiple spatial scales, ecological boundaries, and jurisdictions. To foster transparency and facilitate interpretation, the multiscale index of landscape intactness data release retains four component data sets to enable users to interpret the multiscale index of landscape intactness: the surface disturbance footprint, the terrestrial development index summarized at two scales (2.5-km and 20-km circular moving windows), and the overall landscape intactness index. The multiscale index is a proposed core indicator to quantify landscape integrity for the BLM Assessment, Inventory, and Monitoring program and is intended to be used in conjunction with additional regional- or local-level information not available at national levels (such as invasive species occurrence) necessary to evaluate ecological integrity for the BLM landscape approach.
Measuring precipitation in semi-arid landscapes is important for understanding the processes related to rainfall and run-off; however, measuring precipitation accurately can often be challenging especially within remote regions where precipitation instruments are scarce. Typically, rain-gauges are sparsely distributed and research comparing rain-gauge and RADAR precipitation estimates reveal that RADAR data are often misleading, especially for monsoon season convective storms. This study investigates an alternative way to map the spatial and temporal variation of precipitation inputs along ephemeral stream channels using Normalized Difference Vegetation Index (NDVI) derived from Landsat Thematic Mapper imagery. NDVI values from 26 years of pre- and post-monsoon season Landsat imagery were derived across Yuma Proving Ground (YPG), a region covering 3,367 km2 of semiarid landscapes in southwestern Arizona, USA. The change in NDVI from a pre-to post-monsoon season image along ephemeral stream channels explained 73% of the variance in annual monsoonal precipitation totals from a nearby rain-gauge. In addition, large seasonal changes in NDVI along channels were useful in determining when and where flow events have occurred.
Topock Marsh is a large wetland adjacent to the Colorado River and the main feature of Havasu National Wildlife Refuge (Havasu NWR) in southern Arizona. In 2010, the U.S. Fish and Wildlife Service (FWS) and Bureau of Reclamation began a project to improve water management capabilities at Topock Marsh and protect habitats and species. Initial construction required a drawdown, which caused below-average inflows and water depths in 2010–11. U.S. Geological Survey Fort Collins Science Center (FORT) scientists collected an assemblage of biotic, abiotic, and hydrologic data from Topock Marsh during the drawdown and immediately after, thus obtaining valuable information needed by FWS.
Building upon that work, FORT developed a decision support system (DSS) to better understand ecosystem health and function of Topock Marsh under various hydrologic conditions. The DSS was developed using a spatially explicit geographic information system package of historical data, habitat indices, and analytical tools to synthesize outputs for hydrologic time periods. Deliverables include high-resolution orthorectified imagery of Topock Marsh; a DSS tool that can be used by Havasu NWR to compare habitat availability associated with three hydrologic scenarios (dry, average, wet years); and this final report which details study results. This project, therefore, has addressed critical FWS management questions by integrating ecologic and hydrologic information into a DSS framework. This DSS will assist refuge management to make better informed decisions about refuge operations and better understand the ecological results of those decisions by providing tools to identify the effects of water operations on species-specific habitat and ecological processes. While this approach was developed to help FWS use the best available science to determine more effective water management strategies at Havasu NWR, technologies used in this study could be applied elsewhere within the region.
Estimating the economic impacts of ecosystem restoration—Methods and case studies
Cathy Cullinane Thomas
Cullinane Thomas, Catherine, Christopher Huber, Kristin Skrabis and Joshua Sidon
Federal investments in ecosystem restoration projects protect Federal trusts, ensure public health and safety, and preserve and enhance essential ecosystem services. These investments also generate business activity and create jobs. It is important for restoration practitioners to be able to quantify the economic impacts of individual restoration projects in order to communicate the contribution of these activities to local and national stakeholders. This report provides a detailed description of the methods used to estimate economic impacts of case study projects and also provides suggestions, lessons learned, and trade-offs between potential analysis methods.
This analysis estimates the economic impacts of a wide variety of ecosystem restoration projects associated with U.S. Department of the Interior (DOI) lands and programs. Specifically, the report provides estimated economic impacts for 21 DOI restoration projects associated with Natural Resource Damage Assessment and Restoration cases and Bureau of Land Management lands. The study indicates that ecosystem restoration projects provide meaningful economic contributions to local economies and to broader regional and national economies, and, based on the case studies, we estimate that between 13 and 32 job-years4 and between $2.2 and $3.4 million in total economic output5 are contributed to the U.S. economy for every $1 million invested in ecosystem restoration. These results highlight the magnitude and variability in the economic impacts associated with ecosystem restoration projects and demonstrate how investments in ecosystem restoration support jobs and livelihoods, small businesses, and rural economies. In addition to providing improved information on the economic impacts of restoration, the case studies included with this report highlight DOI restoration efforts and tell personalized stories about each project and the communities that are positively affected by restoration activities. Individual case studies are provided in appendix 1 of this report and are available from an online database at https://www.fort.usgs.gov/economic-impacts-restoration.
Spatial occupancy models for predicting metapopulation dynamics and viability following reintroduction
Chandler, R.B., E.L. Muths, B.H. Sigafus, C.R. Schwalbe, C.J. Jarchow, J. Christopher and B.R. Hossack
The reintroduction of a species into its historic range is a critical component of conservation programmes designed to restore extirpated metapopulations. However, many reintroduction efforts fail, and the lack of rigorous monitoring programmes and statistical models have prevented a general understanding of the factors affecting metapopulation viability following reintroduction.
Spatially explicit metapopulation theory provides the basis for understanding the dynamics of fragmented populations linked by dispersal, but the theory has rarely been used to guide reintroduction programmes because most spatial metapopulation models require presence–absence data from every site in the network, and they do not allow for observation error such as imperfect detection.
We develop a spatial occupancy model that relaxes these restrictive assumptions and allows for inference about metapopulation extinction risk and connectivity. We demonstrate the utility of the model using six years of data on the Chiricahua leopard frogLithobates chiricahuensis, a threatened desert-breeding amphibian that was reintroduced to a network of sites in Arizona USA in 2003.
Our results indicate that the model can generate precise predictions of extinction risk and produce connectivity maps that can guide conservation efforts following reintroduction. In the case of L. chiricahuensis, many sites were functionally isolated, and 82% of sites were characterized by intermittent water availability and high local extinction probabilities (0·84, 95% CI: 0·64–0·99). However, under the current hydrological conditions and spatial arrangement of sites, the risk of metapopulation extinction is estimated to be <3% over a 50-year time horizon.
Low metapopulation extinction risk appears to result from the high dispersal capability of the species, the high density of sites in the region and the existence of predator-free permanent wetlands with low local extinction probabilities. Should management be required, extinction risk can be reduced by either increasing the hydroperiod of existing sites or by creating new sites to increase connectivity.
Synthesis and applications. This work demonstrates how spatio-temporal statistical models based on ecological theory can be applied to forecast the outcomes of conservation actions such as reintroduction. Our spatial occupancy model should be particularly useful when management agencies lack the funds to collect intensive individual-level data.
Tamarisk beetle (Diorhabda spp.) in the Colorado River basin: Synthesis of an expert panel forum
Bloodworth, Benjamin R.; Shafroth, Patrick B.; Sher, Anna A.; Manners, Rebecca B.; Bean, Daniel W.; Johnson, Matthew J.; Hinojosa-Huerta, Osvel
In 2001, the U.S. Department of Agriculture approved the release of a biological control agent, the tamarisk beetle (Diorhabda spp.), to naturally control tamarisk populations and provide a less costly, and potentially more effective, means of removal compared with mechanical and chemical methods. The invasive plant tamarisk (Tamarix spp.; saltcedar) occupies hundreds of thousands of acres of river floodplains and terraces across the western half of the North American continent. Its abundance varies, but can include dense monocultures, and can alter some physical and ecological processes associated with riparian ecosystems.
The tamarisk beetle now occupies hundreds of miles of rivers throughout the Upper Colorado River Basin (UCRB) and is spreading into the Lower Basin. The efficacy of the beetle is evident, with many areas repeatedly experiencing tamarisk defoliation. While many welcome the beetle as a management tool, others are concerned by the ecosystem implications of widespread defoliation of a dominant woody species. As an example, defoliation may possibly affect the nesting success of the endangered southwestern willow flycatcher (Empidonax traillii extimus).
In January 2015, the Tamarisk Coalition convened a panel of experts to discuss and present information on probable ecological trajectories in the face of widespread beetle presence and to consider opportunities for restoration and management of riparian systems in the Colorado River Basin (CRB). An in-depth description of the panel discussion follows.
Wood decay in desert riverine environments
Andersen, Douglas; Stricker, Craig A.; Nelson, S. Mark
Floodplain forests and the woody debris they produce are major components of riverine ecosystems in many arid and semiarid regions (drylands). We monitored breakdown and nitrogen dynamics in wood and bark from a native riparian tree, Fremont cottonwood (Populus deltoides subsp. wislizeni), along four North American desert streams. We placed locally-obtained, fresh, coarse material [disks or cylinders (∼500–2000 cm3)] along two cold-desert and two warm-desert rivers in the Colorado River Basin. Material was placed in both floodplain and aquatic environments, and left in situ for up to 12 years. We tested the hypothesis that breakdown would be fastest in relatively warm and moist aerobic environments by comparing the time required for 50% loss of initial ash-free dry matter (T50) calculated using exponential decay models incorporating a lag term. In cold-desert sites (Green and Yampa rivers, Colorado), disks of wood with bark attached exposed for up to 12 years in locations rarely inundated lost mass at a slower rate (T50 = 34 yr) than in locations inundated during most spring floods (T50 = 12 yr). At the latter locations, bark alone loss mass at a rate initially similar to whole disks (T50 = 13 yr), but which subsequently slowed. In warm-desert sites monitored for 3 years, cylinders of wood with bark removed lost mass very slowly (T50 = 60 yr) at a location never inundated (Bill Williams River, Arizona), whereas decay rate varied among aquatic locations (T50 = 20 yr in Bill Williams River; T50 = 3 yr in Las Vegas Wash, an effluent-dominated stream warmed by treated wastewater inflows). Invertebrates had a minor role in wood breakdown except at in-stream locations in Las Vegas Wash. The presence and form of change in nitrogen content during exposure varied among riverine environments. Our results suggest woody debris breakdown in desert riverine ecosystems is primarily a microbial process with rates determined by landscape position, local weather, and especially the regional climate through its effect on the flow regime. The increased warmth and aridity expected to accompany climate change in the North American southwest will likely retard the already slow wood decay process on naturally functioning desert river floodplains. Our results have implications for designing environmental flows to manage floodplain forest wood budgets, carbon storage, and nutrient cycling along regulated dryland rivers.