The complexity of natural resource issues requires U.S. Geological Survey (USGS) scientists to be able to share and integrate data and ideas across a variety of scientific disciplines and information systems. However, because many information systems and databases are initially conceived to address more local needs and issues, they tend to be geographically dispersed and built on a variety of platforms. Although all were useful within their domains, they were neither connected nor comprehensive, and accessibility was not universal.
Consider the following “user story”:
The Wyoming Landscape Conservation Initiative (WLCI) comprises many federal and state agencies, conservation districts, county commissions, private landowners, and nongovernmental organizations involved in a coordinated strategy to address conservation of wildlife and other natural resources across southwest Wyoming. To conduct the necessary research for informing resource management decisions, data is needed not only on the natural resources themselves but also on human activities and infrastructure across the landscape, such as energy development and roads. Many datasets exist in various places with varying degrees of accessibility, and new data need to be collected. All of the datasets need to be located, described, consolidated, and made available for a variety of scientific and management needs. What datasets exist and where? Are they usable? What new data are being collected? What areas do the data cover geographically, geologically, or jurisdictionally?
Where does an individual WLCI researcher, manager, or citizen even begin to look for the information they need?
It would be so much easier if they had a single, comprehensive data resource that could look for them and not only find and retrieve existing data, but also store and organize new data as it becomes available….
One solution to meet this need is the "ScienceBase" Catalog. ScienceBase provides a data cataloging and collaborative data management platform for USGS scientists and partners. It is essentially a vast structured database containing mostly "information stubs" to data, ranging from field records to Web sites to publications (Fig. 1). This integrated database of information pulls from many different data sources, such as existing data systems, metadata catalog systems, nondigitized collections, and new original content. In addition, ScienceBase offers Web services that drive core applications or platforms like myUSGS, the Geographic Management Information System, and other evolving systems. In short, it is a one-stop shop for finding data.
Metadata (descriptive information about a particular data source) is really what makes ScienceBase tick. Many, if not most, of the items coming into ScienceBase are and will be coming from other data systems. To accommodate this diversity, ScienceBase uses a fairly simple and high-level set of metadata fields to describe ScienceBase Items (databases, publications, people, Web sites, etc.), along with extended attributes as necessary. This allows USGS developers, including the USGS Fort Collins Science Center's Web Applications ("Apps") Team and the USGS Central Region Geospatial Information Office, to quickly build on the vast store of basic information and make valuable connections between items (e.g., identifying the data resources necessary to drive a scientific model), while allowing metadata to grow over time through contributions from scientists and data management professionals.
The primary function of the ScienceBase Metadata Catalog is to integrate metadata about many types of items of importance to scientific research and science-based decision-making, and to do so in a format that provides for discovery and access of resources from the best available sources. A secondary function is to provide for ongoing improvements in the knowledge about scientific data and information resources through logical indexing of common elements (like topics or geographic context) and capturing of value-added metadata. Examples might be short descriptions of relationships between items, written reviews and ratings from experts and other users, or topical “tags” to classify items.
Given the many sources from which data and other resources might come, the following guiding principle for development of the ScienceBase metadata profile is fundamental:
Items discovered by way of ScienceBase applications must be understandable to a user outside of the items’ original context, yet clearly be linked back to their original context.
Therefore, where possible, a ScienceBase item will draw its core fields from, and maintain a link to, its original metadata record coming from a metadata catalog service connection, metadata document harvesting operation, or some other established source for metadata (Fig. 2).
At the core of ScienceBase are "ScienceBase Items," the individual data records making up the aggregated metadata database. With the relatively simple, yet powerful metadata scheme described above, ScienceBase will support many different types of base items. (However, all ScienceBase items share certain characteristics—see box at right.) These fall generally into two categories, Foundational Data and Resource Data.
The following types of items are considered foundational ScienceBase “item types” that are generally stored directly in ScienceBase (meaning ScienceBase is their primary "home" and they are maintained in the ScienceBase data model). Although these items may be originally derived from another source, they are now a core part of ScienceBase. They are considered foundational in that they connect to several other items and serve to facilitate integrated discovery of other resources.
People, Organizations, and Teams. Finding a scientist or organization involved in a project or the creation of a database can often lead to the discovery of other important resources. Therefore, information about people, organizations, and teams (and their locations) is often a key metadata attribute of other types of resources that can be critical in the discovery process. An existing application developed for myUSGS and other platforms, called PLOT (“People, Locations, Organizations, and Teams”), has been incorporated into ScienceBase. PLOT provides contact information, physical locations, descriptions or job titles, and related Web sites for individuals and groups working on projects catalogued in ScienceBase.
Citations/Publications. These will often be accessed as a file for download and possible preview or full view methods within ScienceBase. In some cases, ScienceBase may contain basic bibliographic references, harvested from another source, that link to a Web site where the publication may be obtained. A publication can also refer to an item unavailable online but with instructions on how to physically obtain it. The aim of ScienceBase is to make it as seamless and simple as possible to get to the actual final document. Publications also will have relationships to many other items from a reference standpoint and to people as author references. Sources include the USGS Publications Warehouse as well as ScienceDirect, one of the major sources USGS uses for access to journal articles by USGS scientists.
Projects. Along with people, organizations, teams, and publications, projects are a critical underlying building block for data integration, exploration, and discovery. Projects tie together people and organizations working on an effort along with the products (data assets and publications) that are generated as part of a project. Because much of the scientific community views the world in terms of “projects,” project records are a central organizing factor around which other data and information can be accumulated. Similarly, project records in ScienceBase are an important reference element since they will often "spawn" other ScienceBase Item Types (e.g., publications, databases, applications, and the like). Information about science and other projects can be aggregated from entries in the USGS or other project tracking systems.
Scientific Sites (data collection locations). One of the central user stories behind ScienceBase involves a person drawing a box around an area of interest on a map and finding out all of the places in that area where data have been collected or produced. This is partly accomplished by including online data resources that have a geospatial component inherent in their data. However, the FORT Web Apps Team is creating a specialized "index" of the scientific sites discovered through all of the various data sources catalogued in ScienceBase. This service will facilitate discovery of important sites, which can then be used to seek actual data from those sites available through other ScienceBase services.
The following are the actual data and information resources that users search for via the ScienceBase Catalog to drive a particular application or scientific analysis.
Applications. A scientific application is like a model that can be interacted with or used with data in some way. A core component of the ScienceBase concept is the continual addition of applications that can be plugged into many different systems over time. Anyone can tap into this “toolbox” and take advantage of its capabilities. The toolbox includes existing USGS tools designed to help with data and information management and allows new tools to be added over time. These tools might be accessed through a Web link, a specific interface, or a downloadable file.
Web Resources. These refer to an entire site somewhere on the Web or intranet. Web resources are generally collections of information. Sources include USGS Web sites and Web links in the Comprehensive Science Catalog.
Offline Data or Physical Collections and Items. This general category can encompass several other specific item types. Currently, the Web Apps Team is working with examples from geoscience collections held by the USGS or State Geological Surveys. These are generally physical things like rock cores but also include paper records of various types. The general category of "offline data" can include digital files that are not directly accessible online. Examples might include proprietary file formats that could possibly be made available online by request.
Ultimately, ScienceBase is about collecting a little bit of information about many different resources in a database that can be used to drive robust search interfaces and Web services. These features in turn allow users to create their own customized search interfaces and will facilitate collecting user-added metadata like tags, reviews, and relationships. For example, research communities can set up their own "virtual catalogs" (called “contexts” in ScienceBase) that contain items of particular importance to their work. To illustrate, let’s revisit the WLCI user story presented above, which now reads as follows because of ScienceBase:
The Wyoming Landscape Conservation Initiative generates data from many different data collection efforts. It also uses source data from a number of non-WLCI projects (national and state-wide) to build WLCI-specific models and derived datasets. The ScienceBase Catalog allows the WLCI community to document, search for, view, distinguish, and access both types of data. The search function evaluates the entire body of metadata submitted for the object, not just a title or summary. Data can be viewed in tabular form or as interactive maps (Fig. 2) built on submitted project or dataset footprints, or generated from the metadata itself. Now, when WLCI scientists and managers are looking for specific data, they can go to one Web resource that integrates and searches all of the available data sources for what is needed (Fig. 3).
When fully implemented, ScienceBase will serve as a dynamic, user-driven resource for the advancement and support of USGS science across all of its disciplines.