Fort Collins Science Center

You are here: FORT > Science > High Throughput Computing

High Throughput Computing: A Solution for Scientific Analysis

Geostatistical Simulations and Statistical Analysis

Background

The U.S. Geological Survey, other agencies in the Department of the Interior, and users in the Geographic Information Science (GIS)/Remote Sensing (RS) communities regularly work with raster data sets such as those produced by the Landfire program and other remote-sensing data. In many scientific applications, these datasets are clipped and projected from a regional map projection to a local map projection. Such map projection changes are required to examine ecological relationships within a local study area. Investigators at the USGS Fort Collins Science Center (FORT) have discovered that changes between regional and local map projections of raster datasets can drastically change the data, which inadvertently introduces error and affects interpretation of any analysis using these data. Although several scientific journal articles have demonstrated that errors occur while projecting global raster datasets, few if any have considered the effects of regional and local map projection changes because they were not believed to be significant. Additionally, no known studies have examined how different characteristics of the data affect the type and magnitude of change. 

Data characteristics that can precipitate changes during projection include (1) inherent errors associated with the to-and-from map projection (influenced by shape, area, direction, and distance distortions); (2) spatial autocorrelation (heterogeneity of surface); (3) number of categories (discrete) and the distributions across these categories (composition); (4) resolutions; and (5) structural characteristics such as global/local variance, isotropic/anisotropic conditions, and stationary/non-stationary conditions (Webster and Oliver 2007; Smith et al. 2007; Cressie 1993).

FORT is evaluating changes in raster data for both national datasets (e.g., Landfire data products) and simulated datasets dispersed across the United States. Simulated datasets will provide a means to control characteristics of the landscape and determine which characteristics have greater influence in changing the data when re-projected. We are simulating the data using a geostatistical approach, which will provide equally probable realizations of a random field given the random field’s structure (e.g., mean, residual variogram, intrinsic stationary). An example of several simulations using varying parameters is provided in Figure 1.

Figure 1
Figure 1. Stochastic simulations using varying anisotropic conditions.

 

Problem

We are analyzing two main types of data (national remotely sensed data and simulated data). For each dataset, we will be quantifying numerous spatial and naïve statistics and comparing these between the native map projection and the re-projected dataset. We are also selecting 12 areas of interest (AOIs) for national products and 1 AOI for simulated data. Therefore, our analysis involves a significant amount of data and computations, which will require substantial processing time.

Solution

Step 1

Due to the quantity of data and processing time (Table 1), we will be using High Throughput Computing (HTC) and possibly High Throughput Performance Computing (HTPC) systems. Figure 2 describes the general workflow required for implementation of this analysis as well as how HTC/HTPC are an integral part of the process.

Table 1. Synopsis of data development. We will post the estimated processing times once the project is completed (currently shown as Not Available [NA]).

Description

Total Number Files

File Size (GB)

Estimated Processing Time (per job)

National Data (Unique)

6

--

--

National Data
(12 AOIs)

72

--

--

National-Projected Data (x2)

144

--

NA

Simulated Nonspatial Data
(1 AOI)

1,100

NA

NA

Simulated Spatial Data (1 AOI)

1,100

NA

NA

Simulated-Projected Data (x2)

2,200

NA

NA

Total

4,622

NA

NA

 

Figure 1
Figure 2. Project workflow. The lower right quadrant (yellow) is where HTPC will likely be used. Everything else is HTC.

Step 2

ESRI ArcGIS1 software will be used to manipulate map projections (restricted to a Windows OS) via geoprocessing. Normalized Root Mean Square (NRMSE), naïve statistical properties, spatial autocorrelation, and additional statistics will be assessed in R1 (http://www.r-project.org/) using HTC. This project will utilize a HTCondor1 HTC system with approximately 250 computer cores, a dedicated High Performance Computing (HPC) server, and a network file server. Directed acyclic graphs (DAG) will be used for submitting jobs via HTC and HTPC, which will assist with re-submitting failed jobs as well as handling cyclical dependencies.

Preliminary results for running the national data assessment show that running these analyses on a single machine requires approximately 460 hours, while HTC requires approximately 36 hours. As for the geostatistical simulations, many require less than 1 hour each to run, but many others require 4 or more days each. (These estimates do not include computing time for the statistical analyses.) The amount of time to run each analysis (for both national and geostatistical simulations) will not be reduced, but distributing these analyses via HTC will reduce the overall computing time. As with the other examples, this analysis would be difficult without HTC. Though the process is complex, understanding how re-projecting raster data can change the underlining data values will provide extremely important information to researchers in many disciplines.

References

 

1The use of any trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

Top of Page
Skip navigation and continue to the page title

Accessibility FOIA Privacy Policies and Notices

Take Pride in America home page. FirstGov button U.S. Department of the Interior | U.S. Geological Survey
URL: http://www.fort.usgs.gov/Condor/GeostatSims.asp
Page Contact Information: AskFORT
Page Last Modified: 2:25:30 AM