Verification of Probabilistic Quantitative Precipitation ... - AMS Journals

3 downloads 1147 Views 672KB Size Report
precipitation. (b) The average monthly difference between the ensemble mean of the 0000 UTC fore- ..... tation forecasts in the National Weather Service: A simple approach for ... Fulton, R. A., J. P. Breidenbach, D.-J. Seo, D. A. Miller, and T.
JANUARY 2005

YUAN ET AL.

279

Verification of Probabilistic Quantitative Precipitation Forecasts over the Southwest United States during Winter 2002/03 by the RSM Ensemble System HUILING YUAN Department of Civil and Environmental Engineering, University of California, Irvine, Irvine, California

STEVEN L. MULLEN Department of Atmospheric Sciences, The University of Arizona, Tucson, Arizona

XIAOGANG GAO

AND

SOROOSH SOROOSHIAN

Department of Civil and Environmental Engineering, University of California, Irvine, Irvine, California

JUN DU

AND

HANN-MING HENRY JUANG

NWS/NOAA/NCEP/Environmental Modeling Center, Washington, D.C. (Manuscript received 26 March 2004, in final form 20 June 2004) ABSTRACT The National Centers for Environmental Prediction (NCEP) Regional Spectral Model (RSM) is used to generate ensemble forecasts over the southwest United States during the 151 days of 1 November 2002 to 31 March 2003. RSM forecasts to 24 h on a 12-km grid are produced from 0000 and 1200 UTC initial conditions. Eleven ensemble members are run each forecast cycle from the NCEP Global Forecast System (GFS) ensemble analyses (one control and five pairs of bred modes) and forecast lateral boundary conditions. The model domain covers two NOAA River Forecast Centers: the California Nevada River Forecast Center (CNRFC) and the Colorado Basin River Forecast Center (CBRFC). Ensemble performance is evaluated for probabilistic forecasts of 24-h accumulated precipitation in terms of several accuracy and skill measures. Differences among several NCEP precipitation analyses are assessed along with their impact on model verification, with NCEP stage IV blended analyses selected to represent “truth.” Forecast quality and potential value are found to depend strongly on the verification dataset, geographic region, and precipitation threshold. In general, the RSM forecasts are skillful over the CNRFC region for thresholds between 1 and 50 mm but are unskillful over the CBRFC region. The model exhibits a wet bias for all thresholds that is larger over Nevada and the CBRFC region than over California. Mitigation of such biases over the Southwest will pose serious challenges to the modeling community in view of the uncertainties inherent in verifying analyses.

1. Introduction The southwest United States is marked by highly heterogeneous topography and diverse vegetation. The complexity of topographical, hydrological, and biological scenarios poses challenges for numerical weather prediction over this region. Because of the significant economic and social impacts of accurate precipitation forecasts on this fast-developing area (Pielke and Downton 2000), quantitative precipitation forecasts (QPFs) and probabilistic quantitative precipitation

Corresponding author address: Huiling Yuan, Department of Civil and Environmental Engineering, University of California, Irvine, E-4130 Engineering Gateway, Irvine, CA 92697-2175. E-mail: [email protected]

© 2005 American Meteorological Society

MWR2858

forecasts (PQPFs) are considered to be critical parameters for weather predictions (Fritsch et al. 1998). Unfortunately, the skill of precipitation forecasts has improved relatively slowly (Sanders 1986) compared to other weather elements. Rapid loss of skill is a consequence, in part, of the initial errors that saturate much faster for precipitation forecasting than for other sensible weather elements (Wandishin et al. 2001). Model errors associated with physical parameterizations also play a large role in rapid degradation of precipitation forecasts. Numerous early studies (e.g., Epstein 1969; Leith 1974; Hoffman and Kalnay 1983; Kalnay 2003) suggested that ensemble forecasts, the running of several model forecasts, could provide more useful and more skillful weather forecasts than a single deterministic

280

MONTHLY WEATHER REVIEW

forecast run at higher resolution. Ensemble forecasts are performed by introducing perturbations in initial conditions or in model physical processes, or by using different model configurations. One common method of producing ensemble forecasts is to run multiple members using the same model but from slightly different initial conditions for the same forecast cycle. Global ensemble forecasts have been run operationally at both the National Centers for Environmental Prediction (NCEP; Tracton and Kalnay 1993; Toth and Kalnay 1993; Toth et al. 1997) and the European Centre for Medium-Range Weather Forecasts (ECMWF; Palmer et al. 1993; Molteni et al. 1996) since December 1992. The success of global ensemble systems (e.g., Tracton and Kalnay 1993; Toth and Kalnay 1993; Mureau et al. 1993; Molteni et al. 1996) led to examination of ensemble forecasts with limited-area models at finer resolutions. It was found that the ensemble approach could also improve short-range weather forecasts, especially precipitation forecasting (Brooks et al. 1995; Du et al. 1997; Hamill and Colucci 1998; Buizza et al. 1999; Mullen et al. 1999; Stensrud et al. 1999; Hamill et al. 2000; Toth et al. 2002). Currently, NCEP runs the regional short-range ensemble forecasting (SREF; Du and Tracton 2001) system (0–63 h starting from 0900 and 2100 UTC initial conditions), which uses the Regional Spectral Model (RSM; Juang and Kanamitsu 1994; Juang et al. 1997) and Eta Model (Black 1994; Rogers et al. 1995) at an equivalent grid spacing of 48 km (32 km was implemented in summer 2004) over the contiguous United States (CONUS) region. Motivated by the previous investigations on SREF and the operational implementation of regional ensemble forecasting at NCEP, this paper assesses twicedaily ensemble forecasts over the southwest United States using the RSM (1997 version) at a grid spacing of 12 km, one much finer than the current operational ensemble configuration. Daily forecasts at 0000 and 1200 UTC with 6-h interval output are performed for winter 2002/03 (a weak El Niño year; http:// ggweather.com/winter0203.htm), a period during which several major winter storms, floods, and heavy snowfall occurred over the region (Kelsch 2004). We selected a cool season for this study because runoff from wintertime precipitation provides most of the freshwater supply for the western United States (Palmer 1988; Serreze et al. 1999). The objective of this study is to examine how much a short-range ensemble system, at a model resolution typical of the next generation ensemble systems, can benefit PQPFs during a cool season. The verification focuses on evaluating 24-h PQPFs for hydrological regions. To meet high-resolution requirements of hydrological applications, the NCEP Stage IV 4-km precipitation analyses are selected as verification data. This ensemble configuration and experimental design are described in section 2. Verification procedures, differences among potential verification datasets, and their impact on assessment of forecast accuracy and

VOLUME 133

skill are explained in section 3. Section 4 presents the results of PQPFs and analyzes the model performance for different hydrologic regions. Findings are summarized and recommendations for future research are addressed in section 5.

2. Overview of the RSM ensemble prediction system The NCEP RSM ensemble forecasting system uses the RSM, 1997 version. The model can be run in two configurations: the hydrostatic version (Juang and Kanamitsu 1994) is called the RSM, while the nonhydrostatic version is called the mesoscale spectral model (Juang 2000). Physical packages in the RSM include a surface layer with planetary boundary layer physics, shallow convection, simplified Arakawa–Schubert convective parameterization scheme, large-scale precipitation, shortwave and longwave radiation with diurnal variation, gravity wave drag, the double Laplacian-type horizontal diffusion to control noise accumulation, and some hydrological processes (Kanamitsu 1989). A lateral boundary relaxation is applied to reduce the noise from lateral boundary conditions (Juang et al. 1997). The NCEP RSM ensemble system uses the breeding method to produce initial perturbations (Toth and Kalnay 1997; Kalnay 2003). The ensemble consists of 11 members: an unperturbed control forecast and five pairs of bred forecasts. Each pair of bred members consists of twin forecasts with positive and negative initial perturbations. Four meteorological variables, U, V, T, and Q (two components of wind speeds, temperature, and humidity) at all model levels, plus mean sea level pressure (MSLP), are perturbed in breeding processes. In this study, the RSM ensemble forecasts are run to 24 h at 12-km grid spacing with 28 sigma levels over the southwest United States using the hydrostatic version (the RSM). The daily forecasts for each 0000 and 1200 UTC cycle are performed for all 151 days during the period 1 November 2002 to 31 March 2003. Initial conditions (ICs) are generated from the NCEP Global Forecast System (GFS; http://meted.ucar.edu/ nwp/pcu2/avintro.htm) products. At 0000 and 1200 UTC, the GFS provides global analyses (once called AVN) at T254L64 resolution (T denotes triangular wave truncation, L denotes vertical layers) and runs a deterministic forecast to 84 h at T254L64. To maintain consistency between ICs and the global ensemble forecasts at T126L28 that are used for boundary conditions (BCs) of regional ensemble models, ICs of the control run at 0000 and 1200 UTC are truncated from T254L64 global analyses to T126L28 resolution. At the initial integration time (0000 UTC 1 November 2002), ICs for five pairs of positive and negative bred perturbations are directly obtained from five pairs of the GFS ensemble runs at T126L28 resolution. Afterward, the perturbations of bred members are constructed from the previous 12-h RSM forecasts, so-called regional breed-

JANUARY 2005

YUAN ET AL.

ing (Tracton and Du 2003, manuscript submitted to Wea. Forecasting). BCs for the control run at 0000 UTC and five pairs of bred members come from updating GFS ensemble 6-h outputs. The GFS ensemble forecasts of the 1200 UTC control run at T126L28 resolution are not archived at NCEP, so as a substitute the BCs of the control run at 1200 UTC are truncated from the GFS deterministic global forecasts from T254L64 to T126L28 resolution. Computational limitations always constrain the choice of domain size, grid spacing, and number of ensemble members (Hamill et al. 2001). Du and Tracton (1999) find that restricting domain size has a detrimental effect on the performance of regional ensembles through suppression of spread. They also note, however, that precipitation possesses a much lower dependence on the domain size than other model parameters (e.g., geopotential height, temperature, wind). Because the sole focus of this study is PQPF, a limited domain is chosen to allow for a 12-km grid separation, or a minimum wavelength of 24 km in the RSM, that covers the

281

entire Southwest. The model domain contains 162 ⫻ 171 grid points and is centered at 36.5°N, 114.5°W (Fig. 1). We hereafter refer to this 12-km RSM configuration as RSM12. The grid covers two National Oceanic and Atmospheric Administration (NOAA) RFCs over the Southwest: the California Nevada River Forecast Center (CNRFC) and the Colorado Basin River Forecast Center (CBRFC). It also includes four entire U.S. Geological Survey (USGS) hydrologic unit regions: the Upper Colorado region, the Lower Colorado region, the Great Basin region, and the California region. Model elevation is interpolated from 5⬘ ⫻ 5⬘ topographic data (equivalent to about 5 km ⫻ 5 km; Juang 2000) to 12-km grids over the study domain.

3. Accuracy metrics and verification analyses Multiple verification measures are needed to sample the high dimensionality of forecast quality (Murphy and Winkler 1987; Murphy 1991). For that reason, a suite of measures that are appropriate for assessing

FIG. 1. The study area (163 ⫻ 172 grid points, 12-km mesh) and topography (contour interval 500 m). There are two River Forecast Centers (bold solid line): (left) the CNRFC and (right) the CBRFC, and four USGS hydrologic regions (shaded area): A—the Upper Colorado region, B—the Lower Colorado region, C—the Great Basin region, and D—the California region.

282

MONTHLY WEATHER REVIEW

probabilistic forecasts is employed. These include the Brier skill score (BSS; Stanski et al. 1989; Wilks 1995), ranked probability skill score (RPSS; Stanski et al. 1989; Wilks 1995), ranked histograms (RH; Anderson 1996; Hamill and Colucci 1998; Hamill 2001), attributes diagrams (AD; Stanski et al. 1989; Wilks 1995), relative operating characteristic (ROC; Stanski et al. 1989; Wilson 2000) curves, and potential economic value (PEV; Richardson 2000; Buizza 2001; Zhu et al. 2002). Details on these verification techniques can be found in books (e.g., Stanski et al. 1989; Wilks 1995; Jolliffe and Stephenson 2003) and references contained within. Confidence bounds for the verification scores are estimated using resampling procedures (Hamill 1999; Mullen and Buizza 2001). Distributed hydrological models often require precipitation input at resolutions finer than the RSM12. Because of their fine resolution, the NCEP stage IV precipitation analyses, which are mapped onto a 4 km ⫻ 4 km national grid (hereafter Stage4; http:// wwwt.emc.ncep.noaa.gov/mmb/ylin/pcpanl/stage4/) for the CONUS, are selected to verify QPFs and PQPFs. Stage4 provides quantitative precipitation estimates (QPEs) for the 6-h accumulations valid at 0000, 0600, 1200, and 1800 UTC, whereas many NCEP datasets only archive 24-h estimates valid at 1200 UTC. Stage4 24-h QPEs are obtained from the 6-h QPEs for the 12 RFCs of the CONUS. Approximately 3000 automated, hourly rain gauge observations and hourly radar estimations of precipitation are blended together to generate hourly and 6-h precipitation analyses at most RFCs. The algorithm produces an estimate for any grid cell within approximately 50 km of the nearest gauge report or within approximately 200 km of each successfully decoded radar report. The gauge data is subjected to manual quality control (Young et al. 2000), and radar data undergo a bias adjustment (Fulton et al. 1998). The CNRFC and CBRFC use the Mountain Mapper method with the Precipitation–Elevation Regressions Independent Slopes Model (PRISM, Daly et al. 1994) to produce precipitation distributions that incorporate gauge data (http://www.nws.noaa.gov/oh/hrl/hag/ summary2.htm). Major sources of uncertainty in the Stage4 analyses over the southwest United States come from variations in the spatial density of rain gauges (Fig. 2) and imperfect radar estimation associated with the radar reflectivity over the complex terrain (Fulton et al. 1998). Another source of uncertainty is that Stage4 does not supply a mask for missing or suspect data. Although the grid cells outside the RFC domains are easy to process as missing data in Stage4, it is dif-

VOLUME 133

FIG. 2. The distribution of daily gauge stations at the RFCs for the study period of 151 days. The grayscale represents the percentage of days with reports available.

ficult to discriminate between zero values and the missing data over the land, especially over the regions with relatively few rain gauges (e.g., Nevada). As a way to examine the impact of a different QPE with a comparable resolution on forecast verification, analyses based solely on the RFC gauge data (hereafter RFC4) are used. The RFC4 analyses utilize the same 4-km grid and same gauge-to-grid algorithm as Stage4, but automated and manual gauge reports that number 7000 to 8000 (Higgins et al. 2003) are analyzed. The input data for the RFC4 analyses is not subjected to the quality control at the RFCs, however. The RFC4 mask is then applied to the Stage4 analyses to create a third verification analysis (hereafter RfcS4) that allows comparison over the same set of grid points. The impact of using a coarser-resolution QPE on skill can be examined by comparing verifications based on 1/8° (⬃14 km) 24-h analyses (hereafter RFC14) from the NCEP Climate Prediction Center (CPC) against the Stage4 analyses. The PRISM/least squares scheme (more information available online at http://www.emc.ncep. noaa.gov/mmb/ylin/pcpverif/scores/docs/QandA.html) is used to map 7000–8000 RFC daily gauges (Higgins et

TABLE 1. NCEP precipitation analyses. QC: quality control done at RFCs; CONUS: continental United States.

RFC14 RFC4 Stage4

Resolution

Data source

QC

Interval

Time (UTC)

Mask

Gauge for CONUS

1/8°(14 km) 4 km 4 km

Radar ⫹ gauge Gauge only Radar ⫹ gauge

Yes No Yes

24 h 24 h 6/24 h

1200 1200 0000, 1200

Yes Yes No

7 ⬃ 8000 7 ⬃ 8000 ⬃3000

JANUARY 2005

YUAN ET AL.

283

FIG. 3. The 24-h precipitation during the period 1200 UTC 8 Nov 2002–1200 UTC 9 Nov 2002 for three datasets and RSM forecasts (see section 3): (a) RFC14 (1/8°), (b) RFC4 (4 km), (c) Stage4 (4 km), (e) RSM member n5, (f) RSM member p5, and (g) RSM ensemble mean. The grayscale indicates the precipitation amount (mm); blank areas have no data.

al. 2003) onto the 1/8° grid. The RFC14 scheme also includes a quality control step that combines gauge, radar, and satellite data. Note that the 1/8° spacing is close to the native grid spacing of the RSM forecasts.

Thus, 24-h precipitation at 1200 UTC is available for four datasets: RFC14 (1/8° or ⬃14 km), RFC4 (4 km with own mask), Stage4 (4 km, no mask), and RfcS4 (4 km, with RFC4 mask). The main differences between

284

MONTHLY WEATHER REVIEW

the first three QPEs are summarized in Table 1. Figures 3a–c illustrate representative differences between three QPEs for the 24-h period ending at 1200 UTC 9 November 2002. There are obvious local inconsistencies among the QPEs in terms of precipitation amount and masked areas. Figure 3 also shows 24-h precipitation forecasts for two ensemble members (Figs. 3d,e) and the 11-member ensemble mean (Fig. 3f); noteworthy differences are apparent between the RSM12 forecasts. Comparison of the three QPEs and the two RSM12 forecasts indicates localized differences among the QPEs that are comparable to differences between the RSM12 QPFs. The general impression is that differences among the QPEs are smaller than differences between the ensemble members, but the QPE differences are not negligible. The differences among the panels of Fig. 3 suggest that estimates of the forecast skill can change markedly if different verification analyses are used as “truth.” To illustrate this point, the BSS over the entire verification domain for all 151 forecasts was computed with the four QPEs (Fig. 4a), with the respective sample frequencies used as the referenced forecast in skill estimate (Fig. 4b). Bilinear interpolation is used to transform model output to the analysis grids for this estimate and for all verifications that follow. The most meaningful comparisons are between Stage4–RfcS4 to assess the impact of masked regions, RfcS4–RFC4 to assess the impact of different analyses on the same 4-km grid, and Stage4– RFC14 to assess the impact of a coarser verification. Elimination of potentially suspect grids from Stage4 analyses (Stage4–RfcS4) has a minimal (but mostly positive) impact on skill. The skill is significantly higher for the gauge-only analyses (RFC4) compared to the blended analyses at the same grid points (RfcS4) for thresholds below 20 mm. This comparison vividly points out the large impact that the choice of verifying analysis can have on skill assessment. In fact, the Brier skill score for the lightest thresholds is markedly higher for the RFC14 and RFC4 analyses than for the Stage4/ RfcS4 products. The fact that RFC14 analyses yield the highest skill is not surprising since it does not contain a strong influence from scales that are too small to be resolved by the RSM12. On the other hand, the comparable performance of the RFC4 gauge-only analyses is more surprising. Differences in performance at the small thresholds are related, in good part, to variations in sample climatology (Fig. 4b). Stage4 and RfcS4 analyses systematically possess a lower frequency of occurrence below 25 mm compared to the RFC4 and RFC14 analysis. For example, note the ⬃0.14 frequency at 1 mm for Stage4/RfcS4 versus 0.20–0.22 for RFC14 and RFC4. Differences of this magnitude would yield large differences in estimates of model biases. It turns out that the RSM ensemble possesses a large wet bias (next section), so use of “drier” Stage4 analyses would lead to degradation in skill relative to use of the RFC4 and RFC14 analyses.

VOLUME 133

FIG. 4. The (a) Brier skill score and (b) sample climatological frequency of 24-h precipitation (1200 UTC) at eight thresholds of 1–75 mm during 151 days for four datasets (see section 3): RFC14, RFC4, RfcS4, and Stage4. Error bars in (a) give 90% confidence interval for the Brier skill score based on the Stage4 analyses.

Many hydrological applications, such as flood forecasting, require precipitation input at fine resolutions (Droegemeier et al. 2000). For that reason, the 4-km Stage4 is selected as baseline “truth” for the scores that follow. Major drawbacks of using Stage4 products are questions of QPE fidelity in regions with complex terrain and a paucity of rain gauges, the lack of a mask for missing or suspect data, and errors associated with the interpolation of 12-km model fields to 4-km grids. The reader should keep in mind that the verification statistics that follow exhibit variations if different analyses are used for “truth.”

4. Forecast quality and skill a. Comparison of deterministic RSM12 forecasts and NCEP operational forecasts It is of interest to examine briefly whether the accuracy of the RSM12 precipitation forecasts is compa-

JANUARY 2005

YUAN ET AL.

285

FIG. 5. Monthly averaged precipitation for the five study periods. (a) The Stage4 average monthly precipitation. (b) The average monthly difference between the ensemble mean of the 0000 UTC forecasts and the Stage4 precipitation. (c) The average monthly difference of the ensemble mean between 1200 and 0000 UTC precipitation. The top grayscale is for (a) and (b), while the bottom scale is for (c); units are mm per 30 days (month).

rable to, and preferably better than, that for the NCEP suite of operational models. It is difficult to justify, a priori, the running of higher-resolution ensemble forecasts based solely on initial perturbations if RSM12 performance is significantly worse. NCEP continually evaluates QPF performance from its deterministic models and from manual forecasts issued by the Hydrometeorological Prediction Center (HPC), NWS Weather Forecast Offices (WFOs), and River Forecast Centers (RFCs) (e.g., Charba et al. 2003; Reynolds 2003). Detailed, quantitative comparison between RSM12 and NCEP forecasts for the 2002/03 cool season is not possible in view of differences in the verification period (November–December for RMS12 versus October–March for NCEP). Nevertheless, a qualitative com-

parison (results not shown) of the biases, mean absolute errors, and threat scores reveals that the RSM12 control forecast possesses an accuracy comparable to Nested Grid Model (NGM), Eta Model, GFS model, HPC forecasts, and CNRFC/CBRFC forecasts during the 2002/03 cool season. We believe that the comparable performance of the RSM12 configuration justifies its use for the short-range ensemble PQPFs.

b. Biases The average monthly precipitation for the 5-month forecast period (Fig. 5a) contains two heavy precipitation bands along the Coastal Range and Sierra Nevadas of California. Drier conditions prevailed over the Great Basin region and Desert Southwest, with higher pre-

286

MONTHLY WEATHER REVIEW

VOLUME 133

FIG. 6. Rank histograms of 24-h precipitation for 0000 UTC cycle (black bar) and 1200 UTC cycle (white bar) during 151 days over the whole domain. The abscissa index shows the rank of Stage4 among 1 Stage4 and 11 forecast ensembles. The ordinate shows frequency. The horizontal dashed line denotes frequency for uniform rank distribution. The error bars plotted to the right of each bar give 90% confidence interval for 0000 and 1200 UTC, respectively.

cipitation confined to the higher terrain of the CBRFC. Monthly precipitation varied significantly from month to month during the 2002/03 winter (not shown). California experienced widespread regions with heavy precipitation (⬎200 mm month⫺1) in November and December, with much drier conditions thereafter. The CBRFC zone experienced an opposite trend: a relatively dry November–January, replaced by a wetter pattern in February and March. The difference between the 151-day average precipitation from the 0000 UTC ensembles and the Stage4 average reveals a widespread wet bias (Fig. 5b) in the RSM forecasts. The wet bias is worse in the 1200 UTC forecasts over most regions (Fig. 5c). Moreover, this tendency for wetter 1200 UTC cycles is noted in all five months (not shown). Ranked histograms (Fig. 6) reveal distributions consistent with a wet bias. Distributions for both 0000 and 1200 UTC deviate significantly from a uniform rank, the expected distribution for a reliable ensemble. The “L” shape indicates that verification preferentially falls at the lower end of distribution, which means that the ensemble members are too frequently wetter than verification. The bias is more severe for the 1200 UTC cycle, with the lowest rank being 4% more populated at 1200 UTC than at 0000 UTC. Attributes diagrams also indicate that the wet bias extends to probability space for every threshold examined. The reliability for the 0000 UTC forecasts is somewhat better than for the 1200 UTC forecasts (Fig. 7), albeit both curves are significantly below the 1:1 diagonal for nonzero forecast probabilities (i.e., forecast probabilities are higher than observed frequencies), which indicates a wet conditional bias. Furthermore, distributions of the forecast probabilities (insert histograms) reveal that the 1200 UTC cycle produces fewer

0% forecasts and more nonzero forecasts across a vast majority of probability values than the 0000 UTC cycle. In fact, the conditional bias at 1200 UTC is so severe that only the 0% and very highest probability forecasts make positive contributions to the Brier skill score; that is, the reliability curve for that probability category is closer to 1:1 diagonal than the “no-skill” line. The Brier skill score can be decomposed into three terms: reliability, resolution, and uncertainty. The uncertainty term [ f ⫻ (1 ⫺ f ); Fig. 7] depends solely on the sample climatological frequency ( f ). The resolution term is positively oriented (higher values are better) and measures the ability to properly sort occurrences from nonoccurrences at a level above climatology; on an attributes diagram this corresponds to the distance squared between the reliability curve and the sample climatological frequency weighted by the forecast frequency. The reliability term is negatively oriented and measures the agreement between forecast probability and observed frequency of occurrence; this corresponds to distance squared between the reliability curve and the 1:1 diagonal line. Whenever the resolution term exceeds the reliability term, the forecast system is skillful with respect to sample climatology; this corresponds to the reliability curve lying farther from the horizontal “climatology line” than the no-skill line. Note that forecasts based on sample climatology have perfect reliability, but they possess no ability to discriminate events. Figure 8 shows that the resolution terms for the two cycles are virtually equivalent, but the reliability terms for the 1200 UTC forecasts are noticeably larger (worse) than the 0000 UTC ones for 1–25-mm thresholds. Evidently degraded skill at 1200 UTC is due solely to a larger conditional wet bias, and not a diminished capacity to discriminate precipitation events.

JANUARY 2005

YUAN ET AL.

287

FIG. 7. Attributes diagrams for 24-h precipitation at four thresholds: (a) 1, (b) 10, (c) 20, and (d) 50 mm during 151 days over the whole domain. The horizontal dashed line denotes the sample climatological frequency ( f ). Solid curves with black dots and white circles denote reliability at 0000 and 1200 UTC, respectively. Error bars give 90% confidence bounds for reliability at 0000 UTC. The abscissa index shows number of 11 forecast ensembles (probability ⫻ 1/11). The ordinate shows the observed frequency (Stage4). The sloped solid line denotes the perfect forecast. The sloped dotted line denotes no skill. The internal bar graphs show the percentage numbers of whole grids in terms of forecast probability for 11 members. The numbers of 0% forecast probability are shown on the left bar graph, while the numbers of 1/11, 2/11, . . . 11/11 forecast probability are shown on the right bar graph on each panel.

In summary, the RSM12 ensemble shows a significant wet bias that is reflected in the winter average precipitation, frequency of observations that lie outside the ensemble range, and forecast probabilities for thresholds up to 50 mm day⫺1. The wet bias is significantly larger for the 1200 UTC cycle compared to the 0000 UTC cycle.

c. Regional and monthly verification Comparison of the BSS for the whole domain, the two RFC districts, and the four USGS hydrological ba-

sins (Fig. 9) reveals large regional differences in skill. The CNRFC (Fig. 9b) and California (Fig. 9f) are the only regional domains with positive BSS values for the 0000 and 1200 UTC cycles over the entire range of thresholds, but the confidence intervals (CIs) for the 50- and 75-mm thresholds extend below zero. The other domains are deemed not skillful since the 90% CI extends so far below the zero line that it should probably be considered lacking skill even if the mean value lies above. The regional stratification also shows a strong tendency for smaller BSS values at 1200 UTC for every domain.

288

MONTHLY WEATHER REVIEW

VOLUME 133

FIG. 8. Decomposition of the Brier skill score for the (a) 0000 and (b) 1200 UTC cycles for 151 days over the whole domain into three terms: reliability (chain line with circles), resolution (solid line with asterisk), and uncertainties (dashed line with triangles). The three terms are on a log10 scale.

The RPSS, which is an extension of the BSS to multicategorical forecasts, measures the agreement between the forecast and observed probability distributions. The RSM ensembles (Fig. 10) are most skillful (RPSS ⬎ 0) over California, followed by the CNRFC. Forecasts over the CBRFC, the Great Basin, and the Upper and Lower Colorado watersheds are unskillful. The 0000 UTC forecasts also systematically exhibit higher RPSS than the 1200 UTC forecasts. The RPSS distributions are in basic agreement with the BSS results. The gridpoint distribution of the RPSS at 0000 UTC (Fig. 11) reveals that no more than half of the verification domain contains positive skill. The most skillful regions are situated along the mountain ranges and coastal regions of California, and over the high terrain of the interior regions. Areas with the highest skill (RPSS ⬎ 0.5) are confined to the Pacific coast, the windward slopes of Sierra Nevadas and the Mogollon Rim of the central Arizona. On the other hand, forecasts over the Great Basin generally lack skill, except over the highest terrain. BSS values over the whole domain (Fig. 12) exhibit the highest scores in November and December, when the monthly averaged precipitation intensified along the windward slopes in California but it was relatively dry elsewhere. California received less precipitation in February and March, while the Great Basin region and the CBRFC received heavier precipitation than early in the season. It appears that RSM skill is related to the distribution of precipitation, with wet conditions in California favoring high skill.

d. Discriminating ability and potential economic value The ROC curve extends the concept of the hit rate and false alarm rate to probability forecasts (Table 2).

Because the ROC curve stratifies events according to observations, it is insensitive to conditional biases. The area under the ROC curve provides a scalar summary of the ability of a forecast system to discriminate a dichotomous event. An area of 1.0 denotes a perfect deterministic forecast, while an area of 0.5 indicates no discriminating ability. ROC areas (Fig. 13) for the entire domain and various hydrological regions run ⬃0.9 or higher for thresholds of 1–75 mm. The ROC areas, unlike the bias score itself or verification scores (BSS and RPSS) that are sensitive to bias, show little variation for the 0000 and 1200 UTC forecasts. Buizza et al. (1999) suggest that a ROC area of 0.7 might be considered a threshold for a useful forecast system, and that an area of 0.8 or higher indicates good discriminating ability. It has been argued that ROC area should be computed after modeling the hit rates and false alarm rates as straight lines in normal deviate space (e.g., Wilson 2000), but a fitted curve gives even higher ROC areas than those in Fig. 13 (not shown). In any event, a “straight line” ROC curve with an area 0.9 must be considered indicative of an excellent discriminating ability. Moreover, these high ROC areas indicate that the BSS and RPSS would increase markedly after removal of conditional biases through postprocessing procedures that do not reduce the resolution term. Potential economic value curves can be computed from the ROC curve. The PEV curves give the forecast value relative to the economic value from use of climatology information. The simple economic model assumes that the cost in taking a preventative action to protect from a loss is less than the loss from the weather event itself and that the protective action reduces the loss to zero. A value of 1.0 denotes the PEV for a perfect deterministic forecast, while 0.0 denotes the value for climatology. Forecast value varies by the

JANUARY 2005

YUAN ET AL.

289

FIG. 9. The Brier skill scores for 24-h accumulated precipitation for the 0000 UTC (solid line with asterisk) and 1200 UTC (dashed line with dot) cycles at eight thresholds (1–75 mm) for 151 days. Seven domains are shown: (a) the whole domain; and only (b) the CNRFC, (c) the CBRFC, (d) the Great Basin region, (e) the Upper Colorado region, (f) the California region, and (g) the Lower Colorado region. Error bars give 90% confidence bounds of the BSS.

sample frequency (Fig. 4b), the cost/loss (C/L) ratio, and the threshold probability at which to take a preventive action. The optimal PEV is defined as the highest value among all possible probability thresholds. The PEV for 0000 UTC forecasts (Fig. 14) is positive for C/L ratios between 0.009 and 0.6 for 1 mm, and gradually evolves to a range of 0.001–0.6 for 50 mm. A positive value for a C/L up to 0.6 indicates that some users who require quite confident forecasts could benefit from the RSM12 forecasts. However, the limit of C/L ratios associated with a relatively high level of value, say 0.5 or higher, shrinks considerably to C/L ⫽ 0.2. Value curves for the 1200 UTC forecasts (not shown) are very similar to the 0000 UTC ones.

5. Summary and conclusions The NCEP RSM was used to generate ensemble forecasts over the southwest United States during the 151 days from 1 November 2002 to 31 March 2003. RSM forecasts to 24 h on a 12-km grid were produced from 0000 and 1200 UTC initial conditions. Eleven ensemble members were run each forecast cycle starting from NCEP GFS ensemble analyses (one control and five pairs of bred modes) and forecast lateral boundary conditions. Various verification metrics for 24-h accumulations were performed for hydrological zones using several different NCEP precipitation analyses as truth, but results for 4-km Stage4 that allowed comparison of

290

MONTHLY WEATHER REVIEW

VOLUME 133

• The RSM ensemble possesses a significant wet bias.

A wet bias exists over most of the domain and during both forecast cycles. • Forecasts starting at 1200 UTC show a strong tendency for lower skill than forecasts at 0000 UTC. • There are large spatial variations in skill. In general, California is a region of high skill, whereas the Great Basin is a region with little or no skill. • The RSM ensemble is able to discern precipitation events over a wide range of thresholds. Discriminating ability is highest over California, but it is also exceptional over the individual hydrologic zones with ROC areas on the order of 0.9. The ROC curves, and hence discriminating ability, do not show a significant variation with analysis cycle. FIG. 10. The ranked probability skill score (RPSS) for 0000 UTC (black bar) and 1200 UTC (white bar) over different hydrological regions using four thresholds (1, 10, 20, and 50 mm). From the left to the right: the whole domain, the CNRFC, the CBRFC, the Upper Colorado region, the Lower Colorado region, the Great Basin region, and the California region. Right error bars give 90% confidence bounds for the RPSS.

the 0000 and 1200 UTC forecast cycles were emphasized. The main findings of the study are as follows: • Skill scores show significant sensitivity to the analysis

that is used for verification. Brier skill scores for the RSM can be either skillful or unskillful depending on analysis.

FIG. 11. Spatial distribution of the ranked probability skill score for 0000 UTC cycle forecasts.

Perhaps the most perplexing finding is the large difference in verification scores between the 0000 and 1200 UTC forecast cycles. The underlying cause for this difference is not known, but a couple of possibilities could be contributing factors. The 1200 UTC analyses over the Pacific Ocean, the region upwind of the model domain, correspond to a 6-h assimilation cycle during nighttime hours. This means an absence of visible satellite data and fewer reports from ships of opportunity during the 1200 UTC cycle. The relative lack of data could produce “poorer” analyses at 1200 UTC, which in turn could lead to lower skill. However, the ROC areas and PEV (Figs. 13, 14) do not change appreciably between 0000 and 1200 UTC; thus it is unclear why scores insensitive to biases would not also be degraded at 1200 UTC if poorer analyses were a major factor. “Spinup,” together with a significant diurnal cycle of precipitation, is another possibility that could cause a bias differential if the diurnal cycle of wintertime precipitation over virtually all of the western United States had its maximum much closer to 0000 than 1200 UTC. However, the diurnal cycle during winter is weak over the West, where “morning maxima seem to be prevalent at most stations” (Wallace 1975). On the other hand, if the wet bias began during a specific portion of the diurnal cycle, then it could produce 0000–1200 UTC bias differences that would be especially large in the first 24-h of the forecast. For example, if the wet bias began during the 6-h period of 1800–0000 UTC, then the 24-h forecasts starting at 1200 UTC would exhibit a larger bias than the 24-h forecasts starting at 0000 UTC. We note that simulations with coarser versions of the RSM exhibit a diurnal cycle where convection initiates too early during the 1800–0000 UTC period (Hong and Leetmaa 1999). Whatever the source of the differences noted here, model deficiency or sampling fluctuation, the situation warrants further examination to see if the behavior continues. The diurnal issue aside, the existence of overall wet bias, relative to the Stage4 analyses, is undeniable. It appears in spatial distributions of time-average rainfall, attributes diagrams, and rank histograms. Forecast

JANUARY 2005

291

YUAN ET AL.

FIG. 12. Monthly Brier skill scores for 24-h accumulated precipitation at 0000 UTC for (left) the whole domain and (right) the CNRFC at 1–75-mm thresholds. The plain solid line represents the 5-month average.

quality jumps significantly if the RSM is verified against the RFC4 analyses or the RFC14 analyses, however. The improvement comes solely from a reduction of the wet bias. The frequency of precipitation events is noticeably higher in the RFC4 and RFC14 analyses compared to the Stage4 analyses for thresholds of 25 mm and smaller, precisely those thresholds for which the improvement is greatest. Evidently, exclusion of rain gauges in the Stage4 product vis-à-vis RFC4 leads to the difference in sample frequencies. It is plausible that the practice of eliminating rain gauges in a region of sparse gauge distribution (Fig. 2) and spotty radar coverage owing to terrain blockage (Maddox et al. 2002) could be adversely affecting the Stage4 analyses, and for that reason alone we recommend that NCEP review its practice over the Intermountain West. On the other hand, the RSM wet bias could indeed be “real” if the Stage4 analyses are closer to the “truth” than the RFC4 and RFC14. In that case, the large RSM overestimation could be produced by a variety of model deficiencies, for example, insufficient representation of topography at 12 km or parameterization schemes in the presence of the extremely heterogeneous terrain that have not been carefully “tuned” for 12-km grid, among many others. Unfortunately, we have no way of knowing. What is clear is that mitigation of biases and other forecast errors over the West, if they are real, will pose challenges to the modeling community in view of the uncertainties inherent in the verifying precipitation analyses. Comparison of the spatial distributions of the rain gauge density (Fig. 2), observed precipitation (Fig. 5), and the RPSS (Fig. 11) indicates a similar pattern. Areas with dense gauge coverage and high monthly precipitation seem to coincide with the areas of high skill. The spatial correlation coefficient between the RPSS and monthly precipitation runs ⬃0.6; the correlation between the RPSS and gauge density, though much lower, is significantly positive ⬃0.3. This raises the issue

of “skill” for the RSM being related to precipitation amount and data density. Positive correlation between skill and average precipitation is arguably not surprising in view of RSM wet bias, but any significant correlations between skill and gauge density, however small, only serve to cloud interpretation of these results further. Brier and rank probability skill scores seem to contain spatial variations too large to be attributed to differential gauge density. RSM precipitation forecasts are clearly most accurate along the windward slopes of the Sierra Nevadas and the Coastal Range of California, and to a lesser degree over the Mogollon Rim of Arizona. Skill worsens downstream of these ranges that act as initial barriers to moisture-laden westerly flow from the Pacific Ocean. RSM performance in interior valleys of the West seems particularly poor. It appears that upslope rains during times of “wet” winter flow (Fig. 11) are relatively easy for the RSM ensemble to predict, whereas precipitation forecasts over the interior are more problematic. These regional variations in skill are almost exclusively related to biases, however. Because statistical postprocessing techniques are excellent at mitigating conditional biases in probabilistic forecasts without significant deterioration of the resolution term (e.g., Hamill and Colucci 1998; Eckel and Walters 1998), this suggests that calibration of the RSM12 ensemble forecasts, interpolated to the finer 4-km Stage4 grid, has the potential to yield unbiased PQPFs that can TABLE 2. The contingency table of forecasts and observed events. The hit rate is X/(X ⫹ Y), and the false alarm rate is Z/ (Z ⫹ W). Forecast

Observed

Yes No Total

Yes

No

Total

X Z X⫹Z

Y W Y⫹W

X⫹Y Z⫹W Total

292

MONTHLY WEATHER REVIEW

FIG. 13. Area under the relative operating characteristic curve for 24-h accumulated precipitation for 0000 UTC (dotted line) and 1200 UTC cycle (dashed line) during 151 days over the whole domain at 1–75-mm thresholds. Error bars give 90% confidence bounds for ROC areas.

discriminate 24-h rain events with a high level of confidence. However, the sample size and minimal training period required for precipitation calibration needs to be tested for regional calibration, especially under the challenge of frequent changes to the operational analysis–forecast system.

VOLUME 133

Accurate estimates of precipitation amounts at fine spatial (4 km ⫻ 4 km) and temporal resolution are a critical input for hydrologic flood and river flow forecasting models (Droegemeier et al. 2000). Current operational hydrologic models use 6-h accumulations for general flood forecasting (http://www.wrh.noaa.gov/ cnrfc/flood_forecasting.html), and they require precipitation intensity for accumulation periods as short as 30 min for flash flood forecasting (Kelsch 2002). With recent improvements in short-range precipitation forecasts generated by mesoscale ensemble systems, it is becoming feasible to consider their use to drive hydrologic models for general flood forecasting. The 4-km Stage4 analyses and RFC hourly rain gauge reports offer the opportunity to verify and calibrate precipitation forecasts near the requisite minimal temporal and spatial scales. Although we realize that the 24-h accumulation period used in this study has not yet reached the desired temporal scales for many hydrologic applications, we believe the results of this paper support the notion that the time is ripe to pursue an accelerated development of ensemble hydrometeorological prediction systems. Before the goal of skillful coupled ensemble forecast runoff system can be routinely realized, however, comprehensive validation studies and multivariate calibration efforts must be performed for atmospheric variables that historically have not been scrutinized. Such studies must be extended to longer forecast projections and the more difficult warm season, plus they should include the impact of analysis/observation uncertainty on verification. In that spirit, verification of the RSM12 ensemble for 6-h accumulations is underway and will be reported in due course. Acknowledgments. The authors acknowledge the support of NASA EOS-IDS Grant NAG5-3460, NSF STC program (Agreement EAR-9876800). The second author (SLM) also received support from ONR N00014-99-1-0181. Computer resources were obtained under support of ONR N00014-00-1-0613. The NCEP provided the RSM model. We thank Mr. M. Leuthold and Dr. J. E. Combariza for assisting with the configuration and maintenance of the RSM ensemble system at the University of Arizona. Dr. R. Wobus, Dr. Z. Toth, and others at NCEP provided GSM breeding perturbation code. Y. Lin, B. A. Gordon, and others at NCEP facilitated data access. T. E. Vega, D. K. Braithwaite, J. Broermann, and E. Halper at the University of Arizona assisted with downloading data and providing the coverage information of the river basins. Dr. L. J. Wilson provided ROC curve fitting code. We also thank the two anonymous reviewers for their insightful reviews.

FIG. 14. The optimal potential economic value (negative values not shown) for 24-h precipitation at 0000 UTC for 151 days over the whole domain at four thresholds: 1 mm (dotted line), 10 mm (solid line), 20 mm (dashed line), and 50 mm (chain-dashed line).

REFERENCES Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate, 9, 1518–1530.

JANUARY 2005

YUAN ET AL.

Black, T. L., 1994: The new NMC mesoscale Eta Model: Description and forecast examples. Wea. Forecasting, 9, 265–284. Brooks, H. E., M. S. Tracton, D. J. Stensrud, G. DiMego, and Z. Toth, 1995: Short-range ensemble forecasting: Report from a workshop, 25–27 July 1994. Bull. Amer. Meteor. Soc., 76, 1617–1624. Buizza, R., 2001: Accuracy and potential economic value of categorical and probabilistic forecasts of discrete events. Mon. Wea. Rev., 129, 2329–2345. ——, A. Hollingsworth, F. Lalaurette, and A. Ghelli, 1999: Probabilistic predictions of precipitation using the ECMWF ensemble prediction system. Wea. Forecasting, 14, 168–189. Charba, J. P., D. W. Reynolds, B. E. McDonald, and G. M. Carter, 2003: Comparative verification of recent quantitative precipitation forecasts in the National Weather Service: A simple approach for scoring forecast accuracy. Wea. Forecasting, 18, 161–183. Daly, C., R. P. Neilson, and D. L. Phillips, 1994: A statistical– topographic model for mapping climatological precipitation over mountainous terrain. J. Appl. Meteor., 33, 140–158. Droegemeier, K. K., and Coauthors, 2000: Hydrological aspects of weather prediction and flood warnings: Report of the ninth prospectus development team of the U.S. weather research program. Bull. Amer. Meteor. Soc., 81, 2665–2680. Du, J., and M. S. Tracton, 1999: Impact of lateral boundary conditions on regional-model ensemble prediction. Research activities in atmospheric and oceanic modelling. H. Ritchie, Ed., Rep. 28, CAS/JSC Working Group Numerical Experimentation WMO/TD-No. 942, 6.7–6.8. ——, and ——, 2001: Implementation of a real-time short-range ensemble forecasting system at NCEP: An update. Preprints, Ninth Conf. on Mesoscale Processes, Ft. Lauderdale, FL, Amer. Meteor. Soc., 355–356. ——, S. Mullen, and F. Sanders, 1997: Short-range ensemble forecasting of quantitative precipitation. Mon. Wea. Rev., 125, 2427–2459. Eckel, F. A., and M. K. Walters, 1998: Calibrated probabilistic quantitative pecipitation forecasts based on the MRF ensemble. Wea. Forecasting, 13, 1132–1147. Epstein, E. S., 1969: Stochastic dynamic prediction. Tellus, 21, 739–759. Fritsch, J. M., and Coauthors, 1998: Quantitative precipitation forecasting: Report of the eighth prospectus development team, U.S. weather research program. Bull. Amer. Meteor. Soc., 79, 285–299. Fulton, R. A., J. P. Breidenbach, D.-J. Seo, D. A. Miller, and T. O’Bannon, 1998: The WSR-88D rainfall algorithm. Wea. Forecasting, 13, 377–395. Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155–167. ——, 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550–560. ——, and S. J. Colucci, 1998: Evaluation of Eta–RSM ensemble probabilistic precipitation forecasts. Mon. Wea. Rev., 126, 711–724. ——, S. L. Mullen, C. Snyder, Z. Toth, and D. P. Baumhefner, 2000: Ensemble forecasting in the short to medium range: Report from a workshop. Bull. Amer. Meteor. Soc., 81, 2653– 2664. ——, J. S. Whitaker, and C. Snyder, 2001: Distance-dependent filtering of background error covariance estimates in an ensemble Kalman filter. Mon. Wea. Rev., 129, 2776–2790. Higgins, R. W., W. Shi, E. Yarosh, and R. Joyce, 2003: Improved United States Precipitation Quality Control System and Analysis. NCEP/CPC Atlas 7, 47 pp. [Available online at http:// www.cpc.ncep.noaa.gov/research_papers/ncep_cpc_atlas/7/ index.html.] Hoffman, R. N., and E. Kalnay, 1983: Lagged average forecasting, an alternative to Monte Carlo forecasting. Tellus, 35A, 100– 118.

293

Hong, S.-Y., and A. Leetmaa, 1999: An evaluation of the NCEP RSM for regional climate modeling. J. Climate, 12, 592–609. Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. Wiley and Sons, 240 pp. Juang, H.-M. H., 2000: The NCEP mesoscale spectral model: A revised version of the nonhydrostatic regional spectral model. Mon. Wea. Rev., 128, 2329–2362. ——, and M. Kanamitsu, 1994: The NMC nested regional spectral model. Mon. Wea. Rev., 122, 3–26. ——, S.-Y. Hong, and M. Kanamitsu, 1997: The NCEP regional spectral model: An update. Bull. Amer. Meteor. Soc., 78, 2125–2143. Kalnay, E., 2003: Atmospheric Modeling, Data Assimilation and Predictability. Cambridge University Press, 341 pp. Kanamitsu, M., 1989: Description of the NMC global data assimilation and forecast system. Wea. Forecasting, 4, 335–342. Kelsch, M., 2002: COMET flash flood cases: Summary of characteristics. Preprints, 16th Conf. on Hydrology, Orlando, FL, Amer. Meteor. Soc., CD-ROM, 2.1. ——, 2004: A review of some significant urban floods across the United States in 2003. Preprints, 2004 AMS Annual Weather Review Preliminary Program, Seattle, WA, Amer. Meteor. Soc., 2–3. Leith, C. E., 1974: Theoretical skill of Monte Carlo forecasts. Mon. Wea. Rev., 102, 409–418. Maddox, R. A., J. Zhang, J. J. Gourley, and K. W. Howard, 2002: Weather radar coverage over the contiguous United States. Wea. Forecasting, 17, 927–934. Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF ensemble prediction system: Methodology and validation. Quart. J. Roy. Meteor. Soc., 122, 73–119. Mullen, S. L., and R. Buizza, 2001: Quantitative precipitation forecasts over the United States by the ECMWF ensemble prediction system. Mon. Wea. Rev., 129, 638–663. ——, J. Du, and F. Sanders, 1999: The dependence of ensemble dispersion on analysis–forecast systems: Implications to short-range ensemble forecasting of precipitation. Mon. Wea. Rev., 127, 1674–1686. Mureau, R., F. Molteni, and T. N. Palmer, 1993: Ensemble prediction using dynamically-conditioned perturbations. Quart. J. Roy. Meteor. Soc., 119, 299–323. Murphy, A. H., 1991: Forecast verification: Its complexity and dimensionality. Mon. Wea. Rev., 119, 1590–1601. ——, and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330–1338. Palmer, P. L., 1988: The SCS snow survey water supply forecasting program: Current operations and future directions. Proc. 56th Annual Western Snow Conf., Kalispell, MT, Western Snow Conference, 43–51. Palmer, T. N., F. Molteni, R. Mureau, R. Buizza, P. Chapelet, and J. Tribbia, 1993: Ensemble prediction. Proc. ECMWF Seminar on Validation of Models over Europe, Vol. 1, ECMWF, Shinfield Park, Reading, UK, 21–66. Pielke, R. A., Jr., and M. W. Downton, 2000: Precipitation and damaging floods: Trends in the United States, 1932–97. J. Climate, 13, 3625–3637. Reynolds, D., 2003: Value-added quantitative precipitation forecasts: How valuable is the forecaster? Bull. Amer. Meteor. Soc., 84, 876–878. Richardson, D. S., 2000: Skill and economic value of the ECMWF ensemble prediction system. Quart. J. Roy. Meteor. Soc., 126, 649–668. Rogers, E., D. G. Deaven, and G. S. Dimego, 1995: The regional analysis system for the operational “early” Eta Model: Original 80-km configuration and recent changes. Wea. Forecasting, 10, 810–825. Sanders, F., 1986: Trends in skill of Boston forecasts made at MIT, 1966–84. Bull. Amer. Meteor. Soc., 67, 170–176. Serreze, M., M. Clark, R. Armstrong, D. McGinnis, and R. Pul-

294

MONTHLY WEATHER REVIEW

warty, 1999: Characteristics of western U.S. snowpack telemetry (SNOTEL) data. Water Resour. Res., 35, 2145–2160. Stanski, H. R., L. J. Wilson, and W. R. Burrows, 1989: Survey of common verification methods in meteorology. World Meteorological Organization, World Weather Watch Rep. 8, Tech. Doc. 358, 114 PP. Stensrud, D. J., H. E. Brooks, J. Du, M. S. Tracton, and E. Rogers, 1999: Using ensembles for short-range forecasting. Mon. Wea. Rev., 127, 433–446. Toth, Z., and E. Kalnay, 1993: Ensemble forecasting at NMC: The generation of perturbations. Bull. Amer. Meteor. Soc., 74, 2317–2330. ——, and ——, 1997: Ensemble forecasting at NCEP and the breeding method. Mon. Wea. Rev., 125, 3297–3319. ——, ——, M. S. Tracton, R. Wobus, and J. Irwin, 1997: A synoptic evaluation of the NCEP ensemble. Wea. Forecasting, 12, 140–153. ——, Y. Zhu, I. Szunyogh, M. Iredell, and R. Wobus, 2002: Does increased model resolution enhance predictability? Preprints, Symp. on Observations, Data Assimilation, and Probabilistic Prediction, Orlando, FL, Amer. Meteor. Soc., CD-ROM, J1.9.

VOLUME 133

Tracton, M. S., and E. Kalnay, 1993: Operational ensemble prediction at the National Meteorological Center: Practical aspects. Wea. Forecasting, 8, 379–400. Wallace, J. M., 1975: Diurnal variations in precipitation and thunderstorm frequency over the conterminous United States. Mon. Wea. Rev., 103, 406–419. Wandishin, M. S., S. L. Mullen, D. J. Stensrud, and H. E. Brooks, 2001: Evaluation of a short-range multimodel ensemble system. Mon. Wea. Rev., 129, 729–747. Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences: An Introduction. International Geophysics Series, Vol. 59, Academic Press, 467 pp. Wilson, L. J., 2000: Comments on “Probabilistic predictions of precipitation using the ECMWF ensemble prediction system.” Wea. Forecasting, 15, 361–364. Young, C. B., A. A. Bradley, W. F. Krajewski, A. Kruger, and M. L. Morrissey, 2000: Evaluating NEXRAD multisensor precipitation estimates for operational hydrologic forecasting. J. Hydrometeor., 1, 241–254. Zhu, Y., Z. Toth, R. Wobus, D. Richardson, and K. Mylne, 2002: The economic value of ensemble-based weather forecasts. Bull. Amer. Meteor. Soc., 83, 73–83.