reliability of statistical software - Semantic Scholar

4 downloads 122892 Views 174KB Size Report
Sep 9, 2010 - The reliability of several statisitcal software packages was examined using the National .... widely used analytical software packages may fail.
RELIABILITY OF STATISTICAL SOFTWARE OLUWAROTIMI O. ODEH, ALLEN M. FEATHERSTONE, AND JASON S. BERGTOLD

Key words: accuracy, benchmarking, linear, nonlinear, reliability, statistical software. JEL codes: C80.

The ability to combine econometric techniques, economic theory, and data to analyze societal problems is one of the major strengths of applied economics. The inability of applied economists to perform this task well in some cases has been linked to the fragility of econometric results (Leamer 1983; Tomek 1993). While small changes in model specification may result in changes in empirical results, McCullough and Vinod (1999) have shown that different software packages might give different results to the same problem. An example of this is a scenario where two authors, solving the same problem but using different software packages, report different results. In such a situation, either the packages (or at least one of them) may be inaccurate or either one or both authors have not properly documented the statistical procedures used. McCullough and Vinod (1999) argue that important empirical results might be sensitive to the choice of software, potentially weakening the value of applied econometrics (Tomek 1993). Unfortunately, benchmarking statistical software is time consuming. In most cases, a researcher is more likely concerned about

Oluwarotimi Odeh is an assistant professor at Virginia State University; Allen Featherstone and Jason Bergtold are, respectively, professor and assistant professor at Kansas State University. The authors would like to acknowledge the helpful comments of John Crespi,Terry Kastens,Abdullahi Abdulkadri, two anonymous reviewers, and Paul Preckel.

the research problem and assumes that the estimation results obtained from a built-in estimation routine are reliable. Often, the researcher tries to find errors in the statistical procedures rather than considering the software as a possible source of error. In addition, many economists argue that examining the reliability of software packages should be left to computer engineers, since the economics profession is less trained in this area. According to McCullough and Vinod (1999), this has made the problem of software inconsistencies more prevalent than they ought to be. Economists may place more emphasis on speed and user-friendliness, but accuracy and reliability may be the victim. The foregoing arguments raise questions about whether problems identified in previous studies still exist. Another issue is whether problems are limited to the packages already examined. For example, Matrix Laboratory (MATLAB) and the General Algebraic Modeling System (GAMS), thought of primarily as optimization software, have not to our knowledge been examined in this respect. Generally, while impressive developments have been made in computer technology, there is a need to examine the reliability of statistical software packages on an ongoing basis and determine whether researchers are getting reliable and accurate estimates. Estimation accuracy is important because considerable value (predictive, policy, analytical, etc.) is

Amer. J. Agr. Econ. 92(5): 1472–1489; doi: 10.1093/ajae/aaq068 Received July 2006; accepted June 2010; published online September 9, 2010 © The Author (2010). Published by Oxford University Press on behalf of the Agricultural and Applied Economics Association. All rights reserved. For permissions, please e-mail: [email protected]

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

The reliability of several statisitcal software packages was examined using the National Institute of Standards and Technology linear and nonlinear least squares datasets and models. Software tested include Excel 2007, GAMS 23.4, GAUSS 9.0, LIMDEP 8.0, Mathematica 7.0, MATLAB 7.5, R 2.10, SAS 9.1, SHAZAM 10, and Stata 10. While some of these packages have been previously examined, others, including GAMS and MATLAB, have not been extensively examined. Reliability tests indicate improvements in some of the software packages that were previously tested, but some of these packages failed reliability tests under certain conditions. The findings underscore the need to benchmark software packages to ascertain reliability before use and the importance of solving econometric problems using more than one package.

Odeh, Featherstone, and Bergtold

Previous Studies Several previous studies have examined the efficiency of various mathematical programming algorithms (Land and Powell 1973; Ibaraki 1976; Benichou et al. 1977; Crowder, Dembo, and Mulvey 1978). Tice and Kletke (1984) examined the reliability of linear programming packages used in the applied economics profession focusing on three versions of IBM’s linear programming software. They reported that repeated estimations of the problem yielded different optimum values. In fact, they stated that “it has been generally assumed by researchers that widely used computer software packages written and supported by reputable firms give reliable, consistent,

1473

and accurate solutions” (Tice and Kletke 1984, p. 104), and hence raised questions on why widely used analytical software packages may fail. Tice and Kletke further recommended that users of the examined version of IBM’s Mathematical Programming System Extended should consider using other alternatives. Since the study by Tice and Kletke (1984), optimization software has seen tremendous advancements and improvements. In addition, with the increased use of nonlinear econometric methods, nonlinear optimization has become more important in econometrics. However, research has shown that available software packages are not foolproof and may not be as efficient and accurate as researchers assume (Sawitzki 1994; McCullough 1999a, 1999b; McCullough and Wilson 2002). Compounding errors, convergence problems, and errors due to how software packages read, interpret, and process data impact the value of analytical results (Tomek 1993; Dewald, Thursby, and Anderson 1986). It is obvious that incorrect parameter estimates, when used for the desired purpose (policy, analytical, or predictive) will deviate from the reality the researcher intends to model. When researchers take results as foolproof, without rigorous cross-program testing and validation, parameter estimates and the implications drawn from them may be flawed (Tomek 1993). Tomek (1993) underscores the importance of conducting confirmation research and replication to confirm published results. More importantly, he identifies the use of alternative estimators as one of the major causes of divergent econometric results. Dewald, Thursby, and Anderson (1986) reported errors in privately written computer codes. An earlier version of SAS was reported to compute the Durbin-Watson statistic incorrectly (Tomek 1993). As a result, Tomek (1993) and Dewald, Thursby, and Anderson (1986) urged professional journals to require authors to “submit programs and data” and give details on data transformations, model restrictions, and estimators.1 Many software developers, such as LIMDEP/NLOGIT, SAS Instutute (2003), Stata (2007), and Time Series Processor (TSP; 2010) now provide evidence of reliability, while

1 The Journal of Money, Credit and Banking, following Dewald, Thursby, and Anderson (1986), started requesting authors to submit their data, while the National Science Foundation started archiving authors’ data at the University of Michigan’s InterUniversity Consortium for Political and Social Research (McCullough and Vinod 1999).

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

placed on the estimates derived from research results. If these results vary with the statistical procedures adopted, software used, and the programming codes/options specified, the credibility of these results could be weakened. The objective of this study is to examine the reliability of ten selected software packages (Excel 2007, GAMS 23.4, GAUSS 9.0, LIMDEP 8.0, Mathematica 7.0, MATLAB 7.5, R 2.10, SAS 9.1, SHAZAM 10, and Stata 10) that applied economic researchers commonly use in data analyses. Specifically, the accuracy of estimation results from these packages is assessed using the National Institute of Standards and Technology (NIST) linear and nonlinear regression benchmark tests. Our focus is on the accuracy of results of the packages and whether errors are correctly identified and reported during estimation of linear and nonlinear regression models. This article contributes to existing literature by conducting a comparative examination of the reliability of commonly used packages in agricultural economics, and applied economics in general. In addition, we examine whether observed inadequacies noted in earlier editions have been corrected in the latest versions of these packages. For instance, previous studies have examined Excel 2003 and 2007 (Keeling and Pavur 2007; McCullough and Heiser 2008; McCullough and Wilson 2005), SHAZAM 6.0 and 8.0 (McCullough 1999a; Silk 1996), and GAUSS 3.27 and 8.0 (Vinod 2000; Yalta 2007). A comparison of the reliability test results of these latest versions with earlier ones in the literature provides a means of verifying whether earlier inadequacies have been corrected.

Reliability of Statistical Software

1474

October 2010

2 Some of these packages discuss limitations of some of their programming procedures, functions, and commands in their respective help sections. SAS and MATLAB particularly have extensive help facilities. 3 More often, computation errors do not offset one another. Rather they build up as an increasing function of the number of successive operations; total error is of the order Ne, where N is the number of operations and e is the error. See McCullough and Vinod (1999, p. 642) for further discussion.

Desk, Excel, Interpretive Software Poducts, Generalized Linear Interactive Modelling, SAS, Statistical Package for the Social Sciences [SPSS], S-Plus, and Statgraphics) and reported that all the packages failed basic computational tests. McCullough (1999a) identified weaknesses in (a) random number generators of SAS, SPSS, and S-Plus; (b) the nonlinear least squares routines and the one-way analysis of variance in SAS and SPSS; and (c) the correlation procedure in S-Plus. McCullough (1999a) examined the reliability of random number generators in SAS, SPSS, and S-Plus and found that each of the random number generators suffered from flaws. L’ecuyer and Simard (2007) developed a software library for testing uniform random number generators and tested a number of random number generators that are widely used. They found that the default generators of Excel, MATLAB, Mathematica, and R failed several of their tests. McCullough (2006) provides some guidance on selecting random generators for the practicing economist. He argues that economists should “avoid random number generators that fail ‘reasonable’ tests” (p. 462). In addition to random number generators, there are also issues with statistical distributions. McCullough (1999a) proposed using both ELV (Knüsel 2003) and DCDFLIB (Double Precision Cumulative Distribution Function Library) (Brown 2003) for testing statistical distributions. He found that the normal distribution seems to be correct for SAS, SPSS, and S-Plus. The chi-square was correct for SAS. There were issues for the F-distribution for SAS and SPSS. The Student’s t distribution and the chi-square distribution had issues for S-Plus. Knüsel (1995) examined GAUSS statistical distributions and found that the accuracy of noncentral distributions was suspect. Knüsel (1998, 2003, 2005) and Yalta (2008) examined various versions of Excel for issues with statistical distributions and found them for the binomial, Poisson and inverse standard normal, beta, Student’s t, and F-distributions. Silk (1996) compared SAS, SHAZAM, and TSP based on estimation of linear systems. The study reported that while the software packages generally produced identical results for such routines as two-stage least squares, threestage least squares, and seemingly unrelated regression, performance differences did occur between the limited information maximum likelihood and the full information maximum likelihood procedures.

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

others state limitations2 of their products in specific areas. Simon and Lesage (1988) found that computational inaccuracies can occur due to truncation, cancellation, and accumulation errors. Truncation errors occur when a package handles or stores decimal numbers incorrectly.This type of error is propagated to larger errors. Cancellation errors occur when the data being used have a low level of relative variation. Errors that occur in direct proportion to the total number of arithmetic computations in the estimation process are referred to as accumulation errors. These types of errors will be in direct proportion to the total number of observations in each dataset in the univariate case. McCullough and Vinod (1999) further discuss more serious situations of truncation or rounding error that arise in cases where errors occur at the intermediate stages of estimation and eventually conflate to give entirely different results. They provide a case in point. The Vancouver Stock Exchange index began with a value of 1,000.000 and was recalculated to four decimal places, but truncated to three decimal places after every transaction. After some months the index had fallen to 520 in the face of stable economic activities. The index was later recalculated correctly and found to be 1098.892 after reducing the truncation errors. Whenever errors occur in computation, a proportion of the final result constitutes error. Determining what proportion is correct is a difficult issue, and the potential error may not be insignificant. Specifically, the general simplifying assumption of an independent error structure might break down when the computation error propagates3 itself through the entire estimation process, resulting in the errors being neither independent nor uncorrelated. More recent studies have reported inadequacies in how some software packages perform statistical computations, leading to inaccuracies in the computed results. Sawitzki (1994) conducted a series of tests on the numerical reliability of nine packages (BMDP, Data

Amer. J. Agr. Econ.

Odeh, Featherstone, and Bergtold

1475

“Mathematica achieves unparalleled accuracy and reliability on the NIST [Statistical Reference Dataset] and on the ELV benchmark for statistical distributions” (pp. 295–6). The random number generators did have some weaknesses, but he argued that those issues can be readily handled. Nerlove (2005) examined Mathematica 5.0’s statistical add-on package, testing it with the NIST linear and nonlinear regression benchmarks, datasets, and models.4 He found that Mathematica performed well with the linear regression models. In addition, he found that in all but five of the twentyseven nonlinear regression models, Mathematica converged to the certified values using both sets of starting points (“near” and “far”) provided with the NIST benchmark datasets. Keeling and Pavur (2007) conducted a comparative examination of nine statistical software packages ( SAS 9.1,SPSS 12.0,Excel 2003, Minitab 14.0, Stata 8.1, S-Plus 6.2, R 1.9.1, JMP 5.0, and StatCrunch 3.0) using four suites (univariate summary statistics, analysis of variance, linear regression, and nonlinear regression) in the NIST benchmark tests. They reported that substantive improvements have been made in recent versions of some of the packages (Excel 2003, SAS, SPSS, and S-Plus) when compared with older versions. However, they cautioned that inaccuracies still existed in some of the packages. McCullough and Heiser (2008) examined the accuracy of Excel 2007. They found that it failed accuracy tests for statistical distributions, random number generation, and estimation. They argue: “Given Microsoft’s track record, we recommend that no statistical procedure be used” (p. 4570). They determined that the Excel add-on Solver has a tendency to stop and declare the result a solution at a point that is not an optimal solution (i.e., where the gradient is equal to zero). McCullough and Heiser (2008) found that Solver was able to reliably estimate sixteen of the twenty-seven nonlinear problems from the NIST benchmark tests correctly. Procedures and Data The ten software packages examined in this study are (a) Microsoft Excel 2007 version 120.6524.5003 SP2; (b) GAMS version 23.4.3, May 24, 2010; (c) GAUSS version 9.0.3, 2008; (d) LIMDEP/NLOGIT version 9.0/4.3, March 4

The add-on is now part of the base Mathematica software.

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

In two related studies, McCullough (1998, 1999a) assessed the reliability of three statistical software packages (SAS, SPSS, and S-Plus), focusing on three areas: estimation (linear and nonlinear), random number generation, and statistical distribution. Weaknesses were reported in each of the three areas, but the packages performed well when estimating univariate statistics and linear regression models. The author suggested that conducting more reliability tests would improve the quality of statistical software, thereby remedying situations in which “different packages give different answers to the same problem” (McCullough 1999a, p. 158). A similar conclusion was drawn by McCullough (1999b) from a study of EViews, LIMDEP, SHAZAM, and TSP. The author recommended that default nonlinear options (e.g., algorithmic choice, tolerance level, starting point) should not be relied upon. McCullough and Wilson (1999) examined the accuracy of the statistical procedures in Microsoft Excel 97. They found that Excel performed poorly in linear and nonlinear estimation, random number generation, and statistical distributions (calculation of p-values).The authors recommended that Excel should not be used for statistical analysis. In a follow-up study, McCullough and Wilson (2002) examined the accuracy of statistical procedures in Excel 2000 and Excel XP. The authors reported that flaws identified in Excel 4.0 by Sawitzki (1994) that included the instability of the sample variance estimator were still in Excel 2000 and Excel XP. McCullough and Wilson (2002) further concluded that though slight improvement was noticed in some aspects of Excel’s computations, inaccuracies still existed in the latest versions. In yet another study, McCullough and Wilson (2005) reported that problems identified in previous versions of Excel were not rectified in Excel 2003. They stated that the package still performed poorly in solving nonlinear problems and advised users against using it. McCullough (2004) assessed the numerical accuracy of EViews 3.0, LIMDEP 7.0, RATS (Regression Analysis of Time Series) 4.3, SHAZAM 8.0, and TSP 4.4. He reported that though the use of reliability tests has influenced software developers to improve these packages, many deficiencies still existed. He called for the examination of the new versions of these packages to ascertain whether identified flaws have been fixed. McCullough (2000) examined the accuracy of Mathematica 4. He concluded that

Reliability of Statistical Software

1476

October 2010

NIST Datasets NIST offers five suites of benchmark datasets for reliability testing (National Insitute of Standards and Technology, 2003).6 The five suites are univariate summary statistics, analysis of variance, linear regression, Markov chain Monte Carlo estimation, and nonlinear regression. The data in the suites either are generated or are from experiments (Altman and McDonald 2001). The NIST website offers a compilation of eleven linear and twentyseven nonlinear least squares problems. The site provides test problems with three levels7

5 While efforts were made to use the latest versions available, we recognize that more recent versions (which we hope might have addressed these identified anomalies) may be available. 6 For those that are interested in the univariate summary statistics and the analysis of variance results, Keeling and Pavur (2007) provide a good discussion for commonly used econometric packages. 7 Classifying the datasets (and the respective models) into levels of difficulty is expected to be a rough guide to users. According

of difficulty (low, average, and high), two sets of starting values for nonlinear problems (near and far), the data (in American Standard Code for Information Interchange format),and certified results8 (providing up to fifteen and eleven significant digits for the linear and nonlinear procedures, respectively), making it easy to compare results across software packages (McCullough 1999a). According to Altman, Gill, and McDonald (2004), nonlinear problems are more complex to estimate than the other suites due to the need for a set of starting values, user-specified options that may affect solver performance, and selection of the type of algorithm used to solve the problem. The two sets of starting values provided by NIST include one set that is “far” from the solution and one set that is “near” to the solution. Regarding the options, nonlinear optimization packages sometimes offer flexibility to the user such as the convergence criteria, the method for calculating derivatives, and the algorithmic choice for finding the optimum. In addition, some packages allow the user to specify the method for calculating standard errors. Accuracy Measures For a measure of accuracy of the estimation results, the base 10 logarithm of relative error (LRE) is commonly used to examine statistical software reliability (McCullough and Wilson 1999; McCullough 1998). This accuracy measure is used in assessing the reliability of estimation results for the NIST test problems for all the software packages. The LRE is specified as:   |q − c| (1) LRE = − log10 |c| where q is the estimated value and c is the correct (certified) value. Whenever c equals zero, the LRE measure is undefined; in such a case, the Log Absolute Error (LAE = − log10 |q|) is used. We refer to both as LRE. Only the first nonzero digit and the succeeding digits are considered.9 The LRE measures the number of to the NIST, solving a problem at the high-difficulty level is not a guarantee that a given package will solve another problem in the average or low class or vice versa. 8 The problems were solved using multiple precision systems with different algorithms and different implementations to ensure accuracy. For a more detailed description of the problem selection, difficulty rating, solution method, and other computational details, see Rogers et al. (1998) and McCullough (1998). 9 For a more detailed discussion on LRE, see McCullough (1998, 1999a).

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

1, 2010; (e) Mathematica version 7.0 1.0, 2009; (f ) MATLAB version 7.5.0.0342(R2007b), August 15, 2007; (g) R version 2.10.0, release October 26, 2009; (h) SAS version 9.1.3 service PAK 4; (i) SHAZAM version 10(SP1), 2006; and (j) Stata/IC version 10, March 3, 2008. The choice of these packages was based on their wide usage for statistical and econometric analyses.5 While GAMS, Excel, MATLAB, and Mathematica are not normally considered econometric software packages, they have gained increased use among applied economists because of their distinct capabilities in performing specific tasks. For instance, Excel was examined because of its wide and extensive use by economists and researchers as a primary computer package for statistical and quantitative analyses, research, and instructional purposes. MATLAB has a statistical toolbox available that has been extensively expanded in recent years. Economists also carry out econometric modeling and analyses using Mathematica and MATLAB because of their data manipulation and powerful matrix capabilities. Also, GAMS and Mathematica are commonly used by applied economists working on dynamic programming, risk analysis, neural networks, and other analytical techniques, although users often need to manually program relevant statistical measures (e.g., parameter standard errors), since the packages do not always have built-in routines.

Amer. J. Agr. Econ.

Odeh, Featherstone, and Bergtold

Discussion of Statistical Packages and Estimation Table 1 summarizes the software and settings used to solve the linear and nonlinear NIST problems for all ten software packages tested, including: procedures used; nonlinear algorithms examined; method of standard error estimation; and optimality tolerance levels. When built-in linear and nonlinear least squares statistical procedures were not available for a given software package, the procedures were manually coded. In Excel 2007, nonlinear least squares benchmarks were evaluated by manually coding the nonlinear least squares problem and then solving each using the Solver Add-In. For GAMS, all linear and nonlinear problems were manually coded and solved using the MINOS and CONOPT solvers. There are a number of nonlinear solvers available for consideration in GAMS (GAMSIDE 2010). Two algorithms were examined in this study, CONOPT3 and GAMS/MINOS 5.51. CONOPT is argued to be well suited for models with very nonlinear constraints, while MINOS is argued to be well suited to handle problems with little nonlinearity outside of the objective function. Because the econometric problems are formulated as unconstrained, MINOS may have an advantage; however, CONOPT has more built-in tests for poorly scaled models. CONOPT allows selection of the method for calculating the search directions with conjugate

1477

gradient, reduced Hessian, steepest descent, and quasi-Newton methods available (Gill, Murray, and Wright 1981). The solution method in CONOPT is selected dynamically based on algorithmic performance. In addition, CONOPT has an automatic scaling feature. MINOS uses a combination of quasi-Newton and reduced gradient algorithms to solve nonlinear problems. The other software packages use more conventional algorithms, including: quasi-Newton, Gauss-Newton, conjugate gradient, and Levenberg-Marquardt. In assessing the reliability of the software packages, when alternative algorithmic and tolerance options were available, these were explored to obtain results closer to the certified values. While maximizing the performance of the package can be done when certified values are known, this can be problematic in economic problems when they are unknown. This approach allows us to examine the algorithmic options necessary to solve linear and nonlinear regression problems. The algorithms listed in Table 1 were examined for each nonlinear least squares problem. In addition, tolerance levels were increased or chosen to maximize the accuracy of the estimation procedure being tested. It should be noted that the software packages default to performing numerical computations using double precision floating point accuracy. It should be noted the SHAZAM offers quadratic precision and, in some cases, that improves the performance of the estimators. For all algorithms, derivatives were calculated using numerical methods (differencing) except for GAMS, which calculates analytic derivatives by default. In Excel, the “automatic scaling” option in the Solver Add-In was used when it improved estimation results. Not all of the software packages provided estimates of standard errors. When standard errors were not provided in statistical packages, they were manually estimated using the estimator: (2)



ˆ −1 ˆ X) ˆ = s2 (X Vs (β)

ˆ n is the number where s2 = (n − k)−1 SSR(β), of observations, k is the number of parameters ˆ is the residual sum of being estimated, SSR(β) ˆ is the Jacobian ˆ and X squares evaluated at β, of first partial derivatives of the explanatory variables with respect to the conditional mean being estimated (Davidson and MacKinnon 1993). This method for calculating standard

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

significant digits of the estimation result when it is close to the certified solution. The higher the LRE measure, the better the estimator. An LRE score of 4.70 means that the program is accurate to four significant digits. Whenever the estimation result is far from the correct result, the negative LRE is set equal to zero (McCullough and Wilson 1999). For linear procedures, an LRE score of 9 is desired. That is, it is expected that a reliable algorithm should return nine accurate digits for parameter estimates. A program that fails to give accuracy to nine digits estimating an easy problem should perform less well when faced with more difficult problems. For nonlinear problems, a minimum accuracy (or LRE score) of four digits (an LRE score of 4) is expected (McCullough 1998). These thresholds are used in this study to assess the reliability of the software packages being examined.

Reliability of Statistical Software

1478

Tolerance Settings (default)c

(1) Quasi-Newton (default); and (2) Conjugate Gradient (1) MINOS 5.51; and (2) CONOPT 3

Yes

1e-20 (1e-4)

Yes

1e-6 MINOS (1e-6) 9e-8 CONOPT (9e-8)

(1) LevenbergMarquardt; (2) Polak-Ribiere Conjugate Gradient (default) (1) Gauss-Newton; (2) BFGS Quasi-Newton; (3) David-FletcherPowell (default); (4) Berdnt, Hall, Hall and Hausman

No

Relative Gradient: 1e2 to 1e-11 (1e-5)

No

(1)LevenbergMarquardt; (2)Quasi-Newton; and (3) Conjugate Gradient Levenberg-Marquardt (only option)

No

Parameters: 1e-9 to 1e-15 Obj. Function: 1e-9 to 1e-15 Gradient: 1e-9 to 1e-15 Overall: 1e-21 to 1e-30 (1e-4 for all above) Accuracy Goal: 1e-6 (1e-8) Precision Goal: 1e-6 (1e-8)

Nonlinear Least Squares Estimation Procedures

Nonlinear Algorithms Examineda

Manually coded and solved using Solver Add-In Manually coded and solved using licensed algorithms CurveFit

Software

Version

Excel

2007

GAMS

23.4.3

GAUSS

9.0

Regression Tool in the Analysis Tool Pak Add-In Manually coded and solved using licensed algorithms olsqr2

LIMDEP/ NLOGIT

9.0/4.3

REGRESS

NLSQ

Mathematica

7.0 1.0

LinearModelFit

NonlinearModelFit

MATLAB

Release: R2007b Version: 7.5.0.0342

lscov

Nlinfit (Statistics Toolbox)

No

Parameter Bound: 1e-20 Obj. Function: 1e-20 Parameters: 1e-20 (1e-8 for all above) Continued

Amer. J. Agr. Econ.

Manual Error Standard Estimation (Yes/No)b

Linear Least Squares Estimation Procedures

October 2010

Table 1. Software Packages and Settings Used

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

Odeh, Featherstone, and Bergtold

Table 1. Continued

Software

Version

Linear Least Squares Estimation Procedures

Nonlinear Least Squares Estimation Procedures

Nonlinear Algorithms Examineda

R

2.10.0

lm

nls

(1) Gauss-Netwon (default); and (2) “n12sol” algorithm from the PORT librarya (1) Gauss-Newton (default); and (2) Marquardt DFP Quasi-Newton

SAS

9.1.3 Service Pack 4

PROC REG

PROC NLIN

SHAZAM

10 (SP1)

ols

nl

Stata/IC

10

reg

nl

Tolerance Settings (default)c

No

Relative Gradient: 1e-4 to 1e-20 (1e-5) Min Factor 1e-4 to 1e-12 (9.765625e-4)

No

Parameter: 1e-5 to 1e-11 (1e-5)

Yes and No

Parameters: 1e-6 or 1e-11 (1e-5) Step size: 1e-4 or 1e-6 (1e-4) Parameters: 1e-15 Obj. Function: 1e-15 (1e-10 for all above)

No

Note: a In R, the “n12sol” algorithm from the PORT library is an option developed by Bell Labs (Gay 1990). b Standard errors were estimated using the Jacobian of the estimated function with respect to the parameters following Davidson and MacKinnon (1993). c Mathematica defaults to half of the machine precision for its default tolerance criteria.

Reliability of Statistical Software

Gauss-Newton (only option)

Manual Error Standard Estimation (Yes/No)b

1479

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

1480

October 2010

Results Each of the NIST linear and nonlinear models was estimated for each of the ten statistical software packages. When multiple methods were used within a given package, the “best” LRE was reported. We first discuss the linear least squares regression results and then present the nonlinear least squares regression results. Linear Regression The linear regression minimum LRE results for the parameter estimates and their respective standard errors are provided in table 2. An entry of NS in the table indicates that

the package did not converge to a solution and that the package flagged this with an error message. A score of 0.0 indicates that the LRE was negative or far from the solution and that the package did not note problems with the solution. LREs greater than 9 meet McCullough’s (1998) standard for linear regression. Excel, GAUSS, GAMS, LIMDEP, Mathematica, MATLAB, R, and SHAZAM met the minimum LRE in eight of eleven cases. Stata met the criteria in six of eleven cases. SAS met the criteria in five of eleven cases. The results for SAS and Stata are nearly identical to those obtained by Keeling and Pavur (2007), while Excel 2007 performed a bit better than prior versions of Excel. Each of the packages performed well on parameter estimation for the models in the low and average difficulty categories. None of the models met the LRE standard of 9 for Filip,10 Wampler4, or Wampler5, which was consistent with results found by Keeling and Pavur (2007). GAUSS, MATLAB, R, SAS, and Stata were unable to get solutions for Filip (which was a tenth-order polynomial regression, linear in the parameters). Stata dropped two variables, while the other packages indicated an issue with singularity in the design matrix.According to Murray (1972), a nonlinear procedure can fail in at least one of two ways: when a computer reports an error message after a failed attempt at solving the problem or when the computer incorrectly solves the problem and reports the results without any error message. Murray referred to the former as “miserable failure” and the latter as “disastrous failure.” The LRE for the standard errors generally followed the ability of the package to calculate the parameter estimates (table 2). The standard errors were calculated using the package defaults, except for GAMS that were manually coded. GAMS was unable to calculate standard error estimates for Filip due to an ˆ X ˆ matrix in equation inability to invert the X (2). GAUSS and MATLAB did not perform as well on the Wampler1 model. Nonlinear Regression When comparing results with previous studies, it is important to note that the performance of nonlinear solvers is much more sensitive to user choices. For example, Keeling and Pavur 10 McCullough (1999b, p. 195) reported that EViews and TSP could not compute the Filip model because of “near-singularity of the design matrix.”

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

errors is used by NIST to calculate the certified values for the standard errors for nonlinear least squares (National Institute of Standards and Technology 2003). To calculate the matrix inversion in GAMS, a system of equations was solved (i.e., A−1 A = I, where ˆ  X) ˆ using I is the identity matrix and A is X the constrained nonlinear system (cns) procedure (GAMS online documentation, http://www.gams.com/docs/document.htm). Given issues with estimation of standard errors in nonlinear models in SHAZAM, both the reported and manually estimated standard errors are presented. SHAZAM offers several methods of calculating the standard error within its nonlinear least squares procedure. We chose the option of using the numeric differences to compute the covariance matrix after estimation (NUMCOV) in most cases. We note that nonlinear least squares solvers in all packages are particularly sensitive to the starting values, making it difficult to find a global minimum in the presence of multiple local minima. All nonlinear models are estimated for both the “far” and “near” starting points provided in the NIST datasets. In general, it is more probable that a particular code will fail when the starting values are far from the solutions than when the starting values are relatively close. However, the seriousness of the failure depends largely on the problem being solved, and this makes the NIST suites useful in our test of software reliability. The naïve expectation is that solvers starting from values that are close to the certified values should produce more accurate results.

Amer. J. Agr. Econ.

Odeh, Featherstone, and Bergtold

Reliability of Statistical Software

1481

Table 2. Minimum LRE for Parameter Estimates and Standard Errors for StRD Linear Regression Models Excel GAMS GAUSS LIMDEP Mathematica MATLAB R

SAS SHAZAM Stata

Parameter Estimates 14.0 13.2

12.9 12.5

12.8 12.6

12.2 11.9

12.9 13.1

12.3 12.3 12.3 11.5

13.8 13.0

12.8 14.3

14.7 15.0

14.7 15.0

14.7 15.0

14.7 15.0

14.7 15.0

14.7 14.7 15.0 15.0

14.7 15.0

14.7 15.0

7.0 10.9 10.8 13.2 10.3 8.6 6.1

0.0 11.2 9.3 13.4 9.6 8.3 6.3

7.5 14.4 9.7 13.9 9.6 7.4 6.7

7.3 11.8 9.6 12.9 9.3 7.8 6.1

NS 11.2 9.3 12.3 9.4 8.2 6.2

NS NS 12.9 8.6 9.3 6.6 12.9 9.6 9.6 6.6 8.3 6.6 6.3 6.6

7.9 11.5 10.7 13.1 10.1 7.9 5.9

NS 12.1 6.9 9.7 6.5 6.5 6.4

Standard Error Estimates Low difficulty Norris 14.2 Pontius 12.7 Average difficulty Noint1 15.0 Noint2 14.8 High difficulty Filip 7.2 Longley 14.8 Wampler1 10.4 Wampler2 15.0 Wampler3 11.4 Wampler4 11.8 Wampler5 12.0

13.8 12.3

14.4 12.5

13.6 14.4

13.8 14.4

10.8 8.2

13.5 11.8 12.6 8.6

13.6 13.9

13.5 13.0

15.0 15.0

14.4 15.0

15.0 15.0

15.0 15.0

13.1 14.0

15.0 14.0 15.0 15.0

15.0 15.0

15.0 15.0

NS 7.9 15.0 15.0 7.3 5.3 6.1

0.0 8.6 10.1 14.8 10.6 10.6 10.6

6.8 12.7 9.0 13.9 12.1 12.1 7.6

7.7 12.7 8.9 14.2 13.7 13.7 13.7

NS 10.7 1.6 6.5 10.3 13.4 13.4

NS 14.1 10.1 14.8 14.3 13.8 13.8

7.7 11.5 9.7 14.7 13.7 13.7 13.7

NS 12.9 15.0 15.0 10.8 10.8 10.8

NS 10.3 15.0 15.0 11.2 11.2 11.2

Note: StRD = Statistical Reference Dataset; NS = the package indicated a problem during estimation.

(2007) report results using the defaults specified by the package and solutions with those parameters adjusted by the user. McCullough (1999a) indicates that convergence criteria, choice of algorithm, tolerance, and the use of numeric or analytical derivatives can affect the optimal solution (or estimates). In this study, we adjusted the first three choices but always used the default method for calculating derivatives, which were numerical except in the case of GAMS. In addition, it is important to recognize that the actual solution is known. However, most applied economic studies occur without knowing the optimal solution. Therefore, given that multiple solutions can be obtained from the same package under different tolerances, it is imperative to have a process for determining the optimal solution. McCullough (2004) identified four criteria to consider in determining the reliability of

nonlinear results and whether convergence has been achieved. These are gradient information, the trace,11 the Hessian,12 and a profile of the objective function (McCullough 2004). The nonlinear regression minimum LRE results for the parameter estimates are provided in table 3 for both sets of starting values, and the standard error estimates for both sets of starting values are provided in table 4. A value of NS may indicate a multitude of different errors, such as the algorithm hitting the specified iteration limit, false convergence,

11 This refers to the estimation output at each iteration. By default SHAZAM and GAUSS (depending on the algorithm used) offer the trace. 12 Here, the relative gradient (g’H−1 g) or the norm of the gradient should be zero (preferably 1e-4 or smaller) and Hessian should be positive definite for a typical least squares problem. This can be checked by ensuring that the all eigenvalues are positive.

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

Low difficulty Norris 12.0 Pontius 12.0 Average difficulty Noint1 14.7 Noint2 15.3 High difficulty Filip 7.2 Longley 13.4 Wampler1 10.0 Wampler2 13.4 Wampler3 10.0 Wampler4 8.1 Wampler5 6.1

1482

October 2010

Excel. Reliability results for Excel indicated that eighteen of twenty-seven models

estimated satisfactorily using the “far” starting values, while twenty of twenty-seven models estimated reliably using the “near” starting values (table 3). We generally found better LRE performance for Excel than was found in the previous literature. Compared with the findings of Keeling and Pavur (2007), Excel performed much better on several models. However, that does not indicate that there are not issues with Excel. In addition, we added user-defined routines to provide standard error estimates. When the LREs for the parameter estimates were high, the LREs for the standard errors were high as well. To obtain acceptable solutions, in some cases nonnegativity constraints were imposed on the parameter estimates (Lanczos3 and Rat43), and in other cases the solver stopped and had to be restarted to obtain the optimal solution (Hahn1, Lanczos1, Lanczos2, BoxBOD, and Mgh10). The randomness of adding nonnegativity constraints to obtain an acceptable solution certainly can be a problem unless there is a priori knowledge of the appropriateness of these constraints. However, adding this type of constraint can be a useful tool in trying to determine the area of the parameter space in which the optimal solution occurs. Solver did not converge to a solution after 10,000 iterations for the Mgh17 and Mgh09 models using the “far” starting values. To illustrate the issue of why Solver had to be restarted, consider the BoxBOD model. Solver stopped with an objective value of 9,771.50 and estimates for β1 and β2 of 172.50 and 20.62, respectively. (The NIST-certified values for β1 and β2 are 213.80940889 and 0.54723748542, respectively.) The minimum LRE measure for the parameter estimates was 0.0. After restarting Solver, using the prior solution from the first stopping point as the new starting point, Solver found an improved objective value of 1,168.01 and estimates for β1 and β2 of 213.809 and 0.547, respectively. The minimum LRE measure for the parameter estimates was 8.066 after restarting. Restarting the solution again resulted in no change. Therefore, it is recommended to always restart Solver in Excel to verify that the solution obtained does not change.13

13 If the solution found in Excel is optimal, restarting solver should not find an alternative solution. Therefore, restarting solver from the last solution found should help to check optimality.

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

or no solution found. LREs greater than 4 meet McCullough’s (1998) standard of reliability for nonlinear regression models. Generally, the software packages performed marginally better on the starting values near to the optimal value than the starting values farther away. When trying to obtain the optimal solution, alternative tolerance levels, solution algorithms and other options were adjusted to obtain the solution closest to the certified values. This was done to determine whether there were any consistent adjustments that could be made to the various packages to increase accuracy. No consistent method of improving accuracy was found. Mathematica met McCullough’s criteria of a minimum LRE of 4 in all twenty-seven models for both starting points (table 3). R, SAS, and Stata also met the criteria for all twentyseven sets, and LIMDEP and MATLAB met the criteria in twenty-six of twenty-seven models using the “near” starting values. GAUSS met the criteria for twenty-five of twentyseven models using the “near” starting point. SHAZAM met the criteria in twenty-one of twenty-seven models using the “near” starting values. Excel and GAMS met the criteria for twenty of twenty-seven models at the “near” starting values. Starting values had the largest effect for R and Stata, where the closer values resulted in meeting McCullough’s criteria in twenty-seven of twenty-seven models, while the set of starting values farther from the optimum met the case in only twenty-two of the twenty-seven models. Each of the packages performed well on Misra1a, Chwirut2, Chiwirut1, Gauss2, Misra1c, Misra1d, and Roszman1 (table 3). At least one or more of the packages failed to solve with the other models. There were no consistent patterns among the packages and their levels of difficulty. For example, some of the packages performed well on high-difficulty problems but not so well on low-difficulty problems. The LRE for the standard errors generally followed the ability of the package to calculate the parameter estimates (table 4). The standard errors were calculated using the method offered by the package and manually following equation (2) for Excel, GAMS, and SHAZAM. The default standard error LREs were generally much higher when standard errors were calculated using equation (2) for SHAZAM.

Amer. J. Agr. Econ.

Dataset

E2

GM1

GM2

GS1

GS2

L1

L2

MA1

MA2

MT1

MT2

R1

R2

S1

S2

SH1

SH2

ST1

ST2

Low difficulty Misra1a 8.1 Chwirut2 8.7 Chwirut1 8.9 Lanczos3 0.0 Gauss1 9.1 Gauss2 9.2 Danwood 9.4 Misra1b 7.4

9.0 10.4 8.3 0.0 9.2 8.2 8.5 6.1

11.1 5.0 10.7 0.0 10.6 10.3 0.5 0.0

11.1 5.0 10.7 0.0 10.6 10.3 0.9 0.0

8.8 5.0 8.4 5.5 9.0 8.3 9.1 9.0

9.0 5.0 8.1 6.2 9.9 7.3 8.9 8.8

8.4 5.0 8.5 8.3 8.8 8.9 10.9 8.2

8.4 5.0 9.6 7.9 9.7 8.9 10.2 8.2

11.0 5.0 8.5 6.9 10.6 10.3 10.0 10.9

11.0 5.0 8.5 9.3 10.6 10.3 8.7 10.9

8.1 5.0 6.7 6.2 7.4 7.1 8.9 8.1

8.1 5.0 6.5 6.2 7.3 7.1 7.4 8.1

9.5 5.0 6.5 5.2 8.2 8.9 8.5 8.9

7.9 5.0 6.7 6.1 8.5 8.9 8.7 8.0

10.9 5.0 10.7 10.5 10.6 10.3 11.0 11.0

11.0 5.0 10.7 6.5 10.6 10.3 11.0 11.0

9.3 5.0 10.9 0.0 0.0 9.8 11.2 11.0

11.1 5.0 10.7 0.0 0.0 10.3 10.4 11.0

9.2 5.0 7.6 6.2 8.9 8.2 8.6 9.9

9.5 5.0 7.6 6.8 8.5 8.2 8.7 9.5

Average Difficulty Kirby2 8.4 Hahn1 0.3 Nelson 6.7 Mgh17 NS Lanczos1 0.0 Lanczos2 0.0 Gauss3 9.0 Misra1c 8.9 Misra1d 8.9 Roszman1 8.4 Enso 6.6

7.5 3.7 6.4 7.8 0.0 0.0 8.5 8.0 8.7 8.3 7.3

10.5 10.8 10.9 6.4 0.0 0.0 1.4 10.8 11.0 8.4 9.6

10.5 10.8 10.9 9.9 3.0 0.1 1.4 10.8 11.0 10.6 9.5

7.3 4.2 7.1 0.0 10.6 6.6 8.8 8.2 10.0 7.6 5.8

7.2 4.2 7.9 7.0 10.4 5.6 8.9 8.7 8.6 6.7 5.9

4.3 NS 7.7 0.0 10.6 10.6 9.4 7.8 8.0 8.9 6.1

4.4 NS 7.5 8.8 10.6 9.7 9.4 7.8 8.0 8.9 6.1

9.6 9.1 7.9 7.6 10.6 8.4 10.4 8.3 11.0 10.2 6.8

9.0 8.5 8.3 9.7 10.6 9.1 10.5 10.8 11.0 10.2 6.7

7.0 6.6 6.9 NS 10.5 7.6 7.0 8.2 8.1 6.4 3.6

7.0 6.0 6.9 6.0 10.5 7.7 7.0 8.2 8.1 6.4 3.9

7.4 7.5 NS NS 10.6 5.6 9.5 8.9 9.7 6.9 5.9

7.3 6.6 7.0 7.2 10.6 6.2 9.3 7.9 8.7 6.8 6.0

9.6 8.7 10.9 0.0 10.6 10.4 10.5 10.8 11.0 10.5 7.9

10.3 9.8 10.9 10.8 10.6 10.4 10.4 10.8 9.1 10.9 7.9

9.4 8.4 8.4 NS 0.0 0.0 8.7 10.6 10.2 7.8 7.2

9.4 7.5 7.9 NS 0.0 0.0 9.1 8.7 9.5 8.5 6.7

7.5 7.1 6.8 0.0 10.6 7.5 8.2 9.7 9.4 7.7 4.7

7.6 7.1 6.8 7.7 10.6 7.2 8.2 9.4 9.2 7.7 4.7

High difficulty Mgh09 NS Thurber 8.7 BoxBod 7.8 Rat42 9.7 MGH10 2.2 Eckerle4 9.0 Rat43 0.0 Bennett5 0.8

7.8 8.4 8.8 9.1 3.8 9.0 8.1 0.4

7.7 10.4 11.0 11.0 0.0 8.7 11.0 0.7

5.9 10.4 11.0 11.0 10.9 6.9 11.0 0.9

6.8 7.4 8.6 8.8 6.4 10.2 7.9 5.4

7.4 7.4 8.6 8.8 7.5 9.5 6.8 5.7

0.0 7.7 9.1 9.8 8.1 6.6 NS 9.1

6.0 7.7 9.1 9.5 7.8 7.4 7.1 8.9

7.3 8.5 8.9 10.4 8.9 9.7 8.7 11.0

8.4 9.1 9.2 10.7 10.9 9.6 9.0 11.0

5.8 5.2 NS 6.5 0.0 7.1 4.8 7.1

5.8 5.2 7.1 6.4 6.7 7.1 5.3 7.7

NS 7.7 NS 7.6 NS 8.0 5.7 5.8

7.5 7.4 6.6 8.6 7.0 8.9 6.5 5.1

NS 8.0 10.8 11.0 NS NS 9.1 10.3

7.6 9.0 10.7 11.0 10.8 10.6 9.1 10.0

0.0 8.4 0.0 0.0 0.0 10.5 8.3 6.5

7.1 9.9 0.0 11.0 10.9 10.1 7.9 6.5

NS 6.5 7.3 7.6 NS 0.0 0.0 6.3

7.0 6.5 8.5 7.6 7.7 8.3 6.0 6.4

Note: StRD = Statistical Reference Dataset; E = Excel, GM = GAMS, GS = GAUSS, L = LIMDEP, MA = MATHEMATICA, MT = MATLAB, R = R, S = SAS, SH = SHAZAM, ST = Stata, 1 = Far Starting Point 1 and 2 = Near Starting Point 2, NS = Did not converge to a solution.

Reliability of Statistical Software

E1

Odeh, Featherstone, and Bergtold

Table 3. Minimum LRE for Parameter Estimates of StRD Nonlinear Regression Models

1483

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

1484

Table 4. Minimum LRE for Parameter Standard Error Estimates of StRD Nonlinear Regression Models E1

E2

L1

L2

MA1 MA2 MT1 MT2

R1

R2

S1

S2

Low difficulty Misra1a 7.9 7.9 Chwirut2 9.1 10.7 Chwirut1 9.1 8.7 Lanczos3 0.0 NS Gauss1 9.1 8.9 Gauss2 8.7 8.0 Danwood 9.5 8.7 Misra1b 7.0 5.8

10.8 5.6 10.9 0.0 10.6 10.4 0.0 0.0

10.8 5.6 10.9 0.0 10.6 10.4 0.0 0.0

6.6 5.6 7.2 4.7 7.8 7.3 7.8 6.5

6.5 5.6 7.5 4.1 7.7 7.3 7.6 6.4

6.2 5.6 7.8 7.1 6.3 6.2 7.9 5.9

6.2 5.6 7.8 7.2 6.3 6.2 7.9 5.9

10.4 5.6 8.8 6.9 10.6 10.4 10.1 9.7

10.4 5.6 8.8 9.3 10.4 10.5 8.8 9.7

5.2 5.1 5.1 4.7 5.1 4.7 5.0 5.3

5.2 5.1 5.1 4.7 5.1 4.7 5.0 5.3

6.9 5.5 6.8 4.3 7.5 7.3 7.5 6.2

7.2 5.5 6.9 5.6 7.7 7.3 7.7 6.4

11.0 5.6 10.9 8.5 10.6 10.4 11.0 10.8

10.8 5.6 10.9 6.5 10.6 10.4 11.0 10.8

0.0 1.9 1.8 0.0 0.0 1.7 0.7 0.0

0.0 1.9 1.8 0.0 0.0 1.7 0.7 0.0

9.0 5.6 10.8 0.0 0.0 9.6 11.1 10.8

10.8 5.6 10.9 0.0 0.0 10.4 10.4 10.8

6.4 5.6 6.3 6.1 6.3 5.9 6.2 6.5

6.4 5.6 6.3 5.8 6.3 5.9 6.2 6.4

Average Difficulty Kirby2 8.7 8.0 Hahn1 0.0 0.2 Nelson 6.8 6.5 Mgh17 NS 7.3 Lanczos1 0.0 0.0 Lanczos2 NS NS Gauss3 8.7 8.2 Misra1c 8.6 7.7 Misra1d 9.6 9.1 Roszman1 8.5 9.2 Enso 7.7 8.4

10.7 10.3 10.6 6.0 0.0 0.0 0.3 10.6 11.0 10.9 10.6

10.7 10.3 10.6 9.4 0.0 0.0 0.3 10.6 11.0 10.9 10.4

5.1 2.2 6.5 0.0 3.9 5.1 7.0 6.2 6.3 7.2 6.6

5.1 2.2 6.7 6.0 3.5 4.6 6.9 6.2 6.4 6.5 6.6

2.6 NS 5.9 0.0 3.2 7.1 6.0 5.3 5.6 8.2 5.7

2.6 NS 5.9 7.2 4.1 7.1 6.0 5.3 5.6 8.2 5.7

9.4 8.2 7.9 7.2 1.0 8.4 10.3 8.1 10.0 10.5 8.0

9.4 8.1 7.7 9.2 2.4 9.0 10.5 9.0 10.0 10.5 7.9

5.1 4.4 4.0 NS 3.0 4.7 4.3 5.3 5.3 5.3 4.1

5.1 4.4 4.0 4.9 2.9 4.7 4.3 5.3 5.3 5.3 4.2

6.7 6.6 NS NS 3.3 4.3 7.1 6.2 7.3 7.1 3.8

7.5 6.4 7.0 5.9 3.2 4.9 7.0 6.3 7.7 6.2 3.8

9.9 9.3 10.6 0.0 3.0 8.6 10.4 10.6 11.0 10.8 9.0

10.6 10.4 10.6 10.6 3.2 9.1 10.1 10.6 8.8 11.0 9.0

0.0 0.0 0.0 NS 0.0 0.0 1.8 0.0 0.0 1.2 0.4

0.0 0.0 0.0 NS 0.0 0.0 1.8 0.0 0.0 0.1 0.7

9.7 7.9 8.4 NS 0.0 0.0 8.4 10.6 10.0 6.4 8.4

9.7 7.9 7.9 NS 0.0 0.0 8.7 8.4 9.2 6.2 7.7

6.3 5.1 5.2 0.0 2.9 6.0 5.5 6.5 6.5 6.4 5.3

6.3 5.1 5.2 6.1 3.0 5.3 5.5 6.5 6.4 6.4 5.3

High difficulty Mgh09 NS Thurber 7.9 BoxBod 7.7 Rat42 9.5 MGH10 2.2 Eckerle4 9.4 Rat43 NS Bennett5 0.0

8.0 10.7 10.4 10.6 0.0 8.7 10.7 0.0

6.0 10.7 10.4 10.6 4.8 6.8 10.7 0.0

7.1 6.7 8.6 7.6 5.5 7.9 6.4 5.0

6.8 6.5 8.0 7.4 5.3 7.7 6.4 5.7

0.0 6.3 8.3 7.6 5.6 4.5 NS 7.3

6.2 6.3 8.3 7.6 5.6 4.5 6.8 6.4

7.4 7.8 8.8 10.1 8.9 9.7 8.7 10.3

8.4 8.4 9.1 10.2 11.0 9.6 9.1 10.5

5.3 4.2 NS 4.8 0.0 5.2 3.8 4.8

5.3 4.2 5.7 4.8 3.5 5.2 3.9 4.8

NS 4.2 NS 7.1 NS 7.6 5.7 5.2

6.6 NS 7.7 4.2 7.3 8.3 6.6 10.6 10.3 7.3 10.8 10.6 5.3 NS 9.6 7.7 NS 10.4 7.0 9.3 9.1 4.8 7.2 7.1

0.0 0.5 0.0 0.0 0.0 1.4 0.7 0.0

0.4 0.6 0.0 0.7 0.0 1.4 0.7 0.0

NS 7.6 0.0 0.0 0.0 10.5 8.4 6.4

7.1 10.0 0.0 10.6 9.4 10.1 8.1 6.5

NS 5.4 6.7 6.0 NS 0.0 0.0 6.0

6.5 5.4 6.8 6.0 4.7 6.4 5.0 6.0

8.7 7.7 8.8 8.8 3.9 8.7 1.0 0.0

SH1 SH2 SHB1 SHB2 ST1 ST2

Note: StRD = Statistical Reference Dataset; E = Excel, GM = GAMS, GS = GAUSS, L = LIMDEP, MA = MATHEMATICA, MT= MATLAB, R = R, S = SAS, SH = SHAZAM, SHB = SHAZAM with User Calculated Standard Errors, ST = Stata, 1 = Far Starting Point 1 and 2 = Near Starting Point 2, NS = Did not converge to a solution.

Amer. J. Agr. Econ.

GM1 GM2 GS1 GS2

October 2010

Dataset

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

Odeh, Featherstone, and Bergtold

GAUSS. The software package performed reliably on twenty-five of the twenty-seven models. When considering both the reliability of the parameter and standard error estimates, performance on three of the models (Hahn1, Lanczos1, and MGH17) yielded LREs less than 4 for the“far”starting point,but the model did perform better for the “near” starting point for MGH17. Results are comparable to those of Yalta (2007), but differences may result, as only numerical derivatives were utilized in this study. While two algorithms (LevenbergMarquardt and Conjugate Gradient) are available in the CurveFit module, for the given problems the Levenberg-Marquardt algorithm tended to provide better performance. LIMDEP. The Help Topics in the LIMDEP software provide the NIST nonlinear regression datasets and code for conducting estimation and performing the reliability tests. LIMDEP indicates that it was able to successfully estimate twenty of the twenty-seven nonlinear models using the first starting point; five of the remaining models using the second starting point; and the remaining two by using analytic derivatives rather than program defaults (i.e., numerical derivatives) (LIMDEP 2002). We were able to reliably obtain the parameter estimates and associated standard errors for twenty-two of the twenty-seven models for both starting points, and an additional three of the remaining models (except Hahn1 and Kirby2) using the “near” starting point. While parameter estimates were reliably estimated for the Kirby2 model, standard error estimates fell below the LRE threshold for

1485

reliability. We were not successful in estimating the Hahn1 model using the numerical derivative routines in the LIMDEP software. These results are an improvement over those found by McCullough (1999b). Mathematica. The package reliably estimated each of the twenty-seven problems with an LRE of greater than 4 with either set of starting values. In addition, each of the standard error estimates has an LRE of 4 or greater except for Lanczos1. While Mathematica proved to be highly reliable, the software package has only a limited suite of built-in statistical regression routines (e.g., linear regression, nonlinear regression [least squares], generalized linear models, logitistic regression, and probit models).14 McCullough (2000) showed that when working precision is increased to fifty digits in Mathematica 4 (which implies a default convergence of 1e-25), the minimum LREs for both parameter and standard error estimates for all the nonlinear models were equal to 11. MATLAB. This software package was able to estimate the parameters reliably for twentythree and twenty-six of the twenty-seven models for the “far” and “near” starting values, respectively. Results for the standard error estimates tended to follow the same pattern. As with the other packages, MATLAB generally performed well in estimating the standard errors when the parameters were estimated well. MATLAB was unable to converge to a solution (i.e., diverged to an infinite solution and the Jacobian was ill-conditioned) for the MGH17 and BoxBOD models at the “far” starting values and did not perform reliably on the ENSO model for either set of starting values. A shortcoming of the nonlinear least squares routine in the Statistics Toolbox for MATLAB is that it provides only one algorithm for parameter estimation (i.e., Levenberg-Marquardt). R. This package was able to reliably estimate twenty-two and twenty-seven of the twentyseven models for the “far” and “near” starting values, respectively. Keeling and Pavur (2007) found that R version 1.9.1 was able to solve twenty and twenty-six of the twenty-seven models using the “far” and “near” starting values, respectively. For problems that did not 14 It is hoped that the suite of statistical packages in Mathematica will be expanded in the future, given the powerful algorithms available in the software package.

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

GAMS. Reliability results for GAMS indicated that nineteen of twenty-seven models were estimated satisfactorily using the “far” starting values, while twenty of twenty-seven models estimated reliably using the “near” starting values (table 3). Of the algorithms examined, none had a performance advantage over another. In some cases, CONOPT outperformed MINOS, and in others MINOS outperformed CONOPT. MINOS generally performed a little better on the “near” set of starting values, while CONOPT performed marginally better on the “far” set of starting values. To get standard errors for parameter estimates within the package, the user must either code a program for matrix inversion or call an external program using a procedure distributed with GAMS to determine the inverse in equation (2).

Reliability of Statistical Software

1486

October 2010

SAS. This software was able to reliably estimate the parameters for twenty-three and twenty-seven of the twenty-seven models for the “far” and “near” starting values, respectively. The standard error estimates for the Lanczos1 model fell below the reliability threshold, while the parameters were reliably estimated. The remaining standard error estimation results followed those for the parameter estimates. The results were nearly identical to those found by Keeling and Pavur (2007) and identical to those found by SAS Institute (2003) for version 8.02 of the software. SHAZAM. This software package reliably estimated eighteen and twenty-one of the twenty-seven models for the “far” and “near” starting values, respectively. This is an improvement over prior assessments, as McCullough (1999b) found that SHAZAM was able to reliably estimate fifteen of twenty-three models using version 8.0. While there was some improvement in the LRE results using the “near” starting values, there were cases where the “far” starting values produced more accurate estimates. We tested two convergence criteria: 1e-6, the default specified in SHAZAM; and 1e-11, recommended by McCullough (1999b). In some cases the smaller optimality tolerance resulted in higher LREs, and in some cases the larger tolerance resulted in higher LREs. One issue with SHAZAM is the calculation of standard errors within the package. McCullough (1999b) found that using an accuracy of two digits, SHAZAM met these criteria on eleven of twenty-two models. We used several methods available in SHAZAM to calculate the covariance matrix necessary for standard error estimation for the parameters,

including numeric differences. While there were differences in the accuracy of standard errors depending on the method chosen, none performed very well. We further examined this issue by calculating the standard errors manually using equation (2). Looking at the differences in SH1 and SHB1, as well as SH2 and SHB2 in table 4, there is a dramatic improvement in the accuracy of standard errors when calculated manually by the user (SHB). Users need to recognize that when calculating standard errors in SHAZAM, more reliable estimates may result from user-defined standard error estimates. STATA. This software reliably estimated parameter values for twenty-two and twentyseven of the twenty-seven models for the “far” and “near” starting values, respectively. The LREs for the standard error estimates tended to be reliable when the parameters were reliably estimated. The only exception was the Lanczos1 model. Some of the results differed slightly from those reported by Stata (2007) on its website, but this is likely due to different option settings, computer platform, and operating system.15 Conclusion The reliability of ten selected statistical software packages widely used by applied economists for research analyses was examined. The packages were Excel 2007, GAMS 23.4, GAUSS 9.0, LIMDEP 9.0, Mathematica 7.0, MATLAB 7.5, R 2.10, SAS 9.1, SHAZAM 10, and Stata 10. Two types of tests (linear and nonlinear least squares regression) were performed using the NIST benchmark datasets. The methods used are consistent with previous literature that adjusts options within the packages to obtain values close to those certified by NIST. In actual practice, the certified values are unknown, leaving an econometrician to use the suggestions by McCullough (2004) to verify the results. This study expands on the literature in this area by examining GAMS and MATLAB, which have not been extensively examined in this fashion to the authors’ knowledge.

15 One reviewer pointed out that in addition to the results from the tests, the code that was used to get those results should also be posted on the website. Stata does a nice job of providing the code used to get the results.

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

converge for the “far” starting values, the software indicated “false convergence” or would terminate after hitting a maximum number of iterations (usually greater than 10,000). R provides limited user options for controlling the estimation process, including two algorithms, the maximum number of iterations, step size, and convergence tolerance. For some of the models, the “nl2sol” algorithm (or “port” option) provided better results for parameter estimation (see Gay 1990). This is of interest, given the software indicates that the “port” option should be used with caution. A significant advantage of the R software is that it is free to the scientific community and is relatively flexible to build upon.

Amer. J. Agr. Econ.

Odeh, Featherstone, and Bergtold

1487

economists be conducted.16 While it is often assumed that the responsibility lies with software developers to ensure the accuracy and reliability of their products, it must be realized that users should be aware of software limitations and problems. If users prefer userfriendliness and speed to accuracy and reliability, then research results (and the economic implications drawn from them) may be unreliable and subject to the quality of software used. Future research needs to expand the set of benchmark datasets for testing statistical and econometric software to other models and estimators, including discrete choice, simultaneous equation, and Bayesian models. In addition, future research needs to continue to examine issues discovered by others regarding random number generators and issues with statistical distributions. Finally, it is important to note that reliability tests are completed knowing the “true” values. In most cases, adjustments are made, especially with nonlinear estimation, to get as close as possible to the “true” value. Because applied economists do not usually know the “true” values, cross validation of research results for nonlinear regression becomes even more critical. References Altman, M., and M. P. McDonald. 2001. Choosing Reliable Statistical Software. Political Science and Politics 34: 681–687. Altman, M., J. Gill, and M. P. McDonald. 2004. Numerical Issues in Statistical Computing for the Social Scientist. Hoboken, NJ: John Wiley & Sons. Benichou, M., J. M. Gauthier, G. Hentges, and G. Ribiere. 1977. The Efficient Solution of Large-Scale Linear Programming Problems—Some Algorithmic Techniques and Computational Results. Mathematical Programming 13: 280–322. Brown, B. W. 2003. DCDFLIB: Library of Routines for Cumulative Distribution Functions Inverses, and Other Parameters. Division of Quantitative Science, University of Texas MD Anderson Cancer Center. http://biostatistics.mdanderson.org/SoftwareDownload/SingleSoftware.aspx– Software_Id=21 (accessed June 2, 2010).

16 This could be a function of this Journal with space devoted annually to report the latest results.

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

Our findings indicate that GAUSS, LIMDEP, Mathematica, MATLAB, R, SAS, and Stata provided consistent reliable estimation results for both parameter and standard error estimates across the benchmark datasets. While some of the other packages performed admirably, shortcomings did exist. While Excel 2007 performed better than past versions, Solver still does not always converge to the optimal solution or stops prematurely, indicating an optimal solution has been found. This quirk can be dealt with by rerunning the Solver solution, using the current solution as the starting point, to see if the convergence criteria are actually satisfied. SHAZAM had issues with consistently estimating standard errors using its built-in routines, but this was overcome for some of the problems by manually estimating the standard errors. Researchers need to recognize that blindly trusting the accuracy of statistical software packages may lead to nonoptimal, biased or erroneous results, which impacts applied economic theory and the conclusions and policy recommendations derived from it. The findings underscore the need to crossvalidate research results. Applying different software packages to the same estimation problem should occur to develop confidence in a solution. While the robustness and reliability of nonlinear least squares parameter estimates depend on the algorithm employed, good solutions also depend on the implementation procedure, choice of starting value(s), the nature of the problem, and the dataset used. Researchers need to evaluate estimation results carefully until they are proven to be reliable, especially when multicollinearity or substantial nonlinearities are present. Teachers of econometrics need to make students more aware of the limitations of the software packages available. Using the package defaults alone for nonlinear estimation may produce inaccurate estimates; therefore it is advisable to change defaults to determine the robustness of estimates. In addition, implementing multiple packages and accepting a solution when the results agree across packages may be a more reliable approach than relying on a single package. Software developers also need to benchmark their packages and document the reliability (of the packages) for researchers, which have been done by some distributors (i.e., LIMDEP, SAS, Stata, and TSP). We also suggest that independent reviews of new versions of software packages used by applied

Reliability of Statistical Software

1488

October 2010

of Random Number Generators. ACM Transactions on Mathematical Software 33(4). http://portal.acm.org/citation.cfm– id=1268777 (accessed June 2, 2010). McCullough, B. D. 1998. Assessing the Reliability of Statistical Software: Part I. American Statistician 52(4): 358–366. ———–. 1999a. Assessing the Reliability of Statistical Software: Part II. American Statistician 53(2): 149–159. ———–. 1999b. Econometric Software Reliability: EViews, LIMDEP, SHAZAM and TSP. Journal of Applied Econometrics 14(2): 191–202. ———–. 2000. The Accuracy of Mathematica 4 as a Statistical Package. Computational Statistics 15: 279–299. ———–. 2004. Wilkinson’s Tests and Econometric Software. Journal of Economic and Social Measurement 29: 261–272. ———–. 2006. A Review of TestU01. Journal of Applied Econometrics 21: 677–682. McCullough, B. D., and D. A. Heiser. 2008. On the Accuracy of Statistical Procedures in Microsoft Excel 2007. Computational Statistics and Data Analysis 52: 4570–4578. McCullough, B. D., and H. D. Vinod. 1999. The Numerical Reliability of Econometric Software. Journal of Economic Literature 37(2): 633–665. McCullough, B. D., and B. Wilson. 1999. On the Accuracy of Statistical Procedures in Excel 97. Computational Statistics and Data Analysis 31(1): 27–37. ———–. 2002. On the Accuracy of Statistical Procedures in Excel 2000. Computational Statistics and Data Analysis 40(4): 713–721. ———–. 2005. On the Accuracy of Statistical Procedures in Microsoft Excel 2003. Computational Statistics and Data Analysis 49: 1244–1252. Murray,W. 1972. Failure, the Causes and Cures. In Numerical Methods for Unconstrained Optimization, ed. W. Murray, pp. 107–122. New York: Academic Press, . National Institute of Standards and Technology. 2003. Statistical Reference Datasets. http://www.itl.nist.gov/div898/strd (accessed on March 2, 2010). Nerlove, M. 2005. On the Numerical Accuracy of Mathmatica 5.0 for Doing Linear and Nonlinear Regression. Mathematica Journal 9(4): 824–851. Rogers, J., Filliben, J., Gill, L., Guthrie, W., Lagergren, E., and Vangel, M. 1998. StRD: Statistical Reference Datasets for

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

Crowder, H. P., R. S. Dembo, and J. M. Mulvey. 1978. Reporting Computational Experiments in Mathematical Programming. Mathematical Programming 15: 316– 329. Davidson, R., and J. G. MacKinnon. 1993. Estimation and Inference in Econometrics. NewYork: Oxford University Press. Dewald, W. G., T. G. Thursby, and R. G. Anderson. 1986. Replication in Empirical Economics: The Journal of Money, Credit and Banking Project. American Economic Review 76: 587–603. Gay, D. 1990. Usage Summary for Selected Optimization Routines, Computing Science Technical Report No. 153. Murray Hill, NJ: AT&T Bell Laboratories. http://netlib.bell-labs.com/cm/cs/ cstr/153.pdf (accessed June 2, 2010). Gill, P. E., W. Murray, and M. H. Wright. 1981. Practical Optimization. London: Academic Press. Ibaraki, T. 1976. Theoretical Comparisons of Search Strategies in Branch-and Bound Algorithms. International Journal of Computer Information Science 5: 315–344. Keeling, K. B., and R. J. Pavur. 2007. A Comparative Study of Nine Statistical Software Packages. Computational Statistics and Data Analysis 51: 3811–3831. Knüsel, L. 1995. On the Accuracy of Statistical Distributions in GAUSS. Computational Statistics and Data Analysis 20: 699–702. ———–. 1998. On the Accuracy of Statistical Distributions in Microsoft Excel 97. Computational Statistics and Data Analysis 26: 375–377. ———–. 2003. Computation of Statistical Distributions: Documentation of the Program ELV (2nd ed.). Department of Statistics, University of Munich. http:// www.stat.uni-muenchen.de/∼knuesel/elv/ elv_docu.pdf (accessed June 2, 2010). ———–. 2005. On the Accuracy of Statistical Distributions in Microsoft Excel 2003. Computational Statistics and Data Analysis 48: 445–449. Land,A. H., and S. Powell. 1973. Fortran Codes for Mathematical Programming: Linear, Quadratic and Discrete. New York: John Wiley & Sons. Leamer, E. E. 1983. Let’s Take the Con out of Econometrics. American Economic Review 73: 31–43. L’ecuyer, P., and Simard, R. 2007. TestU01: A C Library for Empirical Testing

Amer. J. Agr. Econ.

Odeh, Featherstone, and Bergtold

1489

An Experience with the IBM Mathematical Programming System Series. American Journal of Agricultural Economics 66: 104–107. Tomek, W. G. 1993. Confirmation and Replication in Empirical Econometrics: A Step Toward Improved Scholarship. American Journal of Agricultural Economics 75: 6–14. Time Series Processor. 2010. Benchmarks. http://www.tspintl.com/products/tsp/benchmarks/index.htm (accessed June 2, 2010). Vinod, H. D. 2000. Review of GAUSS for Windows, Including Its Numerical Accuracy, Journal of Applied Econometrics 15(2): 211–220. Yalta, A. T. 2007. The Numerical Reliability of GAUSS 8.0. The American Statistician 61: 262–268. ———–. 2008. The Accuracy of Statistical Distributions in Microsoft ® Excel 2007. Computational Statistics and Data Analysis 52: 4579–4586.

Downloaded from ajae.oxfordjournals.org at University of Wisconsin-Madison on January 3, 2011

Assessing the Numerical Accuracy of Statistical Software. NIST TN# 1396, National Institute of Standards and Technology. SAS Institute. 2003. Assessing the Numerical Accuracy of SAS Software, version 2. http://support.sas.com/rnd/app/papers/ statisticalaccuracy.pdf (accessed June 2, 2010). Sawitzki, G. 1994. Numerical Reliability of Data Analysis Systems. Computational Statistics and Data Analysis 18(2): 269–286. Silk, J. 1996. System Estimation: A comparison of SAS, SHAZAM, and TSP. Journal of Applied Econometrics 11(4): 437–450. Simon, S. D., and J. P. LeSage. 1988. Benchmarking Numerical Accuracy of Statistical Algorithms. Computational Statistics and Data Analysis 7:197–209. Stata. 2007. NIST StRD Certification Results Using STATA 10. http://www. stata.com/support/cert/nist10/ (accessed June 2, 2010). Tice, T. F., and M. G. Kletke. 1984. Reliability of Linear Programming Software:

Reliability of Statistical Software