Statistical Inference and Prediction in Nonlinear

0 downloads 0 Views 387KB Size Report
Apr 5, 2005 - In many of the aforementioned applications (those not related to econometrics and/or economics) ... Classical statistical inference is numerically.
Statistical Inference and Prediction in Nonlinear Models using Additive Random Fields∗ Christian M. Dahl†

Gloria Gonz´alez-Rivera

Department of Economics

Department of Economics

Purdue University

University of California, Riverside

[email protected]

[email protected] Yu Qin

Countrywide Financial Corporation yu [email protected] April 5, 2005

Abstract We study nonlinear models within the context of the flexible parametric random field regression model proposed by Hamilton (2001). Though the model is parametric, it enjoys the flexibility of the nonparametric approach since it can approximate a large collection of nonlinear functions and it has the added advantage that there is no “curse of dimensionality.”The fundamental premise of this paper is that the parametric random field model, though a good approximation to the true data generating process, is still a misspecified model. We study the asymptotic properties of the estimators of the parameters in the conditional mean and variance of a generalized additive random field regression under misspecification. ∗ The

notation follows Abadir and Magnus (2002). All software developed for this paper can be

obtained from the corresponding author. † Corresponding author. Address: 403 West State Street, Purdue University, West Lafayette, IN 47907-2056, USA. E-mail: [email protected]. Phone: +1 765-494-4503. Fax: +1 765-496-1778.

1

The additive specification approximates the contribution of each regressor to the model by an individual random field, such that the conditional mean is the sum of as many independent random fields as the number of regressors. This new specification can be viewed as a generalization of Hamilton’s and therefore, our results will provide ”tools” for classical statistical inference that also will apply to his model. We develop a test for additivity that is based on a measure of goodness of fit. The test has a limiting Gaussian distribution and is very easy to compute. Through extensive Monte Carlo simulations, we assess the out-of-sample predictive accuracy of the additive random field model and the finite sample properties of the test for additivity comparing its performance to existing tests. The results are very encouraging and render the parametric additive random field model as a good alternative specification relative to its nonparametric counterparts, particularly in small samples.

Keywords: Random Field Regression, Nonlocal Misspecification, Asymptotics, Test for Additivity. JEL classification: C12; C15; C22.

1

Introduction

We study nonlinear regression models within the context of the parametric random field model proposed by Hamilton (2001). Stationary random fields have been long-standing tools for the analysis of spatial data primarily in the field of geostatistics, environmental sciences, agriculture, and computer design, see, e.g., Cressie (1993). Several applications can be found in Loh and Lam (2000), Abt and Welch (1998), Ying (1991, 1993) , and Mardia and Marshall (1984). The analysis of economic data with random fields is still in a preliminary state of development. In a regression framework, random fields were first introduced by Hamilton (2001) and further developed in Dahl (2002), Dahl and Gonz´ alez-Rivera (2003), Dahl and Qin (2004), and Dahl and Hylleberg (2004). Random fields are closely related to universal kriging and thin plate splines. Dahl (2002) points out that the Hamilton’s estimator of the conditional mean function becomes

2

identical to the cubic spline smoother when the conditional mean function is viewed as a realization of a Brownian motion process.1 In addition, Dahl (2002) shows that the random field approach has superior predictive accuracy compared to popular nonparametric estimators, i.e. the spline smoother, when the data is generated from popular econometric models such as LSTAR/ESTAR and various bilinear specifications. Though the random field model is parametric, it enjoys the flexibility of the nonparametric approach since it can approximate a large collection of nonlinear functions but with the added advantage that there is no “curse of dimensionality” and there is no dependence on bandwidth selection or smoothing parameter. In many of the aforementioned applications (those not related to econometrics and/or economics), the random field model has been treated as the true data generating process. The view presented in this paper is that the parametric random field model, though a good approximation to the true data generating mechanism, is still a misspecified model. However, despite of misspecification, Hamilton (2001) shows that it is possible to obtain a consistent estimator of the overall conditional mean function under very general conditions. The theoretical contribution of this paper focuses on the consequences of misspecification for the asymptotic properties of the estimators of the parameters found in the conditional mean and in the variance of the error term of the model. This is an important and additional contribution to the Hamilton’s results, which only focused on the overall conditional mean function. An additional innovation of this paper is the specification of additive random fields to model the conditional mean of the variable of interest. Within the nonparametric literature, additive models play a very predominant role because they mitigate the “curse of dimensionality” and their estimators have faster convergence rates than those of the nonadditive nonparametric models, see i.e. Stone (1985, 1986). Hastie and Tibshirani (1990) provide an important and thorough analysis of additive models. There is a vast literature on nonparametric estimation of additive models. Some recent contributions are: Sperlich, Tjostheim and Yang (2002), Yang (2002), and Carroll, Hardle and Mammen (2002). In this paper, we investigate how and to what extent imposing an additive 1 Using

the results of Kimeldorf and Wahba (1971) and Wahba (1978, 1990) we show how this result

generalizes to xt ∈ Rk .

3

structure on the random field regression model improves its statistical properties when the true data generating process is additive in one or more of its arguments (nonzero interactions terms are allowed). In particular, we propose (in the simplest case where there are no interaction terms,) to approximate the individual contribution of each of the k regressors to the conditional mean by a random field, such that the nonlinear part is the sum of k individual and independent random fields. Our approach differs from the Hamilton’s model where one “comprehensive” random field approximates the joint contribution of the k regressors. This new specification can be viewed as a generalized random field as it collapses to the Hamilton’s model when no assumptions regarding additivity are imposed. This implies that all the asymptotic results that we derive in this paper, they will also apply to the Hamilton’s (2001) model. We will assess the gains in the predictive accuracy of the model when the additive structure is incorporated. First, we provide a complete characterization of the asymptotic theory associated with the estimators of the parameters of the generalized additive random field model. Our environment is similar to Hamilton’s (2001). Our results facilitate classical statistical inference in random field regression models. Classical statistical inference is numerically far less computationally intensive than the Bayesian inference suggested by Hamilton (2001). Establishing the asymptotic properties of the estimators is non-trivial mainly because the random field model is viewed as an approximation to the true data generating process, and as such we need to deal with misspecification concerns. Secondly, when imposing additivity, we need to evaluate the validity of such a restriction. There is a vast literature on specification tests for additivity in a nonparametric setting. For example, Barry (1993) developed a test for additivity within a nonparametric model with sampling in a discrete grid. Eubank, Hart, Simpson, and Stefanski (1995) showed the asymptotic performance of a Tukey–type additivity test based on Fourier series estimation of nonparametric models and they proposed a new test, which delivers a consistent estimation of the interaction terms. Chen, Liu, and Tsay (1995) relaxed the restriction of sampling in a grid and proposed an LM-type test for additivity in autoregressive processes. Hong and White (1995), and Wooldridge (1992) proposed several specification tests, which can also be used to test for additivity. Most of these studies on additivity testing rely on nonparametric models. In addition to being com4

putationally demanding, these methods depend very heavily on the bandwidth selection and on the construction of the weight function, which typically are not easily obtained. Our contribution is a new test for additivity that does not depend on a nonparametric estimation procedure. It will be based on a measure of goodness of fit, which will permit a fast computation. The test has a limiting Gaussian distribution and a Monte Carlo study illustrates that it has very good size and power for a large class of additive and nonadditive data generating processes. A potential drawback of additive models is the large number of parameters to be estimated. We introduce a more restrictive specification, called the “proportional additive random field model” where the weight ratio between any two random fields is kept fixed. This modification substantially improves the computational/numerical aspects of the estimation and testing procedures. Based on extensive Monte Carlo studies, we find that these improvements are achieved without sacrificing predictive efficiency in relation to the generalized additive model. The organization of the paper is as follows. In Section 2 we present the additive random field model. In Section 3 we establish the asymptotic properties of the estimators of the parameters of the model. In Section 4, we develop a new test for additivity and characterize its asymptotic distribution. In Section 5 we conduct various Monte Carlo experiments to analyze the small sample properties of the predictive accuracy of the estimated additive random field model and the size and power properties of the proposed test for additivity. We conclude in Section 6. All proofs can be found in the mathematical appendix.

2 2.1

Preliminaries The additive random field regression model

Let yt ∈ R, xt· ∈ Rk and consider the model yt = µ(xt· ) + e ǫt , 5

(1)

where e ǫt is a sequence of independent and identically N (0, σ ˜ 2 ) distributed random variables and µ(·) : Rk → R is a random function of a k × 1 vector xt· , which is assumed to

be deterministic.2 The vector of explanatory variables is partitioned as (xt1· , . . . , xtI· )′ , P where xti· ∈ Rki and Ii=1 ki = k.3 Model (1) is called an I-dimensional additive random field model if the conditional mean of yt , i.e. µ(xt· ), has a linear component and a

stochastic additive nonlinear component such as µ(xt· ) = xt· β +

I X i=1

˜ i mi (gi ⊙ xti· ), λ

(2)

˜ i ∈ R+ , gi ∈ Rki , I ≤ k, and for any choice of z ∈ Rki , where for i = 1, 2, ..., I, where λ + mi (z) is a realization of a random field. Model (2) is said to be fully additive when I = k; or partially additive when I < k. For example, suppose that the true functional form of the conditional mean is given by yt = β1 xt1 + β2 x2t1 + β3 sin(xt2 xt3 ) + e ǫt .

This model is partially additive in xt1 . Individual random fields approximate the nonlin˜ 1 m1 (g1 · xt1 ) ≈ β2 x2 and λ ˜ 2 m2 (g2 · xt2 , g3 · xt3 ) ≈ β3 sin(xt2 xt3 ), ear components, i.e. λ t1 for I = 2, k1 = 1, and k2 = 2. Each of the I random fields is assumed to have the following distribution mi (z)



E(mi (z)mj (w))

=

N (0, 1),   H (h) i  O T

(3) if i = j

,

(4)

if i 6= j 1

where h is the Euclidean distance h ≡ 21 [(z − w)′ (z − w)] 2 .4 The realization of mi (·) for i = 1, 2, ..., I is considered predetermined and independent of {x1· , ..., xT · , e ǫ1 , ..., e ǫT }.

The i’th covariance matrix Hi (h) is defined as   Gk−1 (h, 1)/Gk−1 (0, 1) Hi (h) =  0

if h ≤ 1

,

(5)

if h > 1

2 Without

3 This

loss of generality we assume that all variables are demeaned. partition of xt· is made to simplify the exposition. In general, regressors can enter one or more

of the random fields in the additive random field model without altering any of the asymptotic results. 4 g is a k × 1 vector of parameters and ⊙ denotes element-by-element multiplication i.e. g ⊙ x is the t Hadamard product. β is a k × 1 vector of coefficients.

6

where Gk (h, r), 0 < h ≤ r is5 Gk (h, r) =

Z

r

h

k

(r2 − z 2 ) 2 dz.

(6)

When the model is fully additive xti· becomes a scalar for all i and expression (5) reduces to Hi (h) = (1 − h) 1(h ≤ 1).6,7 Since mi (z) is not observable for any choice of z, the functional form of µ(xt· ) cannot be observed. Hence, inference about the unknown ˜1, . . . , λ ˜ I , g1 , . . . , gI , σ parameters (β, λ ˜ 2 ) must be based on the observed realizations of yt and xt· only. For this purpose we rewrite model (1) as y = Xβ + ε,

(7)

where y ≡ yT is a T × 1 vector with t-th element equal to yt , X ≡ XT is a T × k matrix with t-th row equal to xt· , ε is a T × 1 random vector with t-th element equal to PI PI ˜ ǫt , and ε ∼ N(0T , H (λ) + σIT ) where H (λ) ≡ i=1 λi Hi and i=1 λi mi (gi ⊙ xti· ) + e

2 ˜2 , σ (λi , σ) ≡ (λ i ˜ ). To avoid identification issues, the shape parameters gi in the random

field model are assumed to be fixed as in Hamilton (2001). Furthermore, we assume that the components of the vector (λ1 , λ2 , ..., λI , σ)′ are strictly positive. Assumption i.

The parameter vector gi ∈ Rk+i for i = 1, 2, ..., I in the additive random

field model (1) is fixed with typical element equal to gji = q 1 2 , for i = 1, . . . , I, and 2 ki sj i PT 1 2 2 ji ∈ [1; k], where sji = T t=1 (xji t − x¯ji ) and x¯ji is the sample mean of the ji -th explanatory variable. 5 From

Hamilton (2001), Gk (h, r) can be computed recursively from G0 (h, r)

=

r−h

G1 (h, r)

=

(π/4)r 2 − 0.5h(r 2 − h2 )1/2 − (r 2 /2) sin−1 (h/r)

Gk (h, r)

=



h kr 2 (r 2 − h2 )k/2 + Gk−2 (h, r) 1+k 1+k

for k = 2, 3,.... 6 The correlation between m(z) and m(w) is given by the volume of the intersection of a k dimensional unit spheroid centered at z and a k dimensional unit spheroid centered at w relative to the volume of a k dimensional unit spheroid. Hence, the correlation between m(z) and m(w) is zero if the Euclidean distance between z and w is h ≥ 2. 7 The reader interested in a critical review on the choice of an appropriate covariance function is referred to Dahl and Gonz´ alez-Rivera (2003).

7

Assumption ii. We define θ ≡ (λ1 , . . . , λI , σ)′ ∈ Θ ⊆ RI+1 + , where Θ is a compact parameter space. There exist sufficiently small but positive real numbers θ = (λ1 , . . . , λI , σ)′ and sufficiently large positive real numbers θ = (λ1 , . . . , λI , σ)′ , such that λi ∈ [λi , λi ] for ∀i = 1, . . . , I and σ ∈ [σ, σ]. We focus on deriving the limiting properties of the maximum likelihood estimators of θ = (λ1 , λ2 , ..., λI , σ)′ (the parameters of the nonlinear part of the model). The vector β will be considered as a nuisance parameter vector, which can be estimated consistently by ordinary least squares by simply ignoring the possibly nonlinear part of the model (independently of whether this part is additive or not). Model (7) can be viewed as a generalized least squares representation where ε has a non-spherical covariance function C = H (λ) + σIT . We can write the average log-likelihood function of ε as (apart from a constant term) QT (θ, β) = −

2.2

1 1 ′ ln det C − (y − Xβ) C −1 (y − Xβ) . 2T 2T

(8)

The data generating process

The random field model is an approximation to the true functional form of the conditional mean. We assume that there is a data generating mechanism that needs to be discovered. The important question is that of “representability” of the true functional form through some function of the covariance function of the random field. We follow similar arguments as in Hamilton (2001). We assume that yt is generated according to the process yt = ψ(xt· ) + et ,

(9)

for t = 1, 2, ..., T , where ψ : Rk → R is given by the additive function ψ(xt· ) = xt· α +

I X

li (xti· ).

(10)

i=1

In addition, the following assumptions are imposed. Assumption 1: The sequence {xt } is dense. The deterministic sequence {xti· },

with xti· ∈ Ai and Ai = A1 × A2 × ... × Aki a closed rectangular subset of Rki and λ ∈ Γ0 , 8

is said to be dense for Ai uniformly on the compact space Ai × Γ0 ⊂ Rki × R if there exists a continuous fi : Ai → R such that fi (xti· ) > 0 for all i and xti· and such that for any ǫ > 0 and any continuous φi : Ai × Ai × Γ0 → R there exists an N such that for all T ≥ N,

Z T 1 X sup φi (xti· , x; λ)fi (x)dx < ǫ φi (xti· , xsi· ; λ) − Ai ×Γ0 T s=1 Ai

for all i = 1, 2, ..., k.

Assumption 2: The function li : Rki → R is representable. Let Ai , and Γ0 be given as in Assumption 1 and let li : Ai × Γ0 → R be an arbitrary continuous function. We say that li (·) is representable with respect to φi (·) if there exists a continuous function fi : Ai → R such that li (xti· ; λ) =

Z

φi (xti· , x; λ)fi (x)dx.

Ai

Assumption 1 is important because it implies that we can write li (xti· ; λi ) = =

T 1X φi (xti· , xsi· ; λi ) T →∞ T s=1

lim

T 1X λi Hi (xti· , xsi· )φ˜i (xsi· ), T →∞ T s=1

lim

where λi is given as in Assumption ii., Hi (xti· , xsi· ) denotes the (t, s) entry in Hi given by (5), and φ˜i : Rki → R is an arbitrary continuous function. Note, that by varying φ˜i (·), Assumption 2 describes the general class of nonlinear/linear functions for which

li (xti· ; λi ) is representable in terms of the spherical covariance function. Importantly, Hamilton (2001) shows that Taylor and Fourier sine series expansions are representable under Assumption 2 . We can thus expect the random field model to have good approximation properties over a very broad class of functions. Defining the sample version of li (·) as lT i (xti· ; λi ) =

T 1 X λi Hi (xti· , xsi· )φ˜i (xis· ), T s=1

(11)

it follows that limT →∞ lT i (xti· ; λi ) → li (xti· ; λi ) uniformly on Ai × Γ0 for ∀t, hereby providing a necessary link between the approximating random field model and the true 9

data generating process. Hamilton (2001) discusses pointwise convergence of lT i (·) to li (·) in xti· for ∀t. Following similar arguments as in Dahl and Qin (2004), we generalize the convergence to uniform convergence. Assumption 3: Distribution of et . The error term et is assumed to be an i.i.d. Gaussian distributed random variable with zero mean and variance σe2 . Assumption 4: Limiting behavior of “second moments” of ψ (X) and X. Let ′

1 X ′X T →∞ T

ψ (X) = (ψ (x1· ) , ..., ψ (xT · )) where ψ (·) is given by (10). Assume: i. lim ′ 1 ψ (X) T →∞ T

converges to a finite nonsingular matrix. ii. lim 1 X T →∞ T

scalar uniformly in α. iii. lim in α.



ψ (X) converges to a finite

ψ (X) converges to a finite k × 1 vector uniformly

Assumptions 3 and 4 seem somewhat restrictive but are a consequence of working in Hamilton’s (2001) environment and they serve primarily to shorten the proofs. We find it important to establish the asymptotic results under these basic assumptions first before extending Hamilton’s (2001) fundamental assumptions. We conjecture that replacing Gaussianity with stationarity, ergodicity, a sufficient number of moment conditions in Assumption 3, and imposing similar conditions on (y, x′ )′ in Assumption 4 would not alter the main results.

3

Asymptotics

In this section we establish three important asymptotic results; (1) consistency of the estimator of the conditional mean in model (2); (2) consistency of the maximum likelihood estimators of the parameters in the nonlinear component of the model; and (3) their asymptotic normality. The estimation procedure is a two-stage approach. In the first stage, we estimate the parameters β in the linear component of the model by OLS. In the second stage, the parameters θ = θ (β) in the nonlinear part of the model are estimated by maximizing the objective function (8). We establish the asymptotic theory for the estimators of β and θ under the additivity assumption in (10). To this end, we need to derive a set of results on uniform convergence of deterministic functions. In the following 10

subsections, we state the main theorems; their proofs can be found in the Mathematical Appendix.

3.1

Consistency of the conditional mean estimator

Let µT = (µ(x1· ), µ(x2· ), ..., µ(xT · ))′ and ξT = (ξT (x1· ), ξT (x2· ), ..., ξT (xT · ))′ , where ξT (xt· ) = E (µ(xt· )|yT , xT · , yT −1 , xT −1· , ...) . Under the assumption of joint normality of yT and µT , the expectation of the conditional mean function, conditional on yT , xT · , yT −1 , xT −1· , ..., is given by −1

ξT = Xβ + H(λ) (H(λ) + σIT )

(y − Xβ) .

For the misspecified additive random field model (2), the following theorem states that ξT is a consistent estimator of µT . Theorem 1

Let assumptions 1, 2, and 3 hold and define ′

LT (λ0 ) = (lT (x1· ; λ), lT (x2· ; λ), ..., lT (xT · ; λ)) , where λ = (λ1 , λ2 , ..., λI )′ , and LT (λ) =

I 1X ei , λi Hi φ T i=1

 ′ ei = φei (x1i· ), φei (x2i· ), ..., φei (xT i· ) . Then, for φ

 1  ′ E (ξT − Xβ − LT (λ)) (ξT − Xβ − LT (λ)) → 0. T →∞Θ×A2 T lim sup

(12)

Theorem 1 generalizes Theorem 4.7 in Hamilton (2001) in two directions: First, it establishes that the convergence is uniform on Θ × A2 , and secondly, it applies to a richer class of random field regression models. From Theorem 1, we can derive some additional results regarding uniform convergence of the following deterministic sequences. These results will play an important role later on, when we establish the asymptotic distribution of the estimators of the parameters.

11

Corollary 1

Let assumptions i., ii., and 1 to 4 hold. Then,

lim

sup

1

T →∞Θ×B×A T

(ψ(X) − Xβ)′ (H (λ) + σIT )−2 (ψ(X) − Xβ) → 0,

i 1 h → 0, tr (H (λ) + σIT )−1 H (λ) H (λ) (H (λ) + σIT )−1 T →∞Θ×B×A T i 1 h −1 −1 lim sup tr (H (λ) + σIT ) H (λ) (H (λ) + σIT ) H (λ) → 0. T →∞Θ×B×A T lim

sup

Corollary 2

(13) (14) (15)

Let assumptions i., ii., and 1 to 4 hold. Then, for all i, j = 1, 2, ..., I,

i 1 h −1 −1 tr (H (λ) + σIT ) Hi (H (λ) + σIT ) Hj → 0. T →∞Θ×B×A T lim

3.2

sup

Consistency of the parameter estimators

Recall the average likelihood function (8) associated with model (7) is given as, QT (θ, β) = −

1 1 ln det C − (y − Xβ)′ C −1 (y − Xβ) . 2T 2T

In this section, we will establish consistency of the maximum likelihood estimator of the p b i.e., βb −→ parameter vector θ ≡ (λ1 , . . . , λI , σ)′ . The consistency of β, β ∗ , is shown

by Dahl and Qin (2004). We proceed as follows: first, we prove that the expectation

of QT (θ, β) uniformly converges to a limiting function Q∗ (θ, β) . In the second stage, p we prove the main theorem that states θˆ −→ θ ∗ , where θ ∗ maximizes Q∗ (θ, β ∗ ). For

these results to hold we need a set of auxiliary propositions on uniform convergence of deterministic sequences. Specifically, let us define RT ≡

1 T

log det C, and UT ≡



(ψ (x) − Xβ) C −1 (ψ (x) − Xβ) ,

∂ i1 +...+il RT ∂ i1 +...+il UT , and Dli1 ,...,il UT = (∂λ i i (∂λ1 )i1 ...(∂λl )il 1 ) 1 ...(∂λl ) l ∂UT ∂RT D0 RT = ∂σ , and D0 UT = ∂σ . Furthermore,

with corresponding differentials Dli1 ,...,il RT = where i1 , . . . , il = 0, 1, . . . , I, l = 1, 2, ...,

1 T

define the following limits Dli1 ,...,il R = lim Dli1 ...,il RT and Dli1 ,...,il U = lim Di1 ,...,il UT , T →∞

T →∞

and D0 R = lim D0 RT , D0 U = lim D0 UT . The following propositions establish that, T →∞

T →∞

under the assumptions stated above, the sequences of RT and UT as well as their differentials will converge uniformly. Proposition 1

Given assumptions i., ii., and 1 to 4, Di R, D2i,j R and D0 U are equal to

zero uniformly on Θ × B × A for all i, j = 1, ..., I. 12

,

Proposition 2 Given assumptions i., ii., and 1 to 4, all of the function sequences  l Di1 ...il RT T for l = 1, 2, ... and i1 , . . . , il = 0, 1, . . . , I, are equicontinuous. Furthermore, each function sequence converges uniformly on Θ × A as T → ∞.

Proposition 3 Given assumptions i., ii., and 1 to 4, all of the function sequences  l Di1 ...il UT T for l = 1, 2, .. and i1 , . . . , il = 0, 1, . . . , I, are equicontinuous. Furthermore, each function sequence converges uniformly on Θ × B × A as T → ∞. Next, we define the following second order sample moment matrices Mx·i x·j (θ) ≡ Mψx·i (θ) ≡ Mψψ (θ) ≡

1 ′ −1 x (H (λ) + σIT ) x·j , T ·i 1 −1 ψ(X)′ (H (λ) + σIT ) x·i , T 1 −1 ψ(X)′ (H (λ) + σIT ) ψ(X), T

where x·i = (x1i , . . . , xT i )′ for i, j = 1, 2, ..., k. Then, we have the following result. Proposition 4 Given assumptions i., ii., and 1 to 4, all of the function sequences   l  Di1 ...il Mx·i x·j T , Dli1 ...il Mψx·i T , and Dli1 ...il Mψψ T for l = 1, 2, .. and i1 , . . . , il = 0, 1, . . . , I, are equicontinuous. Furthermore, each function sequence converges uniformly on Θ × B × A as T → ∞, and lim

sup D0 Mx·i x·j (θ) → 0,

(16)

sup D0 Mψx·i (θ) → 0,

(17)

sup D0 Mψψ (θ) → 0.

(18)

T →∞Θ×B×A

lim

T →∞Θ×B×A

lim

T →∞Θ×B×A

Next, consider the objective function (8). Let Q∗T (θ, β) = E (QT (θ, β)). Substituting y = ψ (X) + e in (8) and taking expectations, we have Q∗T (θ; β) =

1 log det C + (19) 2T  1 1 2 ′ − (ψ (X) − Xβ) C −1 (ψ (X) − Xβ) − σe tr C −1 , 2T 2T



which can also be written according to the notation established previously as 1 σ2 1 Q∗T (θ, β) = − RT − UT − e D0 RT . 2 2 2 13

Theorem 2

Let assumptions i., ii., and 1 to 4 hold. Then lim

p

sup |QT (θ, β) − Q∗T (θ, β)| −→ 0,

(20)

sup |Q∗T (θ, β) − Q∗ (θ, β)| → 0,

(21)

T →∞ Θ×B×A

and lim

T →∞ Θ×B×A

where Q∗ (θ, β) = − 21 R − 12 U −

σe2 2 D0 R. p

By Theorem 2, it can be seen that QT (θ, β) −→ Q∗ (θ, β). The unique existence of θ ∗ ∈ Θ that maximizes Q∗ (θ, β) for a given β is guaranteed in the following theorem.

Theorem 3 Let U (θ) be a convex function of θ ∈ Θ and tr C −1 C −1 < 2σe2 tr C −1 C −1 C −1 . Then Q∗ (θ; β) is a concave function of θ ∈ Θ and has a unique maximizer θ ∗ in Θ. A necessary condition for concavity of Q∗ (θ; β) in σ is given by the condition σ ≤ 2σe2 .

In the following sections, we will derive a consistent estimator of σe2 that will permit actual empirical verification of the necessary and sufficient conditions for identification. Now, we can establish consistency of the estimators of the parameters. Theorem 4 Let assumptions i., ii., and 1 to 4 hold, and let βˆ be the first-stage OLS p estimator of the linear parameter β such that βˆ −→ β ∗ . Then p θˆ −→ θ ∗ ,

where θˆ is the two-stage estimator and θ ∗ maximizes Q∗ (θ, β ∗ ).

3.3

Asymptotic normality

ˆ In the first stage estimation, In this section we establish the asymptotic distribution of θ. ˆ which is taken as a nuisance parameter in the second we compute the OLS estimator β, stage. In the second stage, we estimate θ. The estimation of β will introduce further variability in the estimation of θ that will be reflected in the moments of its asymptotic distribution. We begin with the derivation of the asymptotic distribution of the score

14

function. The objective function associated with the first stage OLS estimation is given as

T 1X 2 mT (β) = − (yt − xt· β) . T t=1

(22)

We stack the gradient vector of (22) and the score vector associated with (8) in a (I + 1 + k) × 1 vector as ′ ′



gT (θ, β) = Dθ QT (θ, β) , Dβ mT (β) where 

D1 QT (θ, β) .. .

,

(23)



        Dθ QT (θ, β) =    DI QT (θ, β)    D0 QT (θ, β)   1 (y − Xβ)′ C −1 H1 C −1 (y − Xβ) − 1 tr C −1 H1 + 2T  2T ..   . =    ′ 1 1 −1 −1  − tr C HI + HI C −1 (y − Xβ) 2T (y − Xβ) C  2T  ′ 1 1 tr C −1 + 2T (y − Xβ) C −1 C −1 (y − Xβ) − 2T

and

2 Dβ mT (β) = T

T X t=1

xt1 (yt − xt· β) , . . . ,

T X t=1

xtk (yt − xt· β)

The following proposition provides us with the exact variance of Proposition 5 cov



   (24) ,  

.

T gT .

Let assumptions i., ii., and 1 to 4 hold. Then

 √ √ T Di1 QT (θ, β) , T Di2 QT (θ, β)

cov

!′



 √ √ T Di1 QT (θ, β) , T Dj1 mT (β)

√  √ T Dj1 mT (β) , T Dj2 mT (β) cov

 σe4 1 tr C −1 Hi1 C −1 C −1 Hi2 C −1 + 2 T 1 σe2 c′ C −1 Hi1 C −1 C −1 Hi2 C −1 c, (25) T   k X Mxj1 xj βj  ,(26) = −2σe2 Di1 Mxj1 ψ − =

j=1

= 15

4σe2 ′ x x·j , T ·j1 2

(27)

for all i1 , i2 = 0, 1, 2, ..., I and j1 , j2 = 1, 2, ..., k. Furthermore, (25) (26) and (27) converge uniformly on Θ × B × A as T → ∞. The asymptotic distribution of Theorem 5

√ T gT is a result of the following theorem.

Let gT (θ, β) be given by (23) and let assumptions i., ii., and 1 to 4 hold.

Then, as T → ∞,

√ d T gT (θ ∗ , β ∗ ) −→ N (0, Σ ∗ ) ,

∗ where Σ ∗ = Σ (θ ∗ , β ∗ ) = [Σ11 ∗ Σ11

=

∗ Σ12

=

∗ Σ22

=

∗ ∗′ Σ12 : Σ12

∗ Σ22 ] and

σe4 1 lim tr(C −1 Hi1 C −1 C −1 Hi2 C −1 ), 2 T →∞ T −2σe2 lim Di1 Mxj1 ψ , T →∞

4σe2 lim

T →∞

1 ′ X X. T

Then, the asymptotic normality of θˆ can be established in the following Theorem 6. Theorem 6

Let assumptions i., ii., and 1 to 4 hold. Define ζ = (θ ′ , β ′ )′ , let QT (ζ)

and mT (β) be given by (8) and (22) respectively, and let Σ ∗ be defined as in Theorem 5. Then

 √  d T ζbT − ζ ∗ −→ N(0, M ∗ ),

(28)

where M ∗ = G∗−1 Σ ∗ G∗−1′ , for G∗ = limT →∞ GT (ζ ∗ ), and GT (ζ)

= Dζ gT (ζ)  D2θθ QT (ζ) =  0k×I

D2θβ QT (ζ) D2ββ mT (β)



.

For the parameter of interest θ, notice that √ d ∗ T (θˆT − θ ∗ ) −→ N(0, M11 ), where M11 is the upper left corner of the matrix M , which is equal to ∗ M11

∗ = (D2θθ Q∗ (ζ ∗ ))−1 Σ11 (D2θθ Q∗ (ζ ∗ ))−1 + (29) 1 X ′ X)−1 (D2θβ Q∗ (ζ ∗ ))′ (D2θθ Q∗ (ζ ∗ ))−1 . σe2 (D2θθ Q∗ (ζ ∗ ))−1 D2θβ Q∗ (ζ ∗ )( lim T →∞ T

16

∗ A consistent estimator of the variance M11 is obtained by substituting σe2 and ψ(X) with

their respective consistent estimators σ ˆe2 and µ ˆ. Establishing the asymptotic normality and consistency of the estimators of the parameters of the additive random field model in Theorem 6 is useful for a number of reasons. These asymptotic results can be used to construct confidence bands for the estimated conditional mean function without the use of the computer intensive Bayesian methods suggested by Hamilton (2001). Such bands are extremely useful in testing hypothesis about the functional form of the data generating process. Another important application of the asymptotic distribution is that it can be used to establish the asymptotic distribution of a very simple parametric test for additivity, which is discussed in the following section.

4

Testing for additivity

There are several additivity tests in the nonparametric literature, for example Chen, Liu, and Tsay (1995), and in the parametric literature, for example, Barry (1993), and Eubank, Hart, Simpson, and Stefanski (1995). The last two tests suffer from a restrictive sampling scheme because the explanatory variables are sampled on a grid, and the first two rely on the selection of a data-dependent bandwidth. We propose an additive test within the framework of a parametric random field model that does not depend on either a sampling scheme or bandwidth selection. As a by-product, we also propose a “new” consistent estimator of σe2 . The null hypothesis is that the data generating process is given by (9) and (10) described in section 2.2. The test for additivity will assess the goodness of fit of the P ˜i mi (gi ⊙ xti· ). A direct misspecified additive random field model µ(xt· ) = xt· β + Ii=1 λ measure of the goodness of fit of the additive random field model is the estimation

error. By Theorem 1, the mean squared error in the estimation of the conditional mean converges to 0 uniformly, if the data generating process is additive in the regressors. Using this result, the test is based on the estimation error of the observed response yt , which is the sum of the true conditional mean ψ(xt· ) and the error term et .

17

Theorem 7

Let assumptions i., ii., and 1 to 4 hold. Define the estimation error as  εˆ = y − Xβ + H (λ) C −1 (y − Xβ) , where C = (H (λ) + σIT ) . Then    a 1 1 √ εˆ′ εˆ + σ 2 T D0 UT + σ 2 σe2 T D200 RT ∼ N 0, − σ 4 σe4 D40000 R , (30) 3 T ′

for any (θ ′ , β ′ ) ∈ Θ × B.

Note that the quantities D0 UT (θ), D200 RT (θ), and D40000 R(θ) are all nonpositive. It follows from Theorem 7 that 1 ′ 2 2 2 εˆ εˆ + σ σe D00 RT = 0 lim sup T →∞ Θ×B T

(31)

because the deterministic term σ 2 D0 UT converges to zero by Proposition 1. Furthermore, it follows, from (31), that a consistent estimator of σe2 can be constructed as described in the following Corollary. Corollary 3 Let assumptions i., ii., and 1 to 4 hold. Let εˆ be defined as in Theorem 7. Then, as T → ∞,

1 ′ ˆ εˆ Tε 2 −ˆ σ D200 RT

p

−→ σe2 ,

(32)

ˆ ′, σ where θˆ = (λ ˆ )′ , and βb are the consistent two-stage estimators.

Note that a careful examination of the proof of Theorem 7 indicates that the results in (31) and (32) do not specifically rely on the additive random field framework. They hold for any general random field model. Therefore, the consistent estimator of σe2 can also be obtained by fitting a random field model to a nonadditive or additive model. One only needs to replace the corresponding quantities, for example εˆ and D200 RT , with their sample counterparts in the random field model under consideration. Based on Theorem 7, we propose the following test statistic for additivity. Corollary 4

ˆ ′, σ ˆ and σ Let θˆ = (λ ˆ )′ , β, ˆe2 be consistent estimators. Under the null hy-

pothesis of an additive data generating process, the statistic ResTDGQ is asymptotically standard normally distributed, i.e., 1 ′ ˆ ˆ ˆ2 σ ˆe2 D200 RT d Tεε+σ −→ N(0, 1). ResTDGQ ≡ q 1 4 4 4 2 4 2 3 ˆ σ ˆe D000 UT − 3T σ ˆ σ ˆe D0000 RT − 3T σ

18

(33)

We will refer to ResTDGQ as the residual based test statistic. It should be noticed that we have added the asymptotically vanishing term D3000 UT to the variance and deleted the asymptotic vanishing term D0 UT from the mean to obtain better finite sample properties. In particular, the additional term in the variance corrects the tendency of the test to over-reject a true null hypothesis in finite samples. Note that, by Proposition 1, limT →∞ D2ij RT → 0 and limT →∞ D0 UT → 0 uniformly. While the first term relies solely on the covariance matrix of the random field, it is the second term that actually reflects the approximation of the true data generating process (hereafter, DGP) by the additive random field model. In other words, when the true DGP is not additive, the additive random field model will not be a good approximation, that is, limT →∞ D0 UT 9 0. However, D0 UT is bounded as D0 UT ≤

1 (ψ(X) − Xβ)′ (ψ(X) − Xβ). T σ2

We can assume that when the true DGP is not additive then D0 UT = O(1) for every √ (θ ′ , β ′ )′ ∈ Θ × B, thus under the alternative hypothesis, ResTDGQ = Op ( T ).8 This property agrees with most of the specification tests in the literature, see, e.g., Wooldridge (1992), providing an argument for the consistency of the test statistic ResTDGQ . Generally, under a true alternative, ResTDGQ gives a large value. The test statistic ResTDGQ is not yet directly computable due to the unknown but asymptotically vanishing term ˆ = X βˆ + D3000 UT . Since the estimator of the conditional mean ψ(X) is given by µ ˆ −1 (θ)(y ˆ ˆ and H(λ)C − X β) ˆT | = 0, lim sup |D3000 U

(34)

T →∞ Θ×B

ˆT = where U

1 T

ˆ ′ C −1 (θ)( ˆ µ ˆ we suggest replacing D3 UT by D3 U ˆ ˆ − X β) ˆ − X β), (µ 000 000 T

when computing the test statistic ResTDGQ . This will not result in a loss of asymptotic power but will improve the small sample properties of the test. Summarizing, the test statistic ResTDGQ can be computed by the following simple 3-step procedure: 8 This

result can be seen by multiplying (57) in the Mathematical Appendix by

√ T and noticing that

the resulting two last terms on the right hand side will be bounded in probability by a central limit √ theorem. After having multiplied by T , the first term on the right hand side of (57), however, will be √ O( T ) when D0 UT = O(1).

19

Step 1 Fit a random field model with only one comprehensive random field as in Hamilton (2001), Dahl and Gonz´alez-Rivera (2002), or Dahl and Qin (2004). Compute the consistent estimator σ ˆe2 as in (32) based on this random field model. Step 2 Fit the additive random field model (2), and then use the σ ˆe2 obtained in Step 1 to calculate the test statistic 1 ′ ˆ ˆ ˆ2 σ ˆe2 D200 RT Tεε+σ \ DGQ = q , ResT 2 4 2 3 ˆ 1 4 4 4 − 3T σ ˆ σ ˆe D000 UT − 3T σ ˆ σ ˆe D0000 RT

ˆ and θ. ˆ by plugging in the two-stage estimates β,  \ −1 Step 3 Reject the null if ResT 1 − 21 α , where Φ (·) is the c.d.f. of the DGQ > Φ standard normal distribution and α denotes the nominal level of the test.

The auxiliary random field model in Step 1 is only used to estimate the variance σe2 of the true disturbance term. This objective can also be achieved by other consistent estimates of σe2 , such as the nonparametric estimator suggested by Hall, Kay, and Titterington (1990) that enjoys the parametric rate of convergence. We notice that a too high estimate of σe2 usually results in a less powerful but more conservative test. In the Monte Carlo experiment section we will discuss the degree of robustness of the residual based test to “overestimates” of σe2 within a large class of nonadditive data generating processes. We conclude this section by discussing why it is not appropriate to construct a nested additivity test based on the likelihood function, such as an LM-type test as proposed by Hamilton (2001) and Dahl and Gonz´alez-Rivera (2003) to detect neglected nonlinearity of a more general form. To perform a nested additivity test, we need under the alternative hypothesis a general model that includes a comprehensive random field, i.e., yt = xt· β + PI ˜ ˜ i=1 λi mi (xti· )+ λh mh (xt· )+ǫt , where mh (xt· ) denotes the comprehensive random field defined on the compact set Ak ∈ Rk . Then, an LM-type test for additivity will have a null hypothesis defined as ˜2 = 0. H0 : λh ≡ λ h

(35)

In this setting, one has to allow the domain of the parameter λh to be a compact set Ah ∈ R containing the origin. On theoretical grounds, the inclusion of the origin in 20

the parameter space invalidates assumption ii. with critical consequences for the validity of the asymptotic results presented. On numerical grounds, we find that, under the null hypothesis (35), the elements of the Hessian matrix D2hh QT (θ; β), i.e., the expression

1 T

c′ C −1 Hh C −1 Hh C −1 c, do not converge. In most cases, this quantity actually

explodes!

5

Simulation experiments

We perform several Monte Carlo experiments with a wide range of additive and nonadditive data generating processes to evaluate the predictive power of the additive random field model, and the performance of the residual based additivity test. In Table 1, we present sixteen data generating processes; eight with an additive structure (A1 to A8) and eight with a non-additive structure (N1 to N8). The nonlinear specifications are diverse: polynomials, logarithmic, exponential, thresholds, and sine functions, which cover many of the most popular econometric models used in applied work. The explanatory variables (xt1 , xt2 )′ are sampled independently from a uniform distribution, i.e., xti ∼ U (−5, 5) for i = 1, 2. Models A5-A8 and N5-N8 are adapted from Chen, Liu, and Tsay (1995), with certain modifications of the coefficients to accommodate the uniformly designed sampling domain. We conduct 1000 Monte Carlo replications with a sample size of 100 observations for estimation and 100 out-of-sample points for the one-step ahead prediction. In this section, we introduce an estimator/algorithm based on a simplified version of the additive random field model, which reduces the computational burden significantly but maintains almost the same predictive accuracy as the additive model described in the previous sections. We call the alternative specification the “proportional additive random field model” and it is given as ˜ yt = xt· β + λ

I X i=1

ci mi (gi ⊙ xti· ) + ǫ˜t ,

(36)

˜ is the total where ci is a predetermined proportional weight of the i-th random field, λ weight of the random field component, and ǫ˜t ∼ IN(0, σ ˜ 2 ). Studies of proportional additivity in nonparametric models can be found in Yang (2002) and Carroll, Hardle, 21

Table 1: Additive and nonadditive data generating processes. It is assumed that xit ∼ i.i.d.U (−5, 5) for i = 1, 2 and et ∼ N(0, σe2 ). Model

True DGP

A1

yt = 1 + 0.1x2t1 + 0.2xt2 + et

A2

yt = 0.1x2t1 + 0.2 ln(xt2 + 6) + et

A3

yt =

A4

yt = 1.5 sin(xt1 ) + 2 sin(xt2 ) + et

A5

yt = 0.5xt1 + sin xt2 + et

A6

yt = 0.8xt1 − 0.3xt2 + et

0.5xt1 −1 xt1 +6

− 0.5 exp(0.5xt2 ) + et

A7

yt = exp(−0.5x2t1 )xt1 − exp(−0.5x2t2 )xt2 + et

A8

yt = −2xt1 1{xt1 0} + et

N1

yt = 0.3x2t1 xt2 + 0.2x2t2 + et

N2

yt = sin(xt1 + xt2 ) + et

N3

yt = 1.5 sin(xt1 + 0.2xt2 ) + 2 sin(0.5xt1 + xt2 ) + et

N4

yt = 2 × 1{xt1 +xt2 i

I X

λi Hi

i=1

I X

λi C

−1

Hi

i=1

λ2i tr C −1 Hi C −1 Hi

i=1

+2

C

−1

!#

!#



 λi λj tr C −1 Hi C −1 Hj .

By combining (15) of Corollary 1 and Lemma A.3 (taking A = C −1 Hi C −1 and B = Hj ) it is obvious that each term on the right hand side is positive and converges uniformly to zero on Θ × B × A as T → ∞, which concludes the proof. Proof of Proposition 1 Di RT =

From Magnus and Neudecker (1999, chapter 8)

  1 1 1 Di log det C = tr C −1 (Di C) = tr C −1 Hi , T T T 34

and D2i,j RT =

  1 1 tr −C −1 (Di C) C −1 (Dj C) = tr −C −1 Hi C −1 Hj . T T

From Corollary 2, it follows that D2i,j R = lim supΘ×A D2i,j RT = 0 for all i, j = 1, ..., I. T →∞

From Corollary 1, it follows that D0 U = 0. Finally, let the eigenvalues of C −1 Hi be γ1 ≤ γ2 ≤ . . . ≤ γT . Then T  1 X 1 γt ≤ Di RT = tr C −1 Hi = T T t=1

s

PT

t=1

γt2

T

=

q D2i,i RT ,

and Di R = 0 immediately follows, for all i = 1, 2, ..., I. Proof of Proposition 2

The sequences of differentials of RT for l = 3, 4 are given by

D3i1 ,i2 ,i3 RT

=

and D4i1 ,i2 ,i3 ,i4 RT

 1 tr C −1 Hi3 C −1 Hi2 C −1 Hi1 T  1 + tr C −1 Hi2 C −1 Hi3 C −1 Hi1 , T

1 T 1 − T 1 − T 1 − T 1 − T 1 − T

= −

tr C −1 Hi4 C −1 Hi3 C −1 Hi2 C −1 Hi1 tr C −1 Hi3 C −1 Hi4 C −1 Hi2 C −1 Hi1 tr C −1 Hi3 C −1 Hi2 C −1 Hi4 C −1 Hi1 tr C −1 Hi4 C −1 Hi2 C −1 Hi3 C −1 Hi1 tr C −1 Hi2 C −1 Hi4 C −1 Hi3 C −1 Hi1

(43)



(44)

   

 tr C −1 Hi2 C −1 Hi3 C −1 Hi4 C −1 Hi1 .

Let us focus on the first term of (43). Recall the following property of the trace operator, |tr (A′ B)| ≤

p tr (A′ A) tr (B ′ B).

35

(45)

Then,

≤ ≤ ≤ =

1  tr C −1 Hi3 C −1 Hi2 C −1 Hi1 T p 1 tr (C −1 Hi3 Hi3 C −1 ) tr (Hi1 C −1 Hi2 C −1 C −1 Hi2 C −1 Hi1 ) Ts 1 T p tr (Hi1 C −1 Hi2 C −1 C −1 Hi2 C −1 Hi1 ) T λ2i3 s s 1 T T 2 2 T λi3 λi2 λ2i1

(46)

1 . λi3 λi2 λi1

The second inequality in (46) follows from C −1

2

2  −1 for all i3 . Thus, ≤ (λi3 Hi3 )

     2 2 1 T tr C −1 Hi3 Hi3 C −1 = tr C −1 (Hi3 )2 = 2 tr C −1 (λi3 Hi3 )2 ≤ 2 . λi3 λi3

Similarly, the third inequality in (46) follows from tr Hi1 C −1 Hi2 C −1 C −1 Hi2 C −1 Hi1



 = tr Hi1 C −1 Hi2 C −1 C −1 Hi2 C −1 Hi1   2  −1 −1 −1 Hi2 C Hi1 ≤ tr Hi1 C Hi2 (λi2 Hi2 )  1 tr Hi1 C −1 C −1 Hi1 2 λi2   2  1 −1 Hi1 tr Hi1 (λi1 Hi1 ) λ2i2 T . λ2i2 λ2i1

= ≤ =

The same argument can be applied to the second term of (43), which subsequently can be shown to be bounded by

1 λi3 λi2 λi1

.

It follows that sup D3i1 ,i2 ,i3 RT is bounded Θ×A

for all T , and all i1 , i2 , i3 = 0, 1, 2, ..., I. Using similar arguments, we can show that sup Di1 RT , supD2i1 ,i2 RT , and sup D4i1 ,i2 ,i3 ,i4 RT are all bounded for all T and for all Θ×A Θ×A i1 , i2 , i3 , i4 = 0, 1, 2, ..., I. Since Dli1 ...il RT T for l = 1, 2, ..., 4, i1 , . . . , il = 0, 1, . . . , I, and  all T are bounded, the sequences Dli1 ...il RT T for l = 1, 2, 3 and i1 , . . . , il = 0, 1, . . . , I Θ×A

are equicontinuous. Since at least one of these equicontinuous sequences converges − for  example D2i1 ,i2 RT T in Proposition 1 − Theorem 5 in Dahl and Yu (2004) and Theorem  7.16 in Rudin (1976) imply that all of the function sequences Dli1 ...il RT T for l = 1, 2, 3 36

and i1 , . . . , il = 0, 1, . . . , I are uniformly convergent on Θ × A as T → ∞. This completes the proof. Proof of Proposition 3

Define c ≡ (ψ(X) − Xβ). The sequences of differentials of

UT for l = 1, 2 and i1 , . . . , il = 0, 1, . . . , I are given by Di1 UT = − and D2i1 ,i2 UT = Notice that

 1 tr c′ C −1 Hi1 C −1 c , T

 1  1 tr c′ C −1 Hi2 C −1 Hi1 C −1 c + tr c′ C −1 Hi1 C −1 Hi2 C −1 c . T T

 1 tr c′ C −1 Hi1 C −1 c = T ≤

≤ where, by Assumption 4,

1 T

 1 tr c′ C −1 λi1 Hi1 C −1 c T λi1  1 tr c′ C −1 c T λi1 1 1 tr (c′ c) , σλi1 T

tr (c′ c) is bounded for all T. Note that the last inequality

follows as C −1 < σ −1 I while Assumption 2 guarantees the existence of that

1 σλi1

. This implies

sup Di1 UT is bounded for all i1 = 1, 2, ..., I and for all T. Following a similar

Θ×B×A

term of D2i1 ,i2 UT is bounded, 1 1  −1 −1 ′ −1 C c H C λ H tr c C λ i1 i1 i2 i2 λi λi T 2 1 1 1 tr (c′ c) . ≤ σλi1 λi2 T

argument as in Proposition 2, the first 1  tr c′ C −1 Hi2 C −1 Hi1 C −1 c = T

The same bound is obtained for the second term of D2i1 ,i2 UT . Consequently,

sup D2i1 ,i2 UT

Θ×B×A

is bounded for all i1 , i2 = 0, 1, ..., I and for all T. It is not difficult to show that all terms −1 1 l ′ of Dli1 ...il UT will be bounded by σi=1 λi sup Dli1 ...il UT T tr (c c) implying that Θ×B×A  for l = 1, 2, .. and i1 , . . . , il = 0, 1, . . . , I is bounded for all T. Since Dli1 ...il UT T for  l = 1, 2, ..., i1 , . . . , il = 0, 1, . . . , I, and all T are bounded, the sequences Dli1 ...il UT T

for l = 1, 2, .. and i1 , . . . , il = 0, 1, . . . , I are equicontinuous. Since at least one of these

equicontinuous function sequences converges − for example {D0 UT }T in Proposition 1 37

− Theorem 5 in Dahl and Qin (2004) and Theorem 7.16 in Rudin (1976) imply that  all of the function sequences Dli1 ...il UT T for l = 1, 2, .. and i1 , . . . , il = 0, 1, . . . , I are

uniformly convergent on Θ × B × A as T → ∞. This completes the proof. Proof of Proposition 4

The proof of equicontinuity proceeds in a similar fashion as

in Propositions 2 and 3. The uniform convergence results of (16),(17), and (18) follow directly from Corollary 1. Inserting ψ(X) = x·i and β = 0 in (13), we have 2 1 ′  −1 x·i (H(λ) + σ1 IT ) x·i → 0. T →∞Θ×A T lim sup

Inserting ψ(X) = x·i + x·i and β = 0 in (13), we can write lim

sup

T →∞Θ×B×A

1 T

c′ C −1

2

2 1 ′ x·i C −1 x·i T →∞Θ×B×A T 2 2 1 ′ 2 ′ + lim sup x·j C −1 x·j + lim sup x·i C −1 x·j . T →∞Θ×B×A T T →∞Θ×B×A T

c = lim

sup

Since the first two terms converge to zero, the last term must also converge to zero for

(13) to hold. Result (17) follows from (13) when β = 0. Result (18) follows from (13) when Xβ = x·i . Proof of Theorem 2 Define fT (e, X, θ, β) ≡ QT (θ, β) − Q∗T (θ, β). We write fT (e, X, θ, β) = − We wish to show that

 1 1 ′ −1 σ2 ′ (ψ(X) − Xβ) C −1 e − e C e + e tr C −1 . T 2T 2T lim

p

sup |fT (e, X, θ, β)| −→ 0,

T →∞ Θ×B×A

which (according to Theorem 21.9 in Davidson (1994)) will be satisfied if and only if p

a) lim fT (e, X, θ, β) −→ 0 for each (θ, β) ∈ Θ × B and b) fT (e, X, θ, β) is stochastically equicontinuous. Notice first that E [fT (e, X, θ, β)] = 0. By Chebyshev’s inequality, condition a) will the be satisfied if lim

 p  sup E fT (e, X, θ, β)2 −→ 0.

T →∞ Θ×B×A

Now, let 0 < γ1 ≤ . . . ≤ γT be the eigenvalues of H(λ), and νt for t = 1, . . . , T be the corresponding eigenvectors. Let zt ≡

1 ′ σe νt e

and at ≡ νt′ (ψ(X) − Xβ) for t = 1, . . . , T .

38

Then zt ∼ IN(0, 1) and we can write fT (e, X, θ, β) =

T 1 X σe at zt σe2 zt2 σe2 − + ] [− T t=1 γt + σ 2(γt + σ) 2(γt + σ)

= −

T 1 X 2σe at zt + σe2 zt2 − σe2 . T t=1 2(γt + σ)

(47)

Therefore, T i h 1 X 4σe2 a2t E(zt2 ) + σe4 E(zt4 ) + σe4 + 4at σe3 E(zt3 ) − 4at σe3 E(zt ) − 2σe4 E(zt2 ) 2 , E fT (e, X, θ, β) = 2 T t=1 4(γt + σ)2

and consequently lim

i h 2 sup E fT (e, X, θ, β)

T →∞ Θ×B×A

= =

= ≤

T 1 X 4σe2 a2t + 3σe4 + σe4 − 2σe4 T →∞ Θ×B×A T 2 4(γt + σ)2 t=1

lim

sup

T 1 X 2σe2 a2t + σe4 T →∞ Θ×B×A T 2 2(γt + σ)2 t=1

lim

lim

sup

sup

T →∞ Θ×B×A

T 1 σe4 1 X − σe2 D0 UT + 2 T T t=1 2(γt + σ)2

!

1 σe4 T →∞ Θ×B×A 2T σ 2

−σe2 D0 U + lim

sup

→ 0, where the last inequality follows from assumption ii. and Proposition 1. This completes the proof of condition a). To verify condition b) define   e βe , feT = fT e, X, θ,

and note that fT − feT





    ′ 1 ′ −1 −1 e e tr (ψ(X) − Xβ) C − ψ(X) − X β C e T    2  1  ′  −1 e −1 e + σe tr C −1 − C e −1 + tr e C − C 2T s2T

 X   ′

1 ′ −1 −1 2 e

e et (ψ(X) − Xβ) C − ψ(X) − X β C T   X 

σe2 1

−1 2 e −1 et + +

C − C

. T 2T 39



 ′

′ −1 e C e −1 ↓ It follows immediately (as X is non-stochastic) that (ψ(X) − Xβ) C − ψ(X) − X β



 ′ P

′ e′ . Furthermore, since 1 e −1 e2t = Op (1) 0 and C −1 − C

↓ 0 when (θ ′ , β ′ ) → θe′ , β T (by LLN) and

σe2 2T

=O(T −1 ) we can conclude that condition b) holds according to, e.g.,

Theorem 21.10 in Davidson (1994). This completes the proof of (20). Condition (21) follows directly from Propositions 2 and 3. Proof of Theorem 3

Calculate the first and second order conditions of the function



Q (θ; β). From Propositions 2 and 3, and Theorem 2, we can write the Hessian matrix as 

− 21 D211 U (θ) · · · − 21 D21I U (θ) .. .. .. . . .

0 .. .

   H(θ; β) =   1 2  − DI1 U (θ) · · · − 1 D2II U (θ) 0 2  2 0 ··· 0 − 21 (D200 R(θ) + σe2 D3000 R(θ))



    . (48)   

The assumption of convexity of U (θ) guarantees that the upper block of the Hessian matrix (48) is negative definite and the function Q∗ (θ; β) is concave in (λ1, λ2 , ...λI ). The right lower element of the Hessian matrix is 1 1 − (D200 R(θ) + σe2 D3000 R(θ)) = tr C −1 C −1 − σe2 tr C −1 C −1 C −1 . 2 2 For this term to be negative, it is necessary and sufficient that tr C −1 C −1 < 2σe2 tr C −1 C −1 C −1 . A necessary condition for concavity of Q∗ (θ; β) in σ is given by σ ≤ 2σe2 . This condi-

tion comes from considering the eigenvalues of the matrix C −1 σ, which are less than one (Magnus and Neudecker, 1999, p. 25). Then, tr C −1 C −1 C −1 σ 3 ≤ tr C −1 C −1 σ 2 and σ≤

tr C −1 C −1 tr C −1 C −1 C −1

< 2σe2 .

Proof of Theorem 4 We follow the five conditions for consistency in Dahl and Qin (2004, Theorem 1). Conditions i. requires that Θ and B are compact parameter spaces, p which is also required in our assumption ii. Condition ii. βˆ −→ β ∗ ∈ B can be verified

by a similar argument as in Dahl and Qin (2004). Condition iii. requires that QT (θ, β) is a continuous measurable function for all T and it is satisfied trivially. Condition iv. p

requires that QT (θ, β) −→ Q∗ (θ, β) uniformly in Θ ×B and it is satisfied by Theorem 40

2. Condition v. requires the existence of a unique maximizer θ ∗ ∈ Θ of Q∗ (θ, β ∗ ) and it ˆ is satisfied by Theorem 3. This completes the proof of consistency of θ. Define v ≡ (y − Xβ) and Bi1 ≡

Proof of Proposition 5

√1 T

C −1 Hi1 C −1 and notice

that v ∼ NT (c, σe2 IT ). First, we focus on proving equation (25) and for that purpose we use the moment generating function M (si1 , si2 ) defined as M (si1 , si2 ) = E (exp(si1 v ′ Bi1 v + si2 v ′ Bi2 v)) , where si1 , si2 ∈ R. Now, let κ

≡ =

= =

=

1 (v − c)′ (v − c) + v ′ (si1 Bi1 + si2 Bi2 ) v 2σe2 1 1 − 2 v ′ v + 2 c′ v 2σe σe 1 ′ − 2 c c + v ′ (si1 Bi1 + si2 Bi2 ) v 2σe   1 − 2 v ′ IT − 2si1 σe2 Bi1 − 2si2 σe2 Bi2 v − c′ v − v ′ c + c′ c 2σe −1 ′  1 h c) IT − 2si1 σe2 Bi1 − 2si2 σe2 Bi2 − 2 (v − IT − 2si1 σe2 Bi1 − 2si2 σe2 Bi2 2σe  −1  × v − IT − 2si1 σe2 Bi1 − 2si2 σe2 Bi2 c  i  −1 − IT c − c′ IT − 2si1 σe2 Bi1 − 2si2 σe2 Bi2 −

 1 (v − c˜)′ IT − 2si1 σe2 Bi1 − 2si2 σe2 Bi2 (v − c˜) 2σe2 −1 c, +c′ (si1 Bi1 + si2 Bi2 ) IT − 2si1 σe2 Bi1 − 2si2 σe2 Bi2



where c˜ = IT − 2si1 σe2 Bi1 − 2si2 σe2 Bi2 write M (si1 , si2 ) = =

Z

−1

c. By using the above formulation we can

1

T exp (κ) dv (2πσe2 ) 2 1 Z I − 2si1 σe2 Bi1 − 2si2 σe2 Bi2 − 2

exp (κ) dv T −1 (2πσe2 ) 2 |I − 2si1 σe2 Bi1 − 2si2 σe2 Bi2 | 2 − 1 = I − 2si1 σe2 Bi1 − 2si2 σe2 Bi2 2  −1  × exp c′ (si1 Bi1 + si2 Bi2 ) IT − 2si1 σe2 Bi1 − 2si2 σe2 Bi2 c . 41

Since cov (v ′ Bi1 v, v ′ Bi2 v) = E[(v ′ Bi1 v)(v ′ Bi2 v)]−E (v ′ Bi1 v) E (v ′ Bi2 v) and E[(v ′ Bi1 v)(v ′ Bi2 v)] = D2si1 si2 M (si1 , si2 )|si1 =0,si2 =0 , we can write cov (v ′ Bi1 v, v ′ Bi2 v) = =

2σe4 tr (Bi1 Bi2 ) + 4σe2 c′ Bi1 Bi2 c 1 2σe4 tr(C −1 Hi1 C −1 C −1 Hi2 C −1 ) T 1 +4σe2 c′ C −1 Hi1 C −1 C −1 Hi2 C −1 c. T

Noticing that  √ √ T Di1 QT (θ, β) , T Di2 QT (θ, β) cov

= cov =



1 ′ 1 v Bi1 v, v ′ Bi2 v 2 2



 σe4 1 tr C −1 Hi1 C −1 C −1 Hi1 C −1 2 T 1 +σe2 c′ C −1 Hi1 C −1 C −1 Hi2 C −1 c, T

completes the proof of equation (25). To show that equation (26) holds we proceed ˜ ·j1 = exactly as above. Let x κ

√1 T

x·j1 . We write,

1 ˜ ·j1 v (v − c)′ (v − c) + si1 v ′ Bi1 v + sj1 x 2σe2 1 ˜ ·j1 )]′ (IT − 2σe2 si1 Bi1 ) = − 2 [v − (IT − 2σe2 si1 Bi1 )−1 (c + σe2 sj1 x 2σe ˜ ·j1 )] + [v − (IT − 2σe2 si1 Bi1 )−1 (c + σe2 sj1 x 1 1 ˜ ·j1 ) − 2 c′ c. ˜ ·j1 )′ (IT − 2σe2 si1 Bi1 )−1 (c + σe2 sj1 x (c + σe2 sj1 x 2σe2 2σe

= −

Then, the corresponding moment generating function can be written as ˜ ′·j1 v) E exp(si1 v ′ Bi1 v + sj1 x − 1 = I − 2si1 σe2 Bi1 2 1 ˜ ·j1 ) ˜ ·j1 )′ (IT − 2σe2 si1 Bi1 )−1 (c + σe2 sj1 x × exp( 2 (c + σe2 sj1 x 2σe 1 − 2 c′ c). 2σe  Immediately, using the matrix differentials, D det (B) = det (B) tr B −1 DB and DB −1 = M (si1 , sj1 ) =

−B −1 DBB −1 , we get

 ˜ ′·j1 v) = E (v ′ Bi1 v)(x =

D2si1 sj1 M (si1 , sj1 )|si1 =0,sj1 =0   ˜ ′·j1 Bi1 c. ˜ ′·j1 c (c′ Bi1 c) + 2σe2 x ˜ ′·j1 c + x σe2 tr (Bi1 ) x 42

  ˜ ′·j1 v = ˜ ′·j1 v)] − E (v ′ Bi1 v) E x ˜ ′·j1 v = E[(v ′ Bi1 v)(x Therefore, we have cov v ′ Bi1 v, x

˜ ′·j1 Bi1 c. Equation (27) follows trivially. Finally, we note that, in expression (25), 2σe2 x 1 T

tr(C −1 Hi1 C −1 C −1 Hi1 C −1 ) is a component in D0i1 0i2 RT and

1 T

c′ C −1 Hi1 C −1 C −1 Hi1 C −1 c is a component in Di1 0i2 UT . Therefore, (25) converges

uniformly following Propositions 2 and 3. Uniform convergence of (26) and (27) results from Proposition 4 and Assumption 4, respectively. T

Theorem A.1 (adapted from Davidson (1994)) Let {gt,T }t=1 be a triangular  P T converge to a semig array of p dimensional random vectors and let T1 var t,T t=1 P T positive definite matrix Σ. If for ∀α ∈ Rp satisfying α′ α = 1, √1T t=1 α′ gt,T converges  P T = 0p . Then to a normal random variable in distribution as T → ∞ and E g t,T t=1 P d T √1 t=1 gt,T −→ N (0p , Σ). T Proof of Theorem 5 Building on Theorem A.1, let p = k + I + 1 and notice that     ˆ βˆ = 0p . In Theorem 4, we have proven the consistency of θ, ˆ βˆ with regT θ,

spect to (θ ∗ , β ∗ ). To prove the asymptotic normality of gT (θ ∗ , β ∗ ), we need to show   ˆ βˆ is normally distributed and, given the equicontinuity of the function that gT θ,

gT (θ, β) together with the consistency property, the desired result will follow. Note that E (gT (θ ∗ , β ∗ )) = 0p and, by Proposition 5, var(gT (θ ∗ , β ∗ )) uniformly converges to a semi-positive definite matrix Σ ∗ as T → ∞. The normality of the last k rows of gT (θ, β), given by Dβ mT (β), is a direct consequence of the normality of y. We need to show the

asymptotic normality of the first I + 1 rows of gT (θ, β) given by Dθ QT (θ; β). Using PT Theorem A.1, we define T1 t=1 gt,T = Dθ QT (θ, β). Let C0 denote a non-stochastic

term that may depend on θ and β. For any α ∈ RI+1 we have I √ X T αi Di QT (θ, β) = i=0

=

I X 1 √ (ψ(X) − Xβ)′ C −1 αi Hi C −1 e T i=0

I X 1 + √ e′ C −1 αi Hi C −1 e + C0 2 T i=0

T 1 1 X √ (γt σe δt zt + γt σe2 zt2 ) + C0 , 2 T t=1

43

where γt for t = 1, . . . , T are the eigenvalues of C −1

nP I

i=0

o αi Hi C −1 , νt are the

corresponding eigenvectors such that νt′ νt = 1, zt ≡ νt′ e/σe with zt ∼ N(0, 1), and δt ≡ νt′ (ψ(X) − Xβ) . Notice that by the Cauchy-Schwarz inequality v u T r u1X 1 1 ′ √ |δt | ≤ (ψ(X) − Xβ) (ψ(X) − Xβ) = t (ψ(xt· ) − xt· β)2 . T T t=1 T

(49)

By Assumption 4 this implies that there exist a constant C1 < ∞ such that 1 √ |δt | ≤ C1 , T for any T and t = 1, . . . , T on B. Next we will show that the condition |γt | ≤ C2 < ∞,

(50)

is satisfied for all t. Let κt (A) denote the t ’th eigenvalue of the T × T matrix A where κ1 (A) ≤ κ2 (A) ≤ ... ≤ κT (A). From Lutkepohl (1996, page 66, expression 13.a), we can write   κt C −1 αi Hi C −1 = αi κt C −1 Hi C −1 ,

(51)

and, from Weyl’s theorem (Lutkepohl, 1996, page 162), we have ! ) ( I X −1 −1 αi Hi C γT ≡ κT C i=0

≤ =

I X

κT C −1 αi Hi C −1

i=0

I X i=0



 αi κT C −1 Hi C −1 ,

where the last equality follows from (51). By the triangular inequality we can write |γT | ≤

1 λ0 λi I

−1



i=0

 |αi | κT C −1 Hi C −1 .

≥ 0 for all i = 0, 1, ..., I, and all t = 1, 2, ..., T , condition (50) will  C −1 Hi C −1 is bounded from above for all values of i. Since C −1 Hi C −1 and

Since κt C hold if κT

−1

I X

Hi C

− C −1 Hi C −1 are both positive definite (real symmetric) matrices for all i (and

t), we have that (Lutkepohl, 1996, page 162)  κT C −1 Hi C −1 ≤ 44

1 < ∞, λ0 λi

for all i = 1, 2, ..., I, and we conclude that (50) holds (the last inequality follows from √ P Assumption ii. In order to show asymptotic normality of T Ii=0 αi Di QT (θ, β) we

need to check the moment condition in Liapunov’s Theorem (Theorem 23.11 in Davidson (1994)), i.e. verify that the following condition is satisfied

Define C3 ≡ C3

≤ = = ≤

T

1 √

T

PT

3 T 1 X 1 2 2 lim √ E γt σe δt zt + γt σe zt = 0. T →∞ T T 2 t=1

t=1

(52)

3 E γt σe δt zt + 12 γt σe2 zt2 . We have

3  T 1 1 X 2 2 √ E |γt σe δt zt | + γt σe zt 2 T T t=1 ! 3 T 1 2 2 3 2 2 2 3 1 X 3 2 2 2 √ E |γt σe δt zt | + E γt σe zt + E |γt σe δt zt | γt σe zt + E |γt σe δt zt | γt σe zt 2 2 4 T T t=1  T  1 1 X 3 3 3 3 3 3 6 3 2 4 3 5 √ |γt | σe3 |δt | E |zt | + |γt | σe6 E |zt | + |γt | |δt | σe4 E |zt | + |γt | |δt | σe5 E |zt | 8 2 4 T T t=1 C2 C1 σe3 E |zt |3 +

T T 6 1 X 2 2 C23 σe6 E |zt | 3 2 1 X √ + C2 C1 σe4 E |zt |4 γt δt + |γt | |δt | T t=1 2 T t=1 8 T

T 5 E |zt | 1 X √ |γt | |δt | . T t=1 T

3 2 5 4 C2 σe

Since zt is a Gaussian random variable, E |zt |j is bounded for all j ∈ N. For (52) to be satisfied, it is sufficient to show that the following two conditions hold: T 1X 2 2 γt δt T →∞ T t=1

lim

=

T 1X |γt | |δt | = T →∞ T t=1

lim

45

0

(53)

0.

(54)

To verify (53) notice that we can write the condition as ) ! ( I ) ( I T X X 1 X 2 2 1 αi Hi C −1 c αi Hi C −1 C −1 tr c′ C −1 γ δ = T t=1 t t T i=0 i=0 =

I  1 X 2 αi tr c′ C −1 Hi C −1 C −1 Hi C −1 c + T i=0

I I  1 XX αi αj tr c′ C −1 Hi C −1 C −1 Hj C −1 c T i=0 j>i

I I  1 XX αi αj tr c′ C −1 Hi C −1 C −1 Hj C −1 c . T j=0 i>j

Define ΥijT ≡

 1 tr c′ C −1 Hi C −1 C −1 Hj C −1 c , T

and notice that by Cauchy-Schwartz (Lutkepohl, 1996, page 43) |ΥijT | ≤

p p ΥiiT ΥjjT .

From this inequality, it is easy to verify that limT →∞ |ΥijT | = 0, since lim ΥiiT

T →∞



lim

T →∞

=



=

0,

 1 1 tr c′ C −1 C −1 c T λ2i

1 lim D0 UT λ2i T →∞

for all i = 0, 1, 2, ..., I, and from Proposition 1, we have that limT →∞ D0 UT = 0, and λi for i = 0, 1, 2, ..., I is bounded away from zero by Assumption ii. Furthermore, since PI 2 i=0 αi = 1 (see Theorem A.1), it follows that |αi | < 1 and |αi αj | < 1 for all i, j =

0, 1, ..., I. Consequently, we can write I X I I X I T I 1 X X X X 2 2 2 αi lim |ΥiiT | + |αi αj | lim |ΥijT | |αi αj | lim |ΥijT | + lim γt δt ≤ T →∞ T →∞ T →∞ T →∞ T t=1 i=0 i>j i=0 j>i i=0 =

0,

and condition (53) is verified. To verify condition (54), we use Cauchy’s inequality and obtain

T 1X |γt | |δt | ≤ T t=1

46

r

1 X 2 2 γt δt , T

and the desired result follows. Finally, the asymptotic variance Σ ∗ is obtained once we show that lim Di1 Mx·j1 x·j1 (θ) → 0,

T →∞

(55)

uniformly on Θ × B × A for all j1 = 1, 2, ...., k. To show condition (55), notice that Di Mx·j x·j (θ) = − Then r

 1 tr x′·j C −1 Hi C −1 x·j . T

r   1 1 tr x′·j C −1 Hi C −1 C −1 Hi C −1 x·j tr x′·j x·j lim T →∞ T T →∞ T s r  1 ′ 1 1 lim lim tr x′·j C −1 C −1 x·j x·j x·j 2 T →∞ T λi T →∞ T

lim Di Mx·j x·j (θ) ≤ T →∞



=

lim

0,

for all i = 0, 1, ..., I and j = 1, 2, ..., K, where the last equality follows from Proposition 4 (first term converges to zero) and Assumption 4 (last term converges to a finite constant). This completes the proof. Proof of Theorem 6

First, we need to show the convergence of the matrices D2θθ QT (θ, β) ,

D2ββ mT (β) , and D2θβ QT (θ, β). The convergence of D2θθ QT (θ, β) is established in Theorem 3. The convergence of D2ββ mT (β) follows trivially from Assumption 4. We need to prove the convergence of D2θβ QT (θ, β). Consider a typical element of D2θβ QT (θ, β) given by D2i1 j1 QT (θ, β) for i1 = 0, 1, ..., I and j1 = 1, 2, ..., k. We can write 1 1 D2i1 j1 QT (θ, β) = − x′·j1 C −1 Hii C −1 c − x′·j1 C −1 Hii C −1 e T T k X 1 βj Di1 Mx·j1 x·j − x′·j1 C −1 Hii C −1 e, = −Di1 Mψx·j1 − T j=1 where the first term converges as T → ∞ according to Proposition 4 for all i1 , j1 , and the last two terms converge to zero by Proposition 5 and Assumption 3, respectively. Now, define ζ = (θ ′ , β ′ )′ and let QT (ζ) and mT (β) be given by (8) and (22), respecp tively. Under Assumptions 1 - 4, the following conditions are satisfied: i. ζbT −→ ζ ∗

(by Theorem 4). ii. QT (ζ) and mT (β) are twice continuously differentiable. iii. √ √ √ T gT (ζ ∗ )=( T Dθ QT (ζ ∗ )′ , T Dβ mT (β ∗ )′ )′ converges to a normal random variable 47

N(0, Σ ∗ ) in distribution (by Theorem 5). iv. D2θθ QT (ζ), D2ββ mT (β) and D2θβ QT (ζ) converge to nonsingular matrices for any ζ in a neighborhood of ζ ∗ . Conditions i.-iv. are sufficient conditions to obtain the desired result (Dahl and Qin, 2004; Theorem 9). Proof of Theorem 7 Derive the first and second moments of

 1 1 ′ −1 −1 c C C c + σ 2 E e′ C −1 C −1 e T T = −σ 2 D0 UT − σ 2 σe2 D200 RT ,

1 ˆ E(εˆ′ ε) T

then E



εˆ′ εˆ √ : T

= σ2

εˆ′ εˆ √ T



√ √ = −σ 2 T D0 UT − σ 2 σe2 T D200 RT .

Notice that 1 1 1 ′ εˆ εˆ = σ 2 D0 UT + 2σ 2 c′ C −1 C −1 e + σ 2 e′ C −1 C −1 e. T T T

(56)

Now, let the eigenvalues and eigenvectors of C −1 be γt and νt , for t = 1, . . . , T , respectively. Let zt = as

ν′e σe

∼ N (0, 1) and at = ν ′ c, t = 1, . . . , T . Then (56) can be written

T T 1X 1 X 2 2 1 ′ εˆ εˆ = σ 2 D0 UT + 2σ 2 σe at γt2 zt + σ 2 σe2 γ z . T T t=1 T t=1 t t

(57)

Using that cov(zt , zt2 ) = 0 and var(zt2 ) = 2, we write var



 1 ′ √ εˆ εˆ = T

T 4σ 4 σe2 X 2 4 a γ var(zt ) + T t=1 t t

+ =

T T 4σ 4 σe3 X σ 4 σe4 X 4 at γt4 cov(zt , zt2 ) + γ var(zt2 ) = T t=1 T t=1 t

1 2 − σ 4 σe2 D3000 UT − σ 4 σe4 D40000 RT . 3 3

Asymptotically lim var

T →∞



(58)

 1 ′ 1 √ εˆ εˆ → − σ 4 σe4 D40000 R, 3 T

by Propositions 1, 2, and 3. The proof of asymptotic normality of

εˆ′ εˆ √ T

follows in a similar

fashion to that of Theorem 5. The asymptotic normality of the last term of (56), which is a multiple of D0 QT (θ, β) , follows immediately from Theorem 5. The second term of (56) is already a normal random variable by Assumption 3. 48

Proof of Corollary 4

We standardize the asymptotic normal random variable on the

left hand side of (30) by the actual standard deviation of √1 T

εˆ′ εˆ + σ 2 T D0 UT + σ 2 σe2 T D200 RT



√1 T

εˆ′ εˆ in (58) and get

a

q ∼ N(0, 1). − 23 σ 4 σe2 D3000 UT − 31 σ 4 σe4 D40000 RT

We replace unknown parameters in the above expression by their consistent estimates. Then we multiply the numerator and the denominator of the left hand side of the above expression by

√1 T

. After removing the term σ 2 D0 UT , which converges to 0, we get the

test ResTDGQ .

49