Simulation Study on Covariance Component Estimation for Two

1 downloads 0 Views 924KB Size Report
(in our case A-' = A = I), and aj and ak are the current solutions for sire effects evaluated with the variance .... .os2. Bias. -.011. -.122. -.m3. -.Do1 t = Ibias/sEMI .87. 19.24. 154 .09. MSEs x lo-* .148 ..... USDA Animal Health and Disease Program.
Simulation Study on Covariance Component Estimation for Two Binary Traits in an Underlying Continuous Scale E. A. M~NlYsAARl,'s* R. L. QUAASP and Y. T. GRC)HN1 Cornel1 University lthaca, NY 14853

highest incidence level, 25%. For 5% incidence, the uncorrected estimate of residual correlation was 50% less than the true value, and after correction for incidence, the parameter was overestimated by 90%. The estimates of residual correlation from the threshold model were regarded fair, except at the lowest level of incidence, where the estimate was 27% higher than the true value. Results indicated that when an accurate estimate of residual correlation is needed, the marginal maximum likelihood estimates are superior to the estimates calculated with the linear model. Using correction for the incidence level for residual correlation did not work well except at the highest incidence level. (Key words: binary traits, variance components, Bayesian methods)

ABSTRACT

The usefulness of the variance and covariance component estimation methods based on a threshold model was studied in a multiple-trait situation with two binary traits. Estimation equations that yield marginal maximum likelihood estimates of variance components on the underlying continuous variable scale and point estimates of location parameters with empirical Bayesian properties are described. Methods were tested on simulated data sets that were generated to exhibit three different incidences, 25, 15, and 5%. Results were compared with analyses of the same data sets with a REML method based on normal distribution and a linear model. Heritabilities and residual correlations calculated from discrete observations were transformed to underlying parameters. In estimation of heritabilities, all methods performed equally well at all incidence levels and with no detectable bias. As suggested by threshold theory, the genetic correlation was accurately estimated directly from the observations without any need of correction for incidence. Marginal maximurn likelihood estimates of genetic correlations were similar to linear model estimates; discrepancies from the true parameters were consistent with both methods. In estimation of residual correlations, the method with the linear model approach yielded satisfactory estimates only at the

Abbreviation key: EM = expecWon-maXimization, h2 = heritability, MME = mixed model equations, MML = marginal maximum likelihood INTRODUCTION

A number of economically important traits for animal production are recorded in a categorical scale. Examples of such traits are calving ease, litter size, survival to fixed ages, susceptibility to diseases and nonretum rates. In the last decade, procedures to analyze such data have been developed by animal breeders in order to estimate breeding values of candidates for selection (8, 9, 11). These methods are based on a threshold concept (1) and are suitable for ordered categorical or binary data. Analogous to methods developed for continuous traits, the Received March 2, 1990. evaluation procedures for categorical traits have Accepted Jane 6, 1990. 'Section of Epidemiology. Department of clinical been extended to work with several traits siSciences,New Y a k State College of Veterinary Medicine. multaneuusly (3, 4, 12, 21). b t address: Agricultaral Research Center, SFIn prediction of breeding values, knowledge 31600 JoLioinen, Finlaod of genetic and phenotypic parameters, Le., heri3Depamerlt of Animal sci-. 1991 J Dairy Sci 74580-591

580

581

VARIANCE COMPONENTS FOR BINARY TRAITS

tabilities and genetic and phenotypic correlations, is essential. When the estimation is based on discrete observations using a threshold model, similar parameters on an underlying “liability” scale are required. Assuming that the assumptions behind the threshold model are valid, these parameters are in some sense the true parameters; they are the values that would have been realized if the trait had been measured on the continuous scale instead of the discrete scale. In a single-trait, single record analysis, the parameter needed is h2 of the underlying “liability” trait. The h2 of the character on the observed scale is a function of the underlying h2 (1); thus, it can be estimated by first estimating the h2 on the observed scale and then transforming it to the h2 of the underlying “liability”. When h2 is estimated directly on the underlying scale (9, ll), the beneficial properties of the threshold model are used in the variance component estimation also. Proposed methods for this (9, 11) are practical generalizations of REML (19) and the expectationmaximization algorithm (EM) (2). They are based on an approximation that the conditional expectations of random effects in EM REML equations can be r e p l a d by their current solutions from the categorical analysis. Thus, in computation, the methods (9, 11) rely on similarities of the estimation equations in the categorical analysis and the mixed model equations in ordinary linear analysis (11). For correlations required in multiple-trait analysis, the same estimation approaches as with a univariate problem can be used. Vinson

et aL (24) derived the relationship between the phenotypic correlation on the underlying scale with the same on the observed discrete scale. If correlation estimates are available on the observed scale, the relationship can be used to calculate the corresponding estimates on the underlying scale. As with the estimation of h2, the transformation works best if the data has no subclasses with different incidence levels. For genetic correlations, the estimates achieved with linear analysis from the observations could be used per se, because the theory (7) and the simulation studies (18) have proposed that the estimates are not much af€ected by the loss of the information due to the discreteness of the data Foulley et al. (5) proposed a more sophisticated estimation of underlying correlations based on Bayesian inference and a threshold model. They derived the marginal posterior distributions of genetic (G) and residual (R) variancecovariance matrices that were then Illilximized with respect to G and R. The estimator of h2 calculated this way is identical to that of Harville and Mee (11). The purpose of this study was to compare the efficiency of estimation methods that are based on the threshold model (5) to the estimation of heritabiities and correlations with a h e a r model. The comparison used simulated data where the true values of the parameters were known. Analyses were limited to models with two binary traits; thus, the parameters estimated were two h2, phenotypic correlation and genetic correlation. The estimates calculated with linear model were transfomed to the underlying scale (7, 24); thus, the effectiveness of such correction factors were also evaluated.

MATERIALS AND METHODS

Slmulatlon Model

The data were simulated according to a linear model:

[

YliJ-]= YZijYm’n

[ 3 [ 13 [ e] [ 3 [ +

+

+

+

“‘ij-1 qijum’n

[I1

where hlj and h2j represents effects of herd j for traits 1 and 2, respectively; alk, a=’, Slm, and Qm’ are effects of age group k (and k‘) and season m (m’); uln and u2, are effects of sire, n; and eliand ezijum’n are the residual environmental effects including the genetic effect of animal’s dam.

J o d of Dairy Science Vol. 74, No. 2, 1991

582

M h l T Y S A A R I ET AI...

The sire effects and residual effects were generated from the normal distribution:

where gll = g z = .05263 and 812 = .02632 were chosen so that the heritabilities (h; and would equal .20,and the genetic correlation r = SO. The values of U l n and uzn

g)

812

were independent across sires, Le., sires were assumed unrelated. The residual variances rll and r z were kept equal to unity;thus, the r12 = .30 equaled the residual correlation. The herd effects hlj and h2j were drawn from the uniform distribution with -.25, p. + -251, where p., the general mean, was either 1.705 (design .05), 1.100 (design .15), or .710 (design .25), depending on the desired incidence. For the age and season effects, four preselected arbitrary values were used within each p level. For the design with the lowest incidence = si = r.35 .18 -.E! -.29], for

were assigned -1. Both the binary and the continuous variables were saved. From here on, the binary variables are called the observed variables and the Continuous variables are called the underlying variables. Simulation Design

Three different values of p. were used to achieve three levels of incidence. At each level the data structure was the same, although the generation of the data was done independently. Number of sires was 100, and cows were assigned to each sire with equal probability. The number of herds generated was 50. For each herd, &he number of cows was randomly drawn from the N(80, 10) distribution and then rounded to nearest integer. Four thousand observations were expected per data set. Further, cows were assigned to age and season classes, the intermediate incidence = si = out of four of each, with every class having 1.34 .ll -.15 -.30], and for the design equal probability. Thus, the designs were unwith the highest incidence a; = { = balanced, especially with respect to distribution C.46 .13 -.19 -.MI, with 1 = 1, 2 were of sires to herds. Each design was replicated 10 times; 30 Used. For each generated animal two binary vari- different data sets were generated. Replications ables, 61i and hi, were formed by truncation. were generated independent of each other with Their values were assigned 1, if values of cor- different values for herds and sires and with responding y.1 were positive, otherwise they different design matrices. Table 1 shows the

4

4

TABLE 1. Frequencies of case.s in different age and season classes weraged over 10 replications in differmt overall incidence levels. Iocidence level

.is

.05

25

Level of factor

Trait 1

Trait2

Trait 1

Trait2

Trait1

Trait 2

Asel

.03

.03

.09

.09 .13 .I9 22

.14 22

.13

.os

Age2

.a

Age3 Ageq Season1 Season2 Season3 season4

.09 .09 .03

overall

.04

.13

.09 .10 .03

.19

.w

.w

.09

.08 .10 .06

.IO .06

Journal of Dairy Science Vol. 74, No. 2, 1991

23 .09 .13 .19 23 .16

.13

.19 23 .16

22

.32

.32

.38

.37

.14 .22

.13 .22

.31

.3 1

.39 26

26

.38

VARIANCE COMPONENTS FOR BINARY TRAITS

proportion of cases (negative values of binary variables) averaged over replications and across different age and season levels in different d e signs. Variance Component Estimation The estimation of variance components was done in three different ways. First, the values of gll, g12, g2. rll. r12, and r z were estimated from the underlying continuous scale variables; these results were used as a control to better take account of the variation in the samples. Second, the estimation was done in the observed discontinuous scale using the linear model as if the observations were normally distributed. The latter method was named “quasi-normal” by Haschele et aL (13). Third, the discontinuous observations were analyzed with a threshold model and marginal maximum likelihood (MML) method (5). Because the MML estimation assumes rll and r a to be equal to unity, the parameters estimated are essentially the h2 and genetic and residual correlations. Variance components from the linear models were used also to calculate h2 and comlations to enable comparisons between methods. Residual correlation and h2 were transformed to the underlying scale using correction factors derived by Dempster and Lemer (1) and Vinson et al. (24). Variance Component Estimation with the Linear M&l. In both the analysis of underlying variables and the analysis of obsemed variables, the statistical model as in muation [l] was used to describe observations, except the values of variance components were assumed unknown. The estimation was done using the REML (19) for a multipletrait modeL Maximization of the likelihood function was done by an EM algorithm (2). Computations were per-

5 83

formed as described in Mhtysaari and Van Vleck (17). Quasi-normal estimates of heritabilities were transformed to underlying parameter values using a formula from Dempster and Lemer (1):

where p is the overall incidence of cases (-1) in a population and Cp is an ordinate of a standard normal density function corresponding to a threshold that divides the probability mass to proportions p and 1 - p. Residual correlations estimated from the discrete categories were transformed to underlying scale by a formula after Vinson et al. (24):

where subscripts of p and $I refer to traits. Variance Component Estimation Using Marginal Maximum Likelihood. In a threshold model analysis, an unknown underlying variable was postulated for the both traits under study. Assumptions for these variables were as described in Equation [l]. According to the threshold concept (l),underlying variables can be thought of as the liabilities to the outcomes. As it was actually done when the data were simulated, the cases (negative values of 61 or are assumed to be observed only if the liability variables are negative. In the general threshold framework, this implies that the threshold has been set to zero and that all parameters related to underlying variables are scaled correspondingly.

Let Y 1 and Y2 be the vectors containing the values of underlying variables for all animals in the data. The linear model can be written in a matrix form:

where bl and are the unknown environmental effects including herd, age, and season affecting Y1 and Y2;u1 and u2 are vectors of unknown sire effects; and XI, X2, Z, are known incidence matrices. Further assume that

Journal of Dairy Science Vol. 74, No. 2, 1991

584

M S A A R I ET AL.

and

w',

a Bayesian approach is taken by assuming it to follow a uniform distribution so For b' = as to reflect complete ignorance about its prior values in the analysis. This corresponds to a "fixed" factor in a classical analysis. The values of observations 61 and 62 are completely determined by values of Yli and Yzi. Define 91i and 9zi to be the values of Yli and Y2i that could be predicted if b and u' = [uluiJ' were known,and, correspondingly, eli = 911-Yli and Q = 92,-Ya. Conditionally on 91i and 9~ the probability of observation of particular combination (61, &) is P(61i,

hi

I 91iV y d = m b {eii
. In their derivation they used a d o r m distribution as the prior f(g, r); thus, the maximization (or mode) of f(g, r I 61, &) produced the marginal maximum likelihood estimates of g and r. With multivariate normality these estimates correspond to REML estimates (10). In the following, only the final computation formulas that were used in this study ate given. Their general forms and their derivation for n binary traits are in Foulley et al. (5). The elements in g were estimated by iterating with

where q is the number of sires, A-' is the inverse of the additive relationship matrix among sires (in our case A-' = A = I), and aj and a k are the current solutions for sire effects evaluated with the variance components from round h. The solutions are found by solving a nonlinear analog to linear mixed model equations (MME): Journal of Dairy Science VoL 74, No. 2. 1991

585

VARIANCE COMPONENTS POR BINARY TRAITS

x;

w11x1

x;

w12x2

g w22x2 Symmetric

x; W l l Z x; w 1 2 z ZWl1Z + Ig"

x; Wl2Z

g!

w22z

ZW12Z + Ig12

zw,z + I

F

1

where the Y's are working variates

I101 and vi are from the first derivatives ofEquation Yr] with respect to b and u. For trait j in animal i

where

Matrix W is a weight matrix, corresponding to R-' in linear multitrait MME. It gives a unique weight to every observation. Let Wi be a 2 x 2 matrix that weights the observation vector of the animal i

sjj = v,&

+ s12 r

where QL is an ordinate of bivariate density. Finally, for traits j and k,

C ~ L= aV(uj, uk I 61.

h

g, r).

which is evaluated from the inverse of the coefficient matrix of Equation [9]. Foulley et al. (5) used notation different fromhere. Instead of having 6, = f l , they used w , that was assigned 0 or 1. Also, all signs in Equations [ll] and [12] are opposite those in Foulley et aL (5). Journal of Dairy Science Vol. 74, No. 2, 1991

5 86

M S A A R I ET AI...

Foulley et al. ( 5 ) estimated the underlying residual correlation by Fisher's scoring algorithm:

where L(r) is the marginal log likelihood function of the data The !&st derivative was computed as

where the summation is over animals and the derivative is evaluated at the g = gh and r = fi. The expectation of the second derivative is also evaluated with previous round variance components.

Formulas [13], [14], and [15] are also given by Tallis (22) for the maximum likelihood estimation of correlation from contingency tables. The iteration on variance components (Equations [8] and [13]) r;eqUires subiterations on location parameters, Le., u and b values (Equation [9]). Computationally the most demanding part is to set up the coefficient matrix in Equation [9] and calculation of new values of variance components will require only minimal work after current sire estimates are solved (by matrix inversion). Thus we chose a two-stage iteration: 1) with g = gh and r = ,iterate on bAh2 and Qh, , h2 = 1, . . . ,mr until the squared change in Q between rounds is smaller than e or the h2 exceeds the maximum allowed number of subiterations. W e iterating, perform summations [14] and [15]. 2)

'

h,+l

rhl

h,+l

solve g and r by Equations 181 and 1131 evaluated using variance components from round hl and location parameters from h2 - 1 within round hl. RESULTS AND DISCUSSION

Differences between estimated and true parameters were displayed by empirical mean squared errors (MSE) MSG2)

2(b2 - h2p = Number of replicates

To compare bias among estimation methods, a simple t test statistic was calculated for each method by incidence designs.

where bias was a difference between the true parameter value and the estimated value averJournal of Dairy Science Vol. 74. No. 2, 1991

aged over replications, and SEM the standard error of the bias. Results from the design targeted to 25% incidence are given in the Table 2. As expected,the estimates calculated from the continuous variable values am close to the true parameters. Standard deviations of estimates are small (.02 to .06) except for the genetic correlation (.14); still, the values of r statistic will remain below 1.5. When parameter estimation is performed from t5e discrete scores, but assuming normality (quasi-normal estimates), the values of h2 and residual correlations are clearly underestimated. The value of the bias is roughly 50% from the true values, and the r statistics are large (from 8.1 to 26.7). The estimate. of the genetic correlation remains unbiased although

5 87

VARIANCE COMPONENTS FQR BINARY TRAITS

TABLE 2. Estimates of heritabilities and genetic and residual comlations from data with two binary variables with 25% mcidence level. The REML estimates were calculated before underlying continuous variables were truncated to two classes. Ten replicates were used with each method. Method

REML Contirmons' Trait 1 Heritability

SD Bias r = Ibias/SEM MSE5 x 1F2 Trait 2 Heritability Bias r = Ibias/SEM MSE x 10-2 Genetic correlation

SD Bias I = Ibias/SEMI MSE x 1W2 Residual correlation

SD

.lo9

.I99

.!I35

.064

.024

-.091 8.11 .935

-.001 .07 .368

209 .OS6

.lo6 .m7

.os 1

.009

.48 .286 518 .137 .018 .41 1.718 .302 .019

O .m

Bias t = Ibias/sEM hfSE x W 2

.34 .034

l ~ e s t r i ~ t cmaximum d like-

QN corrected3

224 .05 1

1.47 .292

SD

-&normal distributed.

QN~

Mh4L4

223 .067 $023 1.10 ,452

.193

220

-.ow

-.om

+.MO

10.99 .958 508 .150 .008 .18 2.027 .175 .015 -.125 25.65 1574

.42 .237 .SO8 .150

121 28 ,527 .144 .027 .59 1.930 .309

.oos .18 2.027 .320 .029 .020 2.17 .115

,052

.M6 .009

1.10

.w

estimates from the d e r l y i n g trne variables.

estimates are. calculated with REML from the binary variables as though they would be normally

3~wsi-nonna1estimates corrected for incidence rate. 41vla@al maximum likelihood from the threshold model. b e a n sq~arederror.

its MSE becomes large due to the high variability among replicates. This, however, includes the large sampling variance in genetic correlation from the data generation. At this incidence level, the correction factors for h2 (1) work well. When the q u a s i - n o d estimates were corrected to the level of underlying h2, the bias disappeared and the values of t statistics a p proached levels corresponding to continuous variable estimates. For the residual correlation, the correction (7, 24) somewhat overcompensated, but the resulting estimate did not significantly differ from the true value. The heritabilities calculated via MML followed the true values closely. Also, the estimation of correlations seemed to give results that were nearly unbiased.

At the 15% incidence level the downward bias in all the parameter estimates from the quasi-normal method becomes larger (Table 3). Even after correction to the underlying level, the h2 will remain below the true values of .2, but there is no clear difference in the downward bias expressed in estimates from continuous variable. The estimate of genetic correlation is 10 percentage units lower than the true value, but due to the large SD the value of t statistic stays low, indicating, perhaps, random variation. After correction for incidence, the estimate of residual correlation shows upward bias from the true value. Bias is less than 6 percentage units, but due to small SD it gives a large value o f t statistic, which shows that systematic error exists. Heritability estimates with the Journal of Dairy Science Vol. 74, No. 2, 1991

5 88

IMANTYSAARI

m AL..

TABLE 3. IMmates of kitabilities and genetic and residual correlations from data with two binary variables with 15% incidence level. The REML estimates were tal- before underlying continuoas variables were troncated to two classes. Ten replicates were used with each method

~

Trait 1 H&tability

SD Bias t = Ibias/sEMI MSEs x lo-* Trait 2 Heritability

SD Bias t = Ibias/sEMI MSE x lff2 Genetic cornlation

SD Bias t = Ibias/sEMI MSE x lCr2 Residual correlation

SD Bias t = lbiadsEMl MSE x 1V2

.189 .039 -.011 .87 .148

.078 .020 -.122 19.24 1530

.183

.078

.036

.m

.017 1.46 .145 .497

-.122 13.72 1362 .397 230 -.lo3 1.42 5.837 .I55 .015 -.145 30.38 2.120

.ow

-.m .12 .635 .301

.010 .Do1

.3 1 .010

~

.I77

.I99

.046

.os2 -.Do1 .09 239

-.m3 154 .244

.178 .063 -.022 1.09 .401 .397 .230 -.103 1.42 5.837 .355

.197

.071 -.003 .16 .457 .382 134 -.118 1.60

6.324 .314

.w

.OM

.os5 4.40

.014 1.29 .125

.441

l ~ e ~ t r i c t maxinmm ai UZWXRXI estimm from the llnderlying tme variables. %)oasi-mrmaI estimates are calculated with REML h o r n thc binary variables as though they would be normaliy distributed. 3Qnasi-rKumal estimates corrected for incidence rate. 4~arginalmaximum l i k f r~~ mthe threshold m~del. 'Mean squared m.

MML method were as close to the true values as the corrected quasi-normal (QN) estimates. Genetic correlation was biassed downward as it was with QN corrected. Most likely, both methods performed equally, and the bias was generated by the truncation itself. The MML estimate of the residual correlation was slightly higher than the true value but not meaningfully so.

In the lowest level of incidence, the MML method became unstable. The problem was caused by a s m a l l number of subclasses without an observed case. This lead to numerical problems in approximation of probability density integrals. The phenomenon has been observed in many studies with the threshold model, and several ways to correct it have been suggested Joamal of Dairy Science VoL 74, No. 2, 1991

(11, 16). In our analyses, two approaches were tried; restricting the parameter space of herd effects and limiting the numerical values of to be strictly O