Simulating Univariate and Multivariate Nonnormal ...

3 downloads 0 Views 396KB Size Report
Apr 5, 2015 - which privacy rules restrict individually identifiable informa- tion by law, as in the Family Educational Rights and Privacy. Act (FERPA, 2000) or ...
Multivariate Behavioral Research, 0:1–17, 2015 C Taylor & Francis Group, LLC Copyright  ISSN: 0027-3171 print / 1532-7906 online DOI: 10.1080/00273171.2014.963194

Simulating Univariate and Multivariate Nonnormal Distributions through the Method of Percentiles Jennifer Koran, Todd C. Headrick, and Tzu Chun Kuo

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

Section on Statistics and Measurement, Southern Illinois University Carbondale

This article derives a standard normal-based power method polynomial transformation for Monte Carlo simulation studies, approximating distributions, and fitting distributions to data based on the method of percentiles. The proposed method is used primarily when (1) conventional (or L) moment-based estimators such as skew (or L-skew) and kurtosis (or L -kurtosis) are unknown or (2) data are unavailable but percentiles are known (e.g., standardized test score reports). The proposed transformation also has the advantage that solutions to polynomial coefficients are available in simple closed form and thus obviates numerical equation solving. A procedure is also described for simulating power method distributions with specified medians, inter-decile ranges, left-right tail-weight ratios (skew function), tail-weight factors (kurtosis function), and Spearman correlations. The Monte Carlo results presented in this study indicate that the estimators based on the method of percentiles are substantially superior to their corresponding conventional product-moment estimators in terms of relative bias. It is also shown that the percentile power method can be modified for generating nonnormal distributions with specified Pearson correlations. An illustration shows the applicability of the percentile power method technique to publicly available statistics from the Idaho state educational assessment.

Distributions of data in the social and behavioral sciences often deviate substantially from the normal distribution (e.g., reaction time) widely assumed in common univariate and multivariate techniques, such as those related to the general linear model and structural equation modeling (Bradley, 1982; Micceri, 1989). Simulation studies of nonnormal data have documented the effects of skew and kurtosis on simple statistical procedures, such as t-tests and F-tests, to provide researchers with guidance regarding when a lack of robustness will matter in the practical investigation of interest (e.g., Glass, Peckham, & Sanders, 1972; Pearson & Please, 1975; Sawilowsky & Blair, 1992). However, as statistical models become more complex (e.g., structural equation models), the impact of nonnormal distributions on the effect of interest depends on more factors (Bradley, 1978), and investigations of the effects of skew and kurtosis are more difficult to generalize. Researchers who want to know the extent to which nonnormality will influence the results of a particular study are Correspondence concerning this article should be addressed to Jennifer Koran, Section on Statistics and Measurement, Department EPSE, 222-F Wham Building, Southern Illinois University Carbondale, Carbondale, IL 62901-4618. E-mail: [email protected]

well advised to investigate this issue empirically. These investigations have historically taken two different forms. In the first more traditional form, statisticians have investigated the effects of presumably typical forms of nonnormality on popular statistical approaches like analysis of variance (e.g., Blair, 1981; Pearson & Please, 1975) and confirmatory factor analysis (e.g., Hu & Bentler, 1998). These simulation studies have provided general guidance that has allowed researchers to make informed choices among different analysis options. In the newer form, now available given researchers’ easy access to powerful computers and statistical software for conducting simulations, researchers may conduct their own simulations to investigate the performance of potential statistical tests prior to analyzing their own data (Muth´en & Muth´en, 2002; Steiger, 2006). The models investigated are those that will be tested in the study and the degree of nonnormality and structure of the simulated data are based on that observed in the sample to be analyzed. This newer form provides specific guidance to the researchers that will guide them in making decisions about their planned analyses. With recent advancements in simulation techniques, straightforward options are available for conducting investigations of the effects of nonnormality in the precise context of interest.

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

2

KORAN, HEADRICK, KUO

The power method (PM) polynomial transformation (Fleishman, 1978; Headrick, 2010; Headrick & Sawilowsky, 1999; Vale & Maurelli, 1983) is a mathematically elegant and popular approach for simulating nonnormal data in studies of multivariate statistical techniques. The essence of the PM approach is to generate normally distributed random variables and then use a simple polynomial transformation on the normally distributed data to produce the desired nonnormal distribution. While alternative multivariate approaches using copulas (Mair, Satorra, & Bentler, 2012) or iterative techniques (Ferrari & Barbiero, 2012; Ruscio & Kaczetow, 2008) have been proposed, the multivariate PM remains one of the simplest and most computationally efficient options for simulating nonnormal multivariate data. Application of the PM transformation requires that the user know the characteristics of the nonnormal distribution to be simulated. This is of little consequence when individual data are available and may be used to calculate these characteristics, such as skew and kurtosis. One problem that arises in the application of the traditional PM approach is when individual observations from which to calculate the particular distributional characteristics required for the PM are not available. This may be the case in secondary analysis of data collected in education or healthcare settings for which privacy rules restrict individually identifiable information by law, as in the Family Educational Rights and Privacy Act (FERPA, 2000) or the Health Insurance Portability and Accountability Act (HIPAA, 1996). In these situations the researcher may be restricted to the use of descriptive distributional statistics commonly released to the public, such as means, standard deviations, percentiles, and correlations. Typically, PM approaches rely on knowledge of the skewness and kurtosis of the distribution (see for example, Headrick, 2010, pp. 131–132), which may not be included in public reports. Proportions and percentiles may be more commonly included in public reports. Ferrari and Barbiero (2012) introduced a technique for simulating ordinal variables in which they defined the distributional characteristics using relative frequencies for each category of each variable. An option for continuous distributions is to use a parametric approach based on percentiles rather than the method of moments. Specifically, Karian and Dudewitz (1999) introduced a percentile method for the Generalized Lambda Distribution (GLD). Two problems remain with these alternative approaches, however. First, closedform solutions are not available, thus requiring computationally intensive numerical solutions (Ferrari & Barbiero, 2012; Karian & Dudewitz, 1999). Second, with the GLD percentiles technique “depending on the precise values of [a percentile-based skew function] and [a percentile-based kurtosis function], as many as four solutions may exist in just one region” (Karian & Dudewitz, 2011, p. 180). Likewise, with the Ferrari and Barbiero (2012) technique there is no assurance that the iterative process will converge to the correct solution.

Thus, it would be advantageous to have a method for simulating multivariate nonnormal data based on percentiles. Further, such a method would be most useful if it included unique, closed-form solutions. Noting these advantages, the goal of this presentation is to introduce a univariate and multivariate percentile-based PM transformation for nonnormal data generation and to compare the performance of this method to the conventional (Fleishman, 1978; Headrick & Sawilowsky, 1999; Vale & Maurelli, 1983) and L-moment-based (Headrick, 2011; Headrick & Pant, 2012b; Serfling & Xiao, 2007) methods. The multivariate percentile-based PM is based on the median, inter-decile range, left-right tail-weight ratio (a skew function), and the tail-weight factor (a kurtosis function) (Karian & Dudewicz, 2011, pp. 172–173), as well as Spearman’s rho. Solutions to PM coefficients are obtained in closed-form, and when solutions exist, those solutions are unique. The structure of this article is as follows. First, a summary of the PM transformation for conventional and L-moments is provided. Second, the method of percentiles is introduced, and the equations are developed for a univariate PM transformation through the method of percentiles. These equations are applied in an example application using publicly available education data from the state of Idaho. A multivariate extension is developed using the Spearman correlation, and Monte Carlo simulation studies investigating the ability of the percentile-based PM to model various nonnormal distributions are used to compare the results of the proposed method to PM results obtained using conventional and L-moments. Finally, the discussion and conclusion section shows how the percentile power method may be modified to accommodate specified Pearson correlations instead of Spearman correlations.

PRELIMINARIES FOR THE POWER METHOD TRANSFORMATIONS General Considerations The PM polynomial transformation based on conventional moments, L-moments, or the proposed method of percentiles considered herein can be generally expressed as (Headrick, 2010, pp. 12–13) p(Z) =

m 

ci Z i−1

(1)

i=1

where p(Z) is a polynomial used to perform the transformation on Z, ci is a constant coefficient defining the nature of the transformation, which will be explained in more detail below, and Z is a standard normal random variable with probability density function (PDF) fZ (z) and cumulative distribution function (CDF)FZ (z)   1 fZ (z) = ϕ(z) = (2π )− 2 exp −z2 /2 , (2)

3

METHOD OF PERCENTILES

z FZ (z) = (z) =

ϕ (u) du, −∞ < z < +∞.

(3)

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

−∞

Setting m = 4 (or m = 6) in Equation (1) gives the thirdorder Fleishman (1978) (or fifth-order Headrick, 2002a) class of PM distributions. The shape of the transformed values p(Z) in Equation (1) is contingent on the values of the coefficients ci , which are determined by conventional moment (or L-moment) matching techniques or through the method of percentiles described in the subsequent sections below. In order for Equation (1) to produce a valid PDF requires that the PM transformation be a strictly monotone increasing function as defined in Headrick (2010, Def. 2.5, p.10). This requirement implies that an inverse function exists (p−1 ). As such, the CDF associated with Equation (1) can be expressed as (4) Fp(z) (p(z)) = (z). Differentiating Equation (4) with respect to z will yield the PM PDF as ϕ(z) (5) fp(z) (p(z)) =  p (z) where p (z) is the first derivative of p(z) and p (z) > 0. Equations (4) and (5) are the general forms of the CDF and PDF for all three power methods discussed herein—conventional moment, L-moment, and percentile. Suppose a T -variate distribution of polynomials of the form in Equation (1) with a specified correlation matrix is desired. Define p Zj and p (Zk ) as in Equation (1) where Zj and Zk have Pearson correlation ρj k and standard normal bivariate density       1 −1 −1 fj k = 2π 1 − ρj2k 2 exp − 2 1 − ρj2k  2  zj + zk2 − 2ρj k zj zk (6) The correlation ρj k between Zj and Zk is not preserved subsequent to the transformation in Equation (1) (Headrick & Pant, 2012a). Thus, an intermediate correlation matrix, which is different from the specified correlation structure among the resulting nonnormal variates, is computed first. The intermediate correlation matrix is purposely constructed such that the specified correlation structure will be achieved after applying the transformation in Equation (1). Remark The Fleishman conventional moment, Headrick Lmoment, and the percentile methods discussed in this article use three different measures of dispersion. The Fleishman conventional moment uses the variance that is equal to 1 for the standard normal distribution. The Headrick L-moment uses one half of the coefficient of the mean difference of the absolute value of all possible draws from two identical distributions, which for the standard normal distribution is

√ 1/ π . The percentile methods use the interdecile range that is equal to 2∗ 1.28 . . . for the standard normal distribution. As such, the differences among the solved coefficients based on the parameters for the three different methods can be viewed as a linear transformation to rescale from one measure of dispersion to another. Thus, the resulting nonnormal distributions all have equivalent parameters (not estimators) of their respective skew and kurtosis functions. Presented in the next two subsections are reviews of the conventional moment and L-moment-based families of PM distributions. We include a review of the conventional moment approach for those readers less familiar with data simulation techniques in general. While the L-moment-based family of PM distributions is not widely familiar, we include a review of this technique because it will serve as a useful point of comparison in the Monte Carlo study later in this article.

The Conventional Moment-Based Fleishman Third-Order Power Method The coefficients ci for Equation (1) that determine the shape of a third-order Fleishman (1978) PM are computed using moment-matching that involves the conventional measures of the mean (α1 ), variance (α2 ), skew (α3 ), and excess kurtosis (α4 ), for which the normal distribution has a value of 0. Specifically, the ci are determined by simultaneously solving the following system of equations (e.g., Headrick, 2010, Eqs. 2.18–2.21, p. 15) α1 = 0 = c1 + c3

(7)

α2 = 1 = c22 + 2c32 + 6c2 c4 + 15c42

(8)

α3 =

(9)

8c33

+

6c22 c3

+ 72c2 c3 c4 +

270c3 c42

α4 = 3c24 + 60c22 c32 + 60c34 + 60c23 c4 + 936c2 c32 c4 +630c22 c42 + 4500c32 c42 + 3780c2 c43 + 10395c44 − 3 (10) for specified values of α3 and α4 and where α1 and α2 are standardized to zero and one, respectively. In general, a standardized non-normal third-order conventional moment-based PM distribution will have a valid PDF iff 0 < c2 < 1, 0 < c4 < 0.258199, and c32 − 3c2 c4 < 0 (Headrick, 2010, pp. 16–21). Figures 1–3 give examples of valid PM PDFs with their corresponding conventional parameters of skew, kurtosis, and coefficients. These three distributions all have zero mean, unit variance, and differ in the extent of their skew and kurtosis. Distribution 1 corresponds to a symmetric heavy-tailed distribution, Distribution 2 to a right-skewed heavy-tailed distribution, and Distribution 3 to a standard normal distribution, distributions that have been frequently used in prior research (e.g., Berkovits, Hancock, & Nevitt, 2000; Curran, West, & Finch, 1996; Headrick & Pant, 2012b).

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

4

KORAN, HEADRICK, KUO

FIGURE 1 The power method (PM) PDF of Distribution 1, a symmetric heavy-tailed distribution. The parameters of skew (α3 ) and kurtosis (α4 ) are based on Equations (9) and (10). The parameters of L-skew (τ3 ) and L-kurtosis (τ4 ) are based on Equations (25) and (26). The parameters of left-right tail-weight ratio (γ3 ) and tail-weight factor (γ4 ) are based on Equations (45) and (46).

Suppose it is desired to simulate a T -variate distribution based on conventional PM polynomials. The PM equation for solving intermediate Pearson correlations (rj k ) for specified Pearson correlations (ρj k ) for variables j and k (Headrick, 2010, p. 30, Eq. 2.59, setting c5 and c6 equal to zero) is   ρj k = E[p(Zj )p(Zk )] = cj 1 (ck1 + ck3 ) + ck3 (ck1 + ck3 ) + rj k (cj 2 ck2 + 3cj 4 ck2 + 3cj 2 ck4 + 9cj 4 ck4 ) + rj2k (2cj 3 ck3 ) + rj3k (6cj 4 ck4 ).

(11)

Note that Equation (11) does not assume c1 = −c3 as in Fleishman (1978), Vale and Maurelli (1983), and Headrick and Sawilowsky (1999). We would point out that the purpose of the intermediate Pearson correlation (rj k ) in Equation (11) is to adjust for the effect of the transformation, such that the transformed T -variate distribution will have the specified Pearson correlation (ρj k ). In the next section, univariate Lmoments are first introduced and then the L-moment-based PM is subsequently discussed. L -moments and Probability Weighted Moments The system of L-moments are introduced by considering Y1 , . . . , Yj , . . . , Yn as independent and identically distributed random variables each with continuous PDF fY (y) and CDF FY (y). The observations on these random variables may be ordered from lowest to highest with the order statistics de-

noted as Y1:n ≤ · · · ≤ Yj :n ≤ · · · ≤ Yn:n , where Yj :n is the jth smallest observation from a sample of size n. In this context n should be considered a sample from the distribution, regardless of whether that distribution is representing a population or a sample from a larger population.1 L-moments are defined in terms of either linear combinations of (1) the expectations of order statistics or (2) probability weighted moments (βi ). For the purposes considered herein, the first four L-moments associated with Yj :n are expressed as (Hosking & Wallis, 1997, pp. 20–22)

λ1 = E [Y1:1 ] = β0

(12)

1 (13) E [Y2:2 − Y1:2 ] = 2β1 − β0 2 1 (14) λ3 = E [Y3:3 − 2Y2:3 + Y1:3 ] = 6β2 − 6β1 + β0 3 1 λ4 = E [Y4:4 − 3Y3:4 + 3Y2:4 − Y1:4 ] = 20β3 − 30β2 4 +12β1 − β0 (15) λ2 =

1The applied researcher may find it easier to think in terms of a subsample of size n drawn from a larger sample of size N.

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

METHOD OF PERCENTILES

5

FIGURE 2 The power method (PM) PDF of Distribution 2, a right-skewed heavy-tailed distribution. The parameters of skew (α3 ) and kurtosis (α4 ) are based on Equations (9) and (10). The parameters of L-skew (τ3 ) and L-kurtosis (τ4 ) are based on Equations (25) and (26). The parameters of left-right tail-weight ratio (γ3 ) and tail-weight factor (γ4 ) are based on Equations (45) and (46).

where the probability weighted moments βi are determined from  βi = y {FY (y)}i fY (y) dy (16) where i = 0, . . . , 3. The coefficients associated with βi in Equations (12)–(15) are obtained from shifted orthogonal Legendre polynomials and are computed as shown in Hosking and Wallis (1997, p. 20) or in Hosking (1990). These coefficients are solutions to the Legendre differential equation that are defined on the interval (0, 1) while the polynomials meet an orthogonality criterion (Weisstein, 2014b). The values of λ1 and λ2 are measures of location and scale and are the arithmetic mean and one half the coefficient of mean difference (or Gini’s index of spread), respectively. One half the coefficient of mean difference (Gini’s index) is one half of “the average of the differences of all the possible pairs of variate-values, taken regardless of sign” (Kendall & Stuart, 1977, p. 47). Higher order L-moments are transformed to dimensionless quantities referred to as L-moment ratios defined as τr = λr /λ2 for r ≥ 3, and where τ3 and τ4 are the analogs to the conventional measures of skew and kurtosis. In general, L-moment ratios are bounded in the interval −1 < τr < 1, as is the index of L-skew (τ3 ) where a symmetric distribution implies that all L-moment ratios with odd subscripts are zero. Other smaller boundaries can be found

for more specific cases. For example, the index of L-kurtosis (τ4 ) has the boundary condition for continuous distributions of (e.g., Jones, 2004) 5τ32 − 1 < τ4 < 1. 4 Consider two random variables Yj and Yk with distribution functions F (Yj ) and F (Yk ), respectively. The second L-moments of Yj and Yk can alternatively be expressed as (Serfling & Xiao, 2007)     (17) λ2 Yj = 2Cov Yj , F (Yj ) λ2 (Yk ) = 2Cov (Yk , F (Yk )) .

(18)

Whereas the second L-moments are analogous to variances with conventional moments, the L-moment statistics analogous to covariances are the second L-comoments. The second L-comoments of Yj toward Yk and Yk toward Yj are     (19) λ2 Yj , Yk = 2Cov Yj , F (Yk )     λ2 Yk , Yj = 2Cov Yk , F (Yj ) . (20) As such, the L-correlations of Yj toward Yk and Yk toward Yj (Serfling & Xiao, 2007) are expressed as ηj k =

λ2 (Yj , Yk ) λ2 (Yj )

(21)

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

6

KORAN, HEADRICK, KUO

FIGURE 3 The power method (PM) PDF of Distribution 3, the standard normal distribution. The parameters of skew (α3 ) and kurtosis (α4 ) are based on Equations (9) and (10). The parameters of L-skew (τ3 ) and L-kurtosis (τ4 ) are based on Equations (25) and (26). The parameters of left-right tail-weight ratio (γ3 ) and tail-weight factor (γ4 ) are based on Equations (45) and (46).

ηkj =

λ2 (Yk , Yj ) . λ2 (Yk )

(22)

The L-correlation in Equation (21) (or 22) is bounded such that −1 ≤ ηj k ≤ 1 where a value of ηj k = 1 (ηj k = −1) indicates a strictly increasing (decreasing) monotone relationship between the two variables. In general, we would also note that ηj k = ηkj . That is, the matrix of second Lcomoments will not typically be symmetric as a traditional covariance matrix is always symmetric.

The L-moment-based Headrick Third-Order Power Method The coefficients ci for Equation (1) that determine the shape of a third-order Headrick (2011) PM are computed using a L-moment-matching technique that involves the measures of λ1 , λ2 , τ3 , and τ4 described in the previous section. Specifically, the ci can be determined by simultaneously solving the following system of equations (Headrick, 2011, Eqs. 2.14–2.17) λ1 = 0 = c1 + c3

√ √ (4c2 + 10c4 )/(2 π) λ2 = 1/ π = = 2 2

(23) (24)

√ 2 3c3 τ3 = π √ 20 2(δ1 c2 + δ2 c4 ) 3 τ4 = − 3 2 π 2

(25) (26)

√ ) is the coefficient of mean where = (4c2 + 10c4 )/(2 π√ difference, λ1 = 0 and λ2 = 1/ π are standardized values associated with the unit normal distribution, and the constants δ1 and δ2 in Equation (26) are (Headrick, 2011, Eqs. A.1, A.2)

√ 3 tan−1 2 3π − √ = 0.360451... (27) δ1 = √ 2 4 2

√ 15 tan−1 2 15π 1 − √ + = 1.151128.... (28) δ2 = √ 2 2 8 2 4 One of the advantages that the L-moment-based system of equations in Equations (23)–(26) has over the conventional moment-based system in Equations (7)–(10) is that the solutions to the coefficients are unique whenever a valid PDF exits. As such, given specified values of L-skew (τ3 ) and L-kurtosis (τ4 ), the solutions for the coefficients (ci ) in Equation (1) can be obtained by simply evaluating the following

METHOD OF PERCENTILES

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

equations (Headrick, 2011, Eq. 2.18) π c1 = −c3 = −τ3 3 √ −16δ2 + 2(3 + 2τ4 )π c2 = 8(5δ1 − 2δ2 ) √ 40δ1 − 2(3 + 2τ4 )π . c4 = 20(5δ1 − 2δ2 )

7

parameters in Equations (34)–(37) are defined to have the restrictions (29) (30) (31)

In general, a standardized nonnormal third-order Lmoment-based PM distribution will have a valid PDF iff 0 < c2 < 1, 0 < c4 < 0.40, and c32 − 3c2 c4 < 0 (Headrick, 2011). Figures 1–3 give examples of valid PM PDFs with their corresponding L-moment-based parameters and coefficients. Suppose it is desired to simulate a T -variate distribution with a specified PM L-correlation matrix and where each distribution has its own specified values of τ3 and τ4 . The equation for solving intermediate Pearson correlations (rj k ) for specified L-correlations (ηj k ) for variable j toward variable k (Headrick & Pant, 2012b, p.430, Eq. 22) is

 1 (32) ηj k = rj k cj 2 + 3cj 4 − cj 4 rj2k . 2 Analogously, the equation for solving intermediate Pearson correlations (rj k ) for specified L-correlations (ηj k ) for variable k toward variable j (Headrick & Pant, 2012b, p.430, Eq. 23) is

 1 2 (33) ηkj = rj k ck2 + 3ck4 − ck4 rj k . 2 In the next section, the percentile PM is introduced.

THE PERCENTILE-BASED POWER METHOD Univariate Percentile-Based Power Method The percentiles (θu ) associated with a conventional or Lmoment-based PM PDF can be obtained by making use of the PM CDF in Equation (4) as shown in Figures 1–3. As such, we will define the following location, scale, and shape parameters as in Karian and Dudewicz (2011, pp. 172–173) γ1 = θ0.50

(34)

γ2 = θ0.90 − θ0.10

(35)

γ3 =

θ0.50 − θ0.10 θ0.90 − θ0.50

(36)

γ4 =

θ0.75 − θ0.25 γ2

(37)

where Equations (34)–(37) are the (1) median, (2) inter-decile range, (3) left-right tail-weight ratio (a skew function) and (4) tail-weight factor (a kurtosis function), respectively. The

− ∞ < γ1 < +∞, γ2 ≥ 0, γ3 ≥ 0, 0 ≤ γ4 ≤ 1

(38)

where a symmetric distribution implies that γ3 = 1. The derivation of a percentile-based system of PM PDFs begins by substituting the standard normal distribution percentiles (zu ) into polynomials of the form in Equation (1) and using Equations (34)–(37) gives γ1 = p(z0.50 )

(39)

γ2 = p(z0.90 ) − p(z0.10 )

(40)

γ3 =

p(z0.50 ) − p(z0.10 ) p(z0.90 ) − p(z0.50 )

(41)

γ4 =

p(z0.75 ) − p(z0.25 ) γ2

(42)

where z0.50 = 0, z0.90 = 1.281 . . . , z0.75 = 0.6744 . . . from the standard normal distribution. Note from symmetry that z0.10 = −z0.90 and z0.25 = −z0.75 . The explicit forms of Equations (39)–(42) are γ1 = c1

(43)

γ2 = 2c2 z0.90 + γ3 = 1 − γ4 =

3 2c4 z0.90

2c3 z0.90 2 c2 + c3 z0.90 + 2c4 z0.90

3 2c2 z0.75 + 2c4 z0.75 3 2c2 z0.90 + 2c4 z0.90

(44) (45) (46)

Note that the derivations of Equations (43)–(46) can be found in the Appendix. Simultaneously solving for the coefficients in Equations (43)–(46) gives the convenient closedform expressions (47) c1 = γ1  3  3 γ2 γ4 z0.90 − z0.75 c2 = 3 (48) 3 2z0.90 z0.75 − 2z0.90 z0.75 c3 =

γ2 (1 − γ3 ) 2 2 (1 + γ3 ) z0.90

c4 = −

γ2 (γ4 z0.90 − z0.75 ) 3 3 2z0.90 z0.75 − 2z0.90 z0.75

(49) (50)

Estimates of the parameters γ1 , γ2 , γ3 , γ4 for a PM distribution based on the percentiles p(zu ) in Equations (39)–(42) for a sample of size n can be determined by finding the j and j + 1 integer values, and their corresponding expected values of the order statistics E[p(Z)j :n ] and E[p(Z)j +1:n ], by making use of the following equation (Headrick & Pant, 2012a; Johnson, Kotz, & Balakrishnan, 1994, p. 93) +∞   n! p(z)ϕ(z) {(z)}j −1 E p(z)j :n = (j − 1)! (n − j )! −∞

{1 − (z)}

n−j

dz

(51)

8

KORAN, HEADRICK, KUO

such that     E p(z)j :n ≤ p(zu ) ≤ E p(z)j +1:n and subsequently solve the equation     p (zu ) = (υ) E p(z)j :n + (1 − υ) E p(z)j +1:n

(52)

(53)

for 0 ≤ υ ≤ 1. Thus, an estimate of p (zu ) can then be obtained based on the order statistics of a sample as p (zu ) ∼ = p(Zu ) = (υ) p(z)j :n + (1 − υ) p(z)j +1:n .

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

Illustration with Publicly Available Descriptive Statistics While the derivation of the percentile PM is straightforward, it is also particularly simple to apply. We pause here to illustrate its utility with an example using publicly available statistics from the Idaho Standards Achievement Test (ISAT). It is not unusual for states to publish percentile ranks or scale score to percentile rank conversion tables from which percentiles may be obtained for their statewide standardized educational assessments. This information is made widely available, while access to individually identifiable data is highly restricted. Such is the case with the state of Idaho and the ISAT, for which scale score to percentile rank conversion tables are publicly available (Stoneberg, 2011). To demonstrate the implementation of the percentile PM, we used the scale score to percentile rank conversion tables for the 2011 ISAT mathematics assessment with third graders (Stoneberg, 2011) to obtain the following five estimated percentile values: q(x)0.10 , q(x)0.25 , q(x)0.50 , q(x)0.75 , and q(x)0.90 . These five percentile values are shown on the left in Figure 4. Using the five percentile values given in Figure 4, the estimates of the four location, scale, and shape parameters are computed as follows by applying Equations (34)–(37): Median: γˆ1 = q (x)0.50 = 205.8 Inter-decile range: γˆ2 = q (x)0.90 − q (x)0.10 = 228.1 −188.5 = 39.6 Left-right tail-weight ratio (skew): γˆ3 =

17.3 q (x)0.50 − q (x)0.10 205.8 − 188.5 = = q (x)0.90 − q (x)0.50 228.1 − 205.8 22.3

= 0.78475 . . . Tail-weight factor: γˆ4 =

q (x)0.75 − q (x)0.25 γ2

19.3 216.3 − 197.0 = = 0.48737 . . . = 39.6 39.6 The tail-weight factor γˆ4 is a measure of the kurtosis of the distribution. Note that γˆ4 ≈ 0.4874 is lower than the value of γˆ4 ≈ 0.5263 for the normal distribution, indicating that

this distribution has heavier tails than a normal distribution. As a point of comparison, a t-distribution with 4 degrees of freedom has γˆ4 ≈ 0.4831. Note that the conventional PM cannot be used to simulate a t-distribution with 4 degrees of freedom as conventional kurtosis for this distribution does not exist. The percentile PM coefficients shown in Figure 4 were obtained by substituting the four computed location, scale, and shape parameters as well as the standard normal constants of z0.90 = 1.281 . . . , z0.75 = 0.6744 · · · into Equations (47)–(50): c1 = γˆ1 = 205.8  3  3 − z0.75 γˆ2 γˆ4 z0.90 c2 = 3 3 2z0.90 z0.75 − 2z0.90 z0.75   39.6 0.4874 ∗ 1.2813 − 0.67443 = = 13.8692 2 ∗ 1.2813 ∗ 0.6744 − 2 ∗ 1.281 ∗ 0.67443 c3 =

γˆ2 (1 − γˆ3 ) 39.6 (1 − 0.7848) = = 1.5222 2 (1 2 + 0.7848) ∗ 1.2812 2 (1 + γˆ3 ) z0.90

c4 = − =−

γˆ2 (γˆ4 z0.90 − z0.75 ) 3 3 2z0.90 z0.75 − 2z0.90 z0.75 39.6 (0.4874 ∗ 1.281 − 0.6744) 2 ∗ 1.2813 ∗ 0.6744 − 2 ∗ 1.281 ∗ 0.67443

= 0.9625 The graph in Figure 4 was produced using the following percentiles PM polynomial defined by substituting these coefficients into Equation (1): p(z) = c1 + c2 z + c3 z2 + c4 z3 = 205.8 + 13.8692 ∗ z +1.5222 ∗ z2 + 0.9625 ∗ z3 Subsequently, p(z), p (z), and ϕ(z) were used in the PM PDF in Equation (5). Note that the shape of this distribution was obtained without publicly available measures of skew and kurtosis or the restricted individual data. To produce nonnormal sample data matching this distribution, first generate standard normal variates Z of specified sample size n. Then substitute each of the n generated standard normal values into p(z) to produce n transformed values. The n transformed values will be approximately distributed with the same nonnormal distribution as depicted in Figure 4. The larger the sample size, the more accurate the approximation. Boundary Conditions for Percentile-Based Power Method PDFs The restriction that p (z) > 0 in Equation (5) implies that a set of solved coefficients may not necessarily produce a valid PDF. To determine if a third-order polynomial produces a valid PDF we first set the quadratic equation p (z) = 0 and

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

METHOD OF PERCENTILES

9

FIGURE 4 The power method (PM) PDF of the distribution of third grade 2011 ISAT mathematics assessment scores. The coefficients c1 , . . . , c4 were obtained using the percentiles PM.

subsequently solve for z as z=

−c3 ±

γ4 = 1 2

− 3c2 c4 ) . 3c4

(c32

(54)

Equation (54) is the familiar quadratic formula, where D = c32 − 3c2 c4 is referred to as the discriminant of the quadratic. A set of solved coefficients will produce a valid PDF if the discriminant of the quadratic D = c32 − 3c2 c4 in Equation (54) is negative. Taking the square root of a negative value produces complex roots containing both real (real number system) and imaginary (imaginary number system) parts. When D = c32 − 3c2 c4 is negative, the complex solutions for z must have non-zero imaginary parts. As such, setting c32 = 3c2 c4 will yield the point where D = 0 and thus realvalued solutions exist to the equation p(z) = 0. Standardizing the inter-decile range in Equation (44) to the unit normal distribution (γ2 = 2z0.90 ) and solving for c4 gives 1 − c2 c4 = 2 . z0.90

(55)

Substituting the right-hand side of Equation (55) into 1 Equations (45) and (46) and setting c3 = ±(3c2 c4 ) 2 yields √   12 2 1 − 3 (c2 − 1) c2 ∓ 2 3z (1 − c2 ) c2 /z0.90 γ3 = (56) 1 + 3c2 (c2 − 1)

3 3 z0.75 c2 z0.75 c2 z0.75 + − . 3 3 z0.90 z0.90 z0.90

(57)

Inspection of Equation (56) indicates that for real values of γ3 to exist, then we must have c2 ∈ [0, 1] and thus from 2 ]. Using Equations (56) and Equation (55) c4 ∈ [0, 1/z0.90 (57), the graph of the region for valid third-order percentile PM PDFs is given in Figure 5, along with the minimum and maximum values of γ3 and γ4 . In summary, a valid standardized nonnormal third-order PDF will have the properties of Equation (i) 0 < c2 < 1, (ii) 0 < c4 < 0.608875, and Equation (iii) c32 − 3c2 c4 < 0.

THE MULTIVARIATE PROCEDURE AND MONTE CARLO STUDY We now extend the univariate percentile PM procedure introduced earlier to simulate nonnormal PM distributions with specified percentiles and Spearman correlations. This multivariate procedure is illustrated and assessed with a Monte Carlo simulation. To simulate a T -variate distribution with a specified Spearman correlation (ξj k ) matrix and where each distribution has its own specified values of γ3 and γ4 , we suggest the following steps:

10

KORAN, HEADRICK, KUO TABLE 1 Specified Spearman Correlation Matrix for the Distributions in Figures 1–3

1 2 3

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

FIGURE 5 Boundary region for valid third-order percentile power method PDFs in the left-right tail-weight ratio (γ3 ) and tail-weight factor (γ4 ) plane. Valid PDFs exist in the region inside the boundary, and a vertical reference line at γ3 = 1 indicates symmetric distributions. The lower and upper bounds of γ3 are 0.072 and 13.928, respectively. The lower and upper bounds of γ4 for symmetric distributions are 0.146 and 0.526, respectively.

1. Specify the percentiles for T polynomials of the form in Equation (1) (i.e., p (Z1 ) , . . . , p (ZT )) and obtain the coefficients (c1 , . . . , c4 ) for each polynomial by evaluating Equations (47)–(50) using the specified values of left-right tail-weight ratio (γ3 ) and tail-weight factor (γ4 ) for each distribution, as illustrated with a single distribution in the ISAT example presented in the last section. Further, specify  a T × T matrix of Spearman correlations for p Zj and p (Zk ), where j < k ∈ {1, 2, . . . , T }. 2. Compute the intermediate Pearson correlations rj k by substituting the specified Spearman correlations ξj k from Step 1 and the specified sample size n into   r 1  n−2 6 jk sin−1 + sin−1 ξj k = π n−1 2 n−1   rj k , (58) for variables j and k (Headrick, 2010, p. 114, Eq. 4.34). Numerically solve Equation (58) for the intermediate Pearson correlations rj k . Repeat this step separately for all T (T − 1) /2 pairwise combinations of correlations. 3. Assemble the intermediate correlations rj k into a T × T matrix and decompose this matrix using a Cholesky factorization. (The Cholesky factorization finds the triangular matrix that when multiplied by its transpose produces the given intermediate correlation matrix.) Note that this step requires the intermediate correlation matrix to be positive definite. 4. Use the results from the Cholesky factorization from Step 3 to generate T standard normal variables (Z1 , . . . , ZT ) correlated at the intermediate levels as follows Zj =

j 

aij Vj

where j = 1, . . . , T and where V1 , . . . , Vj , . . . , VT are independent standard normal random variables and

2

3

1 0.80 0.65

1 0.50

1

where aij represents the element in the ith row and jth column of the matrix associated with the Cholesky factorization performed in Step 3. 5. Substitute Z1 , . . . , ZT from Step 4 into the T PM polynomials in Step 1 to generate the nonnormal distributions with the specified percentiles and Spearman correlations. To demonstrate the steps above, and evaluate the multivariate percentile PM, comparisons among the percentiles, L-moment, and conventional product-moment-based procedures are subsequently described. The distributions in Figures 1–3 are used as a basis for comparison using the specified correlation matrix in Table 1 where a variety of moderate to strong correlations are considered. Table 2 gives the solved intermediate correlation matrices for the conventional moment, L-moment, and percentile–based methods. Solutions for the percentile-based method are given for n = 25 and n = 750. All of the matrices of intermediate correlations in Table 2 are positive definite. Cholesky decompositions on the intermediate correlation matrices were then used to create Z1 , . . . , Z3 with the specified intermediate correlations by making use of the formulae given in Equation (59) of Step 4 with T = 3. The values of Z1 , . . . , Z3 were subsequently substituted into equations of the form of Equation (1) to produce p (Z1 ) , . . . , p (Z3 ) for each of the three procedures.

TABLE 2 Intermediate Correlation Matrixes for PM Procedures Conventional PM procedure

1 2 3

1

2

1 0.897 0.750

1 0.580

1 2 3

1

2

1 0.835 0.689

1 0.536

L-moment PM procedure

3

1

2

3

1

1 0.757 0.599

1 0.466

1

Percentile PM procedure, n = 25

(59)

i=1

1

Percentile PM procedure, n = 750 3

1

2

3

1

1 0.814 0.668

1 0.518

1

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

METHOD OF PERCENTILES

A Fortran algorithm was written for each method to generate 25,000 independent sample estimates for the specified parameters of: (1) conventional skew (α3 ) and kurtosis (α4 ); (2) L-skew (τ3 ) and L-kurtosis (τ4 ); and (3) left-right tailweight ratio (γ3 ) and tail-weight factor (γ4 ). All estimates were based on sample sizes of n = 25 and n = 750. The formulae used for computing estimates of α3 and α4 were based on Fisher’s k-statistics [(for a definition see, for example, Kendall & Stuart, 1977, pp. 297–298, or Weisstein, 2014a, Equations (6) and (7)]. These are the formulae currently used by most commercial software packages such as SAS, SPSS, Minitab, and so on, for computing indices of skew and kurtosis (where α3 = 0 and α4 = 0 for the standard normal distribution). The formulae used for computing estimates of τ3 and τ4 were Headrick’s (2011, pp. 5–6) Equations 2.4 and 2.6. The formulae used for computing estimates of γ3 and γ4 were Equations (36)–(37). The estimate for ρj k was based on the usual formula for the Pearson productmoment of correlation statistic, and the estimate for ηj k was computed based on Equation (32). The estimate for ξj k was based on the usual formula for the Spearman rank order correlation statistic. The estimates for ρj k , ηj k , and ξj k were transformed using Fisher’s z transformation. Bias-corrected accelerated bootstrapped median estimates, confidence intervals (CIs), and standard errors were subsequently obtained for the estimates associated with the parameters (α3 , α4 ,τ3 , τ4 , γ3 , γ4 , zρ j k , zη j k , zξ j k ) using 10,000 resamples via the commercial software package Spotfire S+ (TIBCO Software, 2008). The bootstrap results for the estimates of the medians and CIs associated with zρ j k , zη j k , and zξ j k were transformed back to their original metrics (i.e., estimates for ρj k , ηj k , and ξj k ). An index of relative bias (RB) was computed for the estimate (E) as: RB = ((E − P ) /P ) × 100, except when the parameter (P) is zero. [Note that the small amount of bias associated with any bootstrap CI containing a parameter is considered negligible.] The results of the simulation are reported in Tables 3–6. Note that the bootstrap estimates are the median, which is the measure of central tendency of concern with the proposed percentile PM under study. The percentile PM estimates of left-right tail-weight ratio (γ3 ) and tail-weight factor (γ4 ) are clearly superior over their corresponding conventional moment-based counterparts (α3 , α4 ). For example, Table 3 shows that with samples of size n = 25 the estimates of skew (α3 ) and kurtosis (α4 ) for Distribution 2 had relative bias values of −47.50 and −82.40, respectively, whereas the estimates of left-right tailweight ratio (γ3 ) and tail-weight factor (γ4 ) had substantially lower relative bias values of 1.04 and 2.70, respectively. It can also be seen in Table 3 that relative bias values for the percentile PM and the L-moment PM are comparable to one another. Thus, both the percentile PM and the L-moment PM produce estimates with relatively low relative bias. The left-right tail-weight ratio (γ3 ) and tail-weight factor (γ4 ) are also more efficient estimators as their relative standard errors RSE = (standard error/estimate)×100 are consistently

11

smaller in magnitude than the conventional estimators of skew and kurtosis. For example, with sample size n = 25 the estimates of skew (α3 ) and kurtosis (α4 ) for Distribution 2 had RSE values of 0.52 and 0.73, respectively, whereas the estimates of left-right tail-weight ratio (γ3 ) and tail-weight factor (γ4 ) had lower RSE values of 0.43 and 0.25, respectively. This demonstrates that left-right tail-weight ratio (γ3 ) and tail-weight factor (γ4 ) have more precision because they have less variance around their estimates. The estimates of L-skew (τ3 ) and L-kurtosis (τ4 ) had RSE values of 0.43 and 0.31, respectively. These values were comparable to the RSE values for the percentile PM. Table 4 shows that relative bias values overall are lower when n = 750 than when n = 25. Note that for the percentile PM and the L-moment PM the parameter values for left-right tail-weight ratio (γ3 ) and L-skew (τ3 ) are generally within their corresponding confidence intervals of the estimates. Thus, bias for these conditions is considered negligible. Table 4 shows the same pattern of results when n = 750 as found when n = 25. For example, Table 4 shows that the estimates of skew (α3 ) and kurtosis (α4 ) for Distribution 2 had relative bias values of –10.63 and –32.38, respectively, whereas the estimates of left-right tail-weight ratio (γ3 ) and tail-weight factor (γ4 ) had substantially lower relative bias values of 0.04 (negligible) and 0.14, respectively. The Lmoment PM estimates of L-skew (τ3 ) and L-kurtosis (τ4 ) had relative bias values of –0.01 (negligible) and –0.21, respectively, which were comparable to the results achieved under the percentile PM approach. With regard to the estimated correlation parameters, Table 5 shows that with n = 25 the percentile PM approach outperforms the conventional moment approach. Relative bias values for the Pearson product-moment correlation (ρj k ) under the conventional PM approach ranged from 3.43 to 7.52, whereas the relative bias values for the Spearman rank order correlation (ξj k ) under the percentile PM approach ranged from 1.76 to 3.09. It also appears that the L-moment PM approach outperforms the percentile PM approach when n = 25. Relative bias values for the L-correlation for variable j toward variable k (ηj k ) ranged from 0.08 (negligible) to 1.03. The relative bias values for the L-moment PM approach are notably smaller than the other two approaches for small sample sizes. However, Table 6 shows that all relative bias values were negligible for estimated correlation parameters when n = 750. The relative bias values for the percentile PM and the L-moment PM approaches were similar with large sample sizes.

DISCUSSION AND CONCLUSION Researchers in the social and behavioral sciences frequently encounter data distributions that are substantially nonnormal. Further, these data commonly involve small sample sizes with distributions that are heavy-tailed (Micceri, 1989). In the context of sophisticated multivariate techniques, the impact

12

KORAN, HEADRICK, KUO TABLE 3 Estimated Univariate Parameters for Three Power Method Approaches, n = 25

Dist

1 2 3

1 2

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

3

1 2 3

Parameter

Estimate

95% Bootstrap CI

Standard Error

Relative Bias%

α3 α4 α3 α4 α3 α4

=0 = 25 =3 = 21 =0 =0

−0.0223 4.4560 1.5750 3.6960 0.0034 −0.1786

Conventional PM −0.0497, 0.0045 4.4011, 4.5261 1.5579, 1.5911 3.6452, 3.7525 −0.0038, 0.0103 −0.1906, –0.1678

0.013660 0.030200 0.008122 0.027010 0.003626 0.005579

— −82.18 −47.50 −82.40 — —

τ3 τ4 τ3 τ4 τ3 τ4

= 0.0000 = 0.4225 = 0.3130 = 0.3335 = 0.0000 = 0.1226

−0.0030 0.4007 0.3057 0.3143 0.0011 0.1217

L-moment PM −0.0072, 0.0008 0.3989, 0.4027 0.3030, 0.3082 0.3125, 0.3163 −0.0005, 0.0025 0.1206, 0.1228

0.002023 0.000928 0.001307 0.000980 0.000740 0.000565

— −5.15 −2.35 −5.76 — −0.74

γ3 γ4 γ3 γ4 γ3 γ4

= 1.0000 = 0.3105 = 0.3430 = 0.3868 = 1.0000 = 0.5263

1.0050 0.3208 0.3466 0.3972 0.9978 0.5294

Percentiles PM 0.9942, 1.0154 0.3191, 0.3227 0.3438, 0.3497 0.3954, 0.3993 0.9912, 1.0045 0.5279, 0.5310

0.005348 0.000947 0.001485 0.000983 0.003380 0.000801

0.50 3.33 1.04 2.70 −0.22 0.59

Note. α3 = skew. α4 = kurtosis. τ3 = L-skew. τ4 = L-kurtosis. γ3 = left-right tail-weight ratio. γ4 = tail-weight factor. Dashes indicate that the relative bias was not reported because the parameter is equal to 0.

of nonnormality can depend on a number of factors specific to the particular investigation. Thus, in multivariate behavioral research it is becoming more imperative that researchers be able to evaluate the impact of nonnormality by simulating multivariate nonnormal data. Researchers have several options for simulating nonnormal multivariate continuous distributions. Some other conventional product moment-based examples not previously mentioned are techniques based on the GLD (Headrick & Mugdadi, 2006), Burr Distributions (Headrick & Pant, 2010), and the Tukey g-and-h family of distributions (Kowalchuk & Headrick, 2010). However, the conventional PM technique based on Equation (11) has the advantage over these other options because they all require numerical integration to solve for intermediate correlations. Further, the techniques based on the GLD and Burr distributions also require the additional step of transforming the initial multivariate standard normal deviates to zero-one uniform deviates prior to transformation, which can be computationally expensive. One of the limitations associated with the conventional PM technique is that solved intermediate correlations may not exist in the range of [–1, +1]. To demonstrate, suppose it is desired to simulate Distribution 1 and Distribution 3 in Figures 1 and 3 with a specified Pearson correlation of 0.90. Applying Equation (11) would yield a solved intermediate correlation of 1.039 (Headrick & Pant, 2012b, Table 1), an improper value that obviously cannot be used. Most current software would produce an error message; the researcher would need to specify a lower value of the Pearson correla-

tion, such that the solved intermediate correlation is less than or equal to 1.00, or specify different distributions of the two variables to generate the data sets. However, the percentile PM technique presented in this article, in addition to having the advantage of closed form solutions for the coefficients, does not have the limitation of solutions to intermediate correlations to potentially lie outside the range of [−1, +1]. Specifically, suppose again we desired to simulate Distributions 1 and 3 with a specified Spearman correlation of 0.90 for given sample sizes of n = 25. Applying Equation (58) would yield a solved intermediate correlation of 0.926. In short, the proposed percentile PM technique based on the Spearman correlation obviates the upper-boundary limitation associated with the conventional product-moment PM technique based on the Pearson correlation. The proposed percentile method can also be used to generate PM distributions with specified Pearson correlations. Specifically, this can be accomplished by making use of Equation (11) and the general form of the Pearson productmoment coefficient of correlation as follows:       E p Zj p (Zk ) − α1j (α1k )   (60) ρj k = α2j (α2k ) where α1j , α1k , α2j , and α2k are evaluated using Equations (7) and (8) by substituting the percentile PM coefficients ci into Equation (60), which are computed from Equations (47)–(50) and where E[p(Zj )p(Zk )] from Equation (11) also

METHOD OF PERCENTILES

13

TABLE 4 Estimated Univariate Parameters for Three Power Method Approaches, n = 750 Dist

Parameter

Estimate

95% Bootstrap CI

α3 = 0 α4 = 25 α3 = 3 α4 = 21 α3 = 0 α4 = 0

0.0077 16.4600 2.6810 14.2000 −0.0003 −0.0149

1 2 3

1 2

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

3

1 2 3

Standard Error

Relative Bias%

Conventional PM −0.0114, 0.0265 16.3522, 16.5848 2.6704, 2.6907 14.0915, 14.3173 −0.0017, 0.0013 −0.0176, –0.0124

0.009365 0.058980 0.005327 0.059310 0.000740 0.001306

— −34.16 −10.63 −32.38 — —

τ3 τ4 τ3 τ4 τ3 τ4

= 0.0000 = 0.4225 = 0.3130 = 0.3335 = 0.0000 = 0.1226

0.0002 0.4216 0.3130 0.3328 −0.0001 0.1227

L-moment PM −0.0005, 0.0009 0.4212, 0.4219 0.3126, 0.3136 0.3325, 0.3331 −0.0003, 0.0002 0.1225, 0.1228

0.000383 0.000176 0.000251 0.000163 0.000132 0.000083

— −0.21 −0.01 −0.21 — 0.08

γ3 γ4 γ3 γ4 γ3 γ4

= 1.0000 = 0.3105 = 0.3430 = 0.3868 = 1.0000 = 0.5263

1.0000 0.3108 0.3432 0.3873 1.0000 0.5264

Percentiles PM 0.9978, 1.0020 0.3105, 0.3112 0.3426, 0.3438 0.3869, 0.3877 0.9991, 1.0014 0.5261, 0.5267

0.001062 0.000171 0.000308 0.000203 0.000539 0.000159

0.00 0.11 0.04 0.14 0.00 0.02

Note. α3 = skew. α4 = kurtosis. τ3 = L-skew. τ4 = L-kurtosis. γ3 = left-right tail-weight ratio. γ4 = tail-weight factor. Dashes indicate that the relative bias was not reported because the parameter is equal to 0.

uses the percentile PM coefficients ci . Specify the left-hand side of Equation (60) with a Pearson correlation and then solve for the intermediate correlation rj k . However, this PM approach is subject to the same limitation as the conventional Fleishman-Vale-Maurelli PM approach, which may result in solutions to intermediate correlations that may potentially lie outside the range of [−1, +1]. Percentiles are easy to calculate from existing data and common to obtain when data are restricted or unavailable. One of the primary advantages that the percentile PM has over conventional moment and L-moment-based estimators is that the percentile PM can be applied in situations in which

raw data and higher order (third, fourth) moments are unavailable (e.g., when conventional kurtosis does not exist for a t-distribution with df = 4). The results of this study verify prior research showing L-moment-based estimators to be superior to conventional moment-based estimators. Further, the results demonstrate that the multivariate percentile PM estimators are competitive with the L-moment-based estimators. In multivariate applications the percentile PM showed slightly greater bias in the correlation estimators than the L-moment PM. Thus, when data are available, it is recommended that the L-moment PM be preferred in nonnormal multivariate applications. For those multivariate applications

TABLE 5 Estimated Correlation Parameters for Three Power Method Approaches, n = 25 Parameter

Estimate

95% Bootstrap CI

∗ = 0.80 ρ12 ∗ = 0.65 ρ13 ∗ = 0.50 ρ23

0.8275 0.6959 0.5376

0.8258, 0.8290 0.6943, 0.6976 0.5354, 0.5400

η12 = 0.80 η13 = 0.65 η23 = 0.50

0.8023 0.6505 0.5052

0.8006, 0.8041 0.6486, 0.6536 0.5018, 0.5086

ξ12 = 0.80 ξ13 = 0.65 ξ23 = 0.50

0.8141 0.6658 0.5154

0.8123, 0.8146 0.6646, 0.6685 0.5131, 0.5177

Standard Error

Relative Bias%

0.002612 0.001575 0.001595

3.43 7.52 7.07

0.002398 0.002186 0.002283

0.29 0.08 1.03

0.002007 0.001954 0.001809

1.76 2.43 3.09

Conventional PM

L-moment PM

Percentiles PM

Note. ρj k = Pearson product-moment correlation. ηj k = L-correlation for variable j toward variable k. ξj k = Spearman rank order correlation.

14

KORAN, HEADRICK, KUO TABLE 6 Estimated Correlation Parameters for Three Power Method Approaches, n = 750

Parameter

Estimate

95% Bootstrap CI

∗ = 0.80 ρ12 ∗ = 0.65 ρ13 ∗ = 0.50 ρ23

0.8012 0.6546 0.5022

0.8009, 0.8018 0.6542, 0.6549 0.5018, 0.5026

η12 = 0.80 η13 = 0.65 η23 = 0.50

0.7998 0.6502 0.4998

0.7995, 0.8001 0.6498, 0.6506 0.4998, 0.5004

ξ12 = 0.80 ξ13 = 0.65 ξ23 = 0.50

0.8001 0.6502 0.5000

0.8001, 0.8005 0.6499, 0.6506 0.4995, 0.5005

Standard Error

Relative Bias%

Conventional PM 0.000622 0.000330 0.000266

0.15 0.45 0.71

0.000358 0.000352 0.000359

−0.03 0.04 −0.03

0.000338 0.000303 0.000328

0.02 0.03 0.00

L-moment PM

Percentiles PM

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

Note. ρj k = Pearson product-moment correlation. ηj k = L-correlation for variable j toward variable k. ξj k = Spearman rank order correlation.

in which data are not available but percentiles are available, the percentile PM is plausible with either Spearman or Pearson correlations. Another consideration when choosing among PM approaches is the variability associated with the shape parameters (i.e., skew and kurtosis). The percentile PM and L-moment PM results from this study demonstrated lower values of relative standard error than the conventional PM results. Along the lines of Muth´en and Muth´en (2002) and Steiger (2006), researchers may have a continuous data set that they wish to simulate to investigate the statistical properties of their method of analysis. In cases where bias is negligible, researchers may allow the shape parameters of the simulated data sets to vary so long as those values are within the 95% bootstrap confidence interval, as we have presented in our Monte Carlo study results. An example of this type of evaluation using the bootstrap confidence interval to decide whether the variability in the sample estimate of a shape parameter is acceptable can also be found in Headrick (2010, p. 72). In the present article a third-order percentile PM approach was developed and compared with other third-order PM approaches. It may also be possible to develop a fifth-order percentile PM approach and compare this with the other already existing fifth-order conventional and L-moment PM approaches (Headrick, 2002a, 2010, 2011). For researchers interested in simulating variables that are not continuous and unbounded there are a number of options available. First, consider variables that are continuous but perhaps bounded in an interval of attainable scores (e.g., a psychoeducational instrument that has minimum and maximum potential scores). For the researcher applying the power transformation method, invalid data points generated need not be considered. It is acceptable to truncate the simulated data if points beyond certain bounds are considered meaningless. For variables that are not continuous but are ranked, Headrick, Aman, and Beasley (2008) present an alternative power method approach.

However, for variables that are not ranked, the researcher should consider alternative data simulation approaches. For correlated binary variables, there are procedures based on an underlying multivariate normal distribution (Emrich & Piedmonte, 1991) relevant to considerations with polychoric correlations. For binary variables not assumed to arise from an underlying multivariate normal distribution, there are alternative procedures (Headrick, 2002b; Lunn & Davies, 1998; Oman & Zucker, 2001; Qaqish, 2003). Headrick (2002b) presents a particularly simple procedure based on the uniform random number generator. There are also approaches available for simulating multivariate ordinal variables (Demiritas, 2006). Demiritas and Doganay (2012) and Demiritas (2014) examine power method extensions when both binary and continuous variables are under consideration, and Demirtas and Yavuz (2014) present an approach for ordinal and continuous variables. In summary, the proposed percentile PM is an attractive alternative to the traditional conventional moment-based PM. In particular, the percentile PM has distinct advantages when heavy-tailed distributions and small samples are of concern. A SAS program written for the third-order percentile PM is available (see supplemental data) for implementing the univariate and multivariate procedures.

SUPPLEMENTAL MATERIAL Supplemental data for this article can be accessed on the publisher’s website.

ARTICLE INFORMATION Conflict of Interest Disclosures: Each author signed a form for disclosure of potential conflicts of interest. No authors reported any financial or other conflicts of interest in relation to the work described.

METHOD OF PERCENTILES

Ethical Principles: The authors affirm having followed professional ethical guidelines in preparing this work. These guidelines include obtaining informed consent from human participants, maintaining ethical treatment and respect for the rights of human or animal participants, and ensuring the privacy of participants and their data, such as ensuring that individual participants cannot be identified in reported results or from publicly available original or archival data. Funding: This work was not funded.

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

Role of the Funders/Sponsors: None of the funders or sponsors of this research had any role in the design and conduct of the study; collection, management, analysis, and interpretation of data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication. Acknowledgments: The ideas and opinions expressed herein are those of the authors alone, and endorsement by the authors’ institution is not intended and should not be inferred.

REFERENCES Berkovits, I., Hancock, G. R., & Nevitt, J. (2000). Bootstrap resampling approaches for repeated measure designs: Relative robustness to sphericity and normality violations. Educational and Psychological Measurement, 60, 877–892. doi:10.1177/00131640021970961 Blair, R. C. (1981). A reaction to “Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and covariance.” Review of Educational Research, 51, 499–507. doi:10.3102/00346543051004499 Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144–152. doi:10.1111/j.20448317.1978.tb00581.x Bradley, J. V. (1982). The insidious L-shaped distribution. Bulletin of the Psychonomic Society, 20, 85–88. doi:10.3758/BF03330089 Curran, P. J., West, S. G., & Finch, J. F. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1, 16–29. doi:10.1037/1082-989X.1.1.16 Demirtas, H. (2006). A method for multivariate ordinal data generation given marginal distributions and correlations. Journal of Statistical Computation and Simulation, 76, 1017–1025. doi:10.1080/10629360600569246 Demirtas, H. (2014). Joint generation of binary and nonnormal continuous data. Journal of Biometrics and Biostatistics, 5, 199. doi:10.4172/21556180.1000199 Demirtas, H., & Doganay, B. (2012). Simultaneous generation of binary and normal data with specified marginal and association structures. Journal of Biopharmaceutical Statistics, 22, 223–236. doi:10.1080/10543406.2010.521874 Demirtas, H., & Yavuz, Y. (2014). Concurrent generation of ordinal and normal data. Journal of Biopharmaceutical Statistics. Advance online publication. doi:10.1080/10543406.2014.920868 Emrich, L. J., & Piedmonte, M. R. (1991). A method for generating highdimensional multivariate binary variates. The American Statistician, 45, 302–304. doi:10.1080/00031305.1991.10475828 Family Educational Rights and Privacy Act, 20 U.S.C. A. § 1232g et seq. (2000). Ferrari, P. A., & Barbiero, A. (2012). Simulating ordinal data. Multivariate Behavioral Research, 47, 566–589. doi:10.1080/00273171.2012.692630

15

Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43, 521–532. doi:10.1007/BF02293811 Glass, G. V., Peckham, P. D., & Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying the fixed effects analyses of variance and covariance. Review of Educational Research, 42, 237–288. doi:10.3102/00346543042003237 Headrick, T. C. (2002a). Fast fifth-order polynomial transforms for generating univariate and multivariate nonnormal distributions. Computational Statistics & Data Analysis, 40, 685–711. doi:10.1016/S01679473(02)00072-5 Headrick, T. C. (2002b). JMASM3: A method for simulating systems of correlated binary data. Journal of Modern Applied Statistical Methods, 1, 195–201. Retrieved from http://digitalcommons.wayne.edu/ jmasm/vol1/iss1/27 Headrick, T. C. (2010). Statistical simulation: Power method polynomials and other transformations. Chapman & Hall/CRC. Headrick, T. C. (2011). A characterization of power method transformations through L-moments. Journal of Probability and Statistics (Vol. 2011). doi:10.1155/2011/497463 Headrick, T. C., Aman, S. Y., & Beasley, T. M. (2008). A method for simulating correlated structures of continuous and ranked data. Communications in Statistics: Simulation and Computation, 37, 602–616. doi:10.1080/03610910701812394 Headrick, T. C., & Mudadi, A. (2006). On simulating multivariate non-normal distributions from the generalized lambda distribution. Computational Statistics & Data Analysis, 50, 3343–3353. doi:10.1016/j.csda.2005.06.010 Headrick, T. C., & Pant, M. D. (2010). On simulating univariate and multivariate Burr Type III and Type XII distributions. Applied Mathematical Sciences, 4, 2207–2240. Headrick, T. C., & Pant, M. D. (2012a). On the order statistics of standard normal-based power method distributions. ISRN Applied Mathematics (Vol. 2012). doi:10.5402/2012/945627 Headrick, T. C., & Pant, M. D. (2012b). Simulating non-normal distributions with specified L-moments and L-correlations. Statistica Neerlandica, 66, 422–441. doi:10.1111/j.1467-9574.2012.00523.x Headrick, T. C., & Sawilowsky, S. S. (1999). Simulating correlated multivariate nonnormal distributions: Extending the Fleishman power method. Psychometrika, 64, 25–35. doi:10.1007/BF02294317 Health Insurance Portability and Accountability Act, 20 U.S.C. A. § 1320 et seq. (1996). Hosking, J. R. (1990). L-moments: Analysis and estimation of distributions using linear combinations of order statistics. Journal of the Royal Statistical Society. Series B (Methodological), 52, 105–124. Hosking, J. R. M., & Wallis, J. R. (1997). Regional frequency analysis: An approach based on L-moments. Cambridge University Press. Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3, 424–453. doi:10.1037/1082-989X.3.4.424 Johnson, N. L., Kotz, S., & Balakrishnan, N. (1994). Continuous Univariate Distributions (Vol. 1, 2nd ed.). New York, NY: Wiley. Jones, M. C. (2004). On some expressions for variance, covariance, skewness and L-moments. Journal of Statistical Planning and Inference, 126, 97–106. doi:10.1016/j.jspi.2003.09.001 Karian, Z. A., & Dudewicz, E. J. (1999). Fitting the generalized lambda distribution to data: A method based on percentiles. Communications in Statistics-Simulation and Computation, 28, 793–819. doi:10.1080/03610919908813579 Karian, Z. A., & Dudewicz, E. J. (2011). Handbook of fitting statistical distributions with R. Boca Raton, FL: CRC Press. Kendall, M., & Stewart, A. (1977). The advanced theory of statistics, Vol. 1: Distribution theory (4th ed.).London: Griffin. Kowalchuk, R. K., & Headrick, T. C. (2010). Simulating multivariate g-andh distributions. British Journal of Mathematical and Statistical Psychology, 63, 63–74. doi:10.1348/000711009x423067

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

16

KORAN, HEADRICK, KUO

Lunn, A. D., & Davies, S. J. (1998). A note on generating correlated binary variables. Biometrika, 85, 487–490. doi:10.1093/biomet/ 85.2.487 Mair, P., Satorra, A., & Bentler, P. M. (2012). Generating nonnormal multivariate data using copulas: Applications to SEM. Multivariate Behavioral Research, 47, 547–565. doi:10.1080/00273171.2012.692629 Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156–166. doi:10.1037/00332909.105.1.156 Muth´en, L. K., & Muth´en, B. O. (2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling, 9, 599–620. doi:10.1207/S15328007SEM0904 8 Oman, S. D., & Zucker, D. M. (2001). Modelling and generating correlated binary variables. Biometrika, 88, 287–290. doi:10.1093/biomet/ 88.1.287 Pearson, E. S., & Please, N. W. (1975). Relation between the shape of population distribution and the robustness of four simple test statistics. Biometrika, 62, 223–241. doi:10.1093/biomet/62.2.223 Qaqish, B. F. (2003). A family of multivariate binary distributions for simulating correlated binary variables with specified marginal means and correlations. Biometrika, 90, 455–463. doi:10.1093/biomet/90.2.455 Ruscio, J., & Kaczetow, W. (2008). Simulating multivariate nonnormal data using an iterative algorithm. Multivariate Behavioral Research, 43, 355–381. doi:10.1080/00273170802285693

Sawilowsky, S. S., & Blair, R. C. (1992). A more realistic look at the robustness and Type II error properties of the t-test to departures from population normality. Psychological Bulletin, 111, 352–360. doi:10.1037/00332909.111.2.352 Serfling, R., & Xiao, P. (2007). A contribution to multivariate L-moments: L-comoment matrices. Journal of Multivariate Analysis, 98, 1765–1781. doi:10.1016/j.jmva.2007.01.008 Steiger, J. H. (2006, October). Things we could have known: Some thoughts on seeing the future and rediscovering the past in data analysis and model selection. Presidential Address, Society for Multivariate Experimental Psychology, Lawrence, KS. Stoneberg, B. (2011, July). Idaho Standards Achievement Tests Scale Score to Percentile Rank Conversion Tables Spring 2011. Retrieved from http:// www.sde.idaho.gov/site/gifted talented/resources manuals docs/ISAT Percentile Ranks Spring 2011.pdf TIBCO Spotfire S+ 8.1 for Windows. [Computer software]. Palo Alto, CA: TIBCO Software. Vale, C. D., & Maurelli, V. A. (1983). Simulating multivariate nonnormal distributions. Psychometrika, 48, 465–471. doi:10.1007/BF02293687 Weisstein, E. W. (2014a). k-statistic. MathWorld—A Wolfram Web Resource. Retrieved from http://mathworld.wolfram.com/k-Statistic.html Weisstein, E. W. (2014b). Legendre polynomial. MathWorld—A Wolfram Web Resource. Retrieved from http://mathworld.wolfram.com/ LegendrePolynomial.html

METHOD OF PERCENTILES

17

APPENDIX Derivation of a Percentile-Based System of PM PDFs, Equations (43) to (46) Note that the standard normal distribution percentile z0.50 = 0. Substituting this value into Equation (39) gives γ1 = p (z0.50 = 0) Applying the polynomial of the form in Equation (1) to the expression in Equation (A1) gives     γ1 = c1 + c2 (0) + c3 02 + c4 03 .

(A1)

(A2)

Simplifying Equation (A2) gives γ1 = c1

(A3)

Downloaded by [Southern Illinois University] at 08:35 05 April 2015

Equation (A3) is the same as Equation (43). Note from symmetry of the standard normal distribution that z0.10 = −z0.90 . Substituting this value into Equation (40) gives γ2 = p (z0.90 ) − p (−z0.90 ) . Expanding the expression in Equation (A4) by applying the polynomial of the form in Equation (1) gives  2   3    2   3  γ2 = c1 + c2 (z0.90 ) + c3 z0.90 + c4 z0.90 − c1 + c2 (−z0.90 ) + c3 −z0.90 + c4 −z0.90 .

(A4)

(A5)

Simplifying Equation (A5) gives 3 γ2 = 2c2 z0.90 + 2c4 z0.90

(A6)

Equation (A6) is the same as Equation (44). Having previously noted that z0.50 = 0 and z0.10 = −z0.90 , substituting these values into Equation (41) gives γ3 =

p(z0.50 = 0) − p(−z0.90 ) p(z0.90 ) − p(z0.50 = 0)

Expanding the expression in Equation (A7) by applying the polynomial of the form in Equation (1) gives       2   3  + c4 −z0.90 c1 + c2 (0) + c3 02 + c4 03 − c1 + c2 (−z0.90 ) + c3 −z0.90  2   3      γ3 = . c1 + c2 (z0.90 ) + c3 z0.90 + c4 z0.90 − [c1 + c2 (0) + c3 02 + c4 03 ]

(A7)

(A8)

Simplifying Equation (A8) gives γ3 = 1 −

2c3 z0.90 2 c2 + c3 z0.90 + 2c4 z0.90

(A9)

Equation (A9) is the same as Equation (45). Note from symmetry of the standard normal distribution that z0.25 = −z0.75 . Substituting this value into Equation (42) gives γ4 =

p(z0.75 ) − p(−z0.75 ) γ2

(A10)

Expanding the expression in Equation (A10) by applying the polynomial of the form in Equation (1) and substituting from Equation (A6) gives  2   3    2   3  + c4 z0.75 − c1 + c2 (−z0.75 ) + c3 −z0.75 + c4 −z0.75 c1 + c2 (z0.75 ) + c3 z0.75 γ4 = . (A11) 3 2c2 z0.90 + 2c4 z0.90 Simplifying Equation (A11) gives γ4 = Equation (A12) is the same as Equation (46).

3 2c2 z0.75 + 2c4 z0.75 3 2c2 z0.90 + 2c4 z0.90

(A12)