Handling Missing Data in Ranked Set Sampling

0 downloads 0 Views 1MB Size Report
In sample surveys it is common that some units are missing at the first measurement attempt. Let the characteristic under study determine a variable Y. For each ...
Effects of Missing Data: Ranked Set Sampling vs Simple Random Sampling Carlos N. Bouza Herrera Universidad de La Habana

First issue: MISSING OBSERVATIONS and DATA QUALITY IMPROVEMENT 43

2

Missing data is a well-recognized problem which arises in statistical inferences and data analysis

The methods are affected by the missingness of data. Usuallly statisticians deal with these problems without giving to much importance to the loss of theoretical properties. 43

3

Take a set of de n entries with p : Denoting by (X1, …, Xp) the response of an we have for the i-th the response o item j xij. Then ideally we have the matrix

 x11 x 21  X    xn1

x1 p  x2 P    xnp 

x12 x22 xn 2 43

4

It will have “holes” due to the unknowleged of the value of X for some individuals in some ítems! For example, which is the influence of missingness in the usual regression equation fitting

43

5

A basic question: Why are the Values Missing The statistician may respond: Missing is due to researcher error. But He/she must consider which is the missingness mechanism and establish whether    

Are missing completely at random? Is the missing-ness reflecting researcher bias ? Has been Perceived a risk to researcher? Are Missing observations worse than missing values?

43

6

Ideally statisticians understand why each value is missing.. But usually they delete observations or variables by Dropping variables And or Dropping observations 43

7

Samplers have devoted more time than other statisticians to consider the effect of missing data in the effectivity of simple sampling based inferences

 The actual sampling process involves the 'selection' of the missing values, as well as the units.  So to complete the process of inference in a justifiable way we need to take this into account.

43

8

Consider a finite population U of size N from which a simple random sample s, of size n, is drawn with replacement. Full response surveys are rare situations. In sample surveys it is common that some units are missing at the first measurement attempt. Let the characteristic under study determine a variable Y. For each iU we can determine the value of Y in it, Yi. When some units do not provide information we have that the sample is divided into two subsets 𝑠𝑟 = 𝑖 ∈ 𝑈 𝑡ℎ𝑒 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑌𝑖 𝑖𝑠 𝑜𝑏𝑡𝑎𝑖𝑛𝑒𝑑 , 𝑠𝑟𝑛 = 𝑖 ∈ 𝑈 𝑡ℎ𝑒 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑌𝑖 𝑖𝑠 𝑛𝑜𝑡 𝑜𝑏𝑡𝑎𝑖𝑛𝑒𝑑

43

9

An estimate obtained from sr only may be misleading and estimation is biased.  Missing data in survey research are present because:  An element in the target population U is not included on the survey's sampling frame (non-coverage),  A sampled element does not participate in the survey (total nonresponse)  A unit in the sample fails to provide acceptable responses (item or unit non-response).

43

10

Generally surveyors decide to subsample the non-respondents when the response rate is lower than expected and to interview all them is to costly. Another reason is that non response constitutes an important potential source of bias. Subsampling the non-respondents allows also studying the reasons for avoiding responding. Commonly a representative subsample of non-respondents is taken (those units generating missing data) and it is used for inferring about them.

43

11

IMPUTATION

Imputation means to substitute missing data with plausible values. Some practitioners consider that it solves the missing-data problem 43

12

Rubin (1987, 1976), Little and Rubin (1987) classified missing data mechanisms into three types. 1. Missing completely at random (MCAR). This mechanism is characterized by a distribution such that the probability that a value is missing is independent of values (observed or missing) in the dataset. Hence the observed value of Y is a random result from the set of observed and unobserved values. That is any sampled unit from the population is representative and the subsample interviewed is a representative subsample of the selected sample. 2. Missing at random: (MAR). The distribution that characterizes it is such that if Y is missing in a unit and may depend of some observed values in the dataset but is independent of any missing data. Then the subsample interviewed is not a representative subsample of those selected. An appropriate analysis needs to be used to address the bias. 3. Not missing at random” (NMAR). If missing data can not considered generated neither by MCAR nor MAR the probability that Y be missing may be generated by a dependence on missing data. A NMAR mechanism is present when the missing values are systematically different from observed values, even after conditioning on observed values. Any statistical procedure is expected to behave inaccurately if the missing data mechanism is NMAR.

43

13

43

14

The usual model: Single imputation

Single imputation is a general method of replacing missing values with values derived ad hoc. The imputed values have the same distribution as the non-missing data. For each sampled unit having any missing data, a substitution model uses some available non-missing data of it to form a predictor. Once each missing value is imputed the estimator uses the completed dataset. This procedure has the advantage of replacing missing data with values whose distributions are like the non-missing ones. For an imputation procedure to be valid, it must take into account the fact that imputed values are only a guess and not the value that would have been observed and not the missing values. It is typically used in the presence of a MAR mechanism

43

15

Ignoring the non-responses in the estimation of the mean

A MCAR mechanism is assumed and the probability P of obtaining a response at a visit is a constant and 1 k ys   yi k i 1 .

E  ys    y with variance 𝑉 𝑦𝑠 =

43

𝜎𝑦2 𝑘

16

Under this mechanism we can use the mean substitution method. Define 𝑦𝑖∗ =

𝑦𝑖 𝑖𝑓 𝑖 𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑠 𝑦𝑠 𝑖𝑓 𝑖 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑑 ∗ = 𝑦𝑚

𝑛 ∗ 𝑖=1 𝑦𝑖

𝑛

∗ = 𝜇 and As 𝐸(𝑦𝑠 ) = 𝜇𝑦 we have that 𝐸 𝑦𝑚 𝑦

∗ = 𝑉 𝑦𝑚

𝑘𝜎𝑦2

𝜎𝑦2 + (𝑛 − 𝑘) 𝑛 𝑛+𝑘 𝑛−1 2 = 𝜎𝑦 𝑛2 𝑛3

43

17

Another imputation method The ratio method suggests the naïve point estimator of the population mean

where

1 k xs   xi k i 1

ys yk  x xs

and 𝑥 =

𝑛 𝑖=1 𝑥𝑖

𝑛

,

is the sample mean based on

But the estimator is not unbiased

𝐵 𝑦𝑘 ≅

1 𝑘

𝑀𝑆𝐸 𝑦𝑘 ≅ where

𝜎

𝐶𝑍 = 𝜇𝑍 , 𝑍

Z  X ,Y



1 𝑛

𝜎𝑦2 𝑘

+

𝐶𝑥2 − 𝜌𝐶𝑦 𝐶𝑥 𝜇𝑦 1 𝑘

1

−𝑛

𝑅=

𝑅2 𝜎𝑋2 − 2𝑅𝜎𝑋𝑌

𝜇𝑦 𝜇𝑥

𝜎𝑋𝑌 𝜌= 𝜎𝑋 𝜎𝑌

and xy is the covariance between X and Y. Note that if 𝑘 ≅ 𝑛 the bias is close to zero. 43

18

Ranked Set Sampling 43

19

Ranked set sampling (RSS) was first proposed by McIntyre (1952). He used this model for estimating the mean of pasture yields. This design appeared as a useful technique for improving the accuracy of the estimation of means. This fact was affirmed by McIntyre but a mathematical proof of it was settled by TakahashiWakimoto (1968). An interesting paper is Yanagawa (2000) where and account of Wakimoto`s contributions is made. In many situations the statistician deals with the need of combining some control and/or implementing some flexibility in the use of a random based sample. This is a common problem in the study of environmental and medical studies, for example. In these cases the researcher generally has abundant and accurate information on the population units. It is related with the variable of interest Y and to rank the units using this information is cheap. The RSS procedure is based on the selection of m independent samples, not necessarily of the same size, by using simple random sampling (SRS) with replacement (SRSWR). The sampled units are ranked and the selection of the units evaluated takes into account the order of them in the combined m samples

43

20

When srswr is used the usual estimator of the population mean based on the observations is 𝜇𝑠 = 𝑉(𝜇𝑠 ) =

𝑛 𝑖=1 𝑉(𝑌𝑖 )

𝑛

=

𝑛 𝑖=1 𝑌𝑖

𝑛

. Its variance is

𝜎2 . 𝑛

If we base our inferences on the os’s

𝑉(𝜇(𝑠) ) =

𝑛 𝑖=1 𝑉(𝑌(𝑖) ) 𝑛2

43

=

𝑛 2 𝑖=1 𝜎 (𝑖) 𝑛2

21

Takahasi

and

Wakimoto

(1968)

provided

the

mathematical theory of RSS and showed that 𝑓 𝑦 =

𝜇𝑌 =

𝑚 𝑗=1 𝑓 𝑗𝑚

(𝑦)

𝑚

𝑚 𝑗=1 𝜇𝑌(𝑗𝑚) (𝑦)

𝑚

,

and

𝑉(𝑌 𝑗𝑚 = 𝜎𝑌2 𝑗𝑚 − ∆2𝑌 𝑗𝑚 , ∆2𝑌 𝑗𝑚 = 𝜇𝑌 𝑗𝑚 − 𝜇𝑌 = 1, … , 𝑚

43

2

,𝑗 22

The theoretical frame that permits to use the RSS model is based on the hypothesis: We wish to enumerate the measurable variable Y. i. The units can be ordered linearly without ties. ii.Any sample sU of size m can be enumerated. iii.To identify a unit , order the units in s and enumerate them is less costly than to evaluate iv.{Yi , is} or to order U.

43

23

In survey sampling settings it is logic ranking the units based on the values of an auxiliary variable correlated with the variable of interest. The basic RSS procedure is the following: Step 1: Randomly select

m2

units from the target population. These units are randomly allocated into m sets, each of size m. Step2: The m units of each set are ranked visually or by any inexpensive method free of cost, say X, with respect to the variable of interest Y. From the first set of m units, the smallest ranked unit is measured; from the second set of m units the second smallest ranked unit is measured. Continue until the mth smallest unit (the largest) is measured from the last set. Step 3: Repeat the whole process r(i) times (cycles) Step 4: Evaluate the corresponding units.

43

24

The usual estimator of Z, for a variable Z, is 𝜇𝑍

𝑟𝑠𝑠

=

𝑚 𝑗=1

𝑟 𝑖=1

𝑍(𝑖:𝑖)𝑗

𝑛

, 𝑛 = 𝑟𝑚

Noting that for any j, 𝐸 𝑍

𝑖:𝑖 𝑗

= 𝜇𝑍(𝑖) the unbiasedness of this

estimator is easily derived because

𝐸(𝜇𝑍

𝑟𝑠𝑠

)=

𝑚 𝑗=1

𝑟 𝑖=1

𝑛

𝜇𝑍(𝑖)

=

43

𝑚 𝑗=1

𝜇𝑍(𝑖) 𝑚

= 𝜇𝑍

25

The samples s(j) are independent. 𝜇𝑍 𝑟𝑠𝑠 is :

𝑉 𝜇𝑍

𝑟𝑠𝑠

𝜎𝑍2 = − 𝑛

𝑚 𝑗=1

=

𝑚 2 𝑗=1 ∆𝑌 𝑗:𝑚

𝑚𝑛 ∆2𝑌 𝑗𝑚

𝑟 𝑖=1 𝑛2

Hence, the variance of

𝜎 2𝑍 𝑖

=

𝑚 𝑗=1

𝜎 2𝑍 𝑖

𝑟𝑚2

,

= 𝜇𝑍 𝑗 − 𝜇𝑍

2

, 𝑛 = 𝑟𝑚.

This allows writing

𝜎𝑍2 =

𝑟 𝑖=1

𝜎2

𝑍𝑖

𝑟 𝑖=1

+

𝜇𝑍 𝑗 − 𝜇𝑍

2

𝑟

43

26

The net gain in accuracy due to the use of RSS is measured by 𝑚 2 𝑗=1 ∆𝑌 𝑗:𝑗

𝑚𝑛 43

. 27

Some results of imputation SRS vs RSS

43

28

2013 Handling Missing Data in Ranked Set Sampling

A discussion of the key issues on the theme appears in

43

29

Ratio Imputation Procedures

Kadilar and Cingi (2008) considered the case of missing data in estimating the population suggesting the following estimators of the population mean of the study variable Y : 𝑦𝐾𝐶1 =

𝑦𝑠 +𝑏 𝜇𝑥 −𝑥 𝑥

𝜇𝑥 , 𝑦𝐾𝐶2 =

b

𝑦𝑠 +𝑏 𝜇𝑥 −𝑥𝑠 𝑥𝑠

𝜇𝑥 , 𝑦𝐾𝐶3 =

𝑦𝑠 +𝑏 𝑥−𝑥𝑠 𝑥𝑠

𝑥

sxy s

2 x 43

30

Their biases are

𝐵(𝑦𝐾𝐶1 ) ≅

𝐶𝑥2 𝜇𝑦

𝑛

, 𝐵(𝑦𝐾𝐶2 ) ≅

𝐶𝑥2 𝜇𝑦

𝑘

, 𝐵(𝑦𝐾𝐶3 ) ≅

1 𝑘

1

− 𝑛 𝜌𝐶𝑥 𝐶𝑦 𝜇𝑦

The mean square errors of these estimators are 𝑀𝑆𝐸 𝑦𝐾𝐶1 𝑀𝑆𝐸 𝑦𝐾𝐶2

𝜎𝑦2 𝑅2 − 𝐵2 𝜎𝑥2 ≅ + 𝑘 𝑛 1 2 ≅ (𝜎𝑦 −𝐵𝜎𝑥𝑦 + 𝑅2 𝜎𝑥2 ) 𝑘

𝑀𝑆𝐸 𝑦𝐾𝐶3

B

𝜎𝑦2 ≅ 𝑘

 xy  y  2 x x

43

31

The use of RSS and the structure of the estimator allow deriving that

ykc1( RSS ) 

y(RSS )  b( X  x(RSS ) ) x( RSS )

X

The mean squared error and gain in accuracy are:

M  ykc1( RSS )  

 Y2 nW1

R  

2

 B2  X2 n

m  1 1 m 2 2 2 2  2   DY ( j )   R  B   DX ( j )  n  W1 j 1 j 1 

m  1 1 m 2 2 2 2 G  kc1, RSS   2   DY ( j )   R  B   DX ( j )  n  W1 j 1 j 1 

43

32

Take the estimator

Its gain in accuracy is its MSE

where

ykc 2( RSS )  G  kc2, RSS  

y´( RSS ) b  X  x´( RSS )  x´( RSS )

X

m m  1  m 2 2 2 D  R D  DX ( j ) DY ( j )    Y ( j)   X ( j) nmW1  j 1 j 1 j 1 

m m   Y2 R2 X2  B XY 1 m 2 2 2 M  ykc 2( RSS )       DY ( j )  R  DX ( j )   DX ( j )Y ( j )  nW1 nW1 nmW1  j 1 j 1 j 1 

DX ( j )Y ( j )   X ( j )  X Y( j )  Y 

43

33

The proposed RSS counter part of

ykc3( RSS ) 

The expectation of the MSE

yKC 3

y(RSS )  b  x( RSS )  x(RSS )  x(RSS )

is

x( RSS )

is given approximately, by

 Y2

where

( R  B)2  X2  2( R  B) XY M  ykc3( RSS )     nW1 nW1

m m m  1  2 2 2   ( R  B)  DX ( j )  2( R  B) DX ( j )   DX ( j )Y ( j )  nmW1  j 1 j 1 j 1 

43

34

The corresponding RSS version of Singh and Horn (2003) estimator is

ySH ( RSS ) Its MSE is

(1   ) y(RSS ) x( RSS )   y(RSS )  x(RSS )

M  ySH ( RSS )   M  yr   1   2 

2 2 m D  R2  m DX ( j ) m DY ( j ) X ( j )Y ( j ) 1    2  2   nmW1  j 1 X 2 Y XY j 1 j 1 

m D2  Y ( j) 2      Y m R2  j 1 2  1  m D2 nW1  X ( j) 2  R X  m j 1 

  2   2 m DX ( j )     X   m  j 1    

The expected bias of (3.27) is approximated by m m   m 2  2 D D D D    X ( j) Y ( j) X ( j ) Y ( j )    (1   0 )Y  2 j 1 j 1 j 1  E ( BSH ( RSS ) )  CX  CX CY    mX 2  mY 2  2 mXY  nW1     ,   

43

35

ySD ( RSS )

The RSS counterpart of 0

is

ySD ( RSS )

 x   y(RSS )   x( RSS )   

approximated expected bias of (4.13) is

E  B  ySD ( RSS )  

Y  (  1)CX2   CX CY  nW1

1   (  1) m 2   DX ( j )    2 nW1  2 X XY j 1

 DXY   j 1  m

and the approximated expected MSE is given by 2 m  2  1 ( B  R ) 2 2 2   E  MSE  ySD( RSS )   M  yr   DX ( j )   X  B  R    R G( KC1, RSS    nW1  m j 1  43

36

An empirical evaluation 43

37

We compared by using Monte Carlo Simulation 𝑦𝐴 𝑣𝑠 𝑦𝐴(𝑅𝑆𝑆) For

A=KC1, KC2, KC3, SH, DS, 43

38

We generated 1 000 samples using N(0,1), U(0,1) and a Exp(1) Non responses were generated for each sample generating a Bernoulli variable 𝑈=

1 𝑖𝑓 𝑡ℎ𝑒 𝑢𝑛𝑖𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑠 0 𝑜𝑡ℎ𝑒𝑟 𝑤𝑖𝑠𝑒

With P=P(U=1)= 0,1.

43

39

We computed 𝜏𝐴 =

1000 𝑡=1

𝜏𝐴(𝑅𝑆𝑆) =

𝑌 − 𝑦𝐴 𝑡 , 1000 1000 𝑌 − 𝑦𝐴(𝑅𝑆𝑆) 𝑡=1

𝑡

1000 The comparison between each imputation procedure and its RSS counter art was evaluated by computing

𝜑𝐴 = 𝜏𝐴 /𝜏𝐴 43

𝑅𝑆𝑆 40

The results for a sample size n=100, m=5, m=20 provides an illustration of the gain in accuracy of the use of the imputation procedures when RSS is the sampling design

KC! KC2 KC3 SH DS

N(0,1) 1,98 1,83 2,03 1,72 1,77

U(0,1) 2.03 2,10 1,86 1,85 1,80

Exp(1) 2.83 2,91 2.85 1,97 1.94

43

41

Hence, as expected, RSS behaves better in all the cases. The best results are obtained in the case of exponential variables. Particularly good is KC2 SH is the best alternative when SRS is used and the worst under the U(0,1). SRS has a better behavior in the normal case

43

42

43

43