MSC-Clustering and Forward Stepwise Regression ...

19 downloads 0 Views 1MB Size Report
MSC-Clustering and Forward Stepwise Regression for virtual metrology in highly correlated input spaces. PKS Prakash1, Andrea Schirru2, Peter Hung1 and ...
Abstract—To enhance product quality semiconductor spectroscopy (OES) [2]. The challenge with operating plasma manufacturing industries are increasing the amount of metrology etchers is maintaining a consistent etch rate spatially and information collected during manufacturing processes. This temporally for a given wafer and for successive wafers increase in information has provided companies with many processed in the same etch tool. Etch rate variations occur for a opportunities for enhanced process monitoring and control. variety of reasons including chamber seasoning effects due to However, the increase in information also posses challenges as it chemical interactions with the chamber wall, temperature is quite common now to collect many more measurements than changes in the chamber during the etching step, non-uniformity samples from a process leading to ill-conditioned datasets. Illin the composition of the plasma gases and variability in the conditioned datasets are very common in semiconductor RF current discharge [3]. The complex nonlinear behaviour of manufacturing industries where infrequent sampling is the norm. plasma etch processes and their sensitivity to disturbances It is therefore critical to be able to quantify virtual metrology makes them very difficult to model and control. Further, etch models developed from such data sets. This paper presents an rate measurements are not accessible in real-time. They can aggregative linear regression methodology for modeling that through a costly non-value added post-etch allows the generation of confidence intervals the predicted 2 only be obtained 1 , on 1 and Se´ 1 PKS Prakash Andrea Schirru , Peter Hung an McLoone metrology step leading to significant delays before they are outputs. The aggregation enhances the robustness of the linear (1) for Maynooth process adjustment. Consequently, in practice, models in terms of process variation and National model sensitivity University available of Ireland, plasma etch processes are operated in open-loop fashion using towards prediction. Also, to deal with the large (2) number of University of Pavia, Italy pre-determined fine-tuned recipes with a small number candidate process variables, variable selection methods are { [email protected], [email protected], [email protected], [email protected] } process metrology measurements performed to facilitate employed to reduce the dimensionality and computational efforts associated with building virtual metrology models. In the paper monitoring and statistical process control. three methods for variable selection are evaluated in conjunction MF power source with Abstract—Increasingly aggregative linear regression (ALR). The proposed semiconductor manufacturers are exmethodology is tested on a benchmark semiconductor plasma ploring opportunities for virtual metrology (VM) enabled process etchmonitoring process dataset and the are with state-ofOES and control as results a means of compared reducing non-value added Vacuum Chamber art metrology multiple and linear regression (MLR) and Gaussian Process achieving ever more demanding wafer fabrication Regression (GPR) VM models. tolerances. However, developing robust, reliable and interpretable Plasma  Gas Exhaust VM models can be very challenging due to the highly correlated Keywords: metrology,with Aggregative Linear Regression, input space Virtual often associated the underpinning data sets. A Forward Stepwise Regression, Decision Trees. particularly pertinent example is etch rate prediction of plasma wafer etch processes from multi-channel optical emission spectroscopy data. This paper proposes a novel input-clustering based forward I. INTRODUCTION RF stepwise regression methodology for VM model building in such Grounded  Plasmacorrelated etching input is a key process modern Clustering semiconductor highly spaces. Max in Separation (MSC) Match  Electrode box is employed facilities as a pre-processing step toprecise identify a reduced manufacturing used to achieve control of set of well-conditioned, representative variables that can then wafer etching. Plasma etching offers process simplification and be useddimensional as inputs to tolerances state-of-the-art model to building techniques improved compared wet-chemicalsuch technology. as Forward The Selection Regression (FSR),etching Ridge regression, Fig. 1. Flow diagram of a plasma etch process etching flowchart of a plasma process Figure 1: Flow diagram of a plasma etch process LASSO and Forward Selection Ridge Regression (FSRR). The is methodology shown in Fig. 1. The process involves introducing an is validated on a benchmark semiconductor plasma appropriate mixture of results gases such as AlCl intothose a predict product quality parameters using only process data 3 and Cl2with etch dataset and the obtained are compared vacuum chamber followed by ionization of gases using directly high collected in-line during process(VM) operation. The key enablers achieved when the state-of-art approaches are applied Recently, virtual metrology has gained attention in power microwave frequencies (MF) to generate astep. plasma [1]. to the data without the MSC pre-processing Significant forsemiconductor the development of VM solutions are; (1) sensing of the manufacturing community as the a cutting edge Radio frequencies (RF) accelerate the generated plasma toward performance improvements are observed when MSC is combined process for variables that capture characteristics reflective solution estimating critical process quality parameters from FSR (13%) FSRR (8.5%), with Ridge thewith electrode whereand it interacts with but the not masked waferRegression surface other more accessible in-line process measurements [4]. Thisa of the final product quality; and (2) the availability of (-1%) or LASSO The optimal VMaway results obtained both chemically and(-32%). mechanically to etch theare exposed approach, also referred to as soft-sensing or inferential representative set of historical data for these variables and the using The the MSC-FSR and MSC-FSRR generated surface. chemical composition of the exhaustmodels. gases from

MSC-Clustering and Forward Stepwise Regression for virtual metrology in highly correlated input spaces

estimation, offers the potential to product significantly enhance yield corresponding metrology for the quality measure of the chamber can be monitored using optical emission and improve process capability in semiconductor Keywords: Virtual metrology, optical emission spec- interest. Traditional system identification and machine learning techniques can then be employed to estimate suitable predictroscopy, plasma etch processes, regression, clustering tive models. In practice however, for many processes model I. I NTRODUCTION estimation can be challenging due to the highly correlated As wafer fabrication processes become increasing complex input spaces associated with the underpinning data sets. This is and features sizes continue to decrease process monitoring and especially true for etch rate prediction using optical emission advanced process control (APC) are becoming more and more spectroscopy (OES) in plasma etch processes. critical to stable high quality production in semiconductor Plasma etching is one of the fundamental processes in manufacturing. This, in turn, is leading to an increasing modern semiconductor manufacturing. It has emerged as an demand for metrology. As feature sizes continue to shrink alternative to wet-chemical-etching to meet the requirement it is desirable to have real-time metrology for each wafer to for more precise control of the resolution and directionality of facilitate wafer level APC. However, this is not practical or etch. In a typical plasma etching process (Fig. 1) a mixture economical as metrology is a non-value added activity that is of gases such as AlCl3 and Cl2 is introduced into a vacuum generally expensive and time consuming. chamber and ionized using high power microwave frequencies Recently, virtual metrology (VM), also referred to as pre- (MF) to generate a plasma [14]. The plasma is sustained by dictive metrology, has gained attention in the semiconductor an RF current carried by an inductive element. The generated industry as a means of reducing actual metrology, while plasma is accelerated towards the electrode where it interacts achieving real-time process visibility at the wafer level[3] [15] with the masked wafer surface both chemically and mechan[5]. Virtual metrology involves using mathematical models to ically to etch away the exposed surface. The major challenge 978-1-4673-0351-4/12/$31.00 ©2012 IEEE

45

ASMC 2012

LASSO sparsity example

with plasma etch process monitoring is the inaccessibility of etch rate measurements in-line and in real-time. The etch rate can only be determined through a costly non-value added postetch metrology step leading to significant delays before the information is available for process adjustment [6]. However, the potential exists for building VM models to predict etch rate using data from optical emission spectroscopy sensors (OES) [4] which are increasingly being used for process monitoring on modern plasma etch tools. OES sensors measure the chemical changes in a plasma non-obtrusively [11] based on its optical emissions, and as such are a valuable source of in-line process information. The main challenge with using OES for VM is that OES data sets typically consist of thousands of highly correlated variables which makes the identification of critical variables by direct visualization difficult and also leads to ill-conditioned input spaces which impact on the performance, robustness and interpretability of VM models. In this paper we propose the use of a novel clustering algorithm, called Max Separation Clustering (MSC), as a pre-processing step for identifying clusters of collinear variables [7]. The identified clusters are then replaced by their centroids to reduce the input space to a set of wellconditioned, representative variables that can then be used as inputs for VM model building. The proposed methodology is evaluated for four state-of-the-art model building techniques, namely Forward Selection Regression (FSR), Ridge regression, LASSO and Forward Selection Ridge Regression (FSRR). Section II presents the methodology for MSC based input space reduction and the four model estimation algorithms considered. Then in Section III the proposed methodology is evaluated on a benchmark semiconductor plasma etch process dataset. Finally, conclusions are summarized in Section IV.

2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −1

1

2

3

4

5

wOLS = (X 0 X)−1 X 0 Y

6

(2)

and does not depend on σ 2 . This scheme suffers from two main drawbacks: (i) for small datasets (n ' p) the estimated f (x) may overfit the noisy data or even interpolate the training examples, and (ii) the matrix X 0 X may be ill-conditioned or even singular. A. Ridge Regression (RR) In order to overcome these drawbacks, Ridge Regression (RR, also known as Tikhonov regularization) was proposed in the forties of the last century [13]: a linear estimator is obtained by minimizing the loss function

The basic assumption in machine learning is that latent knowledge can be learned from data. Given a training set (xi ∈ R1×p , yi ∈ R, i = 1, . . . , n) of n samples, let X ∈ Rn×p be a matrix of the p-variate inputs and Y ∈ Rn be the associated output vector. The goal is to find a function f : Rp → R such that, given a new example {xnew , ynew }, the prediction f (xnew ) is close to ynew . In this context, f (x) is called an estimator of y. The basic machine learning technique, Ordinary Least Squares (OLS), can be traced back to Gauss and Legendre [8]; it looks for a linear relationship f (x) = xw by minimizing the following sum of squared residuals with respect to w ∈ Rp :

JRR (w) :=

1 λ λ ||Y − Xw||2 + w0 w = JOLS (w) + w0 w (3) 2 2 2

where λ ∈ R+ is the regularization (hyper)parameter. Under a Bayesian framework, JRR is a logposterior distribution, and λ the term w0 w in (3) is related to the prior distribution of w, 2 p(w), assuming w ∼ N (0, λ−1 I). The larger λ, the smaller the variance of the estimator, at the cost of introducing some bias; in practical applications, λ is often used as a ‘tuning knob’ controlling the bias/variance tradeoff, which is typically tuned either via crossvalidation or other statistical criteria. The optimal coefficient vector wRR and the estimator fRR (x) are

n

1X 1 ||Y − Xw||2 = (yi − xi w)2 2 2 i=1

0

Fig. 2. 2-D graphical example of the sparsity of the LASSO: the contour lines of a quadratic score function (red), whose optimal unconstrained solution is (3.1, 0), meet the LASSO constraint (black) in (1, 0): a sparse model results.

II. M ETHODOLOGY

JOLS (w) :=

Contour lines LASSO admissible region

(1)

wRR

Under a statistical framework, this amounts to maximizing the conditional probability p(Y |X) when assuming Y |X ∼ N (Xw, σ 2 I) or, equivalently, Y = Xw +  with  ∼ N (0, σ 2 I) (that is, i.i.d. Gaussian noise). The optimal coefficient vector wOLS is

fRR (x)

= =

(X 0 X + λI)−1 X 0 Y 0

x(X X + λI)

−1

0

(4)

XY

(5)

The numerical stability problems of Eq. (2) are now avoided, because (X 0 X + λI) has full rank for any λ > 0.

46

ASMC 2012

B. LASSO

0

˜= X

It is to be noted that Ridge Regression does not perform model order reduction, since in general all the entries of wRR will be nonzero. As a matter of fact, when dealing with high dimensional input spaces, overparametrized models are easy to obtain (the so-called ‘curse of dimensionality’ [2]). In order to overcome this issue, it is useful to penalize w using a `1 norm. The most popular techniques employing such a penalty is the LASSO [12], that is solved by finding L

w = arg min JOLS (w) w

! ˜ X

(6)

y˜ = XF SR wF SR + 

(11)

wF SR = (XF0 SR XF SR )−1 XF0 SR Y

(12)

with

|wj | ≤ λ1

(7)

D. Forward Selection Ridge Regression (FSRR)

j=1

The final FSR solution is equivalent to the application of OLS (Eq. 2) to the reduced set of variables identified by FSR. Therefore, a natural extension to FSR is to apply ridge regression rather than OLS to estimate the model parameters, that is:

This formulation allows a sparse solution to be obtained for w (that is, some entries of the selected w are 0, as shown in Fig. 2) if λ1 is low enough. This extremely convenient property of the LASSO allows for the creation of low-order models even when the input space has high dimension: intuitively, such a model is able to improve the stability of the prediction without sacrificing its precision [10]. The hyperparameter λ1 acts again as a tuning knob: by lowering λ1 , the number of selected variables decreases, as well as the magnitude of the coefficients. Notably, the `1 constrained formulation has no closed-form solution: in order to train a predictor using the LASSO, it is necessary to resort to optimization techniques.

wF SRR = (XF0 SR XF SR + λI)−1 XF0 SR Y

E. Max Seperation Clustering (MSC) Max separation clustering (MSC) [7] is an unsupervised correlation-based clustering algorithm which divides the input space into distinct clusters, such that the variables within each cluster are highly correlated. MSC works by identifying maxoids to built clusters around, where for a given cluster the maxoid is defined as the object which is furthest from all objects in the previously identified clusters. For a given set of objects X ∈ Rm×n , where m is the number of measurements and n is the number of variables, the cluster for a given maxoid mk is determined as

A more direct approach to obtaining lower order models is to select a subset of the available variables as regressors. Determining the optimum subset of regressors is computationally infeasible for large input spaces as the number of possibilities grows exponentially with the dimension of the input space. Consequently, pragmatic approaches such as forward selection regression (FSR) are employed in practice to estimate the optimum subset. In FSR variables are added to the model, one at a time, such that at each step the variable added is the one which yields the best improvement in the model [9]. The basic procedure for the algorithm is as follows: Initially, ˜ = X and the variable with the maximum contribution to X the model is selected as x ˜∗ = arg min{||˜ y − yˆ(˜ x)||22 } ˜ x ˜ ∈X

˜ Ck = {xi |corr(xi , mk ) ≥ λM SC , xi ∈ X}

(14)

Here, corr(·) is the Pearson product moment correlation [1] between xi and maxoid mk , λM SC is the desired correlation ˜ is the set of currently unclustered objects. threshold and X The first maxoid m1 is defined as the larger of x∗i and x∗j , the farthest apart objects in X, that is:

(8)

(x∗i , x∗j ) = arg max{dist(xi , xj )}

(15)

m1 = x∗i s.t. ||x∗i ||2 ≥ ||x∗j ||2

(16)

xi ,xj



where x ˜ is the selected variable and yˆ(˜ x) is the estimate of y˜ obtained by regressing on x ˜ , that is, x ˜0 y˜ x ˜0 x ˜

(13)

The resulting model will be referred to as Forward Selection Ridge Regression (FSRR).

C. Forward Selection Regression (FSR)

x ˜∗ = x ˜

(10)

The optimum number of regressors to include in the FSR model is determined using statistical criteria such as the Ftest or through cross-validation on an independent dataset. Denoting XF SR as the matrix of FSR selected regressors, the resulting VM model can be represented as

under the constraint p X

x ˜∗ x ˜∗ I − ∗0 ∗ x ˜ x ˜

where dist is the `2 -norm, i.e., ||xi − xj ||2 , while the maxoid for the kth cluster is given by

(9)

˜ is removed Once x ˜∗ has been identified, its contribution to X before searching for the next regressor, that is,

mk = arg max {dist(xi , C)} ˜ xi ∈ X

47

(17)

ASMC 2012

TABLE I VALIDATION AND TEST NMSE FOR VM MODELLING WITH AND WITHOUT MSC PRE-PROCESSING OF THE INPUT SPACE

where C = [C1 , C2 , ..., Ck−1 ] is the set of previously identified clusters and dist(xi , C) is defined as the distance from the object xi to the closest object in the set of clusters C. After each cluster is identified it is added to C and the ˜ The algorithm is corresponding objects are removed from X. ˜ iterated until there are no objects left in X. F. MSC-FSR, MSC-RR, MSC-LASSO and MSC-FSRR In datasets, such as OES, which are characterised by groups of highly correlated variables only one variable from each group is needed for model building. In MSC the groups of correlated variables are automatically assigned into distinct clusters, hence the possibility exists to use MSC as a preprocessing step in VM modelling to reduce the dimensionality of the input space. This is achieved by replacing each MSC cluster by a single representative variable such as its centroid, maxoid or medoid. The centroid is the preferred choice as this has the desirable effect of averaging out noise. This noise suppression, coupled with the improvement in conditioning afforded by the elimination of the collinearity in the dataset, has the potential to substantially improve the performance of VM models. In the next section the proposed MSC preprocessing scheme will be evaluated for the four VM modelling algorithms considered in the paper, namely, FSR, RR, LASSO and FSRR. The resulting algorithms will be referred to as MSC-FSR, MSC-RR, MSC-LASSO and MSC-FSRR

Method

Threshold

# of variables

Validation

Test

FSR RR LASSO FSRR MSC-FSR MSC-RR MSC-LASSO MSC-FSRR

N/A N/A N/A N/A 0.9844 0.9877 0.8965 0.9669

103 8000 911 103 131 4888 270 979

3.21% 3.07% 3.5% 2.98% 2.85% 3.26% 4.86% 2.95%

4.13% 3.82% 3.91% 3.74% 3.5% 3.72% 4.94% 3.2%

Fig. 3.

MSC correlation threshold as a function of the number of clusters.

III. CASE STUDY The proposed methodology has been applied to a benchmark semiconductor plasma etch data set, consisting of OES signatures and corresponding actual etch rates for a sequence of over 2000 wafers processed on an industrial plasma etch tool. The dataset input space consists of 4 statistical moments (mean, variance, skewness and kurtosis) of the amplitude of the time series for each of 2000 OES wavelengths. For analysis 50% of the data is used for training, 25 % for validation and 25% for testing. The resulting VM models are compared based on two criteria: (i) the number of variables selected; and (ii) the normalized mean squared error (NMSE) on the test data. Here, NMSE is defined as Pn (yi − yˆi )2 N M SE = Pni=1 × 100 (18) ˆ i )2 i=1 (yi − µ Pn where µy = n1 i=1 yi . The VM model parameters were estimated using the training set and the model hyper parameters (number of variables, λ and λ1 ) were optimised using cross-validation on the validation data set. The MSC correlation threshold λM SC was also optimised through cross-validation. The optimal results obtained with each algorithm on both the validation and test datasets are summarized in Table I. Fig. 3 shows the relationship between λM SC and the number of clusters obtained for the dataset. For MSC-FSR, the optimal performance is achieved with a correlation threshold of 0.9844. The corresponding validation dataset NMSE as a function of the number

of cluster centroids (variables) selected by FSR is shown in Fig. 4. For comparison purposes the equivalent graphs for FSR and FSRR are given in Fig. 5 and for LASSO and RR in Fig. 6 and Fig.7, respectively. The results show that the inclusion of MSC enhances the performance of FSR and FSRR by 13% and 8.5%, respectively, but has a negative affect on LASSO. MSC-FSR and MSC-FSRR outperform all other VM algorithms considered for the problem.

IV. CONCLUSIONS This paper proposes MSC based clustering as a preprocessing step to enhance the performance of VM model building techniques in scenarios where input spaces are highly correlated. Using MSC the dimension of the input space is reduced by replacing the MSC clusters with their centroids, leading to a better conditioned set of variables for model building. The approach was evaluated using four VM methods namely, LASSO, Ridge Regression (RR), Forward Stepwise Regression (FSR), Forward Stepwise Ridge Regression (FSRR) on a benchmark dataset. The results show that MSC enhances the performance of FSR and FSRR significantly but has a detrimental effect on LASSO. Further investigation is needed to establish if these trends generalise to other problems.

48

ASMC 2012

Ridge Regression results 100 90 80

Test NMSE

70 60 50 40 30 20 10 0 −15 10

−10

10

Fig. 4. MSC-FSR NMSE performance as a function of the number for cluster centroids selected for a correlation threshold of 0.9844. Fig. 7.

−5

10

0

10 ξ

5

10

10

10

15

10

Ridge regression NMSE result

ACKNOWLEDGEMENTS The financial support of the Irish Centre for Manufacturing Research and Enterprise Ireland (grant CC/2010/1001) are gratefully acknowledged. R EFERENCES

Fig. 5.

[1] BN Asthana, S. Chand, and PC Tulsian. Elements of statistics. Management Accountant, page 234, 2012. [2] R. Bellman and E. Lee. History and development of dynamic programming. Control Systems Magazine, IEEE, 4(4):24–28, 2002. [3] Y.J. Chang, Y. Kang, C.L. Hsu, C.T. Chang, and T.Y. Chan. Virtual metrology technique for semiconductor manufacturing. In Neural Networks, 2006. IJCNN’06. International Joint Conference on, pages 5289–5293. IEEE, 2006. [4] R. Chen, H. Huang, C.J. Spanos, and Gatto M. Plasma etch modeling using optical emission spectroscopy. J. Vac. Sci. Technol. A., 14. [5] F.T. Cheng, J.Y.C. Chang, H.C. Huang, C.A. Kao, Y.L. Chen, and J.L. Peng. Benefit model of virtual metrology and integrating avm into mes. Semiconductor Manufacturing, IEEE Transactions on, 24(2):261–272, 2011. [6] A.C. Diebold. Overview of metrology requirements based on the 1994 national technology roadmap for semiconductors. In Advanced Semiconductor Manufacturing Conference and Workshop, 1995. ASMC 95 Proceedings. IEEE/SEMI 1995, pages 50–60. IEEE, 1995. [7] B. Flynn and S. McLoone. Max separation clustering for feature extraction from optical emission spectroscopy data. Semiconductor Manufacturing, IEEE Transactions on, (99):1–1, 2011. [8] C.F. Gauss. Theoria motus corporum coelestium in sectionibus conicis solem ambientium. Sumtibus F. Perthes et IH Besser, 1809.

FSR and FSRR test data NMSE comparison

90 80 70

Test NMSE

60 50 40 30 20 10 0

0

200

400 600 800 Number of selected regressors

Fig. 6.

1000

1200

LASSO NMSE result

49

ASMC 2012

[9] J. Neter, W. Wasserman, M.H. Kutner, and W. Li. Applied linear statistical models. Number L. Irwin, 1996. [10] I. Ramirez, F. Lecumberry, and G. Sapiro. Universal priors for sparse modeling. In Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2009 3rd IEEE International Workshop on, pages 197–200. IEEE, 2010. [11] M.P. Splichal and H.M. Anderson. Application of chemometrics to optical emission spectroscopy for plasma monitoring. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 1594, pages 189–203, 1992. [12] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. [13] A. Tikhonov. On the stability of inverse problems. In CR (Dokl.) Acad. Sci. URSS, n. Ser, volume 39, pages 176–179, 1943. [14] W.G.M. van den Hoek and T. Mountsier. A new high density plasma source for void free dielectric gap fill. Technical Proceedings of the 1994 SEMI Technology Symposium (chiba, Japan), pages 195–200, 1994. [15] A. Weber. Virtual metrology and your technology watch list: Ten things you should know about this emerging technology. Future Fab International, 22:52–54, 2007.

50

ASMC 2012