Longitudinal Outlier Detection Through Robust ... - Semantic Scholar

31 downloads 0 Views 224KB Size Report
In this paper we show how robust bivariate boxplots Zani et al., 1998 can be used for the analysis of longitudinal data. The suggested boxplot is made up of anĀ ...
Longitudinal Outlier Detection Through Robust Bivariate Boxplots Marco Riani and Sergio Zani Istituto di Statistica, Facolta di Economia, Via Kennedy 6, 43100 Parma, Italy

Introduction In this paper we show how robust bivariate boxplots (Zani et al., 1998) can be used for the analysis of longitudinal data. The suggested boxplot is made up of an inner region, a robust centroid and an outer contour and shows some advantages with respect to the former proposals (Goldberg and Iglewicz, 1992). The inner region is found superimposing a -spline curve on a convex hull which is the bivariate equivalent of the interquartile range. The robust centroid is obtained using the arithmetic means of the observations forming the inner region. Finally, the outer contour is found using a multiple of the distance of the inner curve from the robust center, in such a way that under the additional hypothesis of bivariate normality the probability that an observation lies outside is  0 01. B

:

Longitudinal Analysis Through Successive Bivariate Boxplots We describe the suggested extension by means of an application to a data set referred to volatility and medium term performance of 23 Italian stock investment funds in successive months. Due to lack of space we present the data only for two periods: beginning of Jan. and Mar. 1999 (Source: Italian nancial newspaper \Il Sole 24 ore", data available on request). Of course the best funds are those which combine the highest performance with the lowest volatility. The judgement on a fund, therefore, must consider simultaneously these two aspects and their changes during time. The superimposition of bivariate boxplots in the scatter diagram for each period enables to answer to all these question in a completely non parametric way. Figure 1 shows the robust bivariate boxplots in correspondence of the two periods. In each plot the bulk of the data, as expected, shows a positive correlation: this implies that the highest performance is usually linked with the highest risk. However, there are many points which seem to depart from an imaginary robust regression line that one could draw through the distribution of the data. The graph clearly shows that the spread of the data in the di erent directions is not symmetric, therefore the superimposition of a con dence ellipse does not seem to be appropriate. The outer contour of the bivariate boxplot, on the contrary, adapts to the di ering spread of the data in the di erent directions in a non parametric way and enables to state what are the funds with atypical behaviour (those lying outside the outer contour). On the other hand, the clean observations are those which lie inside the bivariate contour. Note that if we consider di erent varieties of thresholds for the outer contour we could order non parametrically in each period the global behaviour (performance and volatility) of the funds. For example, the funds located in the low right-hand corner of the plot (high performance and low volatility) and outside the di erent contours (e.g. 75%, 90%, 99%) may be considered increasing outperformer funds. In Figure 1 a few funds are identi ed by their number and this let us monitor the changes of their relative position. In general terms: the joint examination of successive bivariate boxplots in each of the  2 periods can lead to de ne a subset of longitudinal clean observations as follows:

q

Figure 1: Robust Bivariate Boxplots of stochastic volatility versus medium term performance for 23 Italian stock investment funds. Left panel: Jan. 99, right panel: Mar. 99 (Symbol \+" denotes robust centroid).

De nition: We call the longitudinal subset of clean observations the one formed by the observations which never lie outside the outer contour in each period. The complementary subset is formed by potential outliers. The order of outlyingness of each unit can be simply evaluated analyzing the number of times it falls outside the outer contour in the q plots. In addition, the de nition of the longitudinal subset of clean observations can become a starting point for outlier detection procedures based on forward search methods (Atkinson, 1994; Atkinson and Riani, 1997). The iterative inclusion of the units proceeds as follows: we preliminary select the group of units which fall only once outside the outer contours. We introduce them iteratively into the subset of clean observations starting from the one which has the smallest distance from the centroid, using a generalized metric for asymmetric distributions (Riani and Zani, 1998). Then we consider the units which lie twice outside the outer contours and proceed similarly. At each step a lot of statistics can be monitored. The suggested approach provides an ordering of bivariate longitudinal data (not belonging to the subset of clean observations) from those closest to the bulk of the data to those furthest from it.

References

Atkinson, A.C. (1994), Fast Very Robust Methods for the Detection of Multiple Outliers, Journal of the American Statistical Association, 89, 1329-1339. Atkinson, A. C. and Riani, M. (1997), Bivariate Boxplots, Multiple Outliers, Multivariate Transformations and Discriminant Analysis: the 1997 Hunter Lecture, Environmetrics, 8, 583-602. Goldberg, K.M. and Iglewicz, B. (1992), Bivariate Extensions of the Boxplot, Technometrics, 34, 307-320. Riani, M. and Zani, S. (1998), `Generalized Distance Measures for Asymmetric Multivariate Distributions', in Rizzi, A., Vichi, M. and H.-H. Bock (eds.), Advances in Data Science and Classi cation, Springer Verlag, Berlin, pp. 503-508. Zani, S., Riani, M. and Corbellini, A. (1998), Robust Bivariate Boxplots and Multiple Outlier Detection, Computational Statistics and Data Analysis, 28, 257-270.