Statistical Issues in Traffic Accident Modeling

0 downloads 0 Views 201KB Size Report
models. Summary statistics describing the data are provided in Table 1. 3. MODEL .... negative binomial distribution becomes identical to the Poisson distribution. .... For example, accident data might not be available for some of these locations ...
Statistical Issues in Traffic Accident Modeling

Ziad Sawalha

Research Associate Department of Civil Engineering University of British Columbia Vancouver, B.C., Canada V6T 1Z4

Tarek Sayed Associate Professor Department of Civil Engineering The University of British Columbia 2324 Main Mall Vancouver, BC, Canada V6T 1Z4 Tel: (604) 822-4379 Fax: (604) 822-6901 E-mail: [email protected]

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

ABSTRACT Accident prediction models are invaluable tools that have many applications in road safety analysis. However, there are certain statistical issues related to accident modeling that either deserve further attention or have not been dealt with adequately in the road safety literature. This paper discusses and illustrates how to deal with two statistical issues related to modeling accidents using Poisson and negative binomial regression. The first issue is that of model building or deciding which explanatory variables to include in an accident prediction model. The study differentiates between applications for which it is advisable to avoid model over-fitting and other applications for which it is desirable to fit the model to the data as closely as possible. It then suggests procedures for developing parsimonious models, i.e. models that are not overfitted, and best-fit models. The second issue discussed in the paper is that of outlier analysis. The study suggests a procedure for the identification and exclusion of extremely influential outliers from the development of Poisson and negative binomial regression models. The procedures suggested for model building and conducting outlier analysis are more straightforward to apply in the case of Poisson regression models due to an added complexity presented by the shape parameter of the negative binomial distribution. The paper, therefore, presents flowcharts detailing the application of the procedures when modeling is carried out using negative binomial regression. The described procedures are then applied in the development of negative binomial accident prediction models for the urban arterials of the cities of Vancouver and Richmond located in the province of British Columbia, Canada.

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

1. INTRODUCTION Statistical modeling is used to develop accident prediction models (APMs) relating accident occurrence on various road facilities to the traffic and geometric characteristics of these facilities. These models have several applications such as estimation of the safety potential of road entities, identification and ranking of hazardous or accident-prone locations, evaluation of the effectiveness of safety improvement measures, and safety planning.

Currently, generalized linear regression modeling (GLM) is used almost exclusively for the development of APMs since several researchers (e.g. Jovanis and Chang, 1986; Miaou and Lum, 1993) have demonstrated that certain standard conditions under which conventional linear regression modeling is appropriate (Normal error structure and constant error variance) are violated by traffic accident data. Most safety researchers now adopt either a Poisson or a negative binomial error structure in the development of APMs. Several GLM statistical software packages are available for the development of these models. This software allows the modeling of data that follow a wide range of probability distributions belonging to the exponential family (among which are the Poisson and the negative binomial distributions).

The road safety literature is rich with APMs developed by Poisson or negative binomial regression. However, there are certain statistical issues related to accident modeling that either deserve further attention or have not been dealt with adequately in the road safety literature. This paper discusses and illustrates how to deal with two statistical issues related to accident modeling, namely the issue of selecting the explanatory variables of an APM and the issue of outlier analysis. The procedures given for dealing with these issues are then applied in the development of APMs for the urban arterials of the cities of Vancouver and Richmond located in the province of British Columbia, Canada.

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

2. DATA DESCRIPTION A total of 58 arterials in the cities of Vancouver and Richmond were investigated for the purpose of developing APMs relating the safety of urban arterial sections to their traffic and geometric characteristics. Geometric data were directly collected from the field. The approach to geometric data collection consisted of dividing each arterial into sections between consecutive signalized intersections and gathering field information for each section separately. The geometric data collected consisted of: (a) section length, (b) between-signal number of lanes, (c) number of unsignalized intersections, (d) number of driveways, (e) number of bus stops, (f) number of pedestrian crosswalks, (g) type of median (no median, flush median, or raised-curb median), (h) type of land use (residential, business, or other), and (i) percentage of section length along which parking is allowed.

The data on accident frequencies and traffic volumes along the arterials were obtained from the cities of Vancouver and Richmond and covered the period from 1994 to 1996. Accidents that occurred at signalized intersections were excluded from the accident data used to develop the models. Summary statistics describing the data are provided in Table 1.

3. MODEL DEVELOPMENT The development of the APMs for the urban arterial sections of Vancouver and Richmond was carried out using the GLIM4 statistical software package (Numerical Algorithms Group, 1994). The following elements were necessary for model development: appropriate model form, appropriate error structure, procedure for selecting the model explanatory variables, procedure for outlier analysis, and methods for assessing model goodness of fit. The procedures for

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

selecting the model explanatory variables and conducting outlier analysis are the statistical issues targeted by this paper and are discussed in the following sections.

3.1 Model Form The mathematical form used for any APM should satisfy two conditions. First, it must yield logical results. This means that a) it must not lead to the prediction of a negative number of accidents and b) it must ensure a prediction of zero accident frequency for zero values of the exposure variables, which, for road sections, are section length and annual average daily traffic (AADT).

The second condition that must be satisfied by the model form is that, in order to use generalized linear regression in the modeling procedure, there must exist a known link function that can linearize this form for the purpose of coefficient estimation. These conditions are satisfied by a model form that consists of the product of powers of the exposure measures multiplied by an exponential incorporating the remaining explanatory variables. Such a model form can be linearized by the logarithm link function. Expressed mathematically, the model form that was used is as follows: Eˆ (Y ) = a0 × La1 × V a 2 × exp ∑ b j x j

(1)

j

where Eˆ (Y ) = predicted accident frequency, L = section length, V = section AADT, x j = any variable additional to L and V , and a0 , a1 , a2 , b j = the model parameters.

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

3.2 Error Structure As mentioned earlier, the GLM approach to modeling traffic accident occurrence assumes an error structure that is Poisson or negative binomial. Let Y be the random variable that represents the accident frequency at a given location during a specific time period, and let y be a certain realization of Y . The mean of Y , denoted by Λ , is itself a random variable (Kulmala, 1995). For Λ = λ , Y is Poisson distributed with parameter λ : λy e−λ P (Y = y | Λ = λ ) = ; E (Y | Λ = λ ) = λ ;Var (Y | Λ = λ ) = λ y!

(2a,b,c)

It is the usual practice to assume that the distribution of Λ can be described by a gamma probability density function. Hauer (1997) examined many accident data sets and the empirical evidence he obtained supported the gamma assumption for the distribution of Λ . If Λ is described by a gamma distribution with shape parameter κ and scale parameter κ / µ , then its density function is: f Λ (λ ) =

(κ / µ ) κ λκ −1e − (κ / µ ) λ µ2 ; E (Λ ) = µ ;Var (Λ ) = Γ(κ ) κ

(3a,b,c)

and the distribution of Y around E (Λ ) = µ is negative binomial (Hinde and Demetrio, 1998; Hauer et al., 1988). Therefore, unconditionally: κ

y

Γ(κ + y )  κ   µ  µ2     ; E (Y ) = µ ;Var (Y ) = µ + P (Y = y ) = Γ(κ ) y!  κ + µ   κ + µ  κ

(4a,b,c)

As shown by equations (4b,c), the variance of the accident frequency is generally larger than its expected value reflecting the fact that accident data are generally over-dispersed. The only exception is when κ → ∞, in which case the distribution of Λ is concentrated at a point and the negative binomial distribution becomes identical to the Poisson distribution.

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

The decision on whether to use a Poisson or negative binomial error structure was based on the following methodology. First, the model parameters are estimated based on a Poisson error structure. Then, the dispersion parameter (σ d ) is calculated as follows:

σd =

Pearson χ 2 n− p

(5)

where n is the number of observations, p is the number of model parameters, and Pearsonχ 2 is defined as: Pearsonχ

[y − Eˆ (Y )] =∑

2

n

2

i

i =1

i

Var (Yi )

(6)

where y i is the observed number of accidents on section i , Eˆ (Yi ) is the predicted accident frequency for section i as obtained from the APM, and Var (Yi ) is the variance of the accident frequency for section i . The dispersion parameter, σ d , is noted by McCullagh and Nelder (1989) to be a useful statistic for assessing the amount of variation in the observed data. If σ d turns out to be significantly greater than 1.0, then the data have greater dispersion than is explained by the Poisson distribution, and a negative binomial regression model is fitted to the data.

3.3 Assessment of Model Goodness of Fit Several statistical measures can be used to assess the goodness of fit of GLM models. The two statistical measures used are those cited by McCullagh and Nelder (1989) for assessing model goodness of fit. These are a) the Pearsonχ 2 statistic, defined in equation (6), and b) the scaled deviance. The scaled deviance is the likelihood ratio test statistic measuring twice the difference between the maximized log-likelihoods of the studied model and the full or saturated model. The full model has as many parameters as there are observations so that the model fits the data perfectly. Therefore, the full model, which possesses the maximum log-likelihood achievable

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

under the given data, provides a baseline for assessing the goodness of fit of an intermediate model with p parameters.

McCullagh and Nelder (1989) have shown that if the error structure is Poisson distributed, then the scaled deviance is as follows: 

yi   ˆ (Y )  E i  

n

SD = 2∑ yi ln i =1

(7)

while if the error structure follows the negative binomial distribution, the scaled deviance is: n



i =1



  −(y i  ˆ  E (Yi )  

SD = 2∑  y i ln

yi

yi + κ



+ κ ) ln

 E (Yi ) + κ

ˆ

   

(8)

Both the scaled deviance and the Pearsonχ 2 have χ 2 distributions for Normal theory linear models, but they are asymptotically χ 2 distributed with n − p degrees of freedom for other distributions of the exponential family.

4. SELECTION OF MODEL EXPLANATORY VARIABLES There seems to be a belief among many safety researchers that the more variables in an APM the better the model. Some researchers have even reported models containing variables with highly insignificant parameters based on the belief that such variables would still improve model prediction. Such variables are hardly of any value for explaining the variability of the specific accident data used in generating the model much less of any value for predicting accident frequencies at new locations not used in the model development. Explanatory variables that have statistically significant model parameters, on the other hand, contribute to the explanation of the variability of the accident data, and their inclusion in the model therefore improves its fit to this data. Nevertheless, improvement of a model’s fit to the accident data is not enough justification

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

for retaining a variable in the model. This paper presents a detailed analysis on how to select which explanatory variables to include in an APM.

The procedure that is suggested in this paper for selecting the explanatory variables to include in an APM depends on the locations whose safety is to be studied by the model. Safety researchers and road safety authorities are not interested only in the safety study of the limited number of locations used for model building. They want to be able to study the safety of other locations in the same region that have traffic and geometric characteristics similar to those of the locations used to build a model. Various reasons could prevent many locations from being included in the development of a model. For example, accident data might not be available for some of these locations at the time of model development. Another reason might be that some of these locations simply did not exist at that time. Therefore generality is a characteristic that is required in a model that will be used for the safety study of new locations not included in its development.

Model generality requires that a model be developed in accordance with the principle of parsimony, which calls for explaining as much of the variability of the data using the least number of explanatory variables. The idea behind the principle of parsimony is to avoid overfitting. It is tempting to include many variables in a model in an effort to make it fit the observed data as closely as desired. In fact, if a number of statistically significant variables equal to the number of observations can be found, a perfect fit to the data can be achieved by including these variables in the model. However, the question that must be asked is whether a model having a perfect or extremely close fit to a particular set of observations can produce reliable predictions when applied to a different set of locations. The answer is that such a model is unstable and may perform poorly when applied to a new sample drawn from the same population. For the safety study of new locations, more is not better, and it may be worse. The procedure that is proposed in this paper for building parsimonious models is explained in the next section.

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

On the other hand, if an APM is to be used for studying the safety of the particular set of locations used to develop it, then a more accurate study would result by using a model that fits the accident data as closely as possible. This best-fit model is achieved by including all the available statistically significant explanatory variables. The procedure for building best-fit models is also explained in the next section.

4.1 Procedures for Building Parsimonious and Best-Fit Models The procedure that is suggested in this paper for developing parsimonious APMs is a forward procedure by which the variables are added to a model one by one. The decision on whether a variable should be retained in the model is based on two criteria. The first criterion is whether the t-ratio of its estimated parameter (equal to the parameter estimate divided by its standard error and equivalent to the Wald statistic) is significant at the 95 percent confidence level (or any other level selected by the model developer). The second criterion is whether the addition of the variable to the model causes a significant drop in the scaled deviance at the 95 percent confidence level.

The second criterion represents an analysis of deviance procedure for comparing two nested models. This procedure is equivalent to carrying out a likelihood ratio test to determine whether the model containing the additional variable significantly increases the likelihood of the observed sample of accident data. As mentioned earlier, the scaled deviance is asymptotically χ 2 distributed with n − p degrees of freedom, and therefore, owing to the reproductive property of the χ 2 distribution, this second criterion is met if the addition of the variable causes a drop in scaled deviance exceeding χ 02.05,1 (equal to 3.84).

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

However, it is important to note at this point that the analysis of deviance procedure for developing parsimonious models should be conducted in different manners for Poisson and negative binomial regression models. In the case of Poisson regression models, the difference in scaled deviance between the model with p + 1 variables and the model with p variables is equal to the likelihood ratio test statistic, which compares the maximized likelihood functions of the sample of accident data under the two models. This statistic is defined as follows:  L p +1    L   p 

LRTS = 2 ln

(9)

where LRTS = the likelihood ratio test statistic L p +1

= the maximum likelihood of the accident data under the model with p + 1 variables

Lp

= the maximum likelihood of the accident data under the model with p variables

The likelihood of the accident data under a certain model is equal to the joint probability of the accident observations. It is therefore a function of the model parameters and the error structure assumed by the model. LRTS is a non-negative statistic and the minimum value it can have is zero in which case the additional variable contributes nothing to increase the likelihood of the sample of accident data. LRTS is used to test the null hypothesis H 0 : β p +1 = 0 , where β p +1 is the parameter of the additional variable. A statistically significant LRTS leads to the rejection of the null hypothesis and the adoption of the model with p + 1 variables. A statistically insignificant LRTS means failing to reject the null hypothesis and concluding that the model with p variables is perfectly satisfactory.

Determining the statistical significance of LRTS

requires knowledge of its sampling

distribution. Aitkin et al. (1989) state that if two nested models with degrees of freedom df1 and df2 are correct, the sampling distribution of their LRTS is an asymptotic χ 2 distribution with degrees of freedom equal to (df1-df2). Therefore for the LRTS in equation (10) to be statistically

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

significant at the 95 percent confidence level it has to exceed χ 02.05,1 = 3.84. The fact that, in the case of Poisson regression models, the difference in scaled deviance between the model with p + 1 variables and the model with p variables is equal to their LRTS means that this

difference can be directly used to assess the usefulness of the additional variable. A drop in scaled deviance exceeding 3.84 can be taken as the basis for choosing the model with p + 1 variables.

In the case of negative binomial regression, models with different variables have different κ values. The scaled deviances of two negative binomial regression models, with different values of κ , cannot be directly compared (Hinde, 1996). The reason behind this is that the difference in scaled deviance between the two models is not equal to their likelihood ratio test statistic. In this case, conducting a meaningful analysis of deviance procedure for developing parsimonious models requires that the value of κ for the model with p variables be imposed on the model with p + 1 variables. The drop in scaled deviance is then compared with χ 02.05,1 in order to assess the usefulness of the additional variable. The procedure for developing parsimonious negative binomial regression models is illustrated in the next section with the end result being a parsimonious APM for the urban arterial sections of Vancouver and Richmond.

Now that the procedure for developing parsimonious APMs has been outlined, it is necessary to justify this procedure. In other words, it is necessary to state why the t-test alone is not sufficient for developing parsimonious Poisson and negative binomial regression models. It should be stated at the outset that the t-test and the likelihood ratio test, or analysis of deviance, are completely equivalent for two nested Normal regression linear models differing by one variable. They can both be used to test the null hypothesis H 0 : β p +1 = 0 . This is not the case for the rest of the GLM family models. The following explains why:

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

1. The use of the t-distribution when parameter estimates are compared with their standard errors is exact for the classical Normal regression linear model, but is otherwise justified only by asymptotic theory (Numerical Algorithms Group, 1994). No general results are known about the adequacy of this approximation for all the other models covered by GLM, so the ttest must be regarded as only a general guide to the accuracy of the estimates. 2. Aitkin et al. (1989) state that an asymptotic χ 2 distribution for the LRTS of two nested models is not used in the case of Normal regression linear models, since exact results are available. The exact distribution for the LRTS of two nested Normal regression linear models containing p + g and p explanatory variables is the Fg , n − p − g distribution with g degrees of freedom in the numerator and n − p − g degrees of freedom in the denominator. Therefore when g = 1 , the distribution of LRTS is F1, n − p −1 which is equal to tn2− p −1 . This means that, in the case of Normal regression linear models, the t-test and the likelihood ratio test are completely equivalent. For regression models in which the response variable is not normally distributed, the value of tn2− p −1 may be substantially different from the LRTS of two nested models differing by one variable.

The procedure used in this paper for developing APMs that fit the accident data as closely as possible is relatively straightforward. It is also a forward procedure by which variables are added to a model one by one. However, in this case, the only condition for retaining a variable in the model is that the t-ratio of its estimated coefficient is significant at the 95 percent confidence level.

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

4.2 Illustration of the Procedure for Developing Parsimonious Negative Binomial APMs The development of parsimonious negative binomial APMs is best illustrated by the flowchart in Figure 1. This flowchart shows the detailed sequence of steps necessary to develop such models according to the procedure outlined in the previous section. Following is the definition of the letter symbols used in the flowchart: i

= an integer variable that is used as an index for the explanatory variables.

j

= an integer variable that is used to refer to the model number.

κ (j) = shape parameter of the negative binomial error structure under Reference Model(j). N

= an integer variable representing the number of explanatory variables, other than the exposure variables, that are under investigation at any phase of the process.

X(i) = explanatory variable i. D(i) = drop in scaled deviance resulting from the addition of variable X(i) to a model. S

= a real variable used to store the value of the greatest drop in scaled deviance during a cycle of variable additions to Reference Model(j).

The flowchart in Figure 1 shows that several models are run with and without fixing the value of

κ at various phases of the model development process. The default method used by GLIM 4 to determine the vector of model parameters, β, and the shape parameter, κ , is the method of maximum likelihood. This method provides maximum likelihood estimates of β and κ . These are the values of β and κ that maximize the log-likelihood function l (β, κ ) given by: l (β, κ ) =

n

∑ i =1

 ln Γ ( y i 



yˆ i



κ

+ κ ) − ln Γ(κ ) − ln( y i !) − κ ln1 +

 − 



κ



yˆ i

y i ln1 +

   

(10)

The log-likelihood function depends on β through the terms yˆ i = Eˆ (Yi ) , which represent the model predictions or fitted values. The model form most commonly used in accident modeling was given in equation (1), and it specifies the fitted values as yˆ i = exp(xi′β) where xi is the vector

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

of explanatory variables corresponding to the ith observation. The method of maximum likelihood uses the Newton-Raphson technique to obtain β and κ as the solution to the equations ∂ l /∂ β = 0 and ∂ l /∂ κ = 0. GLIM 4 also allows the user to pre-specify the value of κ . The value of β that maximizes the log-likelihood function is then calculated using the Newton–Raphson technique with Fisher scoring. The Fisher scoring allows the calculation of β through an iteratively re-weighted least squares (IRLS) procedure.

The starting point in the model development process is a basic model containing the exposure variables only. The reason for this start is that any accident model should at least contain the exposure variables since no accidents occur without exposure. The basic model acts as the first reference model. The term “reference model” is used here to denote a model that serves as the base for generating a new model containing one more additional variable. Subsequently, secondary models are developed by adding the rest of the variables individually to the basic model with κ fixed to the value obtained under that model. Since κ is fixed at this stage in the process, it is possible to compare the scaled deviance of the model resulting from the addition of a certain variable to the scaled deviance of the basic model. No further consideration is given to any variable whose parameter has an insignificant t-ratio or that causes an insignificant drop in the scaled deviance obtained under the basic model. The drop in scaled deviance caused by each of the other variables is noted and recorded.

The next stage in the process consists of developing a new reference model containing the exposure variables and that explanatory variable which caused the maximum drop in scaled deviance during the previous stage; this model will have a new κ value. The remaining variables that were not discarded in the previous stage are now individually added to the second reference model with κ fixed to the new value, and a procedure similar to the one in the previous stage is used to obtain the next reference model. The model building process continues in this manner using the method of maximum likelihood to obtain the reference models and the fixed κ method

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

to obtain the secondary models. The process terminates when all the explanatory variables have been examined, and the final reference model will be the parsimonious model sought.

4.3 Parsimonious and Best-Fit Models Developed The procedures that were described for developing parsimonious and best-fit accident predictions models were used to develop such models for the urban arterial sections of Vancouver and Richmond. These models are listed in Table 2, and they are models for predicting total accident frequency on a section in a period of 3 yrs. The symbols used in the table are: a) L = section length in km, b) V = annual average daily traffic in vehicles per day, c) UNSD = unsignalized intersection density in intersection per km, f) CROD = crosswalk density in crosswalks per km, g) NL = between-signal number of lanes, h) IRES = indicator variable for residential land use (equal to 1 if land use is residential; 0 otherwise), i) IBUS = indicator variable for business land use (equal to 1 if land use is for business; 0 otherwise), j) IUND = indicator variable for undivided cross section (equal to 1 if section has no median; 0 otherwise).

Both models have values of scaled deviance and Pearsonχ 2 that are significant at the 95% confidence level indicating that the models have an acceptable fit to the data. The t-ratios of the parameter estimates of both models are also significant at the 95% confidence level. It should be noted that the parsimonious model contains fewer variables than the best-fit model that was developed using only the significant t-ratio criterion for retaining variables in a model.

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

5. Outlier Analysis Most data sets contain a few unusual or extreme observations called outliers. Outliers occur in a set of data either because they are genuinely different from the rest of the data or because errors took place during data collection and recording. Errors can be detected and corrected, but how should outliers that are genuinely different from the rest of the data be dealt with? This paper proposes excluding outliers from model development only if they have extreme influence on the model equation. Such exclusion relieves the model equation of the sensitivity to a very small percentage of the data and establishes more confidence in its validity. However, it should be emphasized that excluding influential outliers from the development of an accident model is not synonymous with neglecting these outliers or removing them from the database. Rather, they should be investigated to determine what makes them different and whether any information can be extracted from them.

Kulmala (1995) used the leverage statistic to identify outliers that should be excluded from the model development. In the Normal linear regression case, the leverage of data point i is a measure of the distance of the p-dimensional point xi from the centroid of the other variate points in the p-dimensional design space, which is the vector space spanned by the column vectors of an n×p design or model matrix X. The design space will hereafter be referred to as the X-space. High-leverage data points are therefore outliers in the X-space, and they tend to pull the regression hyper-surface towards them because they exist alone in their region of the X-space.

The data point leverage values are the diagonal elements of the n×n hat matrix, which is the matrix that multiplies the vector of observations to yield the vector of predicted values. For Normal linear regression models, the hat matrix is H = X(X′X)-1X′, and the ith diagonal element of H is hii = xi′(X′X)-1xi. hii has a value between zero and unity. In the case of Normal linear regression models, a value of unity means that the value for response predicted by the model is

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

equal to the observed value (Jorgensen, 1993). Thus a large value of hii is an indication that the corresponding observation may be influential in determining the position of the fitted model.

The concept of leverage may be extended to other models of the GLM family. For these other models, the hat matrix is based on the IRLS procedure for model fitting. Therefore, H = W1/2X(X′WX)-1X′ W1/2 where W is the n×n diagonal matrix of weights: 1  dµ i  wii = Var ( y i )  dη i

   

2

(11)

where µ i = the mean of y i and ηi = xi′β is the linear predictor associated with y i . In GLM,

µ i and ηi are related by a link function which is the log-link for Poisson and negative binomial regression models. Therefore, for these models, µ i = exp(ηi ) and

 dµ i   dη i 

   

= exp(xi′β). The ith

diagonal element of H in the case of non-Normal GLM models is: hii = wiixi′(X′WX)-1xi.

It should be noted that in the case of non-Normal GLM models a data point i at the extreme Xrange will not necessarily have high leverage if its weight wii is very small (McCullagh and Nelder, 1989). The sum of the leverage values hii, namely trace(H), is equal to the number of model parameters p. The average value of the leverage is therefore equal to p/n, where n is the number of data points. Many sources (e.g. Hoaglin and Welsch, 1978) consider that a high leverage is one that exceeds 2p/n and that data points with leverage in excess of this value should be subject to further consideration.

High-leverage data points have potential for being influential by virtue of their location in the Xspace. However, high leverage alone is not sufficient to render a point influential and warrant excluding it from the model development. Influence requires that a point be extreme in its value of the response variable in addition to being an outlier in the X-space. An appropriate measure of influence is the Cook’s distance, which is defined as follows:

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

ci =

hii (r p (1 − hii )

where r

PS ' i

PS ' i

)2

(12)

is the studentized residual of data point i. In the case of Poisson and negative

binomial regression models, the studentized residual is given by: r PS 'i =

yi − yˆ i (1 − hii )Var ( yi )

(13)

Thus Cook’s distance is made up of a component that reflects how well or how poorly the model fits the ith observation y i , namely yˆi , and a component that measures how far the data point is from the rest of the points in the X-space, namely hii.

Data points having a high Cook’s distance are influential points that play a significant role in determining the model parameter estimates. Situations in which a relatively small percentage of the data has a significant impact on the model may not be acceptable to model users. Generally, they are more content assuming that a regression equation is valid if it is not overly sensitive to a few observations. Unfortunately, there is no rule for how high Cook’s distance has to be to make a data point extremely influential. Therefore, a procedure is needed by which to decide whether a data point is extremely influential with a high degree of confidence. This paper proposes the following procedure for identifying extremely influential outliers: 1. The data are sorted in descending order according to the ci values. 2. Starting with the data point having the largest ci value, data points are removed one by one, and the drop in scaled deviance is observed after the removal of each point. 3. Points causing a significant drop in scaled deviance are extremely influential outliers.

The same analysis of deviance procedure used to develop parsimonious APMs can be used in the identification of extremely influential outliers. As mentioned before, the scaled deviances of two negative binomial regression models, with different values of κ , cannot be directly compared. Performing outlier analysis for a negative binomial regression case requires that the value of κ

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

for the model with n data points be imposed upon the model with n − 1 data points; then the difference in scaled deviance can be compared to χ α2 ,1 in order to assess whether the removed data point is an extremely influential outlier. The following section illustrates the procedure that is proposed by this paper to perform outlier analysis for negative binomial regression models.

5.1 Illustration of the Procedure for the Identification and Removal of Influential Negative Binomial Model Outliers The flowchart in Figure 2 gives a complete description of the procedure that is proposed by this paper for the identification and removal of extremely influential negative binomial model outliers. Following is the definition of the letter symbols used in the flowchart: n

= the total number of data points available.

i

= an integer variable that is used as an index for the data points.

j

= an integer variable that is used to refer to the model number.

κ(j) = shape parameter of the negative binomial distribution under Reference Model(j). D(i) = drop in scaled deviance resulting from the removal of point i. The term “ reference model” is used here to denote a model that serves as the base for generating a new model using one less data point (the identified influential outlier).

The procedure outlined in Figure 2 was employed to identify and remove the extremely influential outliers of the parsimonious model in Table 2. It resulted in the removal of seven data points, and the final parsimonious APM for the urban arterials of Vancouver and Richmond is listed in Table 3.

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

6. SUMMARY This study presents a detailed discussion of two statistical issues related to accident modeling using Poisson and negative binomial regression. These are the issue of model building, i.e. selecting from the available safety database which explanatory variables to include in an accident prediction model, and the issue of the identification and removal of extremely influential model outliers. The study suggests procedures for model building and conducting outlier analysis and presents enough justification in defense of the validity of these procedures. The procedures are straightforward to apply in connection with Poisson regression models. However, in the case of negative binomial regression models, there is an added complexity arising from the fact that the scaled deviances of two negative binomial regression models with different values of the shape parameter, κ , cannot be directly compared. The study therefore presents flowcharts detailing the application of the procedures in connection with negative binomial regression models. The procedures are then applied in the development of accident prediction models for the urban arterials of the cities of Vancouver and Richmond in the province of British Columbia, Canada.

ACKNOWLEDGEMENTS Financial support for this study was provided by the Natural Sciences and Engineering Research Council of Canada and the Insurance Corporation of British Columbia.

REFERENCES Aitkin, M., Anderson, D., Francis, B., Hinde, J., 1989. Statistical Modelling in GLIM. Oxford University Press, New York.

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

Hauer, E., Ng, J.C.N., Lovell, J., 1988. Estimation of safety at signalized intersections. Transportation Research Record 1185, 48-61. Hauer, E., 1997. Observational before-after studies in road safety: estimating the effect of highway and traffic engineering measures on road safety. Pergamon Press, Elsevier Science Ltd., Oxford, U.K. Hinde, J., 1996. Macros for fitting overdispersion models. Glim Newsletter 26, 10-26. Hinde, J., Demetrio, C.G.B., 1998. Overdispersion: models and estimation. Computational Statistics & Data Analysis 27, 151-170. Hoaglin, D.C., Welsch, R.E., 1978. The hat matrix in regression and ANOVA. Am. Statist. 32 (1), 17-22. Jorgensen, B., 1993. The Theory of Linear Models. Chapman and Hall, New York. Jovanis, P.P., Chang, H.L., 1986. Modeling the relationship of accidents to miles traveled. Transportation Research Record 1068, 42-51. Kulmala, R., 1995. Safety at rural three and four-arm junctions: Development and application of accident prediction models. VTT Publications, Espoo, Finland. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models. Chapman and Hall, New York. Miaou, S., Lum, H., 1993. Modeling vehicle accident and highway geometric design relationships. Accident Analysis & Prevention 25 (6), 689-709. Numerical Algorithms Group, 1994. The GLIM System Release 4 Manual, The Royal Statistical Society, Oxford, UK.

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

Initialize i = 1, j = 1, S = 0 Start with the basic model containing only the exposure variables. This is Reference Model(1).

Add X(i) to Reference Model(j) with k fixed to k(j)

i=i+1

Is t-ratio of X(i) significant?

No

N = 0?

Yes

No N=N-1

Yes

Discard X(i)

Is D(i) significant?

No

Stop. Reference Model(j) is the final model.

Yes

j= j+1

No

Add the X(i) whose D(i) = S to Reference Model(j-1) without fixing k. Obtain Reference Model(j).

Is D(i) > S

Yes S = D(i)

N=N-1 i=i+1

Yes

N = 0? Yes

i = N?

No

No

Stop. Reference Model(j) is the final model.

Renumber remaining variables from 1 to N Reset i to 1 and S to 0

Figure 1 Flowchart Demonstrating the Process of Developing a Parsimonious Negative Binomial Regression Model

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

Initialize i = 1, j = 1. Develop a model using all n data points. This is Reference Model(1).

Obtain the Cook's distance of each data point as computer output for Reference Model(1).

Arrange the data points in order of decreasing Cook's distance. Number the points from 1 to n.

Remove point(i) and run a model with k fixed to k(j).

Is D(i) significant?

No

Yes

Stop. Reference Model(j) is the final model.

j=j+1

Develop Reference Model(j) using the remaining (n-i) points without fixing k.

i=i+1

Figure 2 Flowchart Demonstrating the Procedure for Identification and Removal of Negative Binomial Regression Model Outliers

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

Table 1 Safety Data Base Summary General Statistics City

Total

Vancouver

Richmond

Total:

64 0 225 289

17 11 74 102

81 11 299 391

Total:

193 92 4 289

84 10 8 102

277 102 12 391

Total:

48.857 km 0 km 181.419 km 230.276 km

14.943 km 9.543 km 65.504 km 89.990 km

63.800 km 9.543 km 246.923 km 320.266 km

Number of Street Sections By Median Type Raised-Curb Median Two Way Left Turn Lane (TWLTL) Undivided Cross Section Number of Street Sections By Land Use Residential Business Other Total Length By Median Type Raised-Curb Median Two Way Left Turn Lane (TWLTL) Undivided Cross Section Accident Data (1994-1996) City

Fatalities Injuries Property Damage Only Accidents (PDO) Total: (PDO) Accidents as Percentage of Total Accidents Range of Database Geometric and Traffic Data

Total

Vancouver

Richmond

26

17

43

4478

1080

5558

12816

1722

14538

17320

2819

20139

74.0%

61.1%

72.2% City

Vancouver

Richmond

2-6 113-3608 0-58 0-21 0-12 0-25

2-4 405-2510 0-95 0-9 0-9 0-12

4236-62931

4232-33862

Geometric Data Number of through traffic lanes Section length, meters Number of driveways per km (two-way total) Number of unsignalized intersections per km (two-way total) Number of pedestrian cross walks per km (two-way total) Number of bus stops per km (two-way total) Traffic Data Average daily traffic, vpd

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.

Table 2 Parsimonious and Best-Fit Accident Prediction Models Parsimonious Model: Acc per 3 yrs = 0.02907 × L0.8015 × V 0.6571 × e 0.09323×UNSD + 0.07663×CROD + 0.09115× NL − 0.2350× IRES Scaled Deviance

df

Shape Parameter κ

Pearson χ 2

χ 02.05,384

414.31

384

3.139

413.7

430.7

Variable

Constant

L

V

UNSD

CROD

NL

IRES

Coefficient

0.02907

0.8015

0.6571

0.09323

0.07663

0.09115

-0.2350

5.320

12.265

8.846

9.815

5.443

2.743

-3.188

t-ratio Best-Fit Model:

Acc per 3 yrs = 0.01841× L0.8005 × V 0.6664 × e 0.08561×UNSD + 0.07846×CROD + 0.08838× NL + 0.01509× BUSD + 0.1414× IUND + 0.01470× DRID. IBUS Scaled Deviance

df

Shape Parameter κ

Pearson χ 2

χ 02.05,382

414.34

382

3.210

424.4

428.6

Variable

Constant

Coefficient

0.01841

t-ratio

5.854

L

V

UNSD

CROD

NL

BUSD

IUND

0.8005 0.6664 0.08561 0.07846 0.08838 0.01509 0.1414 12.350

TRB 2003 Annual Meeting CD-ROM

8.832

8.788

5.653

2.723

2.041

1.970

DRID.IBUS 0.01470 3.763

Paper revised from original submittal.

Table 3 Parsimonious Accident Prediction Model After Outlier Analysis Model: Acc per 3 yrs = 0.03608 × L0.7798 × V 0.6337 × e 0.09445×UNSD + 0.07944×CROD + 0.07633× NL − 0.2174× IRES Scaled Deviance

df

Shape Parameter κ

Pearson χ 2

χ 02.05,377

408.73

377

3.547

370.9

423.3

Variable

Constant

L

V

UNSD

CROD

NL

IRES

Coefficient

0.03608

0.7798

0.6337

0.09445

0.07944

0.07633

-0.2174

5.245

12.493

8.957

10.471

5.903

2.401

-3.050

t-ratio

TRB 2003 Annual Meeting CD-ROM

Paper revised from original submittal.