Metric for Web accessibility evaluation - Rehabilitation Engineering ...

2 downloads 25309 Views 177KB Size Report
Aug 31, 2005 - accessibility evaluation and the need for a good Web accessibility .... Web sites would make a good media story, but it is hardly in- formative for ...
Metric for Web Accessibility Evaluation

Bambang Parmanto and Xiaoming Zeng* Department of Health Information Management, School of Health and Rehabilitation Sciences, University of Pittsburgh, Pittsburgh, PA 15260. E-mail: {parmanto, xizst9}@pitt.edu

A novel metric for quantitatively measuring the content accessibility of the Web for persons with disabilities is proposed. The metric is based on the Web Content Accessibility Guidelines (WCAG) checkpoints, an internationally accepted standard, that can be automatically tested using computer programs. Problems with current accessibility evaluation and the need for a good Web accessibility metric are discussed. The proposed metric is intended to overcome the deficiencies of the current measurements used in Web accessibility studies. The proposed metric meets the requirements as a measurement for scientific research. Examples of large-scale Web accessibility evaluations using the metric are given. The examples cover a comparison of Web accessibility of top medical journal Web sites and a longitudinal study of a Web site over time. The validity of the metric was tested using a large number of Web sites with different levels of compliance (rating categories) to the standard WCAG. The metric, which uses a predetermined simple weighting scheme, compares well to the more complex C5.0 machine learning algorithm in separating Web sites into different rating categories.

Introduction The importance of measuring attributes of known objects in quantitative terms is crucial in advancing the state of science of any field. The Web, as one of the most interesting new objects of research, has generated many metrics to assist scientific investigation (Dhyani, Wee Keong, & Bhowmick, 2002). In this article, we propose a novel metric for measuring content accessibility of the Web for persons with disabilities. Measuring Web accessibility in precise and quantitative terms is important for many reasons. First, it would enhance our understanding of the Web in general. It would also allow us to measure the current state of Web accessibility and to compare the accessibility of different Web sites as well as the accessibility of a single Web site at *Current affiliation: Department of Health Services and Information Management, School of Allied Health Sciences, East Carolina University, Greenville, NC 27858. E-mail: [email protected]. Received February 13, 2004; revised September 22, 2004; accepted September 22, 2004



© 2005 Wiley Periodicals, Inc. Published online 31 August 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/asi.20233

different times. A continuous numerical measure would be preferable to the current dichotomous measure of accessibility. A continuous scale would allow not only a more precise measure of accessibility, but also lend itself to more advanced statistical analysis for evaluating large-scale aggregate Web sites. The current practice of evaluating Web accessibility uses a dichotomous method based on absolute compliance with the standard guidelines, known as the Web Content Accessibility Guidelines 1.0 (WCAG), developed by the World Wide Web Consortium (W3C). A Web site is determined to be accessible or inaccessible by evaluating the Web site against the accessibility checkpoints provided by the WCAG. The WCAG contains 14 broadly phrased guidelines that are translated into 91 specific checkpoints explaining how the guidelines should be applied to specific content development scenarios. These checkpoints are organized into three levels of priority: Priority 1 contains 29 checkpoints that must be satisfied; Priority 2 contains 40 checkpoints that should be satisfied; and Priority 3 contains 22 checkpoints that may be satisfied. Considering the number of checkpoints that a Web site must meet to be considered accessible, it is not surprising that the results of accessibility studies found most Web sites inaccessible. Even complying with the basic accessibility in Priority 1 would be difficult. Any violation of the 29 checkpoints in Priority 1, such as forgetting to designate alternate text for one of the images on the Web site, will render a Web site to be inaccessible in this dichotomous measurement. This type of dichotomous measurement also leads to inaccuracies in the accessibility labeling. That is, the majority of Web sites that claim to be fully accessible, in fact, violate the guidelines with which they are supposed to comply. Our study found that only 8.81% of Web sites that claim themselves to be AAA (conforming to WCAG Priorities 1, 2, and 3 checkpoints) truly comply with all three priorities (see Table 1). The current accessibility measurement also does not take into account the size and complexity of a Web site. A large Web site with hundreds or thousands of Web pages would have a higher chance of violating the checkpoints than a simple Web site with only a handful of Web pages. An accessibility metric that takes into account size and complexity would allow fair comparison between Web sites or aggregates of Web sites.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 56(13):1394–1404, 2005

Sullivan and Matson (Sullivan & Matson, 2000) were the first to propose the idea of continuous accessibility measurement: measuring accessibility in terms of “degrees” instead of the dichotomous accessible–inaccessible. However, they do not discuss the detailed calculation of the continuous metrics. Instead, they rank Web sites into four accessibility degrees: highly accessible, mostly accessible, partly accessible, and inaccessible. A numerical metric with continuous values would provide better discrimination power and promote a scientific approach to Web accessibility issues. In this article, we propose a novel metric for quantitatively measuring the accessibility of the Web. This metric is developed using the WCAG guideline as a starting point. More precisely, this metric is based on the WCAG checkpoints that can be automatically tested using computer programs. The metric is intended as an estimate of accessibility, while real measures of accessibility (automated or otherwise) require additional manual checking and human judgment.

ment scenarios. These checkpoints are organized into three levels of priority: Priority 1 contains 29 checkpoints that must be satisfied; Priority 2 contains 40 checkpoints that should be satisfied; and Priority 3 contains 22 checkpoints that may be satisfied. The Web Accessibility Initiative has introduced the WCAG Conformance Logos to further promote accessibility on the Web. Content providers can use these logos on their sites to indicate a claim of conformance to the specific level of the WCAG. The Web Accessibility Initiative expects that use of these logos on conformant sites will help raise awareness of accessibility issues. The definitions of different conformance level are:

• • •

Conformance level A: All Priority 1 checkpoints are satisfied. Conformance level AA: All Priority 1 and 2 checkpoints are satisfied. Conformance level AAA: All Priority 1, 2, and 3 checkpoints are satisfied.

Background and Related Work The Need for an Accessibility Metric Web Accessibility Guidelines Numerous guidelines have been developed to assist Web designers in making Web sites accessible to persons with disabilities. In the 1990s, Web accessibility information was available from organizations such as the Trace Research and Development Center at the University of Wisconsin and companies such as IBM. One of the earliest Web content design standards for access by people with disabilities was developed by the City of San Jose, California (Paciello, 2000). In 1997, the Australian standards for accessible Web design were made available to Web page authors (Australian Human Rights & Equal Opportunity Commission, 1997). In the same year, the W3C established the Web Accessibility Initiative (WAI). The WAI published the WCAG 1.0 as its final recommendation in 1999 (WAI, 1999). There are two major specifications that serve as normative guidelines for Web content accessibility design: the WCAG and the U.S. Access Board’s Electronic and Information Technology Accessibility Standards (known as the Section 508 Guidelines). The WCAG is a stable international specification developed through a voluntary industry consensus. The Section 508 Guidelines were announced in December 2000, pursuant to the US rulemaking process as required by Section 508 of the Rehabilitation Act Amendments of 1998. Both specifications offer checklists that Web developers should follow with regard to content accessibility for people with disabilities. These two specifications largely overlap—only three of the checkpoints defined in Section 508 are not mentioned in the WCAG. The WCAG has more comprehensive checkpoints than Section 508, and it provides a priority level to each checkpoint to reflect the severity of specific violations. The WCAG contains 14 broadly phrased guidelines that are translated into 91 specific checkpoints that explain how the guidelines should be applied to specific content develop-

Since the WCAG was adopted by the W3C and Section 508 of the Rehabilitation Act became law, there have been numerous studies on Web accessibility of various categories of Web sites. These studies used the WCAG as the basis for measuring accessibility and use the automatic assessment tool, Bobby (Watchfire Corp., 2004) for evaluation. Such studies have usually painted a gloomy picture of the state of accessibility of the Web. A recently completed study of accessibility of the 30 most popular French Web sites found that none meets conformance level A (Research Institute for Networks and Communications Engineering [RINCE], 2003). A similar study conducted in Ireland found that at least 94% of the 159 Web sites tested failed to meet the minimum accessibility standard (level A), and not one site met the professional practice accessibility guideline of levels AA and AAA (McMullin, 2002). The results of previous studies are often confusing and conflicting. A study conducted on the accessibility of U.S. federal Web sites revealed that only 13.5% of the 148 sites had zero errors (Stower, 2002), indicating that they could be considered level AAA or “Bobby approved.” This study has generated much publicity (Deaukantas, 2002; Emery, 2002), partly because all U.S. federal Web sites were supposed to have complied with Section 508 of the Rehabilitation Act of 1973 by June 25, 2001. An earlier study conducted by a Brown University researcher found that 37% of the U.S. government Web sites are accessible (West, 2001). Another study found that only 1% of the U.S. federal government Web sites are Bobbyapproved (Jackson-Sanborn, Odess-Harnish, & Warren, 2002), where it is defined as meeting Priority 1 (level A) without a user check. All of the studies employed Bobby, an automated accessibility assessment tool, and used the absolute measure of accessibility. The low rate of accessibility among government Web sites would make a good media story, but it is hardly informative for scientific or policy purposes.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005

1395

We argue that the confusion and conflicting results stem from problems with the metrics used in the studies. The current method of evaluating Web site accessibility relies on a simple rating based on the conformance to the priority checkpoints set forth in the WCAG. The current rating system and the so-called Bobby approved measurement reflect an absolute metric of accessibility: Either the site conforms to all checkpoints or it is considered inaccessible. To illustrate the problems with the current dichotomous absolute accessibility measures, we conducted an evaluation on a large sample of Web sites that considered themselves as accessible. We selected 449, 374, and 318 Web sites that were self-rated as A, AA, and AAA, respectively, for 1,141 Web sites (See the Gold Standard section for detailed information on the sample Web sites). We then added 377 more randomly selected Web sites that violate Priority 1 and are inaccessible. We evaluated the accessibility of all 1,518 Web sites to check their conformance to each priority level of the WCAG. We used only the checkpoints that can be evaluated automatically using a computer program. The results of the evaluation are presented in Table 1. It is surprising that even among the Web sites that considered themselves to have a AAA conformance level, only 8.81% of them are truly AAA. Several previous studies used the AAA criteria or Bobby approved as a criterion for accessibility (Jackson-Sanborn et al., 2002; Stower, 2002). The percentage of Web sites that conform to AAA criteria for self-declared AA and A Web sites is significantly lower (4.28% and 1.11%, respectively) and for randomly selected Web sites the percentage conforming to the AAA standard approaches zero. The results would have been worse if manual checking of all 91 checkpoints were conducted and if all pages of the Web sites were evaluated, rather than a check of the 25 checkpoints that could be evaluated using a fully automated method on only the main page of the Web site. The table can explain the results of previous studies among the 159 Irish Web sites (McMullin, 2002) that found a level A conformance-failure rate of 93.7% and a conformance failure rate of 100% for both AA and AAA. The Irish study and the French study (the 30 most-visited Web sites with a 100% level A conformance-failure rate) are consistent with our results of randomly chosen Web sites. The overly pessimistic results show the weaknesses of the absolute measure of accessibility used in the studies. TABLE 1. Percentage of Web sites with priority violations based on 25 checkpoints. Self-rated Web site category (Number of Web sites) Conformance True A True AA True AAA

1396

Nonrated (377) 1.59% (6) 1.59% (6) 0% (0)

A (449) 72.83% (327) 7.67% (34) 1.11% (5)

AA (347) 96.71% (336) 17.65% (61) 4.28% (15)

AAA (318) 97.26% (309) 16.35% (52) 8.81% (28)

Because single checkpoint violation in a priority would render a Web site inaccessible, only a small percentage of Web sites could be considered accessible. Such results would be of a little help for shedding light into the state of the accessibility of the Web. A different, better measurement is needed for scientific exploration as well as for policy formulation. The Need for Automatic Evaluation The number of unique Web pages was estimated at 2.1 billion pages as of July 2000, growing at a rate of 7.5 million pages per day (Murray & Moore, 2000). The total number of the deep hidden Web pages, Web connected back-end databases, is estimated at 550 billion invisible Web documents (Bergman, 2001). The Web is not only characterized by its sheer enormity but also by its fluidity: Web sites constantly change. The average lifespan of a Web page today is 100 days, as estimated by the Internet Archive Project (Weiss, 2003). A study published in Science magazine found that the prevalence of inactive Web-referred citation in prestigious scientific journals is 10% after 15 months (Dellavalle et al., 2003). Given the nature of the Web, automatic scoring and evaluation would be preferable to and more productive than manual scoring. Automated Web accessibility evaluation has several advantages over nonautomated evaluation, such as the cost to conduct the evaluation, the time needed to complete the evaluation, increased consistency of the accessibility uncovered, reduced need for accessibility expertise, and the possibility of incorporating accessibility evaluation into the Web development process. Similar arguments have been made for automated usability evaluation (Ivory & Hearst, 2001). There is an even more compelling argument for automatic Web accessibility evaluation: There is an internationally accepted guideline with detailed checkpoints. Automatic scoring will allow evaluation to be conducted against a large number of Web sites in a short time and at minimal cost. Properties of a Good Web Accessibility Metric To overcome the deficiencies of the current absolute metric, we propose an accessibility metric that satisfies several requirements. First, accessibility must be measured in a quantitative score that provides a continuous range of values from perfectly accessible to completely inaccessible. A quantitative numerical score would allow assessment of change in Web accessibility over time as well as comparison between Web sites or between groups of Web sites. Instead of an absolute measure of accessibility that categorizes Web sites only as accessible or inaccessible, an assessment using the metric would be able to answer the fundamental scientific question: More or less accessible, compared to what? (Tufte, 1997). Second, the metric and range of values must have a large discriminating power beyond simply accessible and inaccessible. A metric with good discriminating power would allow assessment of the rate of change of Web accessibility over time or a significant difference in accessibility between the Web sites under consideration. An accessibility assessment

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005

using the metric will be able to answer the fundamental scientific question: At what rate? (Tufte, 1997). Third, the metric must be fair by taking into account and adjusting to the size and complexity of the Web sites. Web sites may range from a single home page to large corporate sites comprising thousands of pages. A metric that takes into account size and complexity would allow a fair comparison between Web sites of various sizes. Fourth, the metric should be scalable to conduct largescale Web accessibility studies. Large-scale accessibility assessments require a metric that supports aggregation and second-order statistics such as standard deviation. For a large-scale study, efficiency is paramount. Finally, the measurement should be normative, meaning that it should be derived from standard guidelines of Web accessibility such as the WCAG or Section 508. The proposed metric is designed to work with an automated accessibility evaluation method. Although the metric is a proxy indicator of Web accessibility, not a real measure of accessibility from the user experience, it is practical and has many strengths. One of the primary strengths of an automated scoring system is objectivity: It will allow objective comparisons between sites, categories, and points of time. This method will also allow large-scale assessment of aggregate Web sites. Assessing all WCAG checkpoints conformance potentially requires detailed testing and evaluation of every Web page in a Web site against each checkpoint by an expert human tester. Imagine evaluating 100 Web sites consisting of 1,000 Web pages on each site. Large-scale Web accessibility evaluation using manual evaluation would be time consuming and prohibitively expensive. As McMullin (2002) has argued, it is much more preferable to have available some concrete, comprehensive data relating to Web accessibility on a large scale, even if this data is incomplete. Novel Accessibility Metric: The Web Accessibility Barrier One of the conclusions we can draw from the literature review is that currently accepted evaluation methods for Web accessibility have two primary weaknesses. First, most of them only consider the absolute number of Web accessibility violations presented on a Web page. Simply counting the number of violations, without considering the number of potential violations, e.g., number of image elements when checking nonalternative text images, favors pages with simple designs and may underestimate the effort the Web designer put into complex Web sites. Second, most of the evaluations of Web accessibility do not present the studies as an integrated single measurement score that represents the total accessibility barriers on one Web page or Web site. Instead, the results are mostly presented in the category according to the checkpoints, guidelines, or priorities of WCAG. Although the presentation of the results can provide a sketchy outline of the distribution of Web accessibility among different Web sites, it is hard to simply use this categorical measurement to compare two Web pages. These

weaknesses might explain why the violation of Bobby accessibility increases when the pages are better designed, as one of the studies indicated (Ivory & Hearst, 2002). Because the WCAG and Section 508 largely overlap— with WCAG being more comprehensive and the internationally accepted standard as well—we used the WCAG as the foundation for the accessibility metric we developed. The number of violations of each checkpoint is a component of our scoring method called Web Accessibility Barrier (WAB) score. For example, a Web page with fewer accessibility checkpoint violations, e.g., providing no alternative description for an image object, will be considered as having fewer barriers for people with disabilities and will have a lower WAB score. Because we are more interested in automatically evaluating the level of accessibility of a Web site, those checkpoints demanding manual checking are not included in the calculation of the WAB score. For example, conformance to the rule, “If you use color to convey information, make sure the information is also represented another way,” cannot be verified until a manual check is done. For a list of Web accessibility rules that need to be manually checked, please see the WCAG references (WAI, 1999). As discussed in the background section, the WCAG attaches a three-point priority level to each checkpoint based on its impact on accessibility to people with disabilities. In weighting the calculation of the WAB score, we used the priority levels in reverse order. Priority 1 violations are weighted three times more heavily than the Priority 3 violations because people with disabilities have more difficulty accessing Web pages with Priority 1 violations. However, using only the number of violations of Web accessibility checkpoints may bias the results of the measurement. For example, a Web page with five “image without alternative text” violations may have 500 image objects embedded in the page and the Web page with one “image without alternative text” violation may have only one image object in the page. The developer of the first page may have already paid much attention to and put great effort into complying with the Web accessibility specifications, while the developer of the second page may be completely unaware of Web accessibility. Therefore, the number of actual violations of a checkpoint must be normalized against the number of potential violations of the checkpoint. In the last example, true violations are the image objects without alternative text, and the potential violations include all image objects on the page. The average WAB score of all Web pages within a site will be the WAB score of the Web site. Figure 1 summarizes the calculation of the WAB score of a Web site as a formula. A lower score means fewer accessibility barriers for people with disabilities, while a higher score indicates more barriers. A score of zero denotes that the Web site does not violate any Web accessibility guidelines and should have no accessibility barriers to people with disabilities. Theoretically, the WAB formula can be used to calculate the WAB scores based on all 91 checkpoints in all WCAG priorities. However, because we focus only on the checkpoints that can be evaluated using an automated system, we

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005

1397

nv

WAB score⫽

⌺ ⌺ ( N )(w p

v

v)

v

Np p : Total pages of a website v : Total violations of a Web page nv : Number of violations Nv : Number of potential violations Wv : Weight of violations in inverse proportion to WCAG priority level. Np : Total number of pages checked

however, are not always a good representation of the entire Web site’s accessibility; journals’ Web sites are a good example of this. Instead of using only the home page, we evaluated the home page and all pages at the second level. The number of pages evaluated for each Web site ranged from 8 to 893. We excluded those Web sites that only allowed access to the main page (three Web sites) and journals with no Web sites (six journals), resulting in 29 journal Web sites for this study. Figure 2 shows the computed score for each journal. The best score, 3.98, was achieved by the American Journal of Medicine, while the worst score, 12.55, for Proceedings of the Association of American Physicians. Using the dichotomous measure, all of these Web sites will be considered inaccessible, even when the least stringent conformance (A) is used.

FIG. 1. The WAB formula.

Study of Web Accessibility Over Time used only the 25 rules with the following breakdown: 5 checkpoints in Priority 1, 13 checkpoints in Priority 2, and 7 checkpoints in Priority 3. See the Appendix for a detailed description of the 25 checkpoints. We have developed a program called Kelvin that implements this Web accessibility metric formula. Kelvin is a Javabased program consisting of two main modules: a Web crawler and an accessibility evaluator. The Web crawler is a lightweight automated crawler that follows links to visit Web pages. We did not use other available Web crawlers because many crawlers are too complex to be easily customized to our specific tasks. The crawler can access Web pages at remote Web sites and determine the number of potential violations of Web accessibility checkpoints. The accessibility evaluator will check the potential violations against the 25 WCAG checkpoints and calculate an accessibility score. Examples of Accessibility Evaluation In this section we describe the application of the metric to the assessment of Web sites in different granularities: a comparison of the accessibility of different Web sites and a comparison of the accessibility of Web sites at different times. Our previously published study (Zeng & Parmanto, 2004) provides an example of the use of the WAB measure to compare different categories of consumer health information Web sites (education, government, commercial, portal, and community). Scientific Journal Accessibility The first example is an accessibility study of top medical journal Web sites. We use the 2001 ranking based on impact factor as reported by the ISI Citation Index (Science Citation Index, 2002). We selected the journals with impact factors higher than 1.0. The result is 38 journals out of 112 total journals ranked by ISI Citation Index (2001) for general medicine. Previous accessibility evaluation studies only evaluated the main page (home page) of the Web sites. Home pages, 1398

The second example shows how the metric can be used for conducting a longitudinal study of Web accessibility over time. In this example, we evaluated the accessibility of one Web site to observe how its accessibility changed over time. We evaluated the Food and Drug Administration (FDA), a U.S. federal agency, Web site (http://www.fda.gov). To conduct this study, we used the Wayback Machine, a service from the Internet Archive and Alexa Internet. The Internet Archive began archiving the rapidly changing Web in 1996 in an effort of preservation. By 2001, when the Wayback Machine became available to the public, allowing people to access and use archived versions of stored Web sites, there were already over 100 terabytes of data with a growth rate of 12 terabytes per month (Yaukey, 2003). This Internet “library-of-sorts” allowed us to look back and to analyze what has happened in Web page design and accessibility over time. For each archived year, we selected one sample of the full Web site at a specific time. For convenience, we used the first archived instance for each year. If we were unable to use the first instance (due to a Failed Connection, Path Index Error, File Not Found Error, or some other Wayback Machine error), we used the next archived instance for the year and so on. As a result, we evaluated eight distinct archived instances. These instances represent the Web site from 1997 to 2004. The graph depicted in Figure 3 shows the estimated accessibility of the FDA Web site gets worse from 1998 to 2000 and from 2001 to 2003, as shown by an increase in Web accessibility barrier scores. The scores level off or slightly improve from year 1997 to 1998 and from 2000 to 2001. The graph also shows significant improvement in the accessibility barrier scores from 2003 to 2004. Testing the Validity of the Metric Reliability of the Metric Because the measurement utilizes data directly acquired by automatic machine processing, it does not involve subjective judgment or probabilistic variation. The results of

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005

FIG. 2. The WAB scores of 29 top medical journals (lower score is better).

FIG. 3. The WAB score of the Food and Drug Administration’s (FDA) Web site (http://www.fda.gov) over time.

the measurement objectively reflect the state of the content accessibility of the Web page to a certain extent. Traditional reliability measurements (interrater, test–retest, parallelforms, and internal consistency) are not applicable to our metric. Validity of the Metric Gold standard. To evaluate the validity of the numerical metric, a gold standard has to be employed. A gold standard refers to a reference standard for the evaluation of a novel diagnostic test, in this study, the WAB score. Several candidate measurements can be adopted as the gold standard

in this study. First, we can use persons with disabilities as a judge to determine the accessibility of a Web page. Although this method is ideal, this approach appears to be impractical. People with disabilities themselves are a very diverse group with regard to different types and levels of disabilities. The accessibility requirements from each subgroup are very specific and often conflicted. An extreme example would be a text-dominated Web page, which is very accessible to visually impaired people, while it is inaccessible to a person with a learning disability (Bohman, 2003). A second alternative is to use a comprehensive evaluation of Web pages by following Web accessibility standards— manual checking. The WAI published a template for comprehensively evaluating the level of Web accessibility of a Web page (WAI, 2002). It involves multiple steps, a variety of tools, and large amounts of manual checking. The overhead of such a measurement is tremendous and prohibitively expensive for large numbers of Web pages. A third choice of the gold standard measurement is to rely on certain types of accreditations. Because the WCAG is designed to serve the broadest spectrum of disabilities, it is a good candidate for a gold standard of accessibility. The WAI has introduced the WCAG Conformance Logos, level A to AAA as discussed in the background section, to further promote accessibility on the Web. Content providers can use these logos on their sites to indicate a claim of conformance to the specific level of the WCAG 1.0. After content providers make their Web pages conform to WAI checkpoints, they can add a WAI logo to their pages. The level of conformance determines what type of logo they can use. Because WAI logos themselves are image embedded in the HTML Web page, they also have alternative text binding

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005

1399

with them. For example, the alternative text for the WCAG 1.0 level A conformance logo is “Level A Conformance icon, W3C-WAI Web Content Web Accessibility Guideline 1.0.” By default, a conformance icon refers to a single page unless the scope explicitly claims otherwise. The logo system from the WAI is a potential candidate for a gold standard measurement. It is comprehensive, covering the broadest range of disabilities, and is cost-effective. However, it has several drawbacks that may compromise the study’s results. The logo system is a self-rated system. As discussed in the background section, this self-rating system is not perfect. While the logo system has drawbacks, it is still an optimal gold standard for the measurement study. Even though the logo system is not perfect, it indicates that the content provider has done significant work to remove the accessibility barriers from the Web site. Additionally, the publicity of the Web put the Web page using these logos under consistent scrutiny from the public, who can issue any charge or complaint against any incompliance with the WAI checkpoints. We used the search engine, Google, to acquire Web pages that would serve as a gold standard. The Web pages returned by the search engine were examined to confirm the existence and types of Web accessibility logo they were using. We used the home pages from the top 500 Web sites provided by another search engine, Alexa, as a negative group, one without WAI logos. The individual Web pages in the negative group were further examined to confirm the absence of the WAI logo. The sample of Web sites to be used in this analysis is the same as the one presented in Table 1. Results The results of applying the metric on the Web sites collected as the gold standard are presented in Figure 4. The results show that the WAB metric provides a good continuous representation of the Web sites’ estimated accessibility. On average, Web sites that considered themselves as AAA have better WAB scores than those who considered themselves AA, which, in turn, have better scores than A, which have better scores than nonrated Web sites. The WAB metric provides continuous “degrees” of estimated accessibility. The average scores of AAA, AA, A, and nonrated Web sites are 2.02, 2.74, 4.47, and 10.5, respectively. Figure 4 also shows a number of Web sites in the rated categories (level AAA, AA, and A) whose scores were worse than the average score for nonrated Web sites. These outliers are shown at the top of the Box and Whisker graph as dots that represent data outside the 95% confidence interval. There were also nonrated Web sites that scored better than the average score of the AAA Web sites, as shown by the dots below the nonrated graph. The figure shows that the WAB metric is capable of separating the Web sites from each other based on their estimated accessibility across different levels of the accessibility spectrum. We tested how the scores of level AAA, AA, A, and nonrated Web sites are separated from one another. The performance of the measurement metric in predicting the individ1400

FIG. 4. The WAB scores of Web sites across different levels of conformance.

ual Web site category (whether a Web site belongs to AAA, AA, or A) will be calculated using the Receiver Operating Characteristics (ROC) curve (Egan, 1975). The ROC curve is more appropriate for this metric validity test than simply testing the mean differences of the scores by using a oneway ANOVA (Friedman & Wyatt, 1997). The ROC curve is commonly used to assess the ability of a predictor to discriminate between two possible outcomes. Drawing an ROC curve connects the points defined by a true positive fraction (TPF) and a false positive fraction (FPF) corresponding to different cutting points along the measurement. The area under the curve (AUC) reflects the differential power of the test. A perfect separation between the two categories would yield an AUC score of 1.0 and a curve that fits along the X axis and upper Y axis. Otherwise, a perfect nonseparation would yield an AUC score of 0.5 and a curve along the straight diagonal line. Another merit of the ROC curve is that a specific cutting point or criterion point can be located on the curve with preferred sensitivity and specificity. We used the ROCKIT program from the University of Chicago to conduct ROC analysis (University of Chicago, 2004). Figure 5 shows the ROC curves drawn from a different cutting point from the WAB score for the gold standard Web sites. The curves measure how good the metric is in separating adjacent levels of Web site rating categories (nonrated–A, A–AA, and AA–AAA ). The separation between nonrated and A is the strongest, while the separation between levels AA and AAA is the weakest. Table 2 shows the AUC score of the metric in separating the different categories of accessibility ratings. The results show a clear separation between nonaccessible Web sites and Web sites with a level A rating. The AUC score is 9.17 with a separation between nonrated Web sites with level A Web site very significant (p-value ⬍ 0.0001). The separation between level AAA and nonrated Web sites is even higher, with a score of 9.7 and p-value ⬍ 0.0001. The weakest separation between AA–AAA is also significant (p-value ⬍ 0.0001).

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005

FIG. 5. Receiver Operating Characteristic (ROC) curve of the WAB score in separating level A from nonrated, level AA from level A, and level AAA from level AA Web sites.

We subsequently used machine learning methods to compare the performance of the simple weighting schema used in our WAB metric with a more complex decision tree method using a C5.0 machine learning algorithm that would learn from the data set (Quinlan, 1993). The main difference between the WAB score and the machine learning method is that the WAB score uses a simple predetermined weighting score that is inversely proportionate with the priority level of the violation (the weight of 3 for Priority 1, 2 for Priority 2, and 1 for Priority 3). Machine learning will learn from the data set and assign optimal weights for each of the 25 individual checkpoints. The purpose of this comparison is to see how good the simple weighting scheme is compared to an optimal complex weighting scheme. We used Clementine 7.0 from SPPS Inc. (SPSS, 2003) to construct the C5.0 model. The ratios of violations (true violations over potential violations) for each of the 25 automated checkpoints were used as the input variable of the model. The output variable of the model is the level of WCAG conformance—nonrated, A, AA, AAA . Unlike the WAB score, the decision tree algorithm was used to separate only two levels of conformance at a time (between nonrated–A, A–AA, etc.). We used a threefold cross-validation method to estimate the accuracy of the decision tree model: TABLE 2. AUC (Area Under the Curve) of the WAB score in separating AAA, AA, and A from nonrated Web sites. Nonrated– A AUC

0.917

A–AA 0.689

AA–AAA 0.513

Nonrated– AA 0.972

FIG. 6. The ROC performance of the C5.0 machine learning algorithm in separating level A from nonrated, level AA from level A, and level AAA from level AA Web sites.

We used approximately two thirds of the Web sites to construct the model and one third to test the model. The proportion of the Web sites in each conformance level is maintained in the division of data sets during cross-validation. The parameters for the decision tree construction are pruning severity of 0.75 and minimum records per child branch set as 5. We selected rule set as the output of the model created by the C5.0 algorithm. The tree depths for all three data sets are two levels. We also employed ROC curve to evaluate the performance of the decision tree on each data set. Figure 6 shows the ROC curves drawn from different cutting points for the complex weighting scores generated by the C5.0 machine learning algorithm for the same Web sites. The C5.0 algorithm performs well in separating the adjacent accessibility categories, especially in separating AA–AAA as compared to the WAB score. Table 3 shows the AUC score of the machine learning C5.0 in separating different levels of accessibility. As expected, the machine learning performance is better than the simple weighting scheme used in the WAB metric. Table 4 shows the comparison of the AUC values between the WAB score and the C5.0 method. It shows that the performance of the WAB metric is as good as the complex C5.0 method in separating the rated categories from the nonrated one. The differences in AUC values for separating two TABLE 3. Value of AUC (Area Under the Curve) of the machine learning method (C5.0) in separating AAA, AA, and A: from nonrated Web sites.

Nonrated– AAA

Nonrated– A

A–AA

AA–AAA

Nonrated– AA

Nonrated– AAA

0.962

0.787

0.769

0.983

0.983

0.982 AUC

Note. All AUC has significance ( p-value ⬍ 0.0001) suggesting that all AUC are significantly different than 0.5.

Note. All have p-value ⬍ 0.001 ( p value is one-tail).

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005

1401

TABLE 4. Comparison of AUC between simple weighting of the WAB metric and the machine learning method (C5.0) in separating AAA, AA, and A from nonrated Web sites. Nonrated– A

A–AA

CWAB 0.917 0.689 C5.0 0.962 0.787 Significance p ⬍ 0.001 p ⬍ 0.001

AA–AAA

Nonrated– Nonrated– AA AAA

0.513 0.972 0.982 0.769 0.983 0.983 p ⬍ 0.001 p ⫽ 0.0186 p ⫽ 0.4467

of the three (AA–nonrated and AAA–nonrated) are not significant. Although the difference in the AUC values for separating A–nonrated is significant, the performance of the WAB metric is also excellent (0.917). The significance is computed using the bivariate statistical analysis method with the null hypothesis that two ROC curves are the same. The analysis is conducted using ROCKIT software. Hanley and McNeil (Hanley & McNeil, 1982, 1983) provides detailed statistical background on the calculation on AUC and the significance test for comparing two ROC curves. The C5.0 algorithm is significantly better than the WAB score in separating AA–AAA. It is also better in separating AA–A, although the performance of the WAB is also good (0.69). We dissected the decision tree built with the C5.0 algorithm to examine how it is different from the WAB in assigning weights to the accessibility checkpoints. The WAB score considers all 25 checkpoints while the C5.0 algorithm only selects a subset of the 25 checkpoints. The numbers of checkpoints in the WAB score for Priority 2 and 3 compliances are 13 and 5, respectively (ratio is 13 to 5). The numbers of checkpoints selected by C5.0 algorithm for Priority 2 and 3 compliances are 4 and 3 (ratio is 4 to 3). This might explain why the decision tree algorithm performs significantly better in separating AA and AAA, by giving more weight to checkpoints in Priority 3 (checkpoints that separate AA and AAA). The distribution of checkpoints constructed by C5.0 is shown in Table 5. The results show that the simple weighting method used in the WAB score performs well compared to the more complicated decision tree method in the critical separation tasks (separating rated categories from the nonrated ones). Because the decision tree is more complicated and provides different weighting schemes in different sets of data, the simplicity and reliability of the simple weighting scheme make it more attractive.

TABLE 5.

Checkpoints used by C5.0 (see Appendix for checkpoint IDs). Nonrated–A (ID)

A–AA (ID)

AA–AAA (ID)

Priority 1 checkpoints

5 (g9, g10, g2, g39, g240)

5 (g9, g10, g2, g39, g240)

5 (g9, g10, g2, g39, g240)

Priority 2 checkpoints

4 (g104, g265, 4 (g104, g265, g269, 4 (g104, g265, g269, g271) g271, g273) g269, g271)

Priority 3 checkpoints

3 (g31r, g35r, g125r)

1402

3 (g31r, g35r, g125r)

3 (g31r, g35r, g125r)

Limitations of the Metric The accessibility metric we developed is intended for objective and systematic measurements of the accessibility of the Web. This metric is not appropriate for checking the accessibility of an individual Web site for the purpose of accessibility repair or remediation. This metric is also intended as a proxy measure of accessibility, not a real measure of accessibility, which requires manual checking and human judgment. The metric does not take into account the location of the barrier, which could affect the usability of the Web site (the higher the location of the barrier in a Web site hierarchy, the higher the potential impediment for usability). The location of the barrier can be incorporated in future revisions of the metric. Conclusion We propose a novel metric for measuring Web accessibility that meets the requirements as a measurement for scientific research. This metric can be used for objective evaluation and comparing accessibility between different Web sites, different groups of Web sites, and different Web sites or groups of Web sites at different points in time. This simple metric compares well to the more complex machine learning method. We believe that the availability of an objective metric will open doors to a scientific approach to Web accessibility studies. The study has important implications for similar automated measurement metrics by showing the feasibility of automated assessment metrics for Web accessibility. Acknowledgments This research was supported, in part, by grants #42-60I02013 from the National Telecommunications and Information Administration (NTIA) and #H133A021916 from the National Institute on Disability and Rehabilitation Research (NIDRR). The authors would like to thank Sjarif Ahmad for developing Kelvin, which is used in this research, and Stephanie Hackett for conducting the analysis in the longitudinal study. References Australian Human Rights & Equal Opportunity Commission. (1997). Disability standards and guidelines. Retrieved December 2, 2003, from http://www.hreoc.gov.au/disabiltiy_rights/standards/standards.html Bergman, M.K. (2001). The deep web: Surfacing hidden value. The Journal of Electronic Publishing, 7(1). Retrieved from http://www.press.umich. edu/jep/07-01/bergman.html Bohman, P.R. (2003). Visual vs. cognitive disabilities. Retrieved December 8, 2003, from http://www.webaim.org/techniques/articles/vis_vs_cog Deaukantas, P. (2002). Think tank report: Federal web sites need better accessibility. Retrieved March 4, 2004, from http://www.gcn.com/ vol1_no1/s508/19757–1.html Dellavalle, R.P., Hester, E.J., Heilig, L.F., Drake, A.L., Kuntzman, J.W., Graber, M., et al. (2003). INFORMATION SCIENCE: Going, going, gone: Lost internet references. Science, 302(5646), 787–788. Dhyani, D., Wee Keong, N., & Bhowmick, S.S. (2002). A survey of web metrics. ACM Computing Surveys, 34(4), 469–503.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005

Egan, J.P. (1975). Signal detection theory and ROC analysis. New York: Academic Press. Emery, G.R. (2002). Survey: Agency web sites make progress, still have for to go. Retrieved March 4, 2004, from http://www.washingtontechnology. com/news/1_1/daily_news/18859–1.html Friedman, C.P., & Wyatt, J.C. (1997). Evaluation methods in medical informatics. New York: Springer. Hanley, J.A., & McNeil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36. Hanley, J.A., & McNeil, B.J. (1983). A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148(3), 839–843. Ivory, M.Y., & Hearst, M.A. (2001). The state of the art in automating usability evaluation of user interfaces. ACM Computing Surveys, 33(4), 470–516. Ivory, M.Y., & Hearst, M.A. (2002). Improving web site design. IEEE Internet Computing, 6(2), 56–63. Jackson-Sanborn, E., Odess-Harnish, K., & Warren, N. (2002). Web site accessibility: A study of six genres. Library Hi Tech, 308–317. McMullin, B. (2004, July). Users with disability need not apply? Web accessibility in Ireland. First Monday, 9(7). Retrieved from http://www.firstmonday. org/issue9_7/marincu/ Murray, B.H., & Moore, A. (2000). Sizing the Internet. Retrieved December 2, 2003, from http://www.cyveillance.com/web/downloads/Sizing_the_ Internet.pdf Paciello, M.G. (2000). Web accessibility for people with disabilities. Berkeley, CA: CMP Books. Quinlan, J.R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers. Research Initiative for Network and Communication Engineering (RINCE) and Association BraileNet. (2003). Des sites web français de la vie quotidienne sont inaccessibles aux personnes handicapées [Popular French Web sites are inaccessible to people with disabilities]. Retrieved December 1, 2003, from http://braillenet.org/eval_30sites_France_10sept2003.htm

Science Citation Index. (2002). Philadelphia: Thomson ISI. SPSS Inc. (2003). Clementine 7.0. Retrieved September 1, 2003, from http://www.spss.com Stower, G. (2002). The state of federal websites: The pursuit of excellence. Arlington, VA: The PricewaterhouseCoopers Endowment for the Business of Government. Sullivan, T., & Matson, R. (2000, November). Barriers to use: Usability and content accessibility on the web’s most popular sites. Paper presented at the Proceedings of CUU 2000, ACM Conference on Universal Usability, Arlington, VA. Tufte, E.R. (1997). Visual explanations: Images and quantities, evidence and narrative. Cheshire, CT: Graphics Press. University of Chicago. (2004). ROCKIT [Computer program]. Retrieved January 10, 2004, from ftp://random.bsd.uchicago.edu/roc/ibmpc/ Watchfire Corp. (2004). Bobby, Watchfire Corporation. Retrieved June 10, 2004, from http://bobby.watchfire.com/ Web Accessibility Initiative (WAI). (1999). Web Content Accessibility Guidelines 1.0. Retrieved August 1, 2003, from http://www.w3.org/ TR/WCAG10/ Web Accessibility Initiative (WAI). (2002). Evaluating web sites for accessibility. Retrieved December 8, 2003, from http://www.w3.org/WAI/ eval/Overview.html Weiss, R. (2003, November 24). On the web, research work proves ephemeral electronic archivists are playing catch-up in trying to keep documents from landing in history’s dustbin. Washington Post, p. A08. West, D.M. (2001). WMRC global E-government survey. Retrieved December 2, 2003, from http:// www.insidepolitics.org/egovt01int.html Yaukey, J. (2003). Archive site preserves earliest web pages, from http:// www.gannettonline.com/e/trends/15000566.html Zeng, X., & Parmanto, B. (2004). Web content accessibility of consumer health information web sites for people with disabilities: A cross sectional evaluation. Journal of Medical Internet Research, 6(2), e19. Retrieved from http://www.jmir.org/2004/2/e19/

Appendix Checkpoints used by the WAB score.

Checkpoint

Determining the number of potential violations

WAI Priority

ID

1 1 1 1 1 1 1 2 2

g9 g21 g20 g10 g240 g38 g39 g271 g104

2 2 2 2 2 2 2

g2 g37 g4 g5 g33 g254 g269

2

g41

Explicitly associate form controls and their labels with the LABEL element.

All ⬍img⬎ elements All ⬍applet⬎ elements All ⬍object⬎ elements All ⬍input type ⫽ “image” . . . ⬎ elements All ⬍area⬎ elements All ⬍frame⬎ elements All ⬍frame⬎ element 1* All ⬍table⬎, ⬍th⬎, ⬍td⬎, and ⬍frame⬎ elements All heading elements. All ⬍frameset⬎ element Same as the number of true violationsb Same as the number of true violationsb 1a 1a Number of event handler for both keyboard and mouse Number of form elements such as ⬍input⬎, ⬍select⬎, and ⬍textarea⬎

2

g34

Create link phrases that make sense when read out of context.

Number of ⬍a⬎ elements

Provide alternative text for all images. Provide alternative text for each APPLET. Provide alternative content for each OBJECT. Provide alternative text for all image-type buttons in forms. Provide alternative text for all image map hot-spots (AREAs). Each FRAME must reference an HTML file. Give each frame a title. Use a public text identifier in a DOCTYPE statement. Use relative sizing and positioning (% values) rather than absolute (pixels). Nest headings properly. Provide a NOFRAMES section when using FRAMEs. Avoid blinking text created with the BLINK element. Avoid scrolling text created with the MARQUEE element. Do not cause a page to refresh automatically. Do not cause a page to redirect to a new URL. Make sure event handlers do not require use of a mouse.

(Continued )

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005

1403

Checkpoints used by the WAB score. (Continued)

WAI Priority

ID

2

g265

2 3

g273 g14

3 3 3 3

Checkpoint

Determining the number of potential violations

Do not use the same link phrase more than once when the links point to different URLs. Include a document TITLE. Client-side image map contains a link not presented elsewhere on the page.

Number of ⬍a⬎ elements

g125 g31 g109

Identify the language of the text. Provide a summary for tables. Include default, place-holding characters in edit boxes and text areas.

1a Number of ⬍table⬎ elements Number of ⬍input type ⫽ “text”⬎, ⬍text area⬎, and ⬍select⬎ elements

g35

Separate adjacent links with more than white space.

Number of links.

1a Number of ⬍area⬎ elements

a

This element appears only once in an accessible Web page. This element will be considered violation of WCAG whenever it appears in a Web page.

b

1404

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005