THE SWEDISH DRIVING-LICENSE TEST A Summary of Studies from ...

3 downloads 0 Views 424KB Size Report
A Summary of Studies from the Department of. Educational Measurement, Umeå University. Widar Henriksson. Anna Sundström. Marie Wiberg. Em No 45, 2004.
THE SWEDISH DRIVING-LICENSE TEST A Summary of Studies from the Department of Educational Measurement, Umeå University

Widar Henriksson Anna Sundström Marie Wiberg

Em No 45, 2004

ISSN 1103-2685 ISRN UM-PED-EM--45--SE

INTRODUCTION...................................................................................................... 1 THE DRIVER EDUCATION IN SWEDEN.............................................................. 1 HISTORY OF THE SWEDISH DRIVER EDUCATION AND DRIVINGLICENSE TESTS....................................................................................................... 2 CRITERION-REFERENCED AND NORM-REFERENCED TESTS...................... 7 IMPORTANT ISSUES IN TEST DEVELOPMENT................................................. 8 Test specification ................................................................................................................................. 8 Item specifications............................................................................................................................. 10 Item format ........................................................................................................................................ 11 Evaluation of items............................................................................................................................ 12 Try-out ............................................................................................................................................... 13 Validity............................................................................................................................................... 14 Reliability .......................................................................................................................................... 16 Parallel test versions ......................................................................................................................... 18 Standard setting................................................................................................................................. 19 Test administration............................................................................................................................ 21 Item bank ........................................................................................................................................... 22

EMPIRICAL STUDIES OF THE THEORY TEST................................................. 24 A new curriculum and a new theory test in 1990................................................ 24 Judgement of items – difficulty.......................................................................................................... 25 Parallel test versions ......................................................................................................................... 25

A theoretical description of the test .................................................................... 27 Test specifications ............................................................................................................................. 27 Item format ........................................................................................................................................ 28 Try-out ............................................................................................................................................... 29 Standard setting in the theory test..................................................................................................... 29

Traffic education in upper secondary school – an experiment............................ 30 Analysis of the structure of the curriculum and the theory test........................... 31 Judgement of items – the relation between the curricula and the content of the items........................................................................................................................... 31

Aspects of assessment in the practical driving-licence test................................. 34 A detailed curriculum........................................................................................................................ 34 A model for judgement of competencies ........................................................................................... 35

The computerisation of the theory test................................................................ 37 Methods for standard setting............................................................................... 38 Standard setting for the theory test used between 1990-1999.......................................................... 38 Standard setting for the new theory test introduced in 1999............................................................ 38

Item bank for the theory test ............................................................................... 39 A sequential approach to the theory test ............................................................. 40 Results of the Swedish driving-license test......................................................... 41 Parallel test versions and the relationship between the tests ........................................................... 42 Private or professional education..................................................................................................... 42 Validating the results......................................................................................................................... 43

Driver education’s effect on test performance .................................................... 44 Driver education in the Nordic countries ............................................................ 45 Curriculum, driver education and driver testing ................................................. 46 Assessing the quality of the tests – reliability and validity ............................................................... 47 Assessment of attitudes and motives ................................................................................................. 49

FURTHER RESEARCH .......................................................................................... 50

Introduction Since 1990, the Department of Educational Measurement at Umeå University has been commissioned to study the Swedish drivinglicense test by the Swedish National Road Administration, SNRA. Over the past few years several studies have been conducted in order to develop and improve the Swedish driving-license test. The focus of the majority of the studies has been the theory test. The aims of this paper were threefold: firstly to describe the development of the driver education and the driving-license test in Sweden during the past century; secondly, to summarize the findings of our research, which is related to important issues in test development; and finally, to make some suggestions for further research.

The driver education in Sweden The present driver education consists of a theory part and a practical part. Since the driver education is voluntary, the learner-drivers have the choice of professional and/or private education. Driver instruction refers to professional education at a driving school and driving practice refers to lay instructed driver training. In order to engage in driver instruction or driving practice the learner-driver needs a Learner’s Permit. In September 1993 the age limit for driving practice was lowered from 17 ½ years to 16 years (SFS 1992:1765). It’s common that the learner-drivers in Sweden combine driver instruction with driving practice (Sundström, 2004). The learner-drivers get intense driver instruction at the driving school and practice the exercises at home, for example under supervision of their parents. There are certain criteria that a person has to meet in order to be approved as a lay instructor for a learner-driver, for example the person has to be at least 24 years old and have held a driving license for a minimum of five years (SFS 1998:488). The driver education reflects the curriculum which consists of nine main parts (VVFS 1996:168). To determine if the student has gained enough competence according to the curriculum, a driving-license test is taken. The test consists of two examinations, a theory test and a practical test. Five of the nine parts of the curriculum are tested in the 1

theory test and the remaining four parts are tested in the practical test (see Table 1). Test-takers have to be 18 years old and pass the theory test before they are allowed to take the practical test. Table. 1. The nine content areas of the driving-license test. Theory test Vehicle-related knowledge Traffic regulations Risky situations in traffic Limitation of driver abilities Special applications and other regulations

Practical test Vehicle-related knowledge Manoeuvring the vehicle Driving in traffic Driving under special conditions

History of the Swedish driver education and drivinglicense tests During the past century, the number of vehicles in traffic has rapidly increased, which has been reflected in many new constitutions and regulations. Franke, Larsson and Mårdsjö (1995) described the development of the Swedish driver-education system. The increasing motorism has caused a need for assessment of the driver’s knowledge and abilities. The content of the theory education and the practical driver education, and the knowledge and abilities required to pass the driving-license test have increased over time. The trends in the driver education and the driving-license test are that the focus has changed from teaching students about the construction and manoeuvring of the vehicle to judgement of risk and the driver’s behaviour in traffic. The first regulation for motor traffic was introduced in 1906. In order to drive a car the person needed a certificate. To obtain the certificate, the person had to demonstrate his or her theoretical knowledge and practical ability to a driving examiner. In 1916 the knowledge required for obtaining a driving license, became more extensive. The driver now had to demonstrate his or her knowledge of the construction and management of the vehicle and the most necessary traffic regulations. The requirements for obtaining a driving license got even stricter in 1923 when the driving examiner was required to judge if the person was suitable as a driver. In 1927, the opinion was that the practical test should be the main part of the driving-license test. The practical test should be conducted in different traffic situations so that the driving examiner could assess the driving skill, presence of mind and judgement of the test-taker (Molander, 1997). 2

In 1948 the education in driving theory was supplemented with some new parts that dealt with the responsibilities of the driver and accidents in traffic, these new areas were also reflected in the drivinglicense test. At the time the driving-license test consisted of three examinations: a written test, an oral examination and a practical test. The written test consisted of twenty-five items that should be completed in fifteen minutes. The purpose of the oral examination was to check the test-takers understanding of traffic related problems. The practical test involved at least ten minutes of actual driving where either the testtaker or the driving examiner decided the route. The results of the three tests were considered in the final judgement of the test-taker. There were clear directives on what knowledge was required in order to pass the test. Later, it was stated that there were some problems with the practical test. It was found that the difficulty of the practical test varied a great deal depending on when the test was taken (Franke et al., 1995). In the 1950s the responsibilities of the driver were emphasised to a greater extent than before. This change was based on the opinion that the personality of the drivers affected their behaviour in traffic. The purpose of the theory test was to check that the learner-driver had knowledge that improved his or her judgement in traffic. For a long time, the focus of the education in driving theory had been how the vehicle was constructed. Now, the consideration of other road users and the judgement in traffic were considered the most important parts of the theory test. Even though it was important to improve the judgement of the learner-driver in traffic, the focus was still the practical education. In the end of the 60s the theory education and the practical education were integrated. It became important that the learner-driver understood the content of the education in driving theory, rather than just learning it (Franke et al., 1995). In 1971 a new curriculum was introduced and two years later a new differentiated theory test was employed. The theory test was composed of a basic test and one or more supplementary tests. The basic test had to be taken by all test-takers, irrespective of the type of certificate applied for. The supplementary tests were selected according to the type of certificate (motorcycle, car/light truck, heavy truck etc.) applied for. The theory test was a written test that contained 80 multiple-choice items for AB-applicants (car/light truck). The basic test comprised 60 items and the cut-off score was set to 51 (85%). The 3

supplementary test for AB-applicants consisted of 20 items and the cut-off score was 15 (75%). The scoring model was conjunctive, which means that the test-taker had to pass both theory tests. The items consisted of a question and three options. Only one of the options was correct. The content of the test was not changed very often, so eventually test-takers came to know many of the items before the test (Franke et al., 1995; Spolander, 1974). In 1989 the curriculum was changed (Trafiksäkerhetsverket, 1988) and both the practical and the theory test were altered. The practical test was meant to cover the content of the curriculum to a greater extent than before. Five areas of competence (speed, manoeuvring, placement, traffic behaviour and attentiveness) were introduced. The judgement of traffic situations in the practical test should be related to these competences. The judgement of the practical test was changed from an assessment where the test-taker obtained a grade on a scale from one to five, to an assessment where the test-taker either passed or failed the test. In the field of the theory education, two new content areas were introduced (Trafiksäkerhetsverket, 1988). These areas focused on risky situations in traffic and the limitations of driver abilities. The driver education was extended and the theory education was planned to be more effective. The new objectives of the curriculum had to be covered by the new test, so at the same time as the new curriculum was introduced a new theory test was constructed. When the new curriculum was introduced, it was decided that the test-takers had to pass the theory test before they were allowed to take the practical test (Mattsson, 1990). The new theory test was introduced in January 1990, nine months after the introduction of the curriculum (Mattsson, 1990). The test was administered in six versions and each version consisted of forty items. All items, except for one, were multiple-choice items. The one item that had a different item format consisted of four descriptions of the meaning of four traffic signs that should be paired with four out of eight pictures of traffic signs. The number of options in the new test was increased from three to four and the test-takers did not know how many options were correct. In order to get one point on an item the test-taker had to identify all the correct options. The test-taker did not get a point if he or she answered three out of four options correctly. 4

The item format used in the test is rarely used in other countries, partly because of the inconsistency of the item format, which is a result of that the number of correct answers is sometimes known and sometimes unknown. The content areas of the theory test were given different weights according to the curriculum. The weights of the different parts were regulated through the number of items. The content area that contained most items was “traffic regulations”. In order to pass the theory test, most of the criteria in the curriculum should be met. The testtaker was not allowed to be found lacking in any content area. The scoring of the test was both compensatory and conjunctive, which means that the test-takers could pass the test in two ways. One way to pass the test was if the test-taker’s score was 36 out of 40 (90% of the total score) or higher. Another way to pass the test was if the test-taker reached the specific cut-off score for each content area and had a score of 30 or more (Mattsson, 1993). Table 2. Number of items and cut-off score for the different content areas of the theory test (1990-1999). Content area Number of items Traffic regulations 14 Risky situations in traffic 8 Limitation of driver abilities 8 Vehicle-related knowledge 3 7 Special applications and other regulations Total

40

Cut-off score 11 (79 %) 5 (63 %) 5 (63 %) 1 (33 %) 4 (57 %) + 4 items correct 30 (75 %) or 36/40 (90 %)

The curriculum introduced in 1989 was used until 1996, when the curriculum was revised to include more environmental aspects (VVFS 1996:168). In June 1999 a new theory test was introduced. The new test had the same content areas as the old test but a different itemformat (VVFS 1999:32). The new test consists of sixty-five multiplechoice items with only one option correct for each item. Mainly, the items have four options and the items are proportionally distributed over the five content areas with the old theory test as a model, i.e. the relation between the content areas is the same in the new test as in the old theory test. “Traffic regulations” is still the area that contains most 5

items. Five try-out items that do not count to the score are also put into each test. The cut-off score is set to 52 out of the 65 (80%) and the basis for this decision was that there should not be any change in the level of difficulty between the old and new theory test. The scoring model is compensatory. Lack of knowledge in some area can be compensated with greater knowledge in the other areas (Wolming, 2000b). There are various methods to use for standard setting, but the decision to set the cut-off score at 52 was not based on any of these methods (Wiberg & Henriksson, 2000). Instead a statistical model, which was based on data for the same test-takers taking the old and new theory test, was used. A practical test is set in order to test the four main parts of the curriculum that relate to the practical driving. The performance of the testtaker is assessed with respect to five competences (VVFS 1996:168) that are related to the driver’s awareness of risks in traffic. The first competence is the driver’s speed adaptation in different situations in traffic. The second competence is the driver’s ability to manoeuvre the vehicle. The third area of competence is the driver’s placement of the vehicle in traffic. The fourth area is the driver’s traffic behaviour and the fifth competence is the driver’s attentiveness to various situations in traffic. During the practical test different traffic situations are observed. These traffic situations are divided into five types of situations; handling the vehicle, driving in a built-up area, driving in a non built-up area, combination of driving in a built-up area and in a non built-up area and driving in special conditions e.g. darkness and slippery roads. The performance in these situations is related to the five competences mentioned earlier. If the test-taker fails in any competence the driving examiner notes in what traffic situation the error did occur. One error is sufficient to fail the test-taker. In the following sections the process of test construction and important issues in test development will be considered. When constructing a test it is important to consider if the performance of the test-taker is to be compared with the performance of other test-takers or with some external criterion, i.e. if the test is a criterion- or norm-referenced test.

6

Criterion-referenced and norm-referenced tests In general, tests can provide information, that aids individual and institutional decision-making. Tests can also be used to describe individual differences and the extent of mastery of basic knowledge and skills. These two general areas of test application lead to two approaches to measurement and, as a consequence, also two kinds of test; norm-referenced tests (NRT) and criterion-referenced tests (CRT). This formal differentiation of two general approaches to test construction and interpretation has its origin in an article by Glaser (1963). This article outlines these two formal approaches to test construction and interpretation. The main difference between these approaches is that a CRT is used to ascertain a test-taker’s status with respect to a well-defined criterion and a NRT is used to ascertain a test-taker’s status with respect to the performance of other test-takers on that test. From a more detailed perspective Popham (1990) also defines two major distinctions between CRT and NRT. The first relates to the criterion with CRT focusing mostly on a well-defined criterion and a well-defined content domain. The specification and description of the domain, and the concept of domain, is described in terms of learner behaviours. The specification of the instructional objectives associated with these behaviours is central in CRT. The criterion is performance with regard to the domain. NRT is focusing more on general content or process domains such as vocabulary and reading comprehension. Thus, the difference rests on a tighter, more complete definition of the domain for CRT, as compared to NRT. In some cases CRT also includes specification of the performance standard and this performance standard may for example take the form of specifying number of items to be answered correctly or number of objectives to be mastered. The other major distinction relates to the interpretation of a testtaker’s score. CRT describes the score with respect to a criterion and NRT the score with respect to the score of other test-takers. An example of a NRT in Sweden is the SweSAT which is used for selection to higher education (Andersson, 1999) and an example of a CRT is the national tests that are used as an aid in the grading procedure for teachers in upper secondary school (Lindström, Nyström & Palm, 1996). 7

A closer look at the theory test, the instructional objectives of the driver education (as they are defined in the curriculum issued by the SNRA) and the interpretation of test scores, leads to the conclusion that the theory test can be characterised as a CRT. The curriculum represents the criterion and the theory test consists of five different parts (Table 1) that are connected to the curriculum. The purpose of the test is to determine if a test-taker has acquired a certain level of knowledge compared with the defined criterion and standard setting is used to define this level of knowledge (Mattsson, 1993).

Important issues in test development The first and most important step in test development is to define the purpose of the test or the nature of the inferences intended from test scores. The measurement literature is filled with breakdowns and classifications of purposes of tests and in most cases the focus is on the decision, i.e., the decision that is made on the basis of the test information (see for example Bloom, Hastings and Madus, 1971; Mehrens & Lehman, 1991; Gronlund, 1998; Thissen & Wainer, 2001). The setting for the theory test is that this test is used to make decisions about test-taker performance with reference to an identified curricular domain. Curricular domain is defined here as the skills and knowledge intended or developed as a result of formal, or non-formal, instruction on identifiable curricular content. Test specification When the purpose of the test is clarified the next logical step in test development is to specify important attributes of the test. Test content is, in most cases, the main attribute. Other important attributes include for example test and item specification, item format and design of the whole test, as well as psychometric characteristics, evaluation and selection of items and standard setting procedures. These attributes are also dependant on external factors, such as how much testing time is available and how the test can be administered. Millman & Green (1989), for example, distinguished between external contextual factors (for instance who will be taking the test and how the test will be administered) and internal test attributes (for instance, desired dimensionality of the content and distribution among content components, 8

item formats, evaluation of items and desirable psychometric characteristics of both individual items and the whole test). With reference to internal attributes Henriksson (1996a) also made a distinction between two kinds of models, a theoretical model and an empirical model. The theoretical model is based mainly on judgements but also on statements about, for example, the number of items in the test and item type, and the empirical model is based on empirical data describing psychometric characteristics of items as well as the whole test. The theoretical and empirical model is summarised in test specifications. One effective way to ensure adequate representation of items in a test is to develop a two-way grid called a test blueprint or a table of specification (Nitko, 1996; Haladyna, 1997). In most cases the two-way grid includes content and the types of mental behaviours required of the test-taker when responding to each item. Haladyna (1999), for example, suggested that all content can be classified as representing one of four categories: fact, concept, principle, or procedure. He also defined five cognitive operations: recall, understanding, prediction, evaluation and problem solving. Another well-known hierarchical system is the taxonomy by Bloom (1956) consisting of six major categories. This hierarchical system has also been elaborated (Andersson et al, 2001). However, the behaviour dimensions should not be too complex and it can be claimed that the Bloom taxonomy has never achieved any greater success as a tool for test construction, maybe because it is too complex. Perhaps the revised model will be a step forward in that respect? But, as Henriksson (1996a) pointed out, the matrix schemes for the composition of a test need not be limited to the dimensions of content and process. More dimensions can be added by considering for example the item’s reading level, the amount of text and the formation of distractors. Other factors that also can be considered are surplus information, degree of non-verbal information, abstract-concrete and so forth. The effort to create these theoretical attributes and to establish a theoretical model for the test is based on judgements by experts. It can also be added that these added dimensions give more guidance for the test developer and, at the same time, the model for the whole test becomes more exact.

9

Item specifications There is dependence between test and item specifications since the theoretical as well as the empirical model for a certain test are related to the attributes of the item. Therefore, most of the item specifications are outlined when the test specifications are defined. An item specification includes sources of item format, item content, descriptions of the problem situations, characteristics of the correct response and in the case of multiple-choice items: characteristics of the incorrect responses. The use of item specifications is particularly advantageous when a large item pool should be created and when different item writers will construct the items. If each writer sticks to the item specification, a large number of parallel items can be generated for an objective within a relatively short time (Crocker & Algina, 1986). Different types of information should be stored for each item. First, information used to access the item from a number of different points of view should be stored. This information usually consists of keywords describing the item content, its curricular content, its behavioural classification and any other salient features; for example the textual and graphical portions of the item. Different kinds of judgements by experts give this theoretical information (Henriksson, 1996b). Second, psychometric data should be stored, such as the item difficulty and item discrimination indices. Third, and of relevance to the theory test - the number of times the item has been used in a given period, the date of the last use of the item, and identification of the last test-version the item appeared in, i.e. different indices of exposure for each item. It should also be noted that the storage of empirical item statistics also represents a measurement problem. Under classical test theory, item statistics are group dependent and, therefore, must be interpreted within the context of the group tested (Linn, 1989). It should also be mentioned that when using item response theory (IRT) as a basis for empirical item statistics this disadvantage of group dependence is eliminated, i.e. it is possible to characterise or describe an item, independently of any sample of test-takers who might respond to the item (see for example Lord, 1980; Hambleton et al, 1991; Thissen & Orlando, 2001).

10

Item format Generally speaking, the test developer faces the issue of what to measure and how to measure it. For most large-scale testing programmes, test blueprints and cognitive demands specify content and demands in terms of what to measure. Regarding the question of how to measure, one dilemma facing test developers is the choice of item format. This issue is, according to Rodriguez (2002), significant in a number of ways. One factor is that interpretations vary according to item format and a second factor is that the cost of scoring open-ended questions can be enormous compared with multiple-choice items. A third factor is that the consequences of using a certain item format may effect instruction in ways that foster, or hinder, the development of cognitive skills measured by tests. The significance of format selection is also related to validity, either as a unitary construct (Frederiksen & Collins 1989; Messick, 1989) or as an aspect of consequential validity (Messick, 1994). In view of the statements mentioned in the previous paragraph the conclusion is that it is useful to distinguish between what is measured and how it is measured; between substance and form; between content and format. The two are not independent, for form affects substance, and, to some extent, substance dictates form. Nevertheless, the emphasis here is on form; on how items are presented. First, a set of attributes of item formats is offered that can serve to classify item types. Second, the importance of an item’s format is discussed: its relationship to what is measured and its effect on item parameters (Linn, 1989). The issues surrounding item format selection, and test design more generally, are also critically tied to the nature of the construct being measured. In line with this statement Martines (1999), reviewing the literature on cognition and the question of item format, concluded that no single format is appropriate for all educational purposes. Referring to the driving-license test, we might assert that driving ability can (and should) be measured via a driving-ability performance test and not a multiple-choice exam, but knowledge about driving (procedures, regulations and local laws) can be measured by a multiple-choice exam. The item format is described in the item specifications. For optimal performance tests (for example the theory test) there is a variety of 11

item formats that could be considered. The item formats can be divided into two major categories; those that require the test-taker to generate the response and those that provide two or more possible responses and require the test-taker to make a selection. Because the latter can be scored with little subjectivity, they are often called objective test items (Crocker & Algina, 1986). It is also worth mentioning that open-ended questions, i.e., questions for which the test-taker constructs the answer using his or her own words, are often preferred because of a belief that they may directly measure some cognitive process more readily, or because of a belief that they may more readily tap a different aspect of the outcome domain. The consequence has been that popular notions of authentic and direct assessment have politicised the item-writing profession (Rodriguez, 2002). This tendency to include less objective formats in tests give rise to subjectivity and this conclusion is based on the fact that multiple-choice items can be scored with significant certainty and with objectivity. But the crucial question is whether multiple-choice items and open-ended items measure the same cognitive behaviour or not? Rodriguez (2002, p 214) briefly formulated his standpoint in the following way: “They do if we write them to do so”. In line with the arguments for multiple-choice items Ebel (1951, 1972) suggested that every aspect of cognitive educational achievement is testable through the multiple-choice format (or true-false items). His conclusion is also that the things measured by these items are far more determined by their content than by their form. Many of the recent authors refer to the wise advice in Ebel’s writing regarding test development and item writing. See for example Carey (1994); Osterhof (1994); Kubiszyn & Borich (1996); Payne (1997); McDonald (1999). Evaluation of items The problem of deciding which items to use in a test is related to the theoretical and empirical model as well as to the test and item specification. The summarised conclusion is that quality items are desired. Consequently, evaluation and judgement procedures based on theoretical and empirical data are important to weed out flawed items.

12

An often-used procedure in item construction is that external itemwriters deliver items, which then are examined and scrutinised by test developers. This model is used by the SNRA. The result of this is that in many cases the proposals, which the item writers have, are to be changed in one way or another in order to meet the requirements for good items. The test developer, who is an expert in test- and itemconstruction, makes these changes and improvements. When this process is finished, item evaluation is the next step. The term theoretical evaluation is used for the process when the items are judged against stated and defined criteria. The procedure requires that the items are written but not necessary administered to a representative sample of test-takers in a try-out. Common for all methods for theoretical evaluation is that one or more judges evaluate items against the criteria. A decision must be made about which criterion or criteria should be addressed, and the priority between those criteria. Techniques and methods for evaluating the judgements must be decided upon as well. This process of judgement can be related to the item per se as to the theoretical and empirical model for the test. Henriksson (1996b) defined and described accuracy, difficulty, importance, bias and conformity as assessment criteria. The judgement can also be focused on the classification of items according to item parameters. These item parameters are included in the model for the theoretical component of the total model for the test and in this respect the basic aim of the judgement and evaluation is to get indications about the reliability of classification. To obtain information about certain items, item analysis is used. Item analysis is the computation of the statistical properties of an item response distribution. Item difficulty (p) is the proportion of test-takers answering the item correctly. Item discrimination is used to assess how well performance on one item relates to some other criterion, e.g. the total test score. Two statistical formulas that are commonly used are point biserial correlation (rpbis) and biserial correlation coefficient (rbis) (Crocker & Algina, 1986). Try-out It’s important to pre-test the items before they are put in the actual test since it’s difficult to anticipate how an item will work in the actual test. Before the try-out is carried out it is important to describe what 13

information the try-out should result in. It’s also important to be aware that some items probably will not be good enough and that several tryouts are necessary to end up with a collection of good items. If the test will consist of several parallel test versions, an extensive domain of pre-tested items is required. When selecting the group for the try-out it is important to consider if they are representative of the group that takes the actual test. One should also consider their motivation to do the test and the size and the availability of the group. Of course there are many reasons why the apparent difficulty might be expected to change between item tryout and actual testing. One might, for example, expect the test-takers to be more motivated during the actual testing, or one might believe that there were changes in instruction during the intervening period. The try-out can be done separately from the actual test, or in combination with items in the actual test. If the try-out items are a part of the actual test the test-taker can either be informed that they are working with try-out items or not. The advantage with this design is that the try-out is done in the proper group of test-takers and that they are probably fully motivated. Validity The traditional approach to validity implies that validity is classified into three different types of validity: content-related evidence of validity, criterion-related evidence of validity and construct-related evidence of validity. Content-related evidence of validity refers to the extent to which the content of test items represents the entire body of content. This body of content is often called the content universe or domain. The basic issue in content validation is representativeness. In other words, how adequately does the content of the test represent the entire body of content to which the test user intends to generalise? The word “content” refers, in this context and according to Anastasi (1988), to both the subject-matter included in the test and the cognitive processes that test-takers are expected to apply to the subject matter. Hence, in collecting evidence of content-related evidence of validity it is necessary to determine what kinds of mental operations are elicited by the problems presented in the test, as well as what subject-matter topics have 14

been included or excluded. The key ingredient in securing contentrelated evidence of validity is human judgement. Criterion-related evidence of validity is based on the extent to which the test score allows inferences about the performance on a criterion variable. In this context the criterion is the variable of primary interest. If the information about the criterion can be available at the same time as the test information the validity is called concurrent validity. Concurrent-related evidence of validity is, for example, frequently used to establish that a new test is an acceptable substitute for a more expansive measure. If the criterion information is available after a certain time, for example a year or more, the validity is called predictive validity. Thus, predictive-related evidence of validity refers to how well a test predicts or estimates some future performance on a certain criterion. The degree to which scores on the test being validated predict successful performance on the criterion is estimated by a correlation coefficient. This coefficient is called validity coefficient. Construct-related evidence of validity refers to the relation between test score and a theoretical construct, i.e. a measure of a psychological characteristic of interest. Theoretical constructs are: intelligence, critical thinking, creativity, introversion, self-esteem, aggressiveness and achievement motivation etc. Reasoning ability, reading comprehension, mathematical reasoning ability and scholastic aptitude are other examples of constructs. Such characteristics are referred to as constructs because they are theoretical constructions about the nature of human behaviour. Construct validation is the process of collecting evidence to support the assertion that a test measures the construct that it is supposed to measure. Construct-related evidence of validity can seldom be inferred from a single empirical study or from one logical analysis of a measure. Rather, judgements of validity must be based on an accumulation of evidence. Construct-related evidence of validity is investigated through rational, analytical, statistical and experimental procedures. The development or use of theory that relates various elements of the construct under investigation is central. Hypotheses based on theory are derived and predictions are made about how the test scores should relate to specified variables. In a classical article Cronbach & Meel (1955) suggested five types of evidence that might be assembled in support for construct validity. These types were also succinctly 15

stated by Helmstadter (1964) and Payne (1997). Both evidence of content-related validity and evidence of criterion-related validity are used in this process. In that sense, content validation and criterion validation become part of construct validation. This latter conclusion (i.e. that content-related, criterion-related and construct-related evidence of validity are not separate and independent types of validity, but rather different categories of evidence that are each necessary and cumulative) represents the integrated view of validity. This integrated and unitary view of validity is described, for example, in Messick’s (1989) treatment of validity. Recent trends in validation research have also stressed that validity is a unitary concept (see, for example, Wolming, 2000a; Nyström, 2004). Thus, validityrelated evidence concerns the extent to which test scores lead to adequate and appropriate inferences, decisions and actions. It concerns evidence for test use and judgement about potential consequences of score interpretation and use. However, it can also be added that, in a very real sense, validity is not strictly a characteristic of the instrument itself but of the inference that is to be made from the test scores derived from the instrument. Reliability When a test is administered, the test user would like some assurance that the test is reliable and that the results could be replicated if the same individuals were tested again under similar conditions (Crocker & Algina, 1986). Reliability refers to the degree to which test scores are free from errors of measurement. There are several procedures to estimate test score reliability. The alternate form method requires constructing two similar versions of a test and administering both versions to the same group of testtakers. In this case, the errors of measurement that primarily concern test users are those due to differences in content of the test versions. The correlation coefficient between the two sets of scores is then computed (Crocker & Algina, 1986). If two versions of a test measure exactly the same trait and measure it consistently, the scores of a group of individuals on the two test versions would show perfect correlation. The lack of perfect correlation between test versions is due to the errors of measurement. The greater the errors of measurement, the lower the correlation (Wainer et al., 1990). 16

The test-retest method is used to control how consistently test-takers respond to the test at different times. In this situation measurement errors of primary concern are fluctuations of a test-takers’ observed score around the true score because of temporary changes in the testtakers’ state. To estimate the test-retest reliability the test constructor administers the test to a group of test-takers, waits, and readministers the same test to the same group. Then the correlation coefficient between the two sets of scores is estimated. Internal consistency is an index of both item homogeneity and item quality. In most testing situations the examiner is interested in generalizing from the specific items to a larger content domain. One way to estimate how consistently the performance of the test-takers relates to the domain of items that might have been asked is to determine how consistently the test-takers performed across items or subsets of items on a single test version. The internal consistency estimation procedures estimate the correlation between separately scored halves of a test. It is reasonable to think that the correlation between subsets of items provides some information about the extent to which they were constructed according to the same specifications. If test-takers’ performance is consistent across subsets of items within a test, the examiner can have some confidence that this performance would generalize to other possible items in the content domain (Crocker & Algina, 1986). The techniques for estimation of reliability mentioned above have been developed largely for norm-referenced measurement. Other techniques have been suggested for criterion-referenced tests. Crocker and Algina (1986) presented some reliability coefficients for criterionreferenced measurement. Wiberg (1999a) found that the statistical techniques used to evaluate the reliability in norm-referenced test also could be used to evaluate the reliability in criterion-referenced tests. However, the usage and interpretation of the results must be handled with caution. The variation in test scores among test-takers constitutes an important foundation for the statistical techniques estimating reliability in norm-referenced tests. Only when the items in a criterionreferenced test fulfil the assumptions underlying classical test theory would it be recommendable to use these statistical methods.

17

Parallel test versions If a test has two or more versions and the test-taker’s score from the test is used for decisions (which is the case for the theory test) all of them must be parallel. This means that different versions contain different items but are built to the same test and item specifications and the same models. From a perspective of a test-taker this means that the obtained test result should be exactly the same, irrespective of the version that is administered. The need for parallel test versions is motivated by the need for test security and for the sake of fairness. It is also a fundamental requirement if repeated test taking is permitted. There are formal and theoretical definitions of parallel test forms (see for example Thissen & Wainer, 2001) and sometimes a distinction is made between parallel, equivalent and alternate forms. But, it can also be added that, for example, Hanna (1993) used parallel, equivalent and alternate forms synonymously. Thus, parallel, equivalent or alternate forms1 have identical weight allocations among topics and mental processes, but the particular test questions differ. Ideally, parallel test versions should have equivalent raw score means, variability, distribution shapes, reliabilities, and correlation with other variables. To estimate the reliability between two or more versions of the same test the alternate form method is used (Crocker & Algina, 1986). If the versions are parallel regarding item difficulty there is a high correlation between them. When put into practice, however, the construction and evaluation of parallel test versions give rise to a number of problems and it is necessary to examine the property two versions should have that would qualify them for use interchangeably. The concept of parallel versions sets the ground for a discussion of practical problems in constructing two (or more than two) parallel test versions that we are willing to regard as interchangeable.

1

The term parallel test versions is used in this report.

18

Standard setting The idea of standard setting is to find a method that minimises the number of wrong decisions about the test-taker. There are two types of wrong decision. The first is if a test-taker that does not have the knowledge passes the test. The other is if a test-taker that has the knowledge fails the test (Berk, 1996). The cut-off score in a test represents a line between confirmed knowledge and a lack of knowledge in a certain area. If the test-taker’s total score is equal to or higher than the cut-off score he or she has the knowledge that is measured by the test. If the test-taker’s total score is less than the cut-off score he or she does not have the knowledge measured by the test (Crocker & Algina, 1986). There are various methods that one can use in standard setting and depending on the format of the test different methods are good to use (Berk, 1986). The methods can be categorized from their definition of competence. Some methods assume that the test-takers either have the knowledge or they do not. Other methods view competence as a characteristic that is continuously distributed, and that a test-taker’s knowledge can be seen as a value within an interval in this distribution. These latter methods of standard setting can be divided into different groups depending on the amount of judgement in the decision. Jaeger (1989) proposed two main categories that are based on performance-data of the test-takers; test-centred continuum models that are mainly based on judgements and examinee-centered continuum models that are mainly based on test-taker’s performance on the test. In addition to these models there are judgemental continuum models that are mainly based on judgement. In the last few years a fourth category, “multiple models”, has been introduced. This model is used for standard setting when the test has multiple item formats or multiple cut-off scores. It can also be added that there are basically three general methods for applying standards: disjunctive, conjunctive and compensatory (see for example Gulliksen, 1950; Mehrens, 1990; Haladyna & Hess, 1999). In the disjunctive and conjunctive approaches, performance standards are set separately for the individual assessment, for example a subtest. In the compensatory procedure, performance standards are

19

set for a composite or index that reflects a combination of subtest scores. With the disjunctive model, test-takers are classified as an overall pass if they pass any one of the subtests by which they are assessed. This approach is applied rather seldom and seems most appropriate when the subtests involved in a test battery are parallel versions, or in some other way are believed to measure the same construct. Haladyna & Hess (1999), for example, pointed out that the disjunctive approach is employed in assessment programmes that allow a test-taker to retake a failed test. With a conjunctive model for decision-making, test-takers are classified as having passed only if they pass each of the subtests by which they are assessed. The use of the conjunctive approach seems most appropriate when the subtests assess different constructs, or aspects of the same construct, and each aspect of the construct is highly valued. Failing only one assessment yields an overall fail because the content standards measured by each assessment are considered essential to earn an overall pass. The application of a conjunctive strategy to standard setting results in test-takers being classified into the lowest category attained on any one measure employed. With a compensatory model, test-takers are classified as pass or fail based on performance standards set in combination of the separate subtests employed. Data are combined in a compensatory approach by means of an additive algorithm that allows high scores on some subtests to compensate for low scores on others. The use of a compensatory strategy seems, according to Ryan (2002), appropriate when the composite of the separate subtests has important substantial meaning, a meaning that is not represented by subtests taken separately. A useful combination of the compensatory and conjunctive model can also be employed. Such an approach sets minimal standards on each subtest that is applied in a conjunctive fashion. This means that the test-taker must yield a minimal pass-level on each subtest before a compensatory approach is applied, and a final rating is determined. This combined conjunctive-compensatory approach sets minimum standards that are necessary on each subtest but not sufficient for the subtests taken together. This approach prevents very low levels of 20

performance on one subtest being balanced by exceptional performance on other subtests (Mehrens, 1990). Test administration There are basically three different ways of presenting items to the testtakers; by paper-and-pencil-tests, computerised tests or computerised adaptive tests. In a paper-and-pencil-test all test-takers get the same number of items. The items are answered with a pencil on paper. The test is often administered to a large number of test-takers a limited number of times because of the item exposure. The item analysis is mainly done with classical test theory (Wainer et al., 1990). A computerised test is mainly the same as a paper-and-pencil-test. The difference is that a computerised test is carried out with a computer, which makes it possible to randomise the order of the items and the options for each test-taker. An advantage with computerised tests is that the administration takes less time since the scoring can be done during the test. Another advantage is that the security of the test is increased when there are no paper copies of the test. With computerised tests there’s the possibility of using new innovative types of items (van der Linden & Glas, 2000). Different types of items are created from combinations of item format, response actions, media and interactivity. An example of an item format is a multiplechoice item. A response action could be that the test-taker answers the item with a joystick. The items can contain different media, for example animations and sound. Media can be used both in the item and in the options. An example of interactivity is an item where the test-taker can answer the item by marking a text or a point. It is important to be aware of new measurement errors that can occur in computerised testing. For example, a computerised test could imply problems for test-takers who are not used to working with computers. Another possible measurement error is that bad graphics on the computer monitor can result in blurred pictures. In computerised adaptive tests (CAT) the test-taker obtains items of different difficulty depending on how the person answered the previ21

ous items in the test. CAT has the possibility to give the test-takers a test that fits their ability (Umar, 1997). Which items are selected for a test-taker also depends on the content and difficulty of the test and the item discrimination. For each response the test-taker gives, the computer program estimates the test-taker’s ability and how reliable the estimate is. When the predetermined reliability is achieved the test is finished and the test-taker obtains the final estimate of his or her ability level (Wainer et al., 1990). Tests based on CAT are often analysed with Item Response Theory, (IRT). IRT can be used to describe test-takers, items and the relation between them. IRT takes into account that the items in a test can vary with respect to item difficulty. There are different models in IRT that can be used to create scale-points. The one-parameter logistic (or Rasch) model is the simplest model where only an “item difficulty” parameter is estimated. The two-parameter logistic model estimates not only a “difficulty” parameter but also a “discrimination” parameter. The three-parameter logistic model includes a “guessing” parameter as well as “discrimination” and “difficulty” parameters (Birnbaum, 1968). There are three basic assumptions in IRT (Crocker & Algina, 1986). The test has to be unidimensional, which means that all items measure the same trait. The assumption of local independence means that the answer on one item by a randomly picked test-taker is independent of his or her answers on other items. The third assumption is that the relationship between the proportion of test-takers that answered an item correctly and the latent trait can be described with an itemcharacteristic curve for each item. With IRT models we can determine the relationship between a test-taker’s score on the test and the latent trait, which is assumed to determine the test-taker’s result on the test. A test-taker with higher ability is more likely to answer an item correctly than a test-taker with lower ability. If these three conditions are met, test-takers can be compared even if they did not take parallel test versions. Item bank An item bank is a collection of items that can be used to construct a computerised test or a test based on CAT. An item bank should consist of a large number of pre-tested items so that varied tests can be 22

composed. The items in the bank should have high item discrimination and low chance of selecting the right answer through guessing. With an item bank it takes less time to construct a test. The items provide most information if they have the same characteristics (Wiberg, 2003a). However, it is hard to construct such items, but at least we should strive towards that. When the items are stored in an item bank there is also a need for variables for identifying items with certain characteristics in the bank. There is no agreement on how an item bank is defined and it can, according to Umar (1997), be defined differently depending on how it is to be used No matter how an item bank is defined the main idea of item banking is associated with the need for making test construction and test development easier, faster and more efficient. But, there is one important principle for item banking - only good items are to be stored in the bank. The number of items that is required in an item bank depends on if the test is a paper-and-pencil-test, computerised or a computerised adaptive test. It also depends on how extensive the test is, how often it is administered and if there are several versions of the test. A paper-andpencil-test has to contain enough items to cover all the content areas in the test. The test should consist of more than one version in order to avoid the test-takers getting to know the items before the test. In a computerised test it’s possible to randomise the order of the items and their options. This makes it harder for the test-takers to remember answers and compare the answers with others. This means that a computerised test can be used longer than a paper-and-pencil test before it has to be changed. If the test is based on CAT the number of items in the test varies for different test-takers. The test should be composed of at least twenty items for each test-taker so that a reliable estimate of the test-takers ability can be obtained. This means that an item bank has to contain at least one hundred items varying in difficulty level. The item bank should contain enough items so that a test-taker can take the test several times without getting the same items. A test based on CAT can contain 30 to 50 per cent fewer items than a paper-andpencil test (Bunderson, Inouye & Olsen, 1989). The items have to be assessed regularly in several aspects. One aspect of item development is the try-out of new items. The try-out gives an estimate of the quality of these items and provides an assessment of 23

the items before they are used in an actual test (Wainer et al., 1990). In addition the test also has to be assessed systematically to ensure that the item parameters do not change between the try-out and the actual test. If the test is administered too frequently, try-out items may become familiar to the test-takers in advance, and then the parameters can change between the try-out and the actual test. There are several methods of making sure that the item parameters do not change between the try-out and the actual test (Bock & Aitkin, 1981; Thissen, 1982; Rigdon & Tsutakawa, 1983; Mislevy, 1984, 1986). It’s also important to control the item exposure so that the items stay unknown to the test-takers (Davey & Parshall, 1995). If the test-takers are familiar with the items before the test, the efficiency of the test decreases since the test-takers that do not have the knowledge can pass the test by memorising the right answers. A deviant response pattern could also be an indication that the test-takers have knowledge of the correct answers. Deviant response patterns can also occur when the test-taker is guessing. There are several methods for discoverering deviant response patterns among test-takers, see for example van Krimpen-Stoop & Meijer (2000).

Empirical studies of the theory test A new curriculum and a new theory test in 1990 As stated previously, the curriculum in Sweden was revised in 1990 and two new content areas were introduced in the theory education (Trafiksäkerhetsverket, 1988). These areas focused on risky situations in traffic and limitation of driver abilities. In order to cover these content areas, a new theory test was introduced. The theory test was administered in six versions that should be randomly administrated to the test-takers. The test consisted of forty items with four options each. The test-takers did not know how many options that were correct. In order to get one point on an item the test-taker had to select all the correct options. Mattsson (1990) examined test-takers’ results on the new test compared to test-takers’ results on the old test. In addition, Mattsson described the difficulty and the equality of the versions of the test. A first sample was obtained in January/February 1990 when the new test 24

had just been introduced. In order to examine changes over time another sample was collected in May. Judgement of items – difficulty The result of the first sample showed that the proportion of correct answers was unexpectedly low. As mentioned above, the total score of the test was forty and the mean score of the test-takers that took the test in January/February was 26.4. The pass-rate was 24.2 %, which was much lower than the pass-rate for the previous test. According to Mattsson (1990) one explanation for the low pass-rate was that the driver education did not correspond to the new curriculum. To examine if the results would improve when the teachers became more familiar with the curriculum, the results were compared with the second sample collected in May. The result from the second sample showed that the mean score and the pass-rate had increased. For the sample collected in May, the mean score of the test was 30.7 and the pass-rate 52.5 %. One conclusion from the study was that there were large differences in the pass-rate between the first and the second sample (see Table 4). There was however reason to expect a change in pass-rate when the new test had been administered for a while. As teachers became more familiar with the new curriculum and when driver instruction and study material were amended in line with the new theory test requirements, the pass-rates were expected to rise. Parallel test versions As mentioned above, the test introduced in 1990 was administered in six versions. The idea was that the test versions should be randomly selected but it did not really work out. Mattsson (1990) showed that version number six was administered less frequently than the other versions (see Table 3). This result indicated that some test stations distributed the versions in numerical order, which led to that test stations with few test-takers distributed version 6 less frequently than the other versions.

25

Table 3. Number of test-takers, male and female, divided according to which of the six different versions of the theory test they took.

Male Female Total

1 330 267 597

2 361 248 609

Version 3 352 240 592

4 320 260 580

5 324 231 555

6 285 173 458

Total 1972 1419 3391

The results of the first sample collected in January/February showed that the pass-rate varied between the versions. More than 80 % that got versions 2 and 5 failed the test, while fewer than 70 % failed versions one and four. The result indicated that it was the content areas 3 (traffic regulations) and the new area, 7 (limitation in driver abilities) that made versions two and five more difficult than the others. The differences between the versions led to a revision of the test where some items were changed (Mattsson, 1990). When the equality of the versions was examined in the second sample, collected in May, the result showed that there was still a large difference in pass-rate between the different test versions (from 41-64%). There were even larger differences between the versions in the second sample than in the first one (see Table 4). The results from the second sample showed that versions 2 and 5 were no longer the most difficult. Instead it was version number three that was the hardest. A probable explanation for the change in difficulty was, according to Mattsson (1990), the revision of the test that occurred before the second sample was obtained. Thus, one conclusion from the study was that the versions of the test were not parallel (see Table 4), which meant that the test-takers result depended on which version of the test that they had taken.

26

Table 4. Proportion (%) of test-takers that passed the test, divided according to the different content areas, for two samples collected January/February (J/F) and May (M). Content area 1 3 5 7 9 Proportion passed

1 J/F 94 55 85 70 52 31

M 97 79 95 88 80 64

Version 2 J/F M 86 93 64 38 84 92 56 80 75 85 18 53

3 J/F 98 45 80 69 60 23

M 100 57 90 75 77 41

Total J/F M 93 97 47 67 83 92 64 81 62 81 24,2 52,5

Content area: 1. Vehicle-knowledge, 3. Traffic regulations, 5. Risky situations in traffic, 7. Limitation of driver abilities, 9. Special applications and other regulations.

A theoretical description of the test Mattsson (1993) conducted a more extensive analysis of the theory test where he discussed the test in relation to some methodological aspects of CRT. The study resulted in some suggestions for further improvement and development of the test. Test specifications Mattsson found that the curriculum was used by the test developers as the test specification for both the practical and the theory test. He also concluded that the objectives of the curriculum had been given different weight in the test, according to their judged importance and the amount of information they covered in the curriculum. The weight of the different content areas was adjusted partly by the number of items for each area and partly by the cut-off scores that differed between the areas. The analysis also showed that the curriculum was very detailed, i.e. there were many objectives to test with only forty items. Mattsson (1993) made the suggestion that the priority of the different objectives should be stated in a test specification, where the item format could also be specified. If the test is to cover the objectives of the curriculum there is a need for several versions of the test. The versions should be parallel so that the test is valid, reliable and fair.

27

Item format Since the theory test is administered to many test-takers and distributed over the whole country it is important that the scoring is fast and objective. At the time of the study by Mattsson (1993), the theory test consisted of forty multiple-choice items, but with a quite unusual design. Traditional multiple-choice items only have one option that is correct. However, in the theory test, more than one option could be correct. The test-takers were unaware of how many options were correct for some of the items. In order to get one point for the item, the test-taker had to give all the correct options. Mattsson (1993) also examined the item format of the theory test empirically by comparing two groups. Both groups answered thirty items similar to those in the actual test. One group consisted of students with traffic education and the other group consisted of students without traffic education. With this design, the effects of education should be apparent. The noneducated group answered some items correctly and the reason for this could be that these items consisted of general traffic knowledge or had flaws that made the correct option most appealing. The result also showed that some items were too difficult, even for the educated group. These items should be revised. Mattsson also concluded that the item format, which allowed more than one option to be correct, could be confusing for the test-takers and contribute to incorrect answers for some items in the educated group. Based on the results from the study presented above, Mattsson (1993) made a couple of suggestions for improvement of the item format. He proposed that the scoring should be adjusted to the item format, so that the test-takers got one point for each option they answered correctly. Another possibility was to change the item format so that only one option was correct for each item. Mattsson (1993) also compared two possible ways of scoring the theory test. In the study, the test was viewed as consisting of 160 statements rather than 40 items, each with 4 options. Each correct statement gave one point. This procedure considered the items where the test-taker answered some of the options correct. Two scoring models were compared. In the first model the cut-off scores were multiplied by four to account for 160 statements instead of 40 items. In the second model the cut-off scores were adjusted because the division of the test into 160 statements instead of 40 items resulted in more correct 28

answers. This meant that the proportion of passes increased. In this scoring model, the cut-off score was adjusted so that the same proportion test-takers passed each content area as had in the actual test. The result showed that the first model would result in lowered criteria. With the second model, the proportion of test-takers that passed and failed the test would be the same as before. Only 13% of the testtakers would obtain another result, some of those who passed the test consisting of 40 items would fail it when it was seen as consisting of 160 statements and vice versa. This result indicated that the options could be used as individual items. Try-out Because of the secrecy and the goal that the theory test should assess the objectives stated in the curriculum, it is important to replace old items with new items continuously. Mattsson (1993) suggested a model for the try-out where new items are pre-tested in the actual test. The actual test is increased with a few try-out items that do not count to the score. The test-takers do not know which items that are try-out items. This procedure maximises the motivation of the test-takers and the try-out items are tested on the test-takers who take the actual test. Standard setting in the theory test Mattsson (1993) also studied the standard setting in the theory test by using a model proposed by Berk (1986). There are several methods that can be used to set the standards for the test (Berk, 1996). According to Berk (1986) the standard setting can be done based on the result of an educated and a non-educated group. The idea is that the educated group should perform better at the test than the non-educated group if the education has been effective and if the test works as intended. Mattsson (1993) compared the result of two groups, one with traffic education and one group without traffic education. A test consisting of 30 items similar to the items in the theory test was administered to both groups. If the scoring was done with the cut-off scores used in the actual test sixty-seven per cent of the students in the educated group would pass the test and no one in the non-educated group would pass the test. The results of the two groups were compared according to the model proposed by Berk (1986) and the result indicated that the 29

cut-off score may result in wrong decisions where some test-takers that have the knowledge may fail the test. However, one could not assume that all test-takers in the educated group had the knowledge measured by the test. If the study had been done using one group that undoubtedly had the knowledge required, one should have recommended a lower cut-off score. Mattsson (1993) also made some suggestions for the improvement of the test administration. He suggested that the response-sheets should be changed so that they could be scored optically. That would increase the possibility of conducting follow-up studies, since data is easily obtained. Computerised testing was another solution that Mattsson presented, although an expensive one. Computerised testing has the advantage of automatic scoring. The possibility of creating item banks and the randomisation of items and options also arises with computerised testing. The possibility of Computerised Adaptive Testing, where the response on the previous item determines which item is administered to the test-taker next, also arises with computerised testing.

Traffic education in upper secondary school – an experiment Over the years several investigations have proposed that traffic education should be a part of the upper secondary school since the students can be educated over a long period of time and get a lot of experience in different traffic situations (see for example SOU 1991:39). An experiment with traffic education was conducted at an upper secondary school in Umeå in 1993. Söderström and Mattsson (1993) evaluated the experiment with traffic education. The purposes of the evaluation were threefold. The first aim was to describe the students’, parents’ and teachers’ attitudes to the education. The second aim was to study the students’ attitudes about drugs and the behaviour of road users. The third aim was to describe the acquired knowledge of the students after education. To determine what knowledge the students acquired during the education a theory test was administered. The test consisted of thirty items similar to the items in the theory test administrated by the SNRA. The content of the items reflected the curriculum and the cut-off scores were adjusted in proportion to the reduction of the number of items. The test was ad30

ministered both in the first week of the education and at the end of the education, ie. eleven months later. About fifty per cent of the students left the education before the final test. For the students that completed the education the content area “traffic regulations” was the most difficult part. Only eighteen per cent of the test-takers passed that part of the test. Compared to the results of students from a driving school, this was a low pass-rate as the pass-rate for students at a driving school was seventy-five per cent. The content area “Limitation of driver abilities” was the part of the test where the effect of the education was most obvious. To examine the attitudes to the education a questionnaire was sent to students, parents and teachers. The results showed that the majority of the participants had a positive attitude to the education. Still, the practical arrangements of the education could be better. The students’ attitudes to drugs and to the behaviour of road users did not seem to be affected by the education but some other interesting results were observed. The attitudes did not always correspond to the behaviour. Even if the students considered it important to wear a helmet when bicycling ninety-four per cent never wore a helmet. The main conclusion was that the traffic education in upper secondary school worked out well, but still it can be improved in many ways. Another finding was that the pass-rate on the “test” was lower than the pass-rate on the actual test. According to Mattsson (1990) an explanation for the low mean scores on the test could be that the students had little motivation to learn traffic theory since there was one or two years left for most of the students before they were going to take the real theory test.

Analysis of the structure of the curriculum and the theory test Judgement of items – the relation between the curricula and the content of the items The theory test is intended to measure the knowledge, attitudes and understanding required of a good driver. The test is closely related to the five main parts of the curriculum for the driver education. The curriculum is quite specific and the educational outcome is stated in 31

behavioural terms. It states what new drivers are supposed to know, be able to do, understand, argue for etc. and how they should be able to demonstrate these abilities. When the quality of the items of a criterion-referenced test is judged, the most important aspect to consider is the relationship between the content of the items and the objectives of the curriculum (Berk, 1984). The objectives stated in the curriculum are very detailed and this led to an idea about research for a simple theoretical model for describing the curriculum (Henriksson, Ehn, Wikström & Zolland, 1995). The model that was generated had analysis of variance as a theoretical basis and distinguished between main factors, two-way interactions and three-way interactions (Appendix 1). The model comprised the main factors: driver, vehicle and environment. The model also included two types of states; a stationary and a non-stationary state. These five components were combined in various ways to create fifteen categories in the model. Since these fifteen categories are supposed to represent all possible events that are described in the curriculum, as well as in the test, the model could also be validated by applying it to both the theoretical and the practical curricula. The model also made it possible to study what type of events the driving-license test measures and it could also be used as a tool for item construction (Henriksson, Wikström & Zolland, 1995). In order to validate the new model it was applied to the present curriculum, i.e. each sub-goal in the curriculum was classified according to the model. The theoretical curriculum consisted of 233 goaldescriptions and the practical curriculum consisted of 167 goaldescriptions (TSVFS 1988:43). The result of this process of validation was that about 95% of the sub-goals in the curriculum could be related to the new model. Certain sub-goals in the curriculum could, however, not be classified. The main reason for this was that they were very broad in scope and very generally described. The process of validation, i.e. the classification of each sub-goal in the present curriculum according to the new model, also resulted in certain reflections about the descriptions in the curriculum. In summary these reflections were that some of the sub-goals overlap each other from the point of view of content and that some sub-goals are heterogeneous in their composition. The sub-goals, using sketches and pictures as guidelines, minimise the options when constructing the test. 32

The same effect will arise when the terms “to state”, “to describe”, “to explain” and “to understand” are used in the sub-goals. Another discovery was that some of the categories ended up empty (Henriksson, Wikström & Zolland, 1995). One of the intentions with the development of a new model for the curriculum was that this model could be taken as a reference, or point of departure, when constructing new items. Therefore, advantages of using the new model were also presented. These advantages were, for example, the possibility to make the curriculum homogeneous instead of heterogeneous and more general than specific, which should have an impact on the test construction regarding flexibility. The model also makes it possible to change items reflecting sub-goals with no interaction into sub-goals reflecting courses of events. Such changes will also give the items in the theory test a closer connection with reality. The categories without any classified sub-goals give the opportunity for observing new fields relevant for testing. The description of advantages was regarded as a further step in the process of validation of the new model. Thus, as an overall and summarised conclusion, the statement was that the new model, so far, could be regarded as a promising tool for information for everyone concerned about the purpose and content of the theory test. But, on the other hand, it should also be stated that the process of validation had to include more steps, primarily related to the empirical outcome of the test. In addition, Zolland and Henriksson (1998) used theoretical and empirical analysis to study whether the six versions of the test were parallel. They also examined the frequency of the administration of the different test version to see if the versions were randomly administered to the test-takers. Like Mattsson (1990), they found that the versions differed in the number of times they were administered to the test-takers and that version number six was the least frequently administered. The theoretical model for the Swedish driving-license test described earlier (Henriksson, Ehn, Wikström & Zolland, 1995; Henriksson, Wikström & Zolland, 1995) was used to analyse the test. The result indicated that the versions had different content according to the theoretical model. Since more than one option could be correct, one item could relate to several parts of the curriculum. The empirical analysis revealed that the six versions of the test were not equal in 33

difficulty. The different versions of the test were not randomly administered to the test-takers, which is not good from a test theoretical point of view. In order to illustrate different ways of assessing the structure of the curriculum of the driving-license test and to clarify the division between the theory test and the practical test, another study was conducted (Zolland, 1999). Since Swedish driver education consists of two examinations, a practical and a theory test, one needs to consider two curricula, two types of courses and two test situations. It is important that there is a balance between the curriculum, the education and the test in each of the two systems. The analysis showed that the theory test is focused on both stationary and non-stationary states and that the focus of the practical test is non-stationary states. It might be reasonable that the theory test measures stationary states and the nonstationary states that are too dangerous to measure in the practical test, and that the practical test measure only non-stationary states.

Aspects of assessment in the practical driving-licence test A detailed curriculum As stated previously, the curriculum for the practical part of the driving-licence test is detailed and consists of 167 goal descriptions (TSVFS 1988:43). In line with the ambition to describe the curriculum in more general terms the theoretical model (see Appendix 1) was also applied to the curriculum for the practical test. The result showed that ninety-seven per cent of the objectives in the curriculum fitted the new model (Henriksson, Karlsson & Zolland, 1995), i.e. the detailed subgoals in the curriculum could be related to the theoretical model with fifteen categories. The result also indicated, for the curriculum of the theoretical part, that the sub-goals in many cases were overlapping. There are, for example, no less than four sub-goals describing how the driver should hold the steering wheel. These sub-goals were located in different parts of the curriculum but when the theoretical model was applied to the curriculum all these sub-goals were categorised as twoway interactions (driver → vehicle) and the overlapping was disclosed. Thus, the overall conclusion was that the detailed curriculum could be described in more general terms.

34

A model for judgement of competencies Another strategy was used to get a more general description of the demands, which are formulated, in the detailed curriculum. The practical test, i.e. the test of the capacity of the driver in real driving situations, includes components relating to judgement. This other strategy, as compared with the former strategy which was based on the theoretical model (Appendix 1), was based on discussions with and statements from a group of experts about important competencies for a successful driver and, consequently, important competencies to assess in the practical test (Henriksson, Karlsson & Zolland, 1996a, b). The purpose of those discussions was to come up with statements and conclusions about indicators of good traffic behaviour. The members of the expert-group, eleven participants in all, were not only experienced driver examiners and instructors but also representatives of the authority (SNRA). They were carefully selected because of their competence and their knowledge of the curriculum and requirements for a successful driver. The conclusion about important competencies was described in a hierarchical model that is presented in Figure 1 below. E. Realistic understanding of oneself and others D. Communication, adjustment to traffic situations C. Perception of significant information B. Application of traffic regulations A. Vehiclerelated knowledge, manoeuvring

Figure 1. Model of important competencies.

The A-level (i.e. vehicle-related knowledge and ability to manoeuvre the vehicle) is a basic level. The latter ability is a fundamental qualifi35

cation in order to be able to control and master the vehicle in different traffic environment. The B-level (i.e. application of traffic rules and traffic regulations) is the final step of a process that implies transforming theoretical knowledge to practical knowledge that can be applied in real life and traffic situations. The C-level (i.e. the perception level) is developed gradually from its initial stage of transformation of realworld situations, which are described in theory, to real traffic situations to its final stage implying consciousness about how to select relevant information in different traffic situations. The D-level (i.e., communication and adjustment to traffic situations) is a stage that implies a high degree of independence. An ability to communicate one's own as well as others intentions and an ability to adjust to different environments and different traffic situations. The E-level, i.e., a realistic understanding of oneself and others, implies a high degree of consciousness about the determinants behind different kinds of reactions and actions. It also implies a high degree of preparedness for different kinds of reactions in different traffic situations. It can also be added that a satisfactory level of ability at this level is a fundamental qualification for being a successful driver. If this hierarchical model is related to the previously defined theoretical model including fifteen factors (Appendix 1) the main finding is that there is a certain relation between these two models. Level A-C in the hierarchical model corresponds to two-way interactions and level D-E corresponds to three-way interactions. This means that there is a relation between the hierarchical level in the model for important competencies and the number of factors in the model including interactions. It should be emphasised that this model is not a model for the acquisition of competencies, but a model that describes different levels of competencies. Good traffic behaviour on a certain level implies, in most cases, good behaviour on a lower level. It can also be added that the levels are not independent, i.e., there is an interaction between levels. In a perspective of judgement in the practical driver-license test this model also is supposed to indicate, and give the hint to the driver examiner, to concentrate the observation of good traffic behaviour on a higher level rather than on a lower level. This hint could also be seen from the perspective of the restricted time for observation in the practical test (Henriksson, Karlsson & Zolland, 1996b). 36

The computerisation of the theory test In January 1999 the Swedish theory test was computerised. A study was conducted with the purpose of checking if there were any differences between the results of those who took the same test before and after it was computerised (Wiberg, 1999b). When the test was computerised both the items and the options were randomised. The test was administered in six versions and the results showed that there were differences in results in most of the versions. The five content areas of each test version were examined and there were differences in most of the parts. The statistical analysis of the test showed that the proportion of test-takers that passed the test decreased after the computerisation of the test. The analysis of the different versions of the test showed that versions 1, 3 and 5 were affected by the computerisation. Further, version 2 was not affected at all by the computerisation. Finally, in version 6 only one content area in the version was affected by the computerisation. An analysis of the different content areas of the test showed that the results in the areas traffic regulations, risky situations in traffic and limitation in driver abilities differed significantly before and after the computerisation in most versions of the test. To obtain information about specific items, an item analysis was conducted. The items of the different versions were analysed and values of rpbis, which is a measure of the relationship between the score at a single item and the test-taker’s score on the test, was obtained. The conclusion from this analysis was that some items worked better after the computerisation. The most likely explanation for this is that the possibility of remembering patterns of the options was reduced by the randomisation of the options. The main conclusion of the study was that the overall effect of the computerisation was positive. Most items seemed to work better in the computerised test and the possibility of learning answering patterns was reduced by the randomisation of the items and their options.

37

Methods for standard setting Standard setting for the theory test used between 1990-1999 As mentioned earlier, there are several ways to set the standards for a test. Wiberg and Henriksson (2000) studied the choice of standards in the theory test that was used between 1990 and 1999. They concluded that there were some problems with the cut-off score in the test. The test which consisted of 40 multiple-choice items could be seen as 4 options x 40 items = 160 statements (s) that could be selected correctly or incorrectly. Table 5 is an example of the consequences this kind of scoring might result in. Table 5. A theoretical example of the implications with the item format and scoring in the theory test. Content Number Cut-off score area of St* 3 5 7 1 9

56 32 32 12 28

Total

160

44 (11·4) 20 (5·4) 20 (5·4) 4 (1·4) 16 (4·4) +16 120

Number of St correct, but the testtaker gets no score 42 (14·3) 24 (8·3) 24 (8·3) 9 (3·3) 21 (7·3) 120

Number of St correct, but the test-taker do not pass the test 56 (14·4, all correct) 32 (8·4, all correct) 32 (8·4, all correct) 9 (3·3, all wrong) 26 (5·4+2·3, 2 wrong) 154

* St is an abbreviation for statements. According to Table 5 a test-taker who selected 120 correct statements could either pass the test or get a zero score. The test-taker’s result depends on which items are selected incorrectly. The example in Table 5 also shows that there is a possibility that a test-taker who selected 154 statements correctly (96.25 % of all the statements) would not pass the test. Apparently there were some problems with the old test. These problems led to the development of a new test in 1999. Standard setting for the new theory test introduced in 1999 The new test has the same content areas as the old test but a different item format (VVFS 1999:32). The test consists of sixty-five multiple38

choice items with only one option correct for each item. The scoring model is compensatory, which means that lack of knowledge in some area can be compensated with much knowledge in the other areas (Wolming, 2000b). Wiberg and Henriksson (2000) conducted a study in order to suggest a suitable method for standard setting in the new theory test. They concluded that there were two methods of standard setting that could be particularly useful for the theory test: Informed judgement and Iterative Angoff. Both these methods are based on a combination of judgement of and performance data for the test-takers. Informed judgement (Yalow & Popham, 1983) is a method where judges analyse the area of competence that the test is measuring. The judgement is also based on the context in which the decision of standard setting should be made. Next step is to obtain data on the performance of the test-takers in order to set the standards of the test. In the last step of the procedure the judges are asked to recommend a cut-off score. The mean of all the judges recommendations are computed and that value is set as the cut-off score for the test. Iterative Angoff (Saunders & Mappus, 1984) is a method of standard setting that is based both on judgement and statistical methods. In the first step of the procedure, different judges choose a cut-off score and present their suggestions to each other. In the next step, the testtaker’s test performance is presented to the judges along with descriptive statistics of the cut-off score chosen earlier. On the basis of that information the judges decide the final cut-off score together The conclusion of the study was that either of these methods is appropriate for use when setting the standards for the theory test. However, the cut-off score for the theory test was set to 52 out of the 65 (80 %). That decision was based merely on judgement rather than using some of the models proposed.

Item bank for the theory test With a computerised test the possibility to of creating an item bank arises. There are several advantages with item banking. One advantage is that an item bank reduces the amount of time it takes to compose a test. Wiberg (2002) studied how an item bank may be con39

structed and designed for the theory test. Since the test is administered to many test-takers all over the country almost every day of the year the item exposure is high. The items need to be changed continuously or they will be familiar to the test-takers in advance. The idea of an item bank is to facilitate test construction by having a collection of many pre-tested items that can be selected when constructing a test. An item bank can be used to compose a computerised test or a test based on CAT. Wiberg (2002) discussed the number of items in an item bank for the theory test. The items have to cover all the content areas of the test. Since the theory test can be taken several times, until the test-taker passes the test, an item bank for the theory test have to be relatively large. The test should also consist of more than one version so that the test-takers do not become familiar with the items in advance. The number of items required in an item bank for a computerised test is fewer than for a paper-and-pencil-test. Since the items and the options are randomised, the items can be kept longer than in a paper-and-pencil-test. Since the theory test is measuring a restricted area of knowledge it is difficult to vary the content of the test. The intense administration threatens the test through increased item exposure than a test that is administered once or twice every year. Wiberg (2002) made some suggestions for aspects to consider in the construction of an item bank. It is important to pre-test the items before they are put in the item bank to control the difficulty level and item discrimination. When the items are put in a test, it is important to check that the item parameters do not change between the try-out and the real test. It is also important to control the item exposure so the test-takers do not get to know the items before the test and to control for deviant response patterns that may indicate that the test-takers are taking the test in order to memorise the items. Another suggestion was that the items could be categorised in different ways to facilitate the construction of parallel versions of the test.

A sequential approach to the theory test For the moment, each test version in the theory test consists of a fixed number of items. This means that all test-takers answer all items on the test version. If an item bank is constructed and is used we can use it in at least two different types of test; tests based on CAT and sequential tests. In a sequential test we give items to the test-takers until 40

we can draw the conclusion that the test-taker either has the ability or lacks the ability for which we are testing. Wiberg (2003b) studied the difference between tests with a fixed number of items compared with sequential tests. In sequential testing, the test is terminated as soon as we can make a decision on whether the test-taker should pass or fail the test. This means that different test-takers will get a different number of items and their tests will therefore be different. With a sequential approach we will overall get shorter tests. By using shorter tests each item will be less exposed to the test-takers than in a test with fixed number of items, where all items are given to all test-takers. However, in order to use fewer items in a test we need to use items that are efficient in the sense that they give us as much information as possible on which test-takers have the ability and which test-takers lack the ability. Items that efficiently give us this information are labelled optimal items. Optimal items have the same optimal item characteristics. Wiberg (2003b) showed that optimal item characteristics are: a low chance of being answered correctly through guessing, high item-discrimination and an item difficulty close to the cut-off score. Note that the optimal item characteristics are the same as in tests with fixed number of items except for the item difficulty, which is slightly different. The problem with this approach is that although we know which optimal item characteristics an item should possess we might not manage to create such items. Therefore, we are probably going to have to give the test-takers tests that have non-optimal items. Tests with non-optimal items will be longer than tests with only optimal items. How much longer they will be depends on how different the item characteristics are from the optimal item characteristics.

Results of the Swedish driving-license test When a new theory test was introduced in 1999, there was a need to assess the effects of the new test. There was also a need to study the relationship between the theory test and practical test. Wolming (2000b) presented a description of the test-takers’ results on the new computerised theory test. He also studied the relationship between the theory test and practical test. The purpose of the study was also to examine the difference between professionally- and privately- educated learner-drivers result at the theory test and the practical test. Since the test-takers had to pass the theory test before they were al41

lowed to take the practical test, only candidates who passed the theory test were included in the sample. Parallel test versions and the relationship between the tests The description of the new test showed that there were no significant differences between the versions of the theory test. The analysis of the relationship between the tests showed that test-takers with a high performance on the theory test had a higher pass-rate on the practical test than test-takers with a lower score at the theory test. All parts of the practical test were related to the scores on the theory test to different degrees. The probability of a test-taker passing the practical test increased with a higher score on the theory test. There were major differences between the various parts of the practical test. “Traffic behaviour” and “attentiveness” were two parts that each resulted in more failures than the other parts of the test. Of all the parts in the practical test these also had the strongest correlation to the theory test. The weakest correlation between the theory test and the practical test was the “manoeuvring” part. Private or professional education About seventy per cent of the candidates in the sample were categorised as learner-drivers with professional education and about thirty per cent of the candidates were categorised as learner-drivers with private education. The categorisation was based on the student’s entry to the practical test, i.e. if the student entered as a learner-driver from traffic school or a private learner-driver. This categorisation was not clear-cut since it is possible that students that entered as private learner-drivers had been taking lessons at a driving school. The comparison between privately- and professionally- taught learnerdrivers showed that the latter scored 1.1 points higher, on average, on the theory test. The differences on the practical test were even larger. The professionally-educated learner-drivers made fewer errors than the privately-educated learner-drivers on all parts of the practical test except for the manoeuvring part. The pass-rate at the practical test was seventy per cent for the professional learner-drivers and fifty-four per cent for the privately-educated learner-drivers.

42

The relationship between performance in the two parts of the practical test (traffic behaviour and attentiveness) and the result of the theory test indicated that the ability to behave well in traffic and cope with traffic situations is dependent of the knowledge obtained through the education in driving theory. The result of the study also indicated that there is a benefit to education since learner-drivers from traffic schools make fewer errors in the practical test than privately-educated learner-drivers. Validating the results In order to validate the results of the study conducted by Wolming (2000b) and to study the test-takers’ result at the theory test and practical test further, another study was conducted (Sundström, 2003). The first purpose of the study was to examine the structure of the tests and the performance of the test-takers. A sample of performance data from 6278 test-takers was collected to study the theory test. When compared to the study by Wolming (2000b) the structure of the tests and the performance of the test-takers seemed stable over time. The second aim was to examine the relation between the two tests. Performance data from 1791 test-takers was obtained and the result showed a moderate correlation between the score on the theory test and the performance on the practical test. The third purpose was to investigate the effect of private and professional education on the test taker’s performance on the tests. Learnerdrivers attending professional education performed better at the theory test and practical test compared to learner-drivers attending private education. One problem with the analysis was that the categorisation of learner-drivers was not clear-cut. In order to study the effect of education further it would be desirable to obtain a more reliable categorisation of learner-drivers. This categorisation could be done by administering a questionnaire to a sample of newly qualified drivers. In order to get a more reliable estimate of the relationship between the theory test and the practical test it would also be desirable to let all test-takers take the practical test irrespective of their result on the theory test.

43

Driver education’s effect on test performance As shown in previous studies (Wolming, 2000b; Sundström, 2003) the type of driver education seems to affect the performance on the theory test and practical test. However, the categorisation of the test-takers’ driver education is unreliable, since the test-takers are categorised as private learners or students from traffic school on basis of their entry to the test. The first aim of a study conducted by Sundström (2004) was to create an unambiguous categorisation of the test-takers’ driver education and to use this categorisation to investigate differences in test performance between the categories. The second aim was to examine the arrangement of the practical education and the education in driving theory for the different categories. The third aim was to study the test-takers’ opinions of the theory test and the practical test. In order to categorise the test-takers and investigate the content and arrangement of the education a questionnaire was designed. The questionnaire was administered to 245 persons that took the practical driving-license test in April 2003. The administration of the questionnaire resulted in answers from 142 respondents (58%). The result indicated that the previous categorisation did not fit the reality since only twenty per cent was categorised in the same way in the new model as in the old model. The result also showed that there is a need for professional support in the private driver training, since private learners tended to practice some exercises earlier than students from traffic school. The results also suggested that private training is important for practising basic manoeuvres like “manoeuvring” and “changing gear” since these exercises were ranked highly by the private learners. In addition, private training is important for practising more advanced driving since “roundabouts” were also ranked highly by the respondents. The respondents’ opinions of the tests were also examined. Students from traffic school were more satisfied with the content of the theory test and practical test than the private learners. Unlike the private learners, the majority of the students from traffic school thought that the content of the education corresponded well to the content of the tests. One explanation for these results might be that the content of the professional education makes the students better prepared for the tests than private training alone. Finally, the categories were compared regarding their test performance and the result showed that traffic-school students and the stu44

dents who had combined lessons at traffic school with private training had somewhat higher pass-rates on the practical tests than students with private training only. Due to the sample size and response-rates, there were few responses in each category, which made it difficult to make a more detailed comparison of the categories test performance. In order to compare the test performances of those in different categories the questionnaire could be revised to include only information necessary for the categorisation of the test-takers. With a less extensive questionnaire the sample size could be increased and the response-rates would probably be better; this would facilitate a comparison of the categories.

Driver education in the Nordic countries In March 2002, the Department of Educational Measurement arranged a conference on the driving-license tests in the Nordic countries. The focus at this conference was the theory test in Norway, Denmark, Finland and Sweden. The purpose of the conference was to describe both the driver education and the theory test in the Nordic countries. The aim was also to present research that had been done and future projects. In addition, the conference gave opportunity to discuss ways for the Nordic countries to collaborate and exchange information. As a result of the conference Henriksson, Sundström and Wiberg (2002) provided an overview of the driver education in Sweden, Norway, Denmark and Finland. Similarities and differences in the different educations were discussed in the report. One conclusion was that driver education in Finland and Denmark shares many similarities and so does driver education in Norway and Sweden. One example is the private driver education. In Sweden and Norway private education is encouraged because of the driving experience that comes with it. In Denmark private driver education is not allowed at all and in Finland this kind of training is quite rare. Denmark and Finland also have a more restricted driver education. The learner-drivers have to take a minimum number of driving lessons. In Norway the number of compulsory lessons is fewer than in Denmark and Finland and in Sweden the learner-drivers do not have to take any lessons at all, except for the risk-education on skidpan, before taking the driving-license test. The driver education in the Nordic countries shares some similarities too. In all countries skid-training is a part of the education. The students 45

also have to pass the theory test before they are allowed to take the practical test and the age of licensing is eighteen years in all of the countries. Concerning the theory test there were also some differences and similarities between the countries. As mentioned above the theory test in Sweden is computerised and consists of sixty-five items and five tryout items. The theory test in Norway is also computerised and consists of forty-five multiple-choice items where more than one option could be correct. The theory test in Finland is carried out on a computer and contains fifty items that should be answered in a restricted amount of time. In Denmark the theory test is a written test that consists of twenty-five multiple-choice items. Pictures of different traffic situations are showed to the test-takers with a projector. The options are read from a tape. More than one option could be correct. In all countries the test-takers have to answer at least 80% of the items correctly in order to pass the test. The conference also addressed the need for a platform for discussing the driving-license test in the Nordic countries. The conference gave the opportunity to discuss ways for the Nordic countries to collaborate and exchange information about the theory test.

Curriculum, driver education and driver testing In order to compare the driver education systems of the Nordic countries, a literature study was conducted (Jonsson, Sundström & Henriksson, 2003). The focus of the study was the driver education systems that consist of three parts; curriculum, driver education and driver testing. The first purpose of the study was to present the design of driver education programmes in the Nordic countries, Great Britain and Germany. The second purpose was to describe the theory test and practical test in these countries, as well as to compare the tests with regard to psychometric criteria. The third aim was to present studies that have focused on the relationship between driver education and performance on the driving-license test. The fourth purpose was to present previously conducted studies in order to improve the objectives of the driver education as well as the driving-license test. Finally, the fifth purpose was to describe the opportunity for the countries in question to assess the results of driver education. 46

The results of the study showed that there are three ways of determining whether the objectives of the curriculum have been met: conventional tests, education or a combination of both. When comparing the driver education systems in the different countries three categories of systems were found. The first category contains systems with little or no compulsory education and comprises the driver education systems in Great Britain and Sweden. In these systems private education is allowed. In these kinds of driver education system where the level of compulsory education is minimal or non-existent, the driving-license test is the only way to verify that the test-takers have acquired the knowledge and abilities specified in the curriculum. The second category consists of systems with some compulsory education as well as private education. The combination of professional and private education might be fruitful depending on how the education is arranged. Norway, Iceland and Finland are examples of driver education systems within this category. In the third category of systems the formal driver education is compulsory and private education is forbidden. This category comprises the driver education systems in Denmark and Germany. In these kind of driver education systems, the system owner has two ways of verifying that the student has gained the knowledge and abilities stated in the curriculum: through compulsory education and through testing. Assessing the quality of the tests – reliability and validity The theory test and practical test of the different countries were compared and discussed from a psychometrical aspect (Jonsson, Sundström & Henriksson, 2003). As a result of a restricted use of compulsory education in Sweden and Great Britain, the quality of the student evaluation depends solely on the quality of the tests. Thus, the demands on the tests in terms of reliability and validity are high. In order to obtain a valid measure of the test-takers knowledge and abilities there has to be an agreement between the objectives of the curriculum, the driver education and the driving-license test. For this reason, it is important that the test has content validity, i.e. that the test covers the content of the curriculum. The analysis showed for example that the number of items is an important aspect. In order to get a reliable measurement it is important that there are enough items in the test representing each area of the curriculum. Another important issue is 47

the time aspect. In order to get a reliable measurement it is important that the test-taker has enough time to complete the test if the time is not the factor that should discriminate between the test-takers. There should be a relationship between the number of items and the timelimit for the test, and the time-limit should allow all or nearly all testtakers to complete the test. The amount of time should also depend on the characteristics of the items. The item format was also something that was addressed when comparing the theory tests in the different countries. All countries have multiple-choice items in their tests, but the number of correct options varies. The use of items having only one correct option gives an advantage, since these questions only measure one aspect and thus make the question more straightforward to the test-taker (Jonsson, Sundström & Henriksson, 2003). Finally, the analysis of the theory tests showed that there are two strategies used in administering the test: by paper and pencil or by computers. One advantage with computerised tests is that the security of the test is increased when there are no paper copies of the test. Another advantage is that the try-out procedure is made easier. Due to the try-out, which is unique for the Swedish theory test, the system owner is given an opportunity of obtaining valuable information about new items before they are inserted in an ordinary test. The try-out items can be integrated into the ordinary test, making it easier to anticipate their quality. Computerised testing also offers other advantages: the items and their options can be randomised and the administration of the test takes less time since the scoring can be done during the test. One disadvantage with computerised tests is that some test-takers are not used to working with computers and this inexperience might be reflected in their test results. When considering the reliability and validity of a practical test, there are other methods that can be used in order to maintain the standard of the test. One method that helps ensure the equality of different assessments is the use of standardised test routes. However, if standardised routes are being used, the test routes need to be continuously updated in order to assure that the test-takers do not know the route in advance. If they do, the validity of the test is, of course, negatively affected. The advantage of tests where the examiner decides the route is that the content can be more varied. In addition, the validity of such 48

an assessment is probably higher than an assessment with standardised test routes, since the test-takers are observed in the actual traffic environment. One problem with this kind of assessment is to obtain equality between different assessments, so that one student is judged in the same way by all examiners. In this case it is important that the examiners are trained to assess in the same way. One problem with this kind of assessment is that more demands are placed on the examiner in terms of choosing a route that covers the content of the curriculum. In these cases, it is also important that the time for the driving test is long enough to ensure that the content stated in the curriculum is covered. In most countries in the study, the assessment of the practical test is based on a holistic impression of the test-takers’ performance. However, the criteria for judgement differ between the countries. Some countries are focusing on certain competences during the test and other countries focus on the errors made by the test-taker. Moreover, the length of the testing time is also of great importance for the reliability and validity of the evaluation, since the examiner needs sufficient time to get a good basis for the decision. Another important aspect related to the quality of the test is that the test-taker’s result is independent of the driver examiner that assesses the test-taker. There are several possible ways to check the reliability of a test, and it is of great importance for the quality of the test that these evaluations are made. In order to facilitate the agreement between examiners it is important that there are clearly-defined guidelines for the assessment of the test-takers, so that the examiners assess the test-takers on the same basis. Assessment of attitudes and motives The purpose of the driving-license test is to determine whether the student has acquired the knowledge and abilities specified in the curriculum. Today, when the curricula in many countries emphasises the students’ attitudes towards driving, the driver education systems with no compulsory education encounter a major problem, since it is difficult to get a reliable measurement of the students’ attitudes through driver testing. Therefore, the system owner might need to consider other ways of evaluating the attitudes of the test-taker. One way of evaluating the attitudes is through compulsory education, 49

evaluating the attitudes is through compulsory education, where qualified teachers indirectly assess the student.

Further research The research that has been conducted on the driving-license test by the Department of Educational Measurement over the past twelve years has been a significant contribution to the development and the improvement of the theory test. Still, further improvements to the theory test and practical test can be made. Based on the results and conclusions of the studies conducted within the project, suggestions of further studies could be made. Previous studies (Wolming, 2000b; Sundström, 2003) indicate that there is a relationship between the theory test and the practical test, in the sense that better performance on the theory test is related to fewer errors on the practical test. One methodological problem with the measurement of the relationship between the tests is that the range is restricted since the test-takers have to pass the theory test before they are allowed to take the practical test. In order to solve the restrictedrange problem and to explore the actual relationship between the tests it is desirable to let test-takers take the practical test regardless of their result on the theory test. Studies have also been conducted in order to examine the effect the type of driver’s education has on performance on the theory test and practical test. The first studies (Wolming, 2000b; Sundström 2003) indicated that students with professional education performed better on both the theory test and practical test, compared to private learnerdrivers. In these studies the categorisation of students as professional or private learner-drivers was based on their entry to the drivinglicense test. In order to get a more reliable categorisation of privately and professionally educated learner-drivers a questionnaire was developed and administered to 245 people that recently had taken the practical test (Sundström, 2004). Based on the information from the questionnaire the respondents were assigned to four different drivereducation categories. The first category contained private learnerdrivers and the second category contained professional learner-drivers. The third category comprised students who combined private training with fewer than eleven hours of professional instruction. The fourth 50

category contained students who combined private training and at least eleven hours of professional instruction. Due to the sample size and the response-rate there were few respondants in each category, and therefore it was difficult to compare the categories with regard to test performance. In order to study the test performance of the different categories, the questionnaire could be revised to include only questions necessary for the categorisation. With a less extensive questionnaire, a sample large enough to make reliable comparisons of test performances between the categories could be taken. For many years the focus of the driver education and driver testing has been vehicle-related knowledge, manoeuvring and driving in traffic (Jonsson, Sundström & Henriksson, 2003). Lately, some countries have changed focus from these more mechanical aspects to consider the role of the driver and aspects like self-evaluation, attitudes and motives. For instance, in a proposal of a new curriculum for driver education in Sweden, one aspect is to judge the students capacity of self-evaluation. An important matter in relation to this is how the students’ capacity for self-evaluation should be assessed. In the field of psychological testing many self-reports and self-assessments are made and from this perspective it would be possible to measure this kind of aspects. However, the driving-license test is somewhat special compared to psychological tests in the sense that the test is used for examination. Therefore, it could be difficult to make a reliable measurement of the students’ capacity of self-evaluation. In order to investigate the possibilities of assessing aspects like self-evaluation further research needs to be done.

51

References Anastasi, A. (1988). Psychological testing. Sixth Edition. New York: Macmillan. Andersson, K. (Ed.). (1999). Högskoleprovet. Konstruktion, resultat och erfarenheter. [The Swedish Scholastic Aptitude Test. Construction, Results and Findings] (Pedagogiska Mätningar, Nr 153). Umeå universitet: Enheten för pedagogiska mätningar. Anderson, L.W., Krathwohl, D.R., Airasian, P.W., Cruikshank, K.A., Mayer, R.E., Pintrich, P.R., Raths, J., & Wittrock, M. (2001). A Taxonomy for Learning, Teaching, and Assessing. A Revision of Bloom´s Taxonomy of Educational Objectives. New York: Longman. Berk, R.A. (Ed.). (1984). A Guide to Criterion Referenced Test Construction. Baltimore, MD: Johns Hopkins University Press. Berk, R.A. (1986). A consumer’s guide to setting performance standards on criterion-referenced tests. Review of Educational research, 56, 137-172. Berk, R.A. (1996). Standard setting: the next generation (where few psychometricians have gone before!). Applied Measurement in Education, 9(3), 215-235. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F.M. Lord & M.R. Novick, Statistical theories of mental test scores. Reading MA: Addison-Wesley. Bloom, B.S. (Ed.). (1956). Taxonomy of educational objectives, handbook 1: The cognitive domain. New York: McKay. Bloom, B.S., Hastings, J.T., & Madus, G.F. (1971). Handbook of formative and summative evaluation of student learning. New York: McGraw-Hill.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: an application of an EMalgoritm. Psychometrika, 46, 443-459. Bunderson, C., Inouye, D. K., & Olsen, J. B. (1989). The Four Generations of Computerized Educational Measurement. In: R. Linn, Educational Measurement. Third Edition. New York: American College on Education and Macmillan. Carey, L.M. (1994). Measuring and evaluating school learning. Needham Heights, MA: Allyn & Bacon. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart & Winston. Cronbach, L.J., & Meel, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302. Davey, T., & Parshall, C. G. (1995). New Algorithms for Item Selection and Exposure Control with Computerized Adaptive Testing. Paper presented at the Annual Meeting of the American Educational Research Association, San Fransisco, CA. Ebel, R.L. (1951). Writing the test item. In E.F. Lindquist (Ed.), Educational measurement. First Edition. Washington, DC: American Council on Education. Ebel, R.L. (1972). Essentials of educational measurement. Second edition. Englewood Cliffs, NJ: Prentice Hall. Frederiksen, J.R., & Collins, A. (1989). A systems approach to educational testing. Educational Researcher, 18(9), 27-32. Franke, A., Larsson, L., & Mårdsjö, A-C. (1995). Förarutbildningssystemet i Sverige. Delrapport 1. En historisk beskrivning av förarutbildningssystemet i Sverige. [The Swedish driver education system in a historical perspective](Rapport Nr 1995:16). Göteborgs universitet: Institutionen för pedagogik.

Glaser, R. (1963). Instructional Technology and the Measurement of Learning Outcomes: Some Questions. American Psychologist, 18, 519-521. Gronlund, N.E. (1998). Assessment of student achievement. Sixth Edition. Boston: Allyn and Bacon. Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. Haladyna, T.M. (1997). Writing test items to measure higher level thinking. Needham Heights, MA: Allyn & Bacon Haladyna, T.M. (1999). Developing and validating multiple-choice test items. Mahwah, NJ: Lawrence Erlbaum Associates, Inc., Publishers. Haladyna, T.M., & Hess, R.K. (1999). Conjunctive and compensatory standard-setting models in high-stakes testing. Educational Assessment, 6(2), 129-153. Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory. London: Sage Publications. Hanna, G.S. (1993). Better teaching through better measurement. Orlando: Harcourt Brace Jovanovich, Inc. Helmstadter,G.C. (1964). Principles of psychological measurement. New York: Appleton-Century-Crofts. Henriksson, W. (1996a). The process of test construction. Jamaica: Primary Education Improvement Project. Henriksson, W. (1996b). The process of item construction. Jamaica: Primary Education Improvement Project. Henriksson, W., Ehn, P., Wikström, P., & Zolland, A. (1995). Models for the theoretical and practical parts of the Swedish driver’s license test. In C. Stage & W. Henriksson, Notes from the Third International SweSAT Conference. Umeå, May 27-30 (Educational Measurement No 16). Department of Educational Measurement, Umeå university.

Henriksson, W., Karlsson, S., & Zolland, A. (1995). Klassificering av innehåll i körkortsprovets körprov utifrån en teoretisk modell. [Classification of the content of the practical drivinglicense test on basis of a theoretical model] Paper presented at a Swedish National Road Association Conference in Vaxholm, 14-15 september 1995. Henriksson, W., Karlsson, S., & Zolland A. (1996a). En expertgrupps åsikter om förarkompetens som grund för en modell. [A group of experts and their opinions about driverability as a basis for a model] Paper presented at a Swedish National Road Association Conference in Sjudarhöjden, 28-29 mars 1996. Henriksson, W., Karlsson, S., & Zolland A. (1996b). En expertgrupps reflektioner kring bedömningsproblematik vid körkortsprovets körprov. [A group of experts and their reflections about the problem of judgement in the practical driverlicense test] Paper presented at a Swedish National Road Association Conference in Sjudarhöjden, 28-29 mars 1996. Henriksson, W., Sundström, A., & Wiberg, M. (2002). Körkortsprovet i ett nordiskt perspektiv. [The theory test in the Nordic countries] (Pedagogiska mätningar Nr 174). Umeå universitet, Enheten för pedagogiska mätningar. Henriksson, W., Wikström, P., & Zolland, A. (1995). Modell för körkortsprovets teoriprov. Modellprövning och reflektioner. [Model for the theory test] (Pedagogiska mätningar Nr 103). Umeå universitet, Enheten för pedagogiska mätningar. Jaeger, R.M. (1989). Certification of student competence. In R.L. Linn (Ed.), Educational Measurement, Third Edition. New York: American College on Education and Macmillan.

Jonsson, H., Sundström, A., & Henriksson, W. (2003). Curriculum, Driver Education and Driver Testing. A comparative study of the driver education systems in some European countries. (Educational measurement No. 44). Umeå university: Department of Educational Measurement. Kubiszyn, T., & Borich, G. (1996). Educational testing and measurement. Fifth edition. Glenview, IL: Scott Forseman. Lindström, J-O., Nyström, P., & Palm, T. (1996). Nationellt kursprov i matematik, kurs A, C och E, vt 1996. Resultat och kommentarer [The National Test in Mathematics for Course A, C and E. Results and Comments] (Pedagogiska Mätningar, Nr 118). Umeå universitet: Enheten för pedagogiska mätningar. Linn, R.L. (1989). Educational Measurement. Third Edition. New York: American College on Education and Macmillan. Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, New Jersey: Erlbaum. Martinez, M.E. (1999). Cognition and the question of test format. Educational Psychologist, 34(4), 207-218. Mattsson, H. (1990). Nytt teoriprov 1990. Statistisk beskrivning av körkortsprovet våren 1990. [Statistical description of the new theory test in 1990] (Pedagogiska mätningar nr 38). Umeå universitet: Enheten för pedagogiska mätningar. Mattsson, H. (1993). Körkortsutbildningens teoriprov. Provet i ett forskningsperspektiv och olika utvecklingsmöjligheter. [The theory test in a measurement theoretical perspective] (Pedagogiska mätningar nr 71). Umeå universitet: Enheten för pedagogiska mätningar. Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement. Third Edition. New York: American Council on Education. Mc Donald, R.P. (1999). Test Theory: A unified Treatment. Manwah, New Jersey: Lawrence Erlbaum Associates, Publishers.

McDonald, R.P. (1999). Test theory. A unified treatment. New Jersey: Lawrence Erlbaum Publishers. Mehrens, W.A. (1990). Combining evaluation data from multiple sources. In J. Millman & L. Darling-Hammond (Eds.), The new handbook of teacher evaluation: Assessing elementary and secondary school teachers. Newbury Park, CA: Sage. Mehrens, W.A., & Lehman, I.J. (1991). Measurement and evaluation in education and psychology. Fourth Edition. New York: Holt, Rinehart and Winston. Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement. Third Edition. New York: American Council on Education and Macmillan. Messick, S. (1994). The interplay of evidence and consequences in the validity of performance assessment. Educational Researcher, 23, 13-24. Mislevy, R. J. (1984). Estimating Latent Distributions. Psychometrika, 49. 359-381. Mislevy, R. J. (1986). Bayes Modal Estimation in Item Response Models. Psychometrika, 51. 177-195. Millman, J., & Greene, J. (1989). The specification and development of tests of achievement and ability. In R.L. Linn (Ed.), Educational measurement. Third Edition. New York: American Council on Education and Macmillan. Molander, M. (1997). Automobilbesiktningsmannen. En kort historik om fordonskontroll och förarprövning. [The car examiner. A short history regarding vehicle control and driver examination] (Publikation 1997:65). Borlänge: Vägverket. Nitko, A.J. (1996). Educational assessment of students. Englewood Cliffs, New Jersey: Prentice Hall.

Nyström, P. (2004). Rätt mätt på prov. Om validering av bedömningar i skolan [Validation of educational assessments] Dissertation at the Faculty of Social Sciences, Umeå University.

Osterhof, A. (1994). Classroom applications of educational mesurement. Second edition. New York: Macmillan. Payne, D.A. (1997): Applied Educational Assessment. New York: Wadsworth Publishing Company. Popham, W. J. (1990). Modern Educational Measurement: a practitioner’s perspective. Boston: Allyn and Bacon. Rigdon, S. E., & Tsutakawa, R. K. (1983). Parameter Estimation in Latent Trait Models. Psychometrika. 48.567-574. Rodriquez, M.C. (2002). Choosing an item format. In G. Tindal, & T.M. Haladyna (Eds.), Large-scale assessment program for all students. London: Lawrence Erlbaum Associates, Publishers. Ryan, J.M. (2002). Issues, strategies, and procedures for applying standards when multiple measures are employed. In G. Tindal, & T.M. Haladyna (Eds.), Large-scale assessment program for all students. London: Lawrence Erlbaum Associates, Publishers. Saunders, J.C., & Mappus, L.L. (1984). Accuracy and consistency of expert judges in setting passing scores on criterionreferenced tests: The South Carolina experience. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. SFS 1998:488. Körkortslag. [Driving-license legislation] Stockholm: Näringsdepartementet. SFS 1992:1765. Förordning (1992:1765) om ändring i förordningen (1989:942) om ändring i körkortsförordningen (1977:722). [Regulation (1992:1765): Alteration of regulation (1989:942) – alteration in driving-license legislation] Stockholm: Sveriges Riksdag.

Spolander, K. (1974). Skriftliga differentierade förarprov. Uppföljning och analys av förarprovens egenskaper år 1973. [Analysis and follow-up of the theory test in 1973] (Rapport Nr 46). Stockholm: Statens väg- och trafikinstitut. SOU 1991:39. (1991). Säkrare förare: slutbetänkade av kommittén körkort 2000. [Safer drivers. Report from the committee ”driving-license 2000”] Stockholm: Kommunikationsdepartementet. Sundström, A. (2003). Den svenska förarprövningen. Sambandet mellan kunskapsprovet och körprovet, provens struktur samt kökortsutbildningens betydelse [A study of the relationship between the theory test and the practical test, the structure of the test and the effect of driver education on test performance] (Pedagogiska mätningar Nr. 183). Umeå universitet: Enheten för pedagogiska mätningar. Sundström, A. (2004). Övningskörning privat och på trafikskola. En enkätstudie om körkortsutbildningens betydelse för provresultatet [Private and professional driver education – a study of the driver education’s effect on test performance] (Pedagogiska mätningar Nr. 190). Umeå universitet: Enheten för pedagogiska mätningar. Söderström, T., & Mattsson, H. (1993). Trafikantutbildning i gymnasieskolan. Studie kring försök med frivillig trafikantutbildning vid Dragonskolan i Umeå [Traffic education on a voluntary basis in upper secondary school] (Pedagogiska mätningar Nr 78). Umeå universitet: Enheten för pedagogiska mätningar. Thiessen, D. (1982). Marginal Maximum Likelihood Estimation for the One-parameter Logistic Model. Psychometrika. 47. 175186. Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring. New Jersey: Lawrence Erlbaum Publishers.

Thissen, D., & Wainer, H. (2001). An overview of test scoring. In D. Thissen & H. Wainer (Eds.), Test scoring. New Jersey: Lawrence Erlbaum Publishers. Thissen, D., & Wainer, H. (Eds.) (2001). Test scoring. New Jersey: Lawrence Erlbaum Publishers. Trafiksäkerhetsverket. (1988). Förarutbildning för körkort. Preliminär kursplan/måldokument för behörighet B [Driver education. Preliminary curriculum, class B vehicles]Borlänge: Trafiksäkerhetsverket. TSVFS (1988:43). Trafiksäkerhetsverkets föreskrifter om kursplaner, behörighet B. [National Swedish Road Safety Office. Curriculum for driver education, class B vehicles] Borlänge: Vägverket. Umar, J. (1997). Item Banking. In J. P. Keeves (Ed.), Educational Research, Methodology, and Measurement: An International Handbook. Camebridge: Elsevier Science Ltd. Wainer, H., & Thissen, D. (2001). True score theory: The traditional method. In D. Thissen & H. Wainer (Eds.), Test scoring. New Jersey: Lawrence Erlbaum Publishers. Wainer, H., Dorans, N.J., Flaugher, R., Green, B.F., Mieslevy, R.J., Steinberg, L., & Thissen, D. (1990). Computerized Adaptive Testing: A Primer. Hillsdale, New Jersey: Lawrence Erlbaum Associates, Publishers. Van der Linden, W.J., & Glas, C.A.W. (Eds.). (2000). Computerized adaptive testing: Theory and practice. Dordrecht: Kluwer Academic Publishers.

Van Krimpen-Stoop, E. M. & Meijer, R. R. (2000). Detecting Person Misfit in Adaptive Testing Using Statistical Process Control Techniques. In W. J. Van der Linden & C. A. W. Glas (Eds.), Computerized Adaptive Testing: Theory and Practice. Dordrecht: Kluwer Academic Publishers. Wiberg, M. (1999a). Målrelaterade och Normrelaterade Prov – en teoretisk granskning av vilka statistiska tekniker som kan användas för att beskriva uppgifternas kvalitet och provens reliabilitet [Criterion-referenced and norm-referenced tests – a theoretical comparison] (Pedagogiska mätningar Nr 150). Umeå universitet: Enheten för pedagogiska mätningar. Wiberg, M. (1999b). Datoriseringen av teoriprovet. En beskrivning av effekter utifrån ett antal statistiska indikatorer. [The computerization of the theory test] (Pedagogiska mätningar Nr 158). Umeå universitet: Enheten för pedagogiska mätningar Wiberg, M. (2002). Uppgiftsbank för körkortsprovets teoretiska prov. Relationen mellan utformningen, exponeringen och provtypen. [Item bank for the theory test] (Pedagogiska mätningar Nr 173). Umeå universitet: Enheten för pedagogiska mätningar. Wiberg, M. (2003a). An optimal design approach to criterionreferenced computerized test. Journal of Educational and Behavioural Statistics, 28(2), 97-110. Wiberg, M. (2003b). Computerized Achievement Tests – sequential and fixed length tests. Doctoral Thesis. Umeå University. Wiberg, M., & Henriksson, W. (2000). Metoder för kravgränssättning. En teoretisk granskning samt diskussion av lämplig metod för ett målrelaterat certifieringsprov av typ körkortsprovets teoriprov. [Methods for standard setting] (Pedagogiska Mätningar, Nr 165). Umeå universitet: Enheten för pedagogiska mätningar.

Wolming, S. (2000a). Validering av urval. Umeå universitet: Pedagogiska Institutionen. Wolming, S. (2000b). Förarprövningens struktur och resultat. En studie av relationen mellan kunskapsprov och körprov samt utbildningsbakgrundens betydelse. [The structure and results of driving license examination] (Pedagogiska Mätningar, Nr 166). Umeå universitet: Enheten för pedagogiska mätningar VVFS 1996:168. Vägverkets författningssamling. Vägverkets föreskrifter om kursplaner, behörighet B. [Regulations concerning curriculum, class B] Borlänge: Vägverket. VVFS 1999:32. Vägverkets författningssamling. Vägverkets föreskrifter om ändring i föreskrifterna (VVFS 1998:53) om förarprov behörighet B. [Alteration in the regulation concerning driving-license test, class B] Borlänge: Vägverket. Yalow, E.S., & Popham, W.J. (1983). Appraising the preprofessional skills test for the state of Texas (Report, No. 5). Culver City, CA: IOX Assessments Associates. Zolland, A. (1999). Analys av körkortsprovets kursplansstruktur. [Analysis of the structure of the curriculum for the drivinglicense test] (Pedagogiska Mätningar, Nr 157). Umeå universitet: Enheten för pedagogiska mätningar. Zolland, A., & Henriksson, W. (1998). Analys av det teoretiska körkortsprovet utifrån modeller och statistiska data. [Analysis of the theory test – models and statistical data] (Pedagogiska Mätningar, Nr 134). Umeå universitet: Enheten för pedagogiska mätningar.

Appendix 1 THE THEORETICAL MODEL ______________________________________________ Main factors * Driver * Vehicle * Environment 2-way interactions Driver → Vehicle Driver → Environment Vehicle → Driver Vehicle → Environment Environment → Driver Environment → Vehicle 3-way interactions Driver → Vehicle → Environment Driver → Environment → Vehicle Vehicle → Environment → Driver Vehicle → Driver → Environment Environment → Driver → Vehicle Environment → Vehicle → Driver

EDUCATIONAL MEASUREMENT Reports already published in the series EM No 1.

SELECTION TO HIGHER EDUCATION IN SWEDEN. Ingemar Wedman

EM No 2.

PREDICTION OF ACADEMIC SUCCESS IN A PERSPECTIVE OF CRITERION-RELATED AND CONSTRUCT VALIDITY. Widar Henriksson, Ingemar Wedman

EM No 3.

ITEM BIAS WITH RESPECT TO GENDER INTERPRETED IN THE LIGHT OF PROBLEM-SOLVING STRATEGIES. Anita Wester

EM No 4.

AVERAGE SCHOOL MARKS AND RESULTS ON THE SWESAT. Christina Stage

EM No 5.

THE PROBLEM OF REPEATED TEST TAKING AND THE SweSAT. Widar Henriksson

EM No 6.

COACHING FOR COMPLEX ITEM FORMATS IN THE SweSAT. Widar Henriksson

EM No 7.

GENDER DIFFERENCES ON THE SweSAT. A Review of Studies since 1975. Christina Stage

EM No 8.

EFFECTS OF REPEATED TEST TAKING ON THE SWEDISH SCHOLASTIC APTITUDE TEST (SweSAT). Widar Henriksson, Ingemar Wedman

1994 EM No 9.

NOTES FROM THE FIRST INTERNATIONAL SweSAT CONFERENCE. May 23 - 25, 1993. Ingemar Wedman, Christina Stage

EM No 10.

NOTES FROM THE SECOND INTERNATIONAL SweSAT CONFERENCE. New Orleans, April 2, 1994. Widar Henriksson, Sten Henrysson, Christina Stage, Ingemar Wedman and Anita Wester

EM No 11.

USE OF ASSESSMENT OUTCOMES IN SELECTING CANDIDATES FOR SECONDARY AND TERTIARY EDUCATION: A COMPARISON. Christina Stage

EM No 12.

GENDER DIFFERENCES IN TESTING. DIF analyses using the MantelHaenszel technique on three subtests in the Swedish SAT. Anita Wester

1995 EM No 13.

REPEATED TEST TAKING AND THE SweSAT. Widar Henriksson

EM No 14.

AMBITIONS AND ATTITUDES TOWARD STUDIES AND STUDY RESULTS. Interviews with students of the Business Administration study program in UmeD, Sweden. Anita Wester

EM No 15.

EXPERIENCES WITH THE SWEDISH SCHOLASTIC APTITUDE TEST. Christina Stage

EM No 16.

NOTES FROM THE THIRD INTERNATIONAL SweSAT CONFERENCE. Umeå, May 27-30, 1995. Christina Stage, Widar Henriksson

EM No 17.

THE COMPLEXITY OF DATA SUFFICIENCY ITEMS. Widar Henriksson

EM No 18.

STUDY SUCCESS IN HIGHER EDUCATION. A comparison of students admitted on the basis of GPA and SweSAT-scores with and without credits for work experience. Widar Henriksson, Simon Wolming

1996 EM No 19.

AN ATTEMPT TO FIT IRT MODELS TO THE DS SUBTEST IN THE SweSAT. Christina Stage

EM No 20.

NOTES FROM THE FOURTH INTERNATIONAL CONFERENCE. New York, April 7, 1996. Christina Stage

SweSAT

1997 EM No 21.

THE APPLICABILITY OF ITEM RESPONSE MODELS TO THE SWESAT. A study of the DTM subtest. Christina Stage

EM No 22.

ITEM FORMAT AND GENDER DIFFERENCES IN MATHEMATICS AND SCIENCE. A study on item format and gender differences in performance based on TIMSS´data. Anita Wester, Widar Henriksson

EM No 23.

DO MALES AND FEMALES WITH IDENTICAL TEST SCORES SOLVE TEST ITEMS IN THE SAME WAY? Christina Stage

EM No 24.

THE APPLICABILITY OF ITEM RESPONSE MODELS TO THE SweSAT. A Study of the ERC Subtest. Christina Stage

EM No 25.

THE APPLICABILITY OF ITEM RESPONSE MODELS TO THE SweSAT. A Study of the READ Subtest. Christina Stage

EM No 26.

THE APPLICABILITY OF ITEM RESPONSE MODELS TO THE SweSAT. A Study of the WORD Subtest. Christina Stage

EM No 27.

DIFFERENTIAL ITEM FUNCTIONING (DIF) IN RELATION TO ITEM CONTENT. A study of three subtests in the SweSAT with focus on gender. Anita Wester

EM No 28.

NOTES FROM THE FIFTH INTERNATIONAL SWESAT CONFERENCE. Umeå, May 31 – June 2, 1997. Christina Stage

1998 EM No 29.

A Comparison Between Item Analysis Based on Item Response Theory and on Classical Test Theory. A Study of the SweSAT Subtest WORD. Christina Stage

EM No 30.

A Comparison Between Item Analysis Based on Item Response Theory and on Classical Test Theory. A Study of the SweSAT Subtest ERC. Christina Stage

EM No 31.

NOTES FROM THE SIXTH INTERNATIONAL CONFERENCE. San Diego, April 12, 1998. Christina Stage

SWESAT

1999 EM No 32.

NONEQUIVALENT GROUPS IRT OBSERVED SCORE EQUATING. Its Applicability and Appropriateness for the Swedish Scholastic Aptitude Test. Wilco H.M. Emons

EM No 33.

A Comparison Between Item Analysis Based on Item Response Theory and on Classical Test Theory. A Study of the SweSAT Subtest READ. Christina Stage

EM No 34.

Predicting Gender Differences in WORD Items. A Comparison of Item Response Theory and Classical Test Theory. Christina Stage

EM No 35.

NOTES FROM THE SEVENTH INTERNATIONAL CONFERENCE. Umeå, June 3–5, 1999. Christina Stage

SWESAT

2000 EM No 36.

TRENDS IN ASSESSMENT. Notes from the First International SweMaS Symposium Umeå, May 17, 2000. Jan-Olof Lindström (Ed)

EM No 37.

NOTES FROM THE EIGHTH INTERNATIONAL CONFERENCE. New Orleans, April 7, 2000. Christina Stage

SWESAT

2001 EM No 38.

NOTES FROM THE SECOND INTERNATIONAL SWEMAS CONFERENCE, Umeå, May 15-16, 2001. Jan-Olof Lindström (Ed)

EM No 39.

PERFORMANCE AND AUTHENTIC ASSESSMENT, REALISTIC AND REAL LIFE TASKS: A CONCEPTUAL ANALYSIS OF THE LITERATURE. Torulf Palm

EM No 40.

NOTES FROM THE NINTH INTERNATIONAL CONFERENCE. Umeå, June 4–6, 2001, Christina Stage

SWESAT

EM No 41.

THE EFFECTS OF REPEATED TEST TAKING IN RELATION TO THE TEST TAKER AND THE RULES FOR SELECTION TO HIGHER EDUCATION IN SWEDEN. Widar Henriksson, Birgitta Törnkvist

EM No 42.

CLASSICAL TEST THEORY OR ITEM RESPONSE THEORY: THE SWEDISH EXPERIENCE, Christina Stage

EM No 43.

THE SWEDISH NATIONAL COURSE TESTS IN MATHEMATICS, Jan-Olof Lindström

EM No 44.

CURRICULUM, DRIVER EDUCATION AND DRIVER TESTING. A comparative study of the driver education systems in some European countries, Henrik Jonsson, Anna Sundström, Widar Henriksson