Criteria of good tests - FEEDBACK AND ERROR CORRECTION 119

12. FEEDBACK AND ERROR CORRECTION 119

13.4. Criteria of good tests

It is always the tester’s task to provide the best solution to a particular testing problem. But what is common is that every test or testing system has to fulfil the following requirements:

- consistently provide accurate measures of precisely the abilities in which we are interested (validity and reliability)

- have a beneficial effect on teaching (in those cases where the tests are likely to influence teaching (washback)

- be economical in terms of time and money (practicality) (Hughes, 2003)

13.4.1. Validity

Validity always refers to the degree to which the gathered empirical evidence supports the adequacy and appropriateness of the inferences that are made from the scores (Bachman 1990) It means that the interpretations and uses that we make of test scores are to be valid, in other words, a test is said to be valid to the extent that is measures what it is supposed to measure, (Alderson, 1995) If the test is not valid for the purpose for which it was designed, the scores do not mean what they are supposed to mean. There are different types of validity, which in reality are different ’methods’ of assessing validity (Bachman 1990) There are three main types/aspects of validity that can be distinguished: internal, external and construct validity.

Internal validity relates to the perceived content of the test and its perceived effect. There are three aspects: face validity, content and response validity

- Face validity refers to the surface credibility or public acceptability of the test.

It involves intuitive judgment about the test content made by so called ’lay’ people, who are involved in the testing process but who are not experts in testing. Such people include non expert users, students, their teachers, administrators. If test takers accept the test as a face valid test, they are more likely to perform to the best of their ability on that test. Data on face validity can be collected by interviewing test takers, students or asking them to complete questionnaires about their feelings about or reactions to the test that they have taken.

- Content validity shows whether the test contains a representative sample of

syllabus or curriculum, or by rating test items and texts following a precise list of criteria . A further alternative can be interviewing teachers of a range of academic subjects, or administering a questionnaire survey in which respondents are asked to make judgments about the texts and tasks.

- Response validity can be checked by gathering information on how test takers respond to the test items. It can be collected by asking learners and test takers to tell how they responded to the test item, what their test taking behaviour was, because the reasoning and the processes they follow when they are solving the items give important indications of what the test is testing.

External validity relates to procedures which compare students’ test scores with measures of their ability taken from outside the test. It has two types:

concurrent and predictive validity.

- Concurrent validation involves comparing the test scores of the candidates with some other measure for the same candidates taken roughly at the same time as the test. This measure can be expressed numerically with statistical methods by correlating students’ test results with their test scores gained on other tests, with teachers’ rankings or with the students’ own ratings of their language ability in the form of self assessment.

- Predictive validation involves comparing the students’ test scores with some other external measure taken some time after the test has been administered. This measure also can be expressed numerically with statistical methods by correlating students’ test results with their scores gained on other tests taken some time later (e.g. correlate entrance test results with scores of final tests).

Construct validity shows to what extent the test is based upon its underlying theory, that is, how well test performance can be interpreted as a meaningful measure of some characteristic or quality. The term ‘construct’ refers to a psychological construct, a theoretical concept about a kind of language behaviour that the test makers want to measure. It can be regarded as a kind of attribute or ability of people which is assumed to be reflected in test performance. Construct validity refers to the extent to which performance on tests is consistent with the predictions that test makers make based on a theory of abilities, or constructs.

(Bachman, 1990) It can be assessed by correlating different test components (sub tests) with each other or by complex statistics, a combination of internal and external validation .

13.4.2. Reliability

Reliability refers to the consistency with which a test can be scored, that is, consistency from person to person, time to time or place to place. It means that

tests are to be constructed, administered and scored in such a way that the scores obtained on a test on a particular occasion are likely to be very similar to those which would have been obtained if it had been administered with the same students with the same ability, but at a different time (Hughes, 1991) There are two components of test reliability:

- the reliability of the scores on the performance of candidates from occasion to occasion, which can be ensured by the construction and the administration

- the reliability of scoring

13.4.2.1. Reliability of scoring

Reliability of scoring can be achieved more easily with objectively scored tests (e.g. tests of reading and listening comprehension), in which scoring does not require the scorer’s personal judgement of the correctness, because the test items can be marked on the basis of right or wrong. Scorer reliability is especially important in the case of subjectively scored tests (i.e. tests of writing and speaking skills), because they cannot be assessed on a right or wrong basis, assessment requires a judgement on the part of the scorers. There are two aspects of scorer reliability: intra-rater reliability and inter-rater reliability.

- intra-rater reliability is achieved if the same scorer gives the same set of oral performances or written texts the same scores on two different occasions. It can be measured by means of a correlation coefficient

- inter-rater reliability refers to the degree of consistency of scores given by two or more scorers to the same set of oral performances or written texts.

The reliability of a test can be quantified in the form of a reliability coefficient.

It can be worked out by comparing two sets of test scores. These two sets can be obtained by administering the same test to the same group of test takers twice (test-retest method), or by splitting the test into two equivalent halves and giving separate scores for the two halves, then correlating the scores (split half method) The more similar are the two sets of scores the more reliable is the test said to be.

(Alderson, 1995)

In document Applied Linguistics I for BA Students in English (Pldal 134-137)