###### [Listening to: ‘Don’t Become The Thing You Hated’ by Destroyer; ‘Dancing and Blood‘ by Low]

Statistics are widely misused in language testing. I think the use of statistics in language testing is increasing and, consequently, so is the misuse. Part of the reason that this is happening, I think, is that influential people *explicitly encourage* this misuse.

## Sometimes nothing is better than something

Here is a paradigmatic case. Bachman (1990), in *Fundamental Considerations in Language Testing*, still one of the most frequently (and favourably) cited language testing texts, states that “in order for a test score to be valid, it must be reliable” and that the

investigation of reliability is concerned with answering the question, ‘How much of an individual’s test performance is due to measurement error, or to factors other than the language ability we want to measure?’ and with minimizing the effects of these factors on test scores. (p. 160)

He then goes on to discuss three approaches to this investigation: classical test theory (CTT, or CTS, ‘classical true score’), generalisability theory (G-theory) and item response theory (IRT). Of CTT/CTS, Bachman says the following.

A major limitation to CTS theory is that it does not provide a very satisfactory basis for predicting how a given individual will perform on a given item. There are two reasons for this. First, CTS theory makes no assumptions about how an individual’s level of ability affects the way he performs on a test. Second, the only information that is available for predicting an individual’s performance on a given item is the index of difficulty,

p, which is simply the proportion of individuals in a group that responded correctly to the item. Thus, the only information available in predicting how an individual will answer an item is the average performance of a group on this item. However, it is quite obvious that an individual’s level of ability must also be considered in predicting how she will perform on a given item; an individual with a high level of ability would clearly be expected to perform better on a difficult item that measures that ability than would a person with a relatively low level of the same ability.Because of this and other limitations of CTS theory (and G-theory as well), psychometricians have developed a number of mathematical models for relating an individual’s test performance to that individual’s level ability. … Such models are generally referred to as ‘item response’ models, and the general theory upon which they are based is called ‘item response’ theory (IRT). (pp. 203)

Bachman states that IRT offers several advantages over CTT and G-theory but two key assumptions must be met: all items must “measure a single, or *unidimensional* ability or trait, … that the items form a *unidimensional* scale of measurement” (p. 203) and that all items are *locally independent*, “[t]hat is, we assume that an individual’s response to a given test item does not depend upon how he responds to other items that are of equal difficulty” (p. 11).

However, from what we know about the nature of language, it is clear that virtually every instance of authentic language use involves several abilities. … If language test scores reflect several abilities, and are thus not unidimensional, and if authentic test tasks are, by definition, interrelated, to what extent are current measurement models appropriate for analysing and interpreting them? (pp. 11-12)

One way to deal with this problem would be to design *inauthentic* tests, but a) I don’t think this is desirable; b) it’s not clear that this would necessarily solve the problem altogether; and c) we still have no way of knowing whether the items are actually undimensional and locally independent. In other words, we can never know whether the assumptions are truly met but in many language testing situations we can be fairly confident that they are not.

So, Bachman has acknowledged that there are serious impediments to the practical application of all three approaches – CTT/CTS, G-theory and IRT. In response, he states that “in situations that may not permit the use of G-theory or IRT, estimating reliability through classical approaches is a far better alternative than failing to investigate the effects of measurement error at all because these more powerful approaches are not possible” (p. 209). As I read it, Bachman is saying here that:

- IRT is the best approach but we can’t use it in most situations because the assumptions are not likely to be met.
- G-theory is the second best approach but we can’t use it in most situations because the assumptions are not likely to be met.
- The assumptions of CTT/CTS are not likely to be met in most situations but we should use it anyway because it’s better than nothing.

This is not a coherent argument and is further undermined by Bachman’s statement that “the time and resources that go into investigating reliability and minimizing measurement error must be justified by the amount of information gained and the usefulness of that information” (p. 209). In my experience, the application of CTT/CTS takes up a lot of time and resources and the ‘information’ generated is by definition useless: the assumptions of the model or theory used to generate it are not met; consequently, CTT/CTS cannot be used to interpret it; and, if we can’t use CTT/CTS to interpret the test data, we are left wondering how we *can* interpret it.

In this case, it appears to me, that Bachman was knowingly encouraging misuse in the interests of pragmatism. From what I’ve seen and heard, over the last couple of years, Bachman’s recommendation that an inappropriate psychometric theory is better than none at all has been widely taken up and this misguided practice is spreading rapidly. Who benefits from the increased use of inappropriate psychometric techniques in language testing? The test takers? I doubt it.

## How to identify and resist the inappropriate use of psychometric theory

If someone refers to confidence intervals, reliability coefficients, standard deviations, statistical significance, etc., ask them the following questions.

### 1. What does that statistic mean?

### 2. For that statistic to have that meaning, what assumptions must be met?

### 3. To what extent are these assumptions met in your specific context?

### 4. If the assumptions are not met, what does the statistic mean?

The person’s answer to these questions will indicate whether they:

- Have a good grasp of what the statistic means, assuming the assumptions are met
- Have a good grasp of the limitations to the application of the statistic to real life contexts
- Hav considered whether the statistic is actually appropriate to their specific context
- Take the theory seriously.

I personally don’t think it’s too much to ask of people that, if they are going to use statistics, that they give some thought to these questions first.