Reliability relates to the data or evidence that we collect from students. For data to be reliable it must be consistent in the way that it measures a specified set of behaviours. This consistency needs to be present for each individual marker as well as across markers.
A good analogy would be to imagine that your student performed a task and that you collected data using a set of rubrics. Now imagine that you were able to erase the student’s memory of doing the task and your memory of collecting the data. Now imagine that the student again performed the task and you again used your rubrics to collect evidence. If the rubrics are reliable your coding will be consistent. That is, there would be no variation in the behaviours you identified against the rubric. This is called intra-rater reliability.
Alternatively, a student performs a task and is assessed by a marker. The student’s memory of performing the task is erased. The student performs the task again but this time is assessed by a different marker. If the rubrics are reliable there will be no variation between markers. This is called inter-rater reliability.
Different factors interfere with the reliability of rubrics. Collectively this interference is known as ‘noise’. The guidelines for writing quality criteria have been designed to minimise noise and therefore increase your chances of designing reliable rubrics.
Reliability is a technical term that indicates how much ‘noise’ or interference is involved in gathering observations. For statistical measurement purposes it is designated with a number between 0 and 1. A 1.0 indicates that there are no sources of interference. A 0 indicates that it is all noise. Clinical psychological tests must have less than 5% noise. There are no standards for education.
Validity refers to the use and interpretation of the evidence collected, as opposed to the assessment method or task per se. It asks the question: ‘Can the evidence collected be legitimately used for the purpose for which was is intended?’ One way to think about it might be to imagine that you had been teaching students how to do a cartwheel and you wanted to see if they could now physically perform a cartwheel. You might be able to design a written test that could reliably collect data about the students’ knowledge of the mechanics of a cartwheel, but that test will not be a valid measure of the students’ ability to actually perform a cartwheel. In this way validity in inextricably related to the purpose of the task and the construct being assessed. When we can confidently say that the evidence collected allows us to make inferences about student learning we have construct validity.
To check that the data that you collect from your rubrics will be interpretable against the intended construct it is important to explicitly check that the behaviours that you describe in each rubric correspond to the ideas that you are trying to assess. Curriculum documents, learning progressions and/or teacher-made constructs can be useful tools in checking for validity.
Gareis and Grant (2008), in Teacher-made Assessments: How to Connect Curriculum, Instruction, and Student Learning p. 33, provide a wonderful graphical depiction of the relationship between reliability (the shots) and validity (the target).
You will notice that you can have a reliable task that is not valid. For the purpose of rubric design this means that the rubrics can be well written and generate consistent data but that the data produced does not relate to the idea that we were trying to assess. They miss the target. Unreliable tasks produce invalid inferences. This means that if your rubrics are inconsistent it will be impossible to make a valid inference about student learning.