Test Reliability and Validity


Test Reliability

A test is considered reliable if it is able to measure what it is supposed to measure (internal reliability) consistently over time and within itself. Over time means test-retest reliability, or that each time a test is given it will yield the same results, or pretty close to it. Because of learning (people remember the test responses when taking it more than once) many tests use parallel forms. Of course it is important to demonstrate that the forms are similar enough to yield the same results each time they are given. This similarity is expressed as the correlation coefficient that indicates the stability of scores from one administration to the next.

Internal Consistency
Consider internal consistency reliability. This can be operationally defined, and measured, by evaluating to what degree the test results match other tests measuring the same construct or concept. How to do this scientifically? Well, with numbers of course! One clever method looks at the split-half reliability of a test by forming two sets of items from a test purporting to assess the same thing. If you have a dozen items on a test all designed to assess long division of four-place by two-place numbers (2314/44, 4765/12, and so on) give the test to your 50 students, than randomly split each test’s results into two sets of six problems. Take the scores for the two sets (300 problems per set) and figure the correlation between the two total set scores. It will be high, but only if all 12 test questions are assessing the same concept documenting the internal reliability of the original 12-item test.

Alternatively, you could generate the average
inter-item correlation as a measure of internal reliability consistency. Take all the items on the test looking at the same construct and then correlate each and every pair combination. It should be high if all 12 test questions are assessing the same concept and if low, you should throw out the idea that you are assessing 4 digit/2 digit division knowledge in your students. You get the idea, this has worked for many years to assess test’s internal ability to measure what if the test really assesses what its author says it measures.

Inter-rater Reliability
Another type of reliability is called inter-rater reliability, which assesses how well different people using the same test get the same results. This, in essence, demonstrates a principal difference from the fields of psychiatry which relies mostly on clinical interviewing vs. psychology relies more on testing to determine diagnostic findings. Psychiatrists do have structured interviews, but many rely on personal experience and thus different psychiatrists use different interview techniques producing different results among patients, which can increase the likelihood their ultimate diagnoses will differ.

The DSM introduced (I believe in its third iteration) something called “intentional definitions” where all diagnoses were intentionally described by specified criteria, in order to improve inter-rater reliability. Thus you could use different interview styles, but they should always cover the same criterion points when diagnosing. We’ve inherited this method in DSM5 where you will see the patient often needs to meet at least 3/5 criteria in one category as well as meet 2/4 in another category in order to reach all criteria for a diagnosis. This clever approach allows some flexibility among the options to meet different interview styles, but also requires agreement to improve inter-rater reliability. This has immensely helped improve the consistency of diagnosing among different doctors, but you’ll still find psychiatrists relying on and referring to psychologists for “norm referenced” testing for finer diagnostic differentiation. This division of labor works well in this author’s experience of 40+ years of working beside psychiatrists in outpatient settings and for them as medical directors on inpatient units. We’ll discuss and differentiate cut-off from normed test interpretation methods in a section below.

Test Validity
An important concept to recognize is that for a test to be valid (i.e., accurate at what it is intended to measure) it must be reliable. The test needs to yield the same results each time it is given (so as to have test-retest reliability) and when given by different examiners (inter-rater reliability). If a test for who is likely to be the best varsity high school basket player yields different results each time it is given, or differs among the assistant coaches using it – then, it can’t be a valid measure to test for best player potential. However, the converse is not necessarily true.

Construct Validity
If a panel of experts agrees that a test’s items are germaine to the purpose of the test, that engenders a form of validity about the construction of the test. This is often a starting point.

Sampling Validity
If a panel of experts, or by other means, takes a sample from a large body of knowledge or information, the panel might determine if it is adequately representative or the larger body. Should they find students who pass a limited test are also able to pass a more global test, then maybe the bigger test isn’t necessary to give at the end of the year. Maybe a quiz would suffice, saving time, money and effort.

Formative Validity
You can study how well a test predicts an outcome that can be useful to an area of interest. For example, a test that properly detects an area of under preparedness in a school’s study curriculum can be determined accurate if more intensive teaching in that area improves students’ performance over time. This can ultimately combine with reliability evaluation methods like test-retest methodology.

Parallel Form Validity
There are all kinds and types of valid, for example parallel form reliability to assure that two forms of the same test are measuring the same thing. The two forms should correlate well when given, ideally, to the same group of subjects. If the same subjects can’t take both forms (say for fear of memorizing answers) than the two forms can be given to two very similar groups. We call this sample stratification variables. For example typical stratifactions are made based on a sample’s race, age, sex and socio-economic level.

The more we use various and multiple methods to demonstrating a test’s reliability the closer we come to better determining validity:

Scientifically Determined Validity
A test can be very reliable, but not valid. Later I’ll describe my test designed to predict the best from the worst HS varsity basketball players. Unfortunately, even though it always yields the right results (accurately determines 100% or all the HS juniors and seniors most likely to success as basketball players) it is not a valid measure. How can this be? Why am I not rich from a test that is always right in determine who will be the best player?

Well, a test needs to be reliable to be valid, so if it lacks internal reliability it can’t be valid; i.e., it doesn’t measure what it purports to measure (differentiating the best from the worst player) then it is no good. As you’ll learn I haven’t sold many copies of this test. It fails to accurately distinguish among those players who is best from who is worst, since it
over detects good players and under detects poor players. While detecting 100% of good players, it failed to detect any (0%) of bad players. So it does only half what it claims to do. A test with excellent sensitivity and good selectivity would be considered to have scientifically determined validity.

Avoid Face Validity
Determining test validity can be done in multiple ways. The basketball player test, if you noticed, indirectly references sensitivity/selectivity which is a good method of figuring out if a test is valid or not by placing the score used as a cut-off at just the right spot; i.e., identifying as many good players as possible while choosing for the team as few bad players as possible. We’ll delve into sensitivity/selectivity in the next section.

A method of determining validity that is not very good, is referring to a test’s “face validity” or what one’s gut tells them is true. Sometimes you have no choice, art investors have to rely on “art appreciation” which differs muchly from person to person. That’s why art can shoot up in value when an artist becomes more popular, the fact that more people appreciating someone’s art engenders a higher face validity to its value. The same goes for stock buying. Face validity can be accurate, but it comes with no proof and when wrong can have disastrous results. Consider ex-Attorney General William Barr who reported intensive investigation by his own Justice Department showed no widespread voter fraud, yet he continues to personally believe and espouse that voter fraud exists at a high level because it makes “common sense.”

A.G. Barr is using the face validity of his life-long experience to support his common-sense test – but, his failure public servant avoid personal bias – he is in ignoring the practical “evidence” produced by his own Justice Department that invalidates, specifically, his personal test. Such a double standard also fails to meet an ethics test as an attorney. Snow falls and you can make snow balls suggesting face validity for the argument there is no global warming, yet 97% of scientists agree that there is global climate change. If you choose to rely on your gut feeling and experience (an N=1 situation) in diagnosing NCD rather than use reliable methods, the validity of your findings will always be in question and you are probably working with the wrong organization. Limbic Resources, Inc. adheres to APA ethics stating we should use the best data and valid methods in our work.

You’ll read a lot about “Evidence-based” medicine, now-a-days. This is ancient history in psychology which has always used this idea since the scientist-practitioner model evolved as the basis for training clinical psychologists. Before WWII most psychologists were academics, until the VA asked psychologists to join psychiatrists and social workers in providing treatment to returning veterans overwhelming the system. The difference between the scientist and practitioner models were reconciled within the APA at a U.S. Veterans Administration and NHI of Mental Health conference in Boulder, CO in 1949. In a nutshell, psychologists combined appreciation for and use of the scientific model into their clinical work, and the dissertation continued as a means of learning statistics and science methodology to then be used in clinical work with the expectation that they would contributed to the science through publications and teaching. Evidence-based medicine is nothing more than going beyond face validity. When Dr. Love “proved” to the old, male faculty at the Harvard Medical School her mortality rate was the same for breast lumpectomy as opposed to their concept of better get it all (mastectomy) it shocked them, and changed the medical model which was and is filled by nonscientific “common sense” causing great harm.

RapidWeaver Icon

Made in RapidWeaver