Reviewing the validity and reliability of comprehensive language assessments: Or, which test is best?

Many factors influence which tests SLPs use when we evaluate language skills. What tests do I have? What will capture the skill gaps that I see in this client? What test will show progress after intervention? What test do I not hate giving? Just kidding… Evidence-based practice requires us to ask also: which tests have been shown empirically to be good tests – meaning that they actually measure what we think they are measuring? This study looks at this last question by taking a systematic approach to finding and evaluating evidence of reliability and validity for 15 language tests. Importantly, the authors looked at evidence from peer-reviewed papers in addition to the stuff in the front of the test manuals. The tests they selected were all recent (20 years old at most), diagnostic, comprehensive spoken language assessments normed on monolingual English-speaking children between 4 and 12 years old. Check out Tables 5 and 6 for the full lowdown on what tests they included and excluded, respectively.**

Reliability and Validity

Bear with us for a brief journey back in time to your grad (or even undergrad) Assessment class. Imagine a hazy pink dream sequence and harp music if it helps. This study looked into six dimensions of reliability (how stable and consistent the test scores are) and validity (whether the test measures what it claims to be measuring). Let’s take a moment and remember what these actually mean:

  1. Internal consistency – Do you get similar answers to similar questions?
  2. Reliability – Can you repeat the test and get the same score?
  3. Measurement error – How much might the score you measure vary from the “true score?”
  4. Content validity – Is the test actually measuring all of the content it’s supposed to be? Think of a final exam covering the entire semester.
  5. Structural Validity – How well does the test (e.g. an IQ test) measure what it’s supposed to be measuring (intelligence)?
  6. Hypothesis Testing – Can you make predictions based on some theory, and have them come out in the results of the test? Think of correlations between scores on two similar tests.

Check out Table 9 for a summary of the level of evidence the authors found in each area for the 15 targeted assessments. Because of issues with study methodologies, the authors found no compelling evidence of internal consistency, measurement error, or structural validity in ANY of the tests. Yikes. If there’s a test you give regularly, or one you have concerns about, it’s worth knowing specific strengths and weaknesses of that test.

So… which tests have the best evidence base?

“Whilst all assessments were identified as having notable limitations, four assessments: ALL, CELF-5, CELF:P-2, and PLS-5 were identified as currently having better evidence of reliability and validity. These four assessments are suggested for diagnostic use, provided they suit the purpose of the assessment process and are appropriate for the population being assessed.”

TISLP review Denman et al 2017.png

A few things to keep in mind

  • The authors are clear that, “…it should be noted that where evidence is reported as lacking, it does not mean that these assessments are not valid or reliable, but rather that further research is required to determine psychometric quality.”
  • As always, consider where the evidence is coming from. Most of the sources for reliability and validity data are the test manuals themselves. (And when the authors found independent sources of evidence, they didn’t always agree with the manuals.) The stuff in the manual is NOT peer reviewed, and you can only see if after you pony up for the test. This is not to say that it’s necessarily bad science, but we always want converging evidence from independent sources when possible.
  • ALL of the tests the authors looked at were found to have “limitations with regards to evidence of psychometric quality.” Meaning, there’s still a lot of work to be done. In the meantime, keep following best practices for evaluations. Don’t base a diagnosis or eligibility decision on a single test, and use other evaluation tools (language samples, dynamic assessment, interviews, RTI… all that good stuff) in addition to standardized testing.

**Important note. This review did NOT look at assessments published since 2014. This includes the CASL-2 and the TILLS.

Denman, D., Speyer, R., Munro, N., Pearce, W.M., Chen, Y., & Cordier, R. (2017). Psychometric Properties of Language Assessments for Children Aged 4-12 Years: A Systematic Review. Frontiers in Psychology (8), 1515. doi: 10.3389/fpsyg.2017.01515.