Jason M. Szanyi and Katarina Guttmannova*
This past term, in Ricci v. DeStefano, the Supreme Court reshaped employment discrimination litigation. In a decision that garnered significant notoriety both for its potential impact on the future of Title VII of the Civil Rights Act of 1964 and its ties to then-Supreme Court nominee Sonia Sotomayor,the Court held that the City of New Haven had violated Title VII by disregarding the results of a promotional examination for the City’s firefighters. As white firefighters generally had outperformed minority candidates on that examination, only white firefighters would have moved up the ranks within the New Haven fire department.
The decision attracted substantial criticism as well as praise, and the case is guaranteed to be the subject of ongoing critical commentary. Future scholarly discussions will inevitably focus on how the Court either departed from or remained true to its past precedents in employment discrimination cases under Title VII. These pieces will take on broad issues, such as anticipating the impact that Ricci will have on employment discrimination litigation or the future of Title VII and disparate treatment and disparate impact claims under the Fourteenth Amendment’s Equal Protection Clause—all worthy areas of discussion.
Rather than entering the academic debate on these broad topics, we have one focused and forward-looking purpose: to help lawyers understand how statistics can and should guide litigation under the Supreme Court’s most recent precedent. The Ricci precedent requires that an employer have a “strong basis in evidence” for its fear that it will face disparate impact liability before taking remedial measures to avoid it. Essentially, before employers can invalidate or disregard results based on racially disparate outcomes, employers must generate tangible proof that either their tests or assessments did not measure what they should have measured, or that the employer could have analyzed the results in a way that would have lead to less discriminatory, but equally valid results. We argue that litigators can meet that burden through statistical techniques that are specifically geared toward gauging how a test or performance assessment may be biased in favor of one group or against another. These statistical concepts may be foreign to practicing attorneys, who would understandably shy away from them because of their apparent complexity, yet they may provide an opportunity for meeting the evidentiary burden established by the Supreme Court this past term.
The first part of this article provides a brief synopsis of the Ricci decision and the Court’s key holding. We then identify how the Ricci decision presents the opportunity for practicing lawyers to demonstrate a “strong basis in evidence,” particularly in light of the case’s explicit references to the role of statistical evidence in Title VII cases. We conclude by identifying the key statistical concepts that can help guide attorneys in making or challenging a claim under the Supreme Court’s most recent precedent and by proposing a collaboration that would facilitate the appropriate use of these statistical techniques in future litigation.
I. The Ricci Decision
In Ricci, a group of firefighters from the City of New Haven challenged the City’s refusal to certify the results of a promotional examination on which white candidates for promotion had outperformed African-American and Hispanic candidates. The City, fearing that minority candidates would sue on the basis of disparate impact liability if only white candidates moved up in the ranks, decided to discard the examination results. The Ricci plaintiffs, comprised of seventeen white firefighters and one Hispanic firefighter, alleged that the City’s action had violated both Title VII of the Civil Rights Act of 1964 and the Fourteenth Amendment’s Equal Protection Clause. Justice Anthony Kennedy, writing for a five-justice majority, held that the City’s refusal to certify the examination results in order to avoid only white firefighters securing promotions was a violation of Title VII, which prohibits disparate treatment on the basis of race.
The majority held that before an employer could engage in intentional discrimination with the stated purpose of avoiding the use of a test that could lead to disparate impact, the employer must have a “strong basis in evidence” to support its belief that disparate impact liability would follow without such action. This is the standard that the Court had applied in earlier cases brought under the Fourteenth Amendment’s Equal Protection Clause. For example, in Wygant v. Jackson Board of Education, the Court concluded that “[e]videntiary support for the conclusion that remedial action is warranted becomes crucial when the remedial program is challenged in court by non-minority employees.”
In that context, the standard struck a balance between the competing goals of eliminating segregation and discrimination through the use of remedial measures on the one hand and eliminating any form of race-based discrimination by the government, including the use of remedial measures, on the other. InRicci, the Court imported this standard to reconcile what it found to be the similarly competing prohibitions on disparate treatment and disparate impact in Title VII. Significantly, Ricci is now the key reference point for any employer covered by Title VII that wishes to remedy what it perceives as disparate impact through action designed to offset or balance such an outcome.
II. Identifying a “Strong Basis in Evidence”
The Ricci case turned on the City’s failure to meet its burden under the Court’s “strong basis in evidence” standard. Although it is unclear what exactly constitutes a strong basis in evidence that an employer will face disparate impact liability under the Court’s newly minted precedent, there are some guidelines from the case itself. The Ricci decision makes clear that the mere fear of litigation based on the disparate impact of a test or assessment is insufficient to trigger an employer’s ability to intentionally discriminate on the basis of race under Title VII, even when an employer aims to correct what it perceives as an unfair or undesirable outcome. However, the Court’s statement affords little guidance in contemporary Title VII litigation, as municipalities and their lawyers are well aware of the need to be ready to justify their actions with tangible evidence.
Indeed, in Ricci itself, the City of New Haven did not simply assert an unsupported fear of litigation as the basis for its actions. For example, Justice Ginsburg’s dissent noted that the City employed testing techniques that were outside of the mainstream, having relied on oral and written examinations when other jurisdictions used real-world simulations. Moreover, the New Haven officials were faced with striking numerical disparities. For example, the pass rates for African-American and Hispanic candidates for captain were approximately one half of those for white candidates for the position.
The cases argued in the wake of the Ricci decision will be those in which an employer covered by Title VII amasses some evidence to justify its use of intentional discrimination to avoid disparate impact liability. In this type of case, two questions are key: (1) What type of evidence is required to constitute a “strong basis in evidence”?, and (2) How much evidence must a plaintiff or defendant gather to support her position? The majority opinion does not answer these questions, but in rejecting the City of New Haven’s evidence, the Court did give some important signals as to where future defendants may fall short.
The majority opinion presented two alternative avenues for an employer to make a sufficient showing under the “strong basis” standard. The Court held that for it to find that the City’s actions were justified, the City would have to make one of two showings: either that the promotional examination “[was] not job related and consistent with business necessity,” or that “there existed an equally valid, less-discriminatory alternative that served the City’s needs but that the City refused to adopt.” With respect to the former ground, the Court stated that there was “no genuine dispute that the examinations were job-related and consistent with business necessity.” In particular, the majority rejected the City of New Haven’s arguments that the examination did not test the appropriate skills. The Court placed fault with the City for not requesting a more detailed report of the examination’s validity, even though such a report had been available. That report could potentially have served as evidence to support the City’s arguments.
With respect to the latter ground, the Court rejected the City’s claim that there was an alternative way of calculating the results of the promotional exam that would have both benefitted African-American candidates and been an equally valid assessment of the skills relevant to promotion. Specifically, the Court noted that the City neither produced evidence that the existing methodology for evaluating the exam results was arbitrary, nor pinpointed a specific alternative formula for generating less discriminatory but equally valid results. Indeed, the Court explicitly stated that the City “[could not] create a genuine issue of fact based on a few stray . . . statements in the record” made by the psychologist who was hired by the City to help develop the exam. In short, the Court required more than just an abstract statement of what may or may not have been possible; an actual alternative way of analyzing the results was a must.
III. Using Statistics to Make (or Break) A Claim In A Post-Ricci World
Ricci’s take-home message is that employers looking to use intentional discrimination as a way of avoiding disparate impact liability face a high hurdle before taking such action; the Supreme Court has made clear that anything short of very persuasive evidence will fail to satisfy the “strong basis in evidence” standard. Yet the Ricci decision itself does not present litigants with a roadmap of how to sustain their burden under this precedent.
We outline two statistical concepts that can help lawyers focus and articulate their arguments in a post-Ricci world: measurement validity and test bias or measurement invariance. We have selected these two areas because they best capture the overall concerns raised by the Court when it articulated its “strong basis” standard. Further, encouraging litigants on both sides of future cases to present their evidence and arguments in these terms will provide a common, coherent basis for generating case law in the lower federal courts under the Ricci precedent.
A. Measurement Validity
Under Ricci, the first way of demonstrating a strong basis in evidence is to show that an assessment or examination “[is] not job related and consistent with business necessity.” From a statistical standpoint, that determination requires an answer to the following question: “Does the test measure what it intends to measure?” Thus, one must ensure that an instrument measures skills that are relevant to the assessed attribute; this concept is generally known as construct validity. One must also establish criterion-related validity, which refers to the extent to which the assessment instrument estimates the desired criterion or external attribute in a concurrent or predictive manner. For example, in the case of the New Haven firefighters, a valid instrument would measure the skills that would relate to the potential to excel in a supervisory capacity, and those with high skills and potential would score high on the given test (and vice versa). Furthermore, to demonstrate validity, individual test questions or assessments must actually tap into the skills that are relevant to the performance of the job. For example, Justice Ginsburg’s dissent noted that most other jurisdictions had moved away from pencil and paper examinations because real-time simulations provided a better sense of how candidates would actually perform under real-life conditions. However, the majority largely ignored this fact in its analysis. Thus, proof that a locality uses an outdated or non-mainstream assessment is insufficient to generate a strong basis in evidence under the Court’s new precedent. However, a systematic statistical analysis aimed at assessing measurement validity may meet the Court’s standard.
Suppose a city wanted to examine the construct validity and reliability of its promotional examination. Validity addresses the content of assessment, whereas reliability refers to its precision. In this situation, a suitable approach might involve a multitrait-multimethod (MTMM) study. Ricci can serve as a useful illustration. In that case, the City of New Haven was presumably evaluating two or more traits related to a firefighters’ excellence in a supervisory capacity—for example, the ability to make sound decisions in high stress situations and the ability to direct resources where they are most needed. If the City assessed these hypothetical latent traits using two or more methods such as a paper and pencil questionnaire, real-life simulations, and direct observation, it could have utilized an MTMM evaluation. One of the primary goals of the MTMM study is to parse out the effects of the hypothetical latent trait or ability that the test is designed to measure from the effects of the different methods employed in the assessment (i.e., to distinguish between variability in scores due to differences in the relevant skills, and the variability in scores due to the testing method).
An MTMM study also allows researchers to evaluate the construct validity of a set of assessments that use different methods. Specifically, researchers can look at both convergent and discriminant validity, both of which are components of construct validity. Convergent validity taps into how well measures of the same latent trait or ability that are related based on theory ultimately relate to each other based on the data. The observed agreement between those measures is an indicator of convergent validity. Discriminant validity serves a different function, aiming to identify the extent to which individual measures discriminate between theoretically different traits or abilities.
A preliminary approach to evaluating these two components of construct validity would involve examining the correlation matrix of scores from all variables—essentially, a table that lists correlations between each item on the assessment and every other item. High correlation among variables assessing the same latent trait or ability that were obtained by different methods would provide evidence of convergent validity; a relatively low correlation among variables assessing different constructs with different methods would suggest evidence of discriminant validity. In other words, that evidence could support the claim that the measurements not only tap into the same trait or ability, but that they also successfully distinguish between the desired trait and other potential skills. In addition to a basic examination of a correlation matrix, more advanced statistical procedures such as confirmatory factor analysis are available to answer questions related to the validity and reliability of multiple constructs and assessment methods.
These are certainly not the only ways to assess measurement validity. However, they are relevant examples of a systematic approach that could have strengthened the arguments of either side, particularly because the importance of measurement validity was something on which the majority and dissent both agreed. The majority explicitly faulted the City of New Haven for failing to request a more detailed report on the test’s relation to skills relevant to promotion—a criticism that rang hollow according to Justice Ginsburg, who argued that the consultants who generated the instrument were not equipped to provide that information. Although the majority and dissent disagreed about the City’s ability to obtain this information and the importance of that fact to the City’s case, both sides recognized the relevance of that information to such a showing. Therefore, these statistical concepts should be at the center of the parties’ arguments in future cases. Further, litigators should remember that it is not the test itself, but the interpretation of test results that concerns validation efforts. In other words, although a test generates data in the form of individual responses, it is the soundness of the inferences drawn from those responses that matter most.
B. Test Bias or Measurement Invariance
The second way for an employer to demonstrate a strong basis in evidence is to show that “there existed an equally valid, less-discriminatory alternative that served the City’s needs but that the City refused to adopt.” Under this standard, an employer needs to muster tangible proof of bias in favor of a certain group before it contemplates abandoning existing results and resorting to an alternative assessment. A first step in making such a showing would be to demonstrate that the existing instrument does discriminate in favor or against a certain group on a basis that is irrelevant to the task at hand (i.e., unrelated to the ability or characteristic that is intended to assess). In statistics, this concept is known as measurement or test bias. Measurement invariance represents the absence of this type of measurement bias.
The concept of measurement invariance recognizes that in addition to one’s achievement in a given subject or one’s knowledge of a specific area, many other factors can influence a person’s performance on a test of achievement—factors that do not have a direct bearing on the skills assessed. Some of these factors may represent a random measurement error such as those related to idiosyncrasies of the testing process (e.g., a test taker’s fatigue on a given day). The amount of random measurement error is inversely related to the reliability of the assessment instrument. Other factors may be more systematic and include the test taker’s general anxiety level or experience in taking a test in a given format. Furthermore, some of these factors may be related to an individual’s group membership or demographic characteristics, such as socio-economic status or ethnic or racial background. In other words, while random error is related to measurement precision, systematic measurement error is related to measurement bias or nonequivalence. In statistics, measurement invariance or equivalence is achieved when individuals with equal ability have equal test scores regardless of the group or groups to which these individuals happen to belong.
In Ricci, measurement invariance (or conversely measurement bias) appeared to be a significant issue for the promotional examinations, yet neither party addressed it in any systematic way. For example, the majority cited a statistician who had asserted that “most of the literature on firefighters shows that the different [racial] groups perform the job differently.” Further, it is possible that additional research using the New Haven data, if it were made publicly available, would confirm what numerous research studies on the performance of ethnic minority test-takers have revealed—namely, that factors such as negative stereotype threat (rooted in the experience of prejudice and discrimination), not lower ability, influenced the test performance of the African-American firefighters.
Assuming that the City of New Haven (or another employer) wanted to determine if stereotype threat or other factors unrelated to the assessed ability could explain disparities in test results, they would have several statistical tools at their disposal. Historically, researchers examined test bias using what are known as invariant predictive models. In those models, a criterion, such as occupational achievement or success, was regressed on the test scores, meaning that researchers looked at how one’s occupational achievement or success would vary depending on his or her test scores. Test bias was established when there was evidence of difference in regression function (i.e., a different predictive relationship among the assessed variables) for members of different groups, such as males and females or different ethnic groups. However, in the past few years, statisticians and psychometricians have highlighted the measurement invariance model as a more viable and appealing alternative to invariant predictive models. The measurement invariance model focuses not on the predictive relationships between the observed test scores and the future outcome, but instead on the relationship between the test scores and the latent attribute that the test is designed to assess. In the case of Ricci, that latent attribute was the ability to serve as a leader within the ranks of the fire department.
The technical details of measurement invariance analyses are beyond the scope of this editorial. However, we mention two general approaches to assessing measurement invariance for illustration. One technique is embedded within the structural equation modeling framework and utilizes confirmatory factor analytic techniques. The general idea behind the analysis is to compare a series of models specifying the relationship between the test items and the latent attribute: in one set of models the relationships are set to not vary by group, and in the other the relationships are freely estimated based on the observed data. The statistical fit of the two models is then compared. That permits researchers to identify whether the data and instrument depart from the goal of measurement invariance across groups.
Another technique, based on Item Response Theory (IRT), can examine differential item functioning to assess measurement invariance. When using IRT, researchers identify a set of item parameters that are unique for each item (such as the item difficulty and discrimination, or the extent to which the specific item discriminates between individuals of varying ability). Based on the observed item responses (given the item parameters), the underlying latent trait or ability is estimated for each test taker. This parametrization allows researchers to examine whether certain groups demonstrate a different probability of giving a correct response on a test, even when they share the same ability or skill level. These two general approaches are analytically distinct, and deciding on and actually performing either one requires consultation with a statistician or psychometrician trained in these techniques; however, they are both tools that can address the concerns discussed above.
Neither party in Ricci utilized these tools to justify or reject the need to take remedial measures on the basis of race in light of the significant disparities in the exam results. Evidence from this type of analysis would square particularly well with the majority’s concerns in Ricci. To support this claim, it is important to remember the basis on which the Court rejected the City’s claim that it could have analyzed the examination results in a way that would not have generated the same racial disparities. The Court rejected the City’s claim not because it thought that such a showing would be insufficient, but because the City had not provided any concrete evidence that such an alternative existed.
In short, a statistical analysis that demonstrated a lack of measurement invariance—that is, that the test produced systematically biased results by favoring members of a certain group over another of equal ability—could provide the tangible basis for the need to consider less discriminatory assessment alternatives. Lawyers who articulated their arguments in these terms and provided the statistical analyses to support them would make a viable case that the strong basis in evidence test had been satisfied.
For practicing lawyers in the employment discrimination field, Ricci is a watershed moment for Title VII. The case is certain to have a significant impact on employers, who now face a higher hurdle when taking actions on the basis of race to avoid disparate impact liability. It is unclear where the Court will ultimately set the bar with respect to the type and quantum of proof required under the Ricci standard; indeed, any such prediction is difficult due to the fact-dependent nature of these cases. Yet, in elaborating these statistical concepts, we have identified the tools that we believe best track the majority’s concerns and equip attorneys with a vocabulary for making their case in future litigation.
Although these concepts hold definite value for future disputes under this recent precedent, these concepts hold their greatest value for employers at an early stage: when designing and testing prospective tests or assessments. The Ricci majority itself, although rejecting the City of New Haven’s subsequent attempt to correct for what it perceived to be an undesirable result, recognized that “Title VII does not prohibit an employer from considering, before administering a test or practice, how to design that test or practice in order to provide a fair opportunity for all individuals, regardless of their race.”
In light of the high bar set by the Supreme Court in Ricci, employers looking to establish equitable and appropriate assessments should turn to these statistical tools before administering those instruments. Colleges and universities throughout the country have professionals who are trained in these techniques and who could help ensure their proper application. Such collaboration would help to achieve the result that the City of New Haven intended, but without requiring the drastic remedy of throwing out the examination results of 118 applicants—and without generating the threat of future litigation.
*Jason M. Szanyi, J.D., 2009, Harvard Law School. Mr. Szanyi is a Skadden Fellow at the Public Defender Service for the District of Columbia and the Center for Children’s Law and Policy.
Katarina Guttmannova, Ph.D., 2004, University of Montana, Missoula. Dr. Guttmannova is a Research Scientist at the Social Development Research Group, School of Social Work at the University of Washington.
 Ricci v. DeStefano,129 S. Ct. 2658 (2009).
 Sotomayor was a member of the Second Circuit panel that summarily affirmed the district court opinion in the case. The district court had held that the defendant’s attempt to remedy the disparate impact of its promotional exam by disregarding the results “[was] not equivalent to an intent to discriminate against non-minority applicants.” Ricci v. DeStefano, 554 F. Supp. 2d 142, 158 (D. Conn. 2006), aff’d 530 F.3d 87 (2d Cir. 2008) (per curiam).
 See, e.g., David A. Drachsler, Assessing the Practical Repercussions of Ricci, ACS Blog, July 27, 2009, http://www.acslaw.org/node/13829 (last visited Nov. 13, 2009).
 See, e.g., Richard A. Epstein, Ricci vs. DeStefano: Getting Back to First Principles of Affirmative Action, Forbes, June 29, 2009, http://www.forbes.com/2009/06/29/ricci-destefano-new-haven-supreme-court-affirmative-action-opinions-columnists-firefighters.html (last visited Nov. 13, 2009).
 Of the 77 candidates that took the examination, 43 were white, 19 were African-American, and 15 were Hispanic. Only 34 individuals passed the examination, 25 of whom were white, 6 of whom were African-American, and 3 of whom were Hispanic. The top 10 passing candidates were all white. Ricci, 129 S. Ct. at 2666.
 The Court explicitly declined to consider the firefighters’ constitutional claims, having found a statutory basis for its decision. Ricci, 129 S. Ct. at 2681. However, many of the same issues raised by theRicci decision are likely to arise in a constitutional context as well—a decision that Justice Scalia noted would necessarily follow from this case. See Ricci, 129 S. Ct. at 2681–82 (Scalia, J., concurring) (noting that“[the] resolution of this dispute merely postpones the evil day on which the Court will have to confront the question: Whether, or to what extent, are the disparate-impact provisions of Title VII of the Civil Rights Act of 1964 consistent with the Constitution’s guarantee of equal protection?”).
 Intentional discrimination, as explained by the Court, involves consciously making a decision on the basis of race. In Ricci, this involved “invalidat[ing test results] in sole reliance on race-based statistics” that indicated a disparate outcome. Ricci, 129 S. Ct. at 2676.
 Ricci, 129 S. Ct. at 2664.
 Wygant v. Jackson Board of Education, 476 U.S. 267, 277 (1986).
 Title VII’s provisions apply to all private employers, state and local governments, and education institutions that employ fifteen or more individuals, as well as other entities such as labor unions and employment agencies. See 42 U.S.C. §§ 2000e(b)–2000e(e) (2008).
 Ricci, 129 S. Ct. at 2690 (Ginsburg, J., dissenting) (arguing that the majority “ignore[d] substantial evidence of multiple flaws in the tests New Haven used”).
 Id. at 2705.
 Id. at 2692.
 Id. at 2678.
 Id. at 2679.
 Id. at 2681.
 Although these arguments would help lawyers establish the strong basis in evidence that Riccirequires, they would also be useful in countering such claims. In this way, these concepts are meant to ensure that prospective plaintiffs and defendants are speaking the same language when it comes to analyzing a given test or assessment.
 We do not focus on the issue of weighting here. Weighting, while related to the general issues of validity, reliability and test bias, is too case-specific for this summary.
 Ricci, 129 S. Ct. at 2678.
The following works provide a useful and accessible introduction to the concept of measurement validity. See, e.g., American Educational Research Association, American Psychological Association & the National Council on Measurement in Education, Standards For Educational and Psychological Testing (1999); Herman Aguinis & Marlene A. Smith, Understanding the Impact of Test Validity and Bias on Selection Errors and Adverse Impact in Human Resource Selection, 60 Personnel Psychol. 165 (2007); Michael T. Kane, Current Concerns in Validity Theory, 38 J. Educ. Measurement 319 (2001).
 See Ricci, 129 S. Ct. at 2705 (citing a 1996 study that indicated that two-thirds of surveyed municipal employers used simulations as part of their promotional schemes).
 This approach was originally developed by and is described in one of the most-cited methodological works in the history of psychology. See Donald T. Cambell & Donald W. Fiske, Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix, 56 Psychol. Bulletin 56 (1959). For a practical application of the idea and method, see e.g., Herbert W. Marsh & David Grayson, Latent Variable Models of Multitrait-Multimethod Data, Structural Equation Modeling 177-98 (Rick H. Hoyle ed. 1989); Kenneth A. Bollen, Measurement Models: The Relation Between Latent and Observed Variables, Structural Equations with Latent Variables 179-225 (Kenneth A. Bollen ed. 1989). For an overview of the historical connections and theoretical background on construct validation in clinical psychology, see Milton E. Strauss & Gregory T. Smith, Construct Validity: Advances in Theory and Methodology, 5 Ann. Rev. Clinical Psychol. 1 (2009).
 For additional practical examples and extensions of the MTMM approach, see, e.g., David A. Kenny & Deborah A. Kashy, Analysis of the Multitrait-Multimethod Matrix by Confirmatory Factor Analysis, 112 Psychol. Bulletin 165 (1992); Charles S. Reichard & S.C. Coleman, The Criteria for Convergent and Discriminant Validity in A Multitrait-Multimethod Matrix, 30 Multivariate Behavioral Research 513 (1995); Charles E. Lance, Carrie L. Noble & Steven E. Scullen, A Critique of the Correlated Trait—Correlated Method and Correlated Uniqueness Model for Multitrait-Multimethod Data, 7 Psychol. Methods228 (2002).
 Ricci, 129 S. Ct. at 2679.
 Ricci, 129 S. Ct. at 2707 (Ginsburg, J., dissenting) (arguing the report would have “merely summarized the steps that [the consultants] took methodologically speaking, and would not have established the exams’ reliability”) (quotation marks omitted).
 Id. at 2678.
 For an introduction to measurement invariance and its application, see, e.g., Robert J. Vandenberg & Charles E. Lance, A Review and Synthesis of the Measurement Invariance Literature: Suggestions, Practices, and Recommendations for Organizational Research, 3 Org. Research Methods 4 (2000); Steven P. Reise, Keith F. Widaman & Robin H. Pugh, Confirmatory Factor Analysis and Item Response Theory: Two Approaches for Exploring Measurement Invariance, 114 Psychol. Bulletin 552 (1993).
 For a more extensive exploration of this issue, see, e.g., Katarina Guttmannova, Jason M. Szanyi & Philip W. Cali, Internalizing and Externalizing BehaviorProblem Scores: Cross-ethnic and Longitudinal Measurement Invariance of the Behavior Problem Index, 68 Educ. & Psychol. Measurement 676, 679–80 (2008). For additional examples of how researchers have examined measurement invariance in achievement tests and other assessments, see Shudong Wang & Hong Jiao,Construct Equivalence Across Grades in a Vertical Scale for a K-12 Large-Scale Reading Assessment, 69 Educ. & Psychol. Measurement 760 (2009) (investigation measurement invariance of large-scale reading comprehension assessment program across elementary and secondary grades); Niels G. Waller et al., Using IRT to Separate Measurement Bias from True Group Differences on Homogeneous and Heterogeneous Scales: An Illustration with the MMPI, 5 Pyschol. Methods 125 (2000) (using Item Response Theory to demonstrate difference between measurement bias and true group differences on a popular personality test frequently used to assess adult psychopathology).
 Ricci, 129 S. Ct. at 2669.
 For an introduction to the concept of stereotype threat,see Claude M. Steele, A Threat in the Air: How Stereotypes Shape Intellectual Identity and Performance, 52 Am. Psychologist 613 (1997). Two social psychologists maintain a website that outlines the concept of stereotype threat in greater detail and includes the latest research studies on the topic: Reducing Stereotype Threat, http://www.reducingstereotypethreat.org (last visited Jan. 6, 2009).
 For a review of this work and extensions of this approach, see Denny Borsboom, Jan-Willem Romeijn & Jelte M. Wicherts, Measurement Invariance Versus Selection Invariance: Is Fair Selection Possible?, 13 Psychol. Methods 75 (2008); Roger E. Millsap, Invariance in Measurement and Prediction, Revisited, 72 Psychometrika 461, 461 (2007) (noting that “[t]he body of work on invariance in measurement and prediction has yet to have much impact on measurement practice”).
 For additional explanation of the details of this methodology, see John L. Horn & J.J. McArdle, A Practical and Theoretical Guide to Measurement Invariance in Aging Research, 18 Experimental Aging Research 117 (1992); Robert J. Vandenberg, Toward a Further Understanding of an Improvement in Measurement Invariance Methods and Procedures, 5 Organizational Research Methods 139 (2002); Vandenberg & Lance, supra note 28.
 For a good introduction to IRT methods, see Susan E Embretson & Stephen P. Reise, Item Response Theory for Psychologists (2000).
 Stephen P. Reise et al., Confirmatory Factor Analysis and Item Response Theory: Two Approaches for Exploring Measurement Invariance, 114 Psychol. Bulletin 552 (1993).
Ricci v. DeStefano,129 S. Ct. 2658, 2679 (2009).
 Ricci, 129 S. Ct. at 2677 (emphasis added).