Item Response Theory (IRT) Explained

Reading time: approx. 7 minutes · Category: Science & Psychometrics

Behind every serious adaptive test lies a mathematical theory: Item Response Theory, or IRT for short. It describes how a person's response to a question (an "item") relates to their underlying characteristic (their "trait" or "latent attribute"). This sounds complex – but it isn't, once you work through it step by step.

What Is a "Latent Attribute"?

A latent attribute is a characteristic that cannot be observed directly, but can only be inferred from behaviour and responses. Personality traits like empathy, analytical thinking, or risk-taking are classic latent attributes – you can't measure them like a blood test, but you can observe how someone reacts in specific situations.

That is exactly what a personality test does: it asks questions about situations, preferences, and behaviours to draw conclusions about these latent attributes. IRT formalises this process mathematically.

The Core Idea: The Item Characteristic Curve

At its heart, IRT describes for every question an Item Characteristic Curve (ICC). This curve shows how likely it is that a person with a given trait level will give a particular answer.

Imagine we are measuring "Analytical Thinking" on a scale from −3 (very low) to +3 (very high). For the question "I prefer to analyse problems logically rather than intuitively", the curve looks roughly like this:

People with very low analytical scores agree with low probability (~10%)
People with mid-range scores agree with ~50% probability
People with very high scores agree almost always (~90%)

📐 The S-shaped curve: This item characteristic curve typically has an S-shape (sigmoid curve). The steeper the curve, the better the question distinguishes between different trait levels.

The Three Parameters of an IRT Question

In the classic 3-parameter IRT model (3PL), every question has three characteristic properties:

⚖️

Difficulty (b)

The difficulty parameter indicates at which trait level the probability of agreement sits at 50%. A "difficult" question is only endorsed by people with very high trait levels.

🎯

Discrimination (a)

Discrimination describes how well the question distinguishes between different trait levels. High discrimination means a steep curve – the question is highly informative in the statistical sense.

🎲

Guessing (c)

This parameter indicates the probability of a particular response even when someone has essentially no trait at all. In personality tests this parameter is often close to zero.

How Does IRT Estimate a Trait Value?

When you answer several questions, the system has a probability statement for each answer: "How likely would this response be at a given trait level?" IRT combines all of this information using a procedure called Maximum Likelihood Estimation (MLE) or Bayesian estimation.

Simply put: the system finds the trait value that best explains why you answered exactly the way you did. With each additional answer the estimate becomes more precise – uncertainty shrinks.

IRT vs. Classical Test Theory (CTT)

For a long time Classical Test Theory (CTT) dominated psychology. In CTT a test score is simply the sum of raw points – the more "correct" or agreeing answers, the higher the score. This has significant drawbacks:

Criterion	Classical Test Theory	Item Response Theory
Item Dependency	Result depends strongly on the specific questions asked	Result is independent of the specific question selection
Sample Independence	Item parameters depend on the sample	Item parameters are sample-independent (with good calibration)
Measurement Error	Uniform standard error for everyone	Individual standard error per person
Adaptive Testing	Barely feasible	Ideal foundation for CAT
Efficiency	Everyone gets the same number of questions	Minimal questions for maximum precision

What Is Computerized Adaptive Testing (CAT)?

The combination of IRT and computers enables Computerized Adaptive Testing (CAT) – precisely what Traitora implements. A computer can calculate in milliseconds which question next would yield the highest information gain. Without computers this would be unthinkable.

CAT has been used in educational research since the 1970s. Well-known applications include the GMAT (Graduate Management Admission Test), the GRE (Graduate Record Examinations), and the TOEFL (Test of English as a Foreign Language). Traitora brings this technology to the field of personality psychology.

How Well Does IRT Work for Personality Tests?

IRT was originally developed for achievement tests (e.g. school exams), where there are correct and incorrect answers. In personality tests there are no "correct" answers – every answer reveals something about the respondent's personality.

For this purpose Traitora uses the Polytomous IRT model (specifically the Graded Response Model), which is suited for multiple response categories with no clear "correctness". Each answer option carries weightings for different traits, and the system calculates which trait constellation most likely matches your overall answer pattern.

Fairness and Comparability

An important advantage of IRT is fairness across different groups. Because IRT-based tests account for individual item parameters, results from different people can be compared directly – even if they did not answer the same questions. This is not possible in classical test theory.

This principle is called Measurement Invariance: the underlying trait values can be compared across different groups without distortion from different question sets.

🔬 Scientific background: IRT traces back to work by Georg Rasch (1960) and Frederic Lord (1952). The now most widely used 3-parameter model was formalised by Lord and Novick (1968). Modern applications use Bayesian extensions for even more robust estimates.