Contact centre quality calibration
A QA scorecard that produces different scores depending on who does the assessment is not measuring quality — it is measuring assessor opinion. Calibration aligns assessor judgements so the score reflects the contact, not the assessor.
Why calibration is non-negotiable
Performance management fairness
If an agent's QA score varies by 10–15pp depending on which assessor reviews their calls, the score cannot be fairly used to support a performance management decision. An agent who consistently scores 75% with Assessor A and 88% with Assessor B has a performance management problem or an assessor consistency problem — without calibration, it is impossible to determine which.
Agent trust in the QA programme
Agents who receive different scores for contacts that feel similar in quality lose trust in the QA programme as a development tool. An agent who is coached on a call scored 62% by one assessor, then scores 84% on a similar call reviewed by another, rationally concludes that the score is arbitrary.
Comparison validity across teams
If Team A's calls are primarily reviewed by a strict assessor and Team B's by a lenient assessor, the centre average QA score reflects assessor calibration gaps as much as quality differences. Cross-team benchmarking is meaningless without inter-rater reliability.
Legal defensibility in regulated sectors
In FCA-regulated operations, QA evidence used to support a disciplinary decision or regulatory reporting must be consistent. An FCA investigation that finds that assessors scored identical behaviour differently would undermine the reliability of the entire QA programme.
The calibration session: step-by-step
Select 2–4 calls for the session
Choose calls that represent different contact types and difficulty levels — including at least one call with an ambiguous QA element (a judgment call that assessors are likely to score differently). Avoid selecting calls that are clearly excellent or clearly poor — calibration is most valuable for the middle ground.
Assessors score independently before the session
Each assessor reviews and scores the selected calls before the calibration meeting, without seeing other assessors' scores. Independent scoring is essential — if assessors hear others' scores before forming their own, the calibration data is contaminated by anchoring bias.
Scores are revealed simultaneously at the start of the session
Display all assessors' scores for each element simultaneously (a shared spreadsheet or QA system screen). Note which elements have high variance (>5pp spread) and which are consistent. Consistent elements do not need discussion; focus session time on divergent elements.
Discuss divergent scores — not to force consensus, but to understand rationale
For each element where scores diverge, each assessor explains their rationale. The goal is not to reach a single agreed score through social pressure — it is to understand whether different assessors applied the same criteria or whether the scorecard criteria are ambiguous. Ambiguous criteria should be clarified in the scorecard notes after the session.
Agree the reference score for each divergent element
After discussion, the calibration lead (typically the QA manager or senior assessor) confirms the reference score for each element — the score that will be used as the standard for future similar assessments. This is not an average of the assessors' scores — it is the score the calibration lead determines best reflects the criteria applied consistently.
Document the calibration session and update the QA standards notes
Record which elements were discussed, what the disagreement was, and what the agreed standard is. Update the QA standards notes (the expanded guidance on scorecard criteria) to reflect any clarifications made during the session. Assessors who missed the session should read the standards notes before their next assessment batch.
Measuring inter-rater reliability
Inter-rater reliability (IRR) is the statistical measure of how consistently two or more assessors score the same contact. There are several approaches; the simplest and most practical for contact centre use is the absolute agreement percentage — the proportion of assessor pairs that score within an acceptable range of each other.
| IRR threshold (total score variance) | Assessment | Action |
|---|---|---|
| Within ±3pp | Excellent — high reliability | No action required. Reduce calibration frequency if sustained for 3+ months. |
| Within ±5pp | Acceptable — meets standard | Continue existing calibration cadence. Note elements where variance occurred for targeted discussion. |
| 5–10pp variance | Below standard — calibration required | Increase calibration frequency. Identify which scorecard elements are producing variance. Clarify criteria in standards notes. |
| Above 10pp variance | Poor — scores not comparable across assessors | Suspend cross-assessor benchmarking and performance decisions based on QA scores until calibration is achieved. Run intensive calibration (weekly) and retrain assessors on ambiguous elements. |
Quality calibration questions
How often should contact centres run QA calibration sessions?
At minimum, monthly for teams with active QA programmes. For teams where QA scores are used in performance management, fortnightly calibration is recommended. Sessions involve all active QA assessors plus team leaders who use QA scores in coaching. A session takes 60–90 minutes and covers 2–4 calls, focusing discussion on elements where scores diverge by more than 5pp. New assessors should attend calibration more frequently until their inter-rater reliability is consistently within ±5pp of the reference assessor. Frequency can be reduced once the team achieves reliable consistency but should never drop below monthly.
Related guides
Quality management
The full QA function
Quality framework
Designing the QA framework
QA scorecard design
Designing a calibration-friendly scorecard
Coaching guide
Using QA findings in coaching
Performance management
Using QA data in performance decisions
Call recording guide
The recordings used in calibration
AHT calculator
Calibrate AHT benchmarks from the call sample used in QA sessions
FCR calculator
Measure FCR agreement rate across calibration participants