Skip to main content
TurnellaBeta
WFM guideQuality assurance

Contact centre quality calibration

A QA scorecard that produces different scores depending on who does the assessment is not measuring quality — it is measuring assessor opinion. Calibration aligns assessor judgements so the score reflects the contact, not the assessor.

Why calibration is non-negotiable

Performance management fairness

If an agent's QA score varies by 10–15pp depending on which assessor reviews their calls, the score cannot be fairly used to support a performance management decision. An agent who consistently scores 75% with Assessor A and 88% with Assessor B has a performance management problem or an assessor consistency problem — without calibration, it is impossible to determine which.

Agent trust in the QA programme

Agents who receive different scores for contacts that feel similar in quality lose trust in the QA programme as a development tool. An agent who is coached on a call scored 62% by one assessor, then scores 84% on a similar call reviewed by another, rationally concludes that the score is arbitrary.

Comparison validity across teams

If Team A's calls are primarily reviewed by a strict assessor and Team B's by a lenient assessor, the centre average QA score reflects assessor calibration gaps as much as quality differences. Cross-team benchmarking is meaningless without inter-rater reliability.

Legal defensibility in regulated sectors

In FCA-regulated operations, QA evidence used to support a disciplinary decision or regulatory reporting must be consistent. An FCA investigation that finds that assessors scored identical behaviour differently would undermine the reliability of the entire QA programme.

The calibration session: step-by-step

1.

Select 2–4 calls for the session

Choose calls that represent different contact types and difficulty levels — including at least one call with an ambiguous QA element (a judgment call that assessors are likely to score differently). Avoid selecting calls that are clearly excellent or clearly poor — calibration is most valuable for the middle ground.

2.

Assessors score independently before the session

Each assessor reviews and scores the selected calls before the calibration meeting, without seeing other assessors' scores. Independent scoring is essential — if assessors hear others' scores before forming their own, the calibration data is contaminated by anchoring bias.

3.

Scores are revealed simultaneously at the start of the session

Display all assessors' scores for each element simultaneously (a shared spreadsheet or QA system screen). Note which elements have high variance (>5pp spread) and which are consistent. Consistent elements do not need discussion; focus session time on divergent elements.

4.

Discuss divergent scores — not to force consensus, but to understand rationale

For each element where scores diverge, each assessor explains their rationale. The goal is not to reach a single agreed score through social pressure — it is to understand whether different assessors applied the same criteria or whether the scorecard criteria are ambiguous. Ambiguous criteria should be clarified in the scorecard notes after the session.

5.

Agree the reference score for each divergent element

After discussion, the calibration lead (typically the QA manager or senior assessor) confirms the reference score for each element — the score that will be used as the standard for future similar assessments. This is not an average of the assessors' scores — it is the score the calibration lead determines best reflects the criteria applied consistently.

6.

Document the calibration session and update the QA standards notes

Record which elements were discussed, what the disagreement was, and what the agreed standard is. Update the QA standards notes (the expanded guidance on scorecard criteria) to reflect any clarifications made during the session. Assessors who missed the session should read the standards notes before their next assessment batch.

Measuring inter-rater reliability

Inter-rater reliability (IRR) is the statistical measure of how consistently two or more assessors score the same contact. There are several approaches; the simplest and most practical for contact centre use is the absolute agreement percentage — the proportion of assessor pairs that score within an acceptable range of each other.

IRR threshold (total score variance)AssessmentAction
Within ±3ppExcellent — high reliabilityNo action required. Reduce calibration frequency if sustained for 3+ months.
Within ±5ppAcceptable — meets standardContinue existing calibration cadence. Note elements where variance occurred for targeted discussion.
5–10pp varianceBelow standard — calibration requiredIncrease calibration frequency. Identify which scorecard elements are producing variance. Clarify criteria in standards notes.
Above 10pp variancePoor — scores not comparable across assessorsSuspend cross-assessor benchmarking and performance decisions based on QA scores until calibration is achieved. Run intensive calibration (weekly) and retrain assessors on ambiguous elements.
Track IRR by scorecard element, not just total score: Two assessors may produce the same total score through different element scores — one scoring empathy high and compliance low, the other the reverse. Element-level IRR tracking identifies which specific criteria are inconsistently applied — so the calibration discussion is targeted at the right elements rather than the total score.

Quality calibration questions

How often should contact centres run QA calibration sessions?

At minimum, monthly for teams with active QA programmes. For teams where QA scores are used in performance management, fortnightly calibration is recommended. Sessions involve all active QA assessors plus team leaders who use QA scores in coaching. A session takes 60–90 minutes and covers 2–4 calls, focusing discussion on elements where scores diverge by more than 5pp. New assessors should attend calibration more frequently until their inter-rater reliability is consistently within ±5pp of the reference assessor. Frequency can be reduced once the team achieves reliable consistency but should never drop below monthly.

Related guides