WFM guideTechnology

Speech analytics in contact centres

Q: What is speech analytics in a contact centre?

Speech analytics in a contact centre refers to technology that analyses call recordings to extract structured data: which words and phrases were used, how long silences lasted, whether specific regulatory disclosures were made, what emotions were detectable in tone, and whether agents followed required scripts or processes. There are two main technical approaches: phonetic search (which searches audio directly by sound pattern — fast but less accurate) and transcription-based analysis (which converts speech to text first, then analyses the text — more accurate but slower and more expensive). Both produce structured data from previously unstructured audio, enabling organisations to analyse 100% of calls rather than the 1–3% sampled manually for QA.

Q: What is auto-QA and how accurate is it?

Auto-QA (automated quality assessment) uses speech analytics to score call recordings against a quality framework without human review. It can reliably detect: whether specific words or phrases were said (script adherence, required disclosures, prohibited phrases); how long holds and silences lasted; call outcome metadata (was a resolution code entered, was a complaint flag raised). It cannot reliably detect: agent empathy and tone in nuanced human terms; whether the explanation given was actually accurate and helpful; whether the resolution was genuinely good for the customer vs. merely technically correct. Auto-QA accuracy for objective items (script adherence, disclosure compliance) is typically 85–95%. Accuracy for subjective quality judgements (customer experience quality, tone appropriateness) is typically 60–80% — useful as a flag, not as a definitive assessment. Best practice is to use auto-QA to score 100% of calls on objective criteria, and use manual QA for a sample (8–10 per agent per month) for subjective quality dimensions.

Manual QA reviews 1–3% of calls. Speech analytics reviews 100%. The technology converts unstructured audio into structured operational data — which phrases were used, how long holds lasted, whether a disclosure was made. When applied to the right problems, the ROI is strong. When applied to vendor promises about emotion detection, it is not.

Note on legal jurisdiction

This guide describes UK GDPR and data protection obligations as they apply to contact centres operating in Great Britain. Data protection law varies by jurisdiction. Always verify the requirements applicable to your operation with your Data Protection Officer or legal counsel before changing data handling practices. This guide is for operational context, not legal advice.

Phonetic search vs. transcription-based analysis

Phonetic search

How it works: Searches the audio recording directly by sound pattern. The system does not first convert speech to text — it looks for phoneme sequences that match the search term.

Accuracy: 85–92% for common English phrases in clear audio. Drops significantly with accents, background noise, or technical vocabulary.

Speed: Fast — can search 1,000 hours of audio in minutes without transcription step.

Best for: High-volume keyword and phrase search (specific word detection, prohibited phrase monitoring, competitor name mentions).

Limitation: Cannot handle context — 'not happy' and 'very happy' produce similar phonemes in parts. No sentence-level understanding.

Transcription-based analysis

How it works: Converts speech to text first (using ASR — Automatic Speech Recognition), then analyses the text using NLP. Produces a full text transcript that can be searched, categorised, and fed to LLM-based models.

Accuracy: 92–98% for clear UK English audio on modern ASR models (Google, AWS Transcribe, Azure Cognitive Services). Lower for accents, cross-talk, poor audio quality.

Speed: Slower than phonetic — transcription takes processing time. Batch or near-real-time rather than instant.

Best for: Auto-QA scoring, sentiment analysis, topic categorisation, emerging theme detection, integration with generative AI for call summarisation.

Limitation: Accuracy drops significantly with poor audio quality or strong regional accents. Transcription errors compound in downstream analysis.

Use case matrix: what speech analytics can and cannot do

Use case	Technical approach	Realistic accuracy	Operational value
Mandatory disclosure detection (FCA, GDPR, TCF)	Phonetic or transcription keyword search	90–96% on standard phrases	High — compliance monitoring at 100% coverage vs. 1-3% manual sample; regulatory evidence on demand
Prohibited phrase detection ('guaranteed', 'risk-free')	Phonetic search	85–92%	High — risk phrase alerts for coaching and compliance before they become FCA findings
Silence and hold detection (AHT analysis)	Audio signal analysis (no transcription needed)	95–99%	Very high — silence patterns reveal hold abuse, system navigation delays, knowledge gaps. Directly actionable for AHT reduction
Call categorisation by topic	Transcription + NLP topic modelling	80–90% for top 10 topics	High — replaces manual wrap code entry; reduces ACW; more accurate categorisation of contact types
Auto-QA scoring (objective criteria)	Transcription + checklist matching	85–95%	High — covers 100% of contacts for objective items (disclosure, script adherence, resolution code)
Emotion/sentiment detection (customer distress, frustration)	Transcription + acoustic analysis	60–80% — tone and text combined	Medium — useful as a flag for supervisory review; not reliable enough for standalone performance assessment
Agent empathy and tone quality scoring	Transcription + LLM evaluation	60–75% alignment with manual QA	Medium — directional signal only; manual QA still required for nuanced quality judgements
Competitor name and churn intent detection	Phonetic or transcription keyword search	85–95%	Medium-high — feeds save team routing and competitive intelligence

Silence detection as an AHT diagnostic tool

Silence analysis breaks AHT into components that reveal specific causes

Hold silence

Agent placed the call on hold deliberately. If average hold duration >2 minutes in a specific contact type, investigate: is the knowledge base inadequate? Is an approval needed? Is the system slow?

Action: Process redesign, knowledge improvement, system performance review.

Dead air / mutual silence

Neither agent nor customer speaking. Common during agent desktop navigation between systems — especially legacy multi-application desktops. If >30 seconds, indicates system friction.

Action: Desktop simplification, application consolidation, faster navigation training.

Agent monologue (no customer response)

Agent speaking for >2 minutes without customer interruption. May indicate agent is reading from a script without checking comprehension, or customer is disengaged.

Action: Coaching — pacing, comprehension checks, dialogue structure.

Long ACW silence (post-call)

Recording ends but ACW code not entered — agent is navigating post-call admin. If >90 seconds after call end, system or process friction.

Action: ACW process simplification; ACW system access review.

Silence analysis requires no transcription — it is an audio signal analysis. This means it works with 100% of call recordings at very low cost compared to full transcription. Start with silence analysis before investing in transcription-based analytics — the AHT diagnostic value is immediate and the cost is a fraction of full speech analytics deployment.

Realistic ROI: where speech analytics pays and where it does not

High-ROI applications

✓Compliance monitoring at 100% coverage — saves cost of regulatory fines and remediation that manual sampling misses
✓AHT reduction via silence analysis — identifying specific silence types and addressing root causes typically delivers 5–15% AHT reduction
✓Auto-categorisation replacing wrap codes — reduces ACW and improves data quality for forecasting
✓Churn intent detection feeding save team routing — incremental save revenue vs. cost of analytics licence

Overstated in vendor pitches

✗Emotion detection as a performance management tool — accuracy too low; legal challenge risk (Equality Act); GDPR special category data concerns
✗Auto-QA replacing manual QA entirely — 60-75% alignment on subjective criteria means 25-40% wrong. Keep manual QA for nuanced quality dimensions
✗100% of contacts auto-scored for agent performance review — auto-QA at this scope requires significant calibration effort and creates industrial relations risk if not validated by human review
✗Speech analytics as a cost-reduction tool in isolation — it finds problems; fixing them requires operational change effort that is separate from the analytics licence cost

Speech analytics questions

What is speech analytics in a contact centre?

Technology that analyses call recordings to extract structured data: words/phrases used, silence duration, regulatory disclosures made, tone/emotion signals, topic categorisation. There are two approaches: phonetic search (searches audio directly by sound pattern — fast but less accurate) and transcription-based (converts speech to text first, then analyses — more accurate but slower). Both enable analysis of 100% of calls vs. 1-3% manual QA sampling.

What is auto-QA and how accurate is it?

Auto-QA uses speech analytics to score call recordings against a quality framework without human review. Accuracy for objective items (script adherence, disclosure compliance): 85–95%. Accuracy for subjective quality (empathy, tone, customer experience quality): 60–80%. Best practice: auto-QA on 100% of calls for objective criteria; manual QA on 8–10 per agent per month for subjective dimensions. Do not use auto-QA alone as a standalone performance assessment tool.