WFM guideOperational risk

Contact centre disaster recovery & BCP

Q: What is a business continuity plan for a contact centre?

A contact centre BCP is a documented, tested plan for how the operation continues to handle customer contacts when its normal operating conditions are disrupted. It defines: what triggers a BCP response (system outage, site evacuation, major incident); tiered response levels and who declares each; minimum viable staffing (the lowest agent count that maintains an acceptable service level or regulatory compliance); failover routing (directing contacts to an alternative site or IVR/digital channel); work-from-home activation process; and escalation and communication protocols. A BCP that exists as a document but has never been tested is not a BCP — it is a plan for what someone hoped would happen.

Q: What counts as a contact centre disaster for BCP purposes?

For BCP purposes, a contact centre disaster is any event that prevents the operation from delivering its minimum service level from its normal site and systems. Common triggers: (1) Technology failure — telephony/ACD platform outage, CRM unavailability, network loss affecting >20% of agents, data centre failure. (2) Site loss — building evacuation (fire alarm, flood, structural), loss of utilities (power, heating, air), public health incident requiring closure. (3) Mass absence — acute illness event causing >30–40% of rostered staff to be unavailable simultaneously; civil emergency restricting movement. (4) Supplier failure — outsource partner or BPO declaring a major incident; cloud provider regional outage. A service level miss caused by a forecast error alone is not typically classified as a disaster — it is an operational failure managed through intraday flex.

Q: What is minimum viable operations in a contact centre?

Minimum viable operations (MVO) is the lowest staffing level at which a contact centre can legally and reputationally continue to operate. For regulated contact centres (FCA-supervised, Ofcom-licensed), MVO is set partly by regulatory obligation: FCA requires that financial advice and complaint contacts continue to be handled even under degraded conditions. For contact centres without regulatory minimum obligations, MVO is the Erlang C minimum needed to hold queue growth — below this floor, the queue compounds faster than agents can clear it. MVO is not a comfortable service level target — it is the floor below which you would be better off routing all contacts to IVR, digital self-service, or callback than trying to staff an inbound queue.

Peak staffing covers unplanned volume surges. Disaster recovery covers something different — not too many contacts, but too few agents. A site failure, system outage, or mass absence event collapses supply while demand often increases. The BCP is the bridge between normal operations and minimum viable service.

What triggers a BCP response?

💻 Technology failure

·Telephony/ACD platform outage
·CRM or case management unavailability
·Network loss affecting 20%+ of agents
·Data centre failure or cloud regional outage
·Telephony fraud event requiring number takedown

🏢 Site loss

·Building evacuation (fire, structural, flood)
·Loss of utilities (power, heating, air conditioning)
·Public health incident requiring closure
·Access restriction (police cordon, civil emergency)
·Long-term infrastructure failure (burst pipe, HVAC failure)

🏥 Mass absence

·Acute illness event (30%+ of roster unwell simultaneously)
·Transport disruption preventing staff from reaching site
·Industrial action
·Civil emergency restricting movement
·Extreme weather event (requires WFH as alternative)

🔗 Supplier / partner failure

·Outsource/BPO partner declaring a major incident
·Managed service provider failure (telephony, CRM)
·SaaS platform outage (workforce management software)
·Third-party IVR/bot provider outage affecting routing

Not a disaster: A service level miss caused by a forecast error, unexpectedly high volume on a single day, or a scheduled agent shortage is an operational failure— managed through intraday flex (Tier 1). It does not trigger BCP. Declaring a false BCP event drains trust in the process and burns BCP resources (WFH equipment, BPO flex capacity) for situations that didn't warrant them.

The tiered BCP response model

Tier 1

Degraded operations

Response within 15–30 minutes

Trigger

System slowdown, partial outage, 10–20% agent impact, one-team absence spike

Agent impact

Up to 20% of rostered agents unavailable or unproductive

Declared by

Real-time analyst or duty operations manager

WFM response

Intraday flex — cancel discretionary activities, pull breaks, offer emergency overtime to available agents. No BCP declaration needed.

Tier 2

Partial BCP activation

Activation within 1–2 hours

Trigger

Core system outage, site partial evacuation, 20–50% agent impact, supply-side emergency

Agent impact

20–50% of rostered agents unavailable or unable to take contacts

Declared by

Operations manager or site director

WFM response

Activate WFH failover for affected agents. Route overflow to alternative site or BPO partner. Prioritise regulatory-obligatory contact types. Implement callback and queue messaging.

Tier 3

Full BCP activation

Full activation within 4 hours; regulatory notification within 24 hours if applicable

Trigger

Site loss, full platform failure, mass absence event (30%+), civil emergency

Agent impact

50%+ of rostered agents unavailable; site cannot be used

Declared by

Head of Operations or Director

WFM response

Full WFH or alternative site activation. All non-essential contacts deflected to IVR/digital. MVO staffing deployed. Senior management and comms team engaged. Regulatory notification where required (FCA, Ofcom).

Work-from-home failover

Post-2020, WFH capability is the primary BCP mitigation for most contact centres. It eliminates site-dependency risk and provides staffing continuity during site loss events. But WFH failover only works if it is tested and pre-provisioned:

WFH BCP activation checklist

Pre-provisioning (done before any incident)

✓All agents have a tested home working setup (headset, laptop/thin client, sufficient broadband)
✓PSTN/SIP softphone configuration tested from home network
✓WFH tech access confirmed for each agent: CRM, WFM adherence, telephony, wrap/ACW
✓Security controls confirmed: VPN or ZTNA, no call recording gaps, PCI DSS pause-and-resume functional from home
✓Tested at least annually — untested WFH capability should be treated as unavailable for BCP purposes

Activation steps (during incident)

✓Declare tier level and notify agents via out-of-band communication (SMS/WhatsApp, not email or Teams on the affected system)
✓Route telephony to WFH-capable agents via ACD re-routing or DDI transfer
✓Confirm agent-by-agent connectivity — do not assume 100% WFH yield; plan for 75–85% actual availability
✓Update WFM schedule to reflect WFH-capable vs. unavailable agents
✓Set modified service level targets and communicate to stakeholders
✓Define duration: hours vs. days vs. extended — triggers different staffing and equipment decisions

Plan for 75–85% WFH yield, not 100%. In a real activation, some agents will have broadband issues, equipment failures, or personal circumstances preventing WFH. Building a WFH BCP that assumes 100% conversion from site to home is systematically over-optimistic. Use 80% as the planning assumption unless empirically validated otherwise.

Minimum viable operations (MVO)

Worked example: setting MVO for a 100-agent regulated financial services contact centre

Regulatory minimum

FCA requires that complaints and advised sales contacts are not abandoned and meet minimum handling standards. Estimated minimum: 12 agents to handle obligatory contact types at legally compliant levels.

Queue floor (Erlang C)

At normal volume (300 calls/hr, 6 min AHT), minimum to prevent infinite queue growth: 30 agents. Below this, the queue compounds at every interval.

MVO determination

MVO = max(regulatory minimum, queue floor) = 30 agents. At this level, SL will be poor but queue will not compound indefinitely. Contacts below regulatory obligation can be deflected to IVR callback.

Define MVO before you need it

Determining MVO during an active incident under time pressure produces the wrong number. Calculate it in advance using your normal Erlang C inputs, your regulatory obligations, and your maximum tolerable queue growth rate. Document it. Review it when volume or SL targets change.

MVO is not a service level target — it is a floor

Operating at MVO means accepting that service level will be poor. The purpose is to prevent the situation from becoming permanently unrecoverable (infinite queue spiral, regulatory breach, reputational catastrophe) while the BCP incident is resolved. Communicate the degraded SL to stakeholders explicitly rather than hoping it goes unnoticed.

Below MVO, deflection is better than staffing

If available agents fall below MVO, continuing to staff the inbound queue and accumulating compounding backlog is worse than routing all contacts to a callback, digital channel, or IVR message explaining the disruption. A managed callback experience is less damaging than an hours-long wait with agents that cannot clear the queue.

WFM planning under degraded conditions

A BCP activation changes the WFM inputs and constraints. Normal planning assumptions do not apply during an incident:

Forecasting

Normal

Rolling 4–8 week forecast using historical patterns

Degraded (BCP)

Real-time contact volume only; historical patterns irrelevant during incident. Estimate volume based on live ACD data and expected incident duration. Abandon weekly forecast cycle for the duration.

Scheduling

Normal

Shift schedule published 2–4 weeks in advance

Degraded (BCP)

24-hour emergency scheduling based on available agents (WFH-confirmed, alternative site, BPO overflow). Daily reforecast of available supply. Priority rostering for highest-skill, highest-priority contact handlers.

Shrinkage

Normal

Planned shrinkage of 30–40% absorbed in scheduled headcount

Degraded (BCP)

Cancel all discretionary activities (training, meetings, coaching). Emergency overtime authorised. Annual leave requests suspended or reverted. Target: 15–20% shrinkage during incident.

Service level targets

Normal

Standard SLA (e.g., 80% in 20s for voice, 90% in 24h for email)

Degraded (BCP)

Reduced SL targets agreed in advance with stakeholders. Regulatory minimum maintained. Non-critical SLA suspended for duration. Customer communications updated to set expectations.

Testing your BCP without a real incident

Tabletop exercise

Quarterly

Walk through a scenario (e.g., "telephony outage at 09:00 on a Monday") with operations, WFM, and IT in a meeting room. No systems activated — pure discussion. Tests whether people know the plan and whether the plan is coherent. The most common finding: the BCP contact tree has outdated numbers.

Partial WFH drill

2× per year

Select one team and activate WFH for a half-day without a real incident. Tests: connectivity from home, ACD routing, CRM access, and WFM monitoring from home. Validates that 80% yield assumption is realistic.

Full site failover test

Annually

Full site closure simulation with all agents working from home or alternative site for one full business day. Tests the entire BCP at scale. Run on a lower-volume day (e.g., a Saturday in a low-peak period). Requires stakeholder sign-off and customer communication.

Post-incident review

After every real activation

Document what the plan said should happen, what actually happened, and the gap. Every real incident is a BCP test. The gap is the plan update. Operations that skip post-incident reviews are planning to fail again the same way.

BCP and disaster recovery questions

What is a business continuity plan for a contact centre?

A documented, tested plan for how the contact centre continues handling customer contacts when normal operating conditions are disrupted. It defines: what triggers BCP response (system outage, site loss, mass absence); tiered response levels; minimum viable staffing; failover routing; WFH activation; and escalation/communication protocols. A BCP that has never been tested should be treated as unavailable.

What counts as a contact centre disaster for BCP purposes?

Any event preventing the operation from delivering minimum service level from its normal site and systems: technology failure (ACD outage, CRM unavailability), site loss (evacuation, utility failure), mass absence (30%+ of roster simultaneously unavailable), or supplier failure (BPO partner major incident, cloud regional outage). A service level miss from forecast error is not a disaster — it is an operational failure managed through intraday flex.

What is minimum viable operations in a contact centre?

MVO is the lowest staffing level at which the contact centre can legally and operationally continue. For regulated contact centres (FCA, Ofcom), it is set partly by regulatory obligation. For others, it is the Erlang C minimum needed to prevent infinite queue growth. Below MVO, deflecting contacts to callback or digital self-service is better than maintaining an understaffed inbound queue that compounds indefinitely.