WFM guideForecasting

Forecast data cleansing

Q: Why do you need to cleanse historical data before forecasting contact volume?

Because a forecasting model cannot tell the difference between genuine demand and a one-off anomaly — it learns a pattern from whatever data you give it. If last year's history contains a day when a system outage suppressed volume to near zero, an untreated model will learn that the corresponding day/period is naturally quiet and under-forecast it this year. If the history contains a one-off spike from a viral news story or a billing error, the model will expect that spike to recur and over-forecast. Data gaps (missing intervals) can be read as zero demand; double-counted intervals (from a reporting glitch or a system migration) inflate the apparent baseline. None of these are real, repeatable demand patterns, but the model treats them as signal unless they are cleansed first. Cleansing means identifying each anomaly and either removing it, or replacing it with a representative value (e.g. the same interval from a comparable normal week), so that the model trains only on genuine, repeatable demand. It is the least glamorous step in forecasting and the one that most determines whether the output can be trusted — a sophisticated model trained on dirty data produces confidently wrong forecasts.

A forecasting model cannot tell real demand from a one-off anomaly — it learns from whatever you feed it. An outage day, a one-off media spike, or a data gap, left in the history, teaches the model a pattern that is not real. Cleansing the history is the unglamorous step that decides whether the forecast can be trusted.

Garbage in, confidently-wrong out

The danger of dirty data is not that the model breaks — it is that the model produces a forecast that looksperfectly reasonable but is built on patterns that are not real. A sophisticated forecasting method trained on uncleansed history will confidently reproduce an outage day as a recurring quiet period, or a one-off spike as a seasonal peak. The error is invisible in the output and only shows up as a mysterious miss when reality doesn't match. Cleansing is upstream insurance: the most valuable forecasting work often happens before the model runs, in deciding which history is genuine signal and which is noise to remove.

Five anomaly types and how to treat them

System outage / downtime

What it looks like

An interval or day where volume drops to near zero (or spikes afterward) because the phone system, website, or a dependent service was down — customers couldn't get through, then re-contacted once it recovered.

How to treat it

Replace the suppressed period with a representative value from a comparable normal day/week. Also treat the post-recovery spike (pent-up re-contacts) as anomalous — it is not a recurring pattern. Flag the date so it is excluded from future baselines.

One-off event spike

What it looks like

A sharp, isolated volume spike from a non-recurring cause — a viral news story, a product recall, a billing error affecting many customers, a marketing send that won't repeat.

How to treat it

Cap or replace the spike with a normal value if the event will not recur. If similar events recur predictably (e.g. an annual sale), keep them but model them as a separate event multiplier rather than as part of the baseline trend.

Data gap / missing intervals

What it looks like

Intervals or days with no data at all — a feed failure, a logging gap, a system migration window. Easily mistaken for zero demand by a model that reads absence as zero.

How to treat it

Do not let missing data be read as zero — that teaches the model a false low. Either exclude the gap from training, or interpolate using the same interval from comparable normal periods. Document which periods were interpolated.

Double-counting / data duplication

What it looks like

Volume that is inflated because the same contacts were counted twice — a reporting glitch, overlapping queues being summed, or a system migration where both old and new systems logged the same period.

How to treat it

Identify and de-duplicate at source. Double-counting is insidious because the inflated figures look plausible — they just quietly raise the baseline, leading to systematic over-staffing. Reconcile total volume against an independent source (e.g. billing or CRM contact counts) to catch it.

Known calendar effects mistaken for trend

What it looks like

Bank holidays, seasonal closures, or a leap-year/payday effect appearing as if they were part of the underlying trend, because they fall in the training window.

How to treat it

Don't cleanse these away — they are real and recurring — but model them explicitly (as holiday flags or event multipliers) rather than letting them blur into the trend. Cleansing is for non-recurring noise; recurring calendar effects should be captured, not deleted.

Cleansing principles

Remove non-recurring noise; keep recurring signal

The test for whether to cleanse something: will it happen again on a predictable basis? A one-off outage or viral spike won't — cleanse it. A bank holiday or annual sale will — keep it, but model it explicitly as an event rather than letting it blur into the trend.

Replace, don't just delete

Deleting an anomalous interval leaves a gap the model may read as zero. Better to replace the anomaly with a representative value — the same interval from a comparable normal week — so the series stays continuous and the model trains on a plausible value.

Document every adjustment

Keep a log of what was cleansed, when, and why. An undocumented adjustment is indistinguishable from a data error later, and makes the forecast impossible to audit. The log is also how you build institutional knowledge of recurring anomaly sources.

Cleanse, but don't over-smooth

Real demand is genuinely variable — not every spike is an anomaly. Over-aggressive cleansing removes real signal and produces a forecast that is too smooth to capture genuine peaks. Cleanse identifiable, explainable anomalies; leave unexplained-but-plausible variation alone.

Data cleansing questions

Why do you need to cleanse historical data before forecasting contact volume?

Because a forecasting model can't tell genuine demand from a one-off anomaly — it learns a pattern from whatever it is given. An untreated outage day teaches the model that period is naturally quiet (under-forecasting it next time); a one-off spike from a news story or billing error teaches the model to expect a recurring peak (over-forecasting); data gaps can be read as zero demand; double-counted intervals quietly inflate the baseline and cause systematic over-staffing. None are real repeatable patterns, but the model treats them as signal unless cleansed. Cleansing means identifying each anomaly and removing or replacing it (e.g. with the same interval from a comparable normal week) so the model trains only on genuine demand. It is the least glamorous forecasting step and the one that most determines whether the output can be trusted — a sophisticated model trained on dirty data produces confidently wrong forecasts.