Every number you see on WaterWatch comes from one place: Thames Water's own Event Duration Monitoring sensor network. We don't estimate, extrapolate, or fill gaps. We don't apply severity scores or editorial weightings. When a sensor records a discharge, we show it. When it doesn't, we don't invent one. This sounds obvious — but in the current landscape of environmental data reporting, it's rarer than it should be.
This article explains exactly how our data pipeline works, what we do to verify it, and where the boundaries of our knowledge honestly lie. Trust shouldn't be asked for — it should be earned, and it should be specific.
Where the data actually comes from
Thames Water operates a network of Event Duration Monitoring (EDM) sensors at over 700 permitted combined sewer overflow (CSO) sites across the Thames region. These sensors detect when sewage begins discharging into waterways and when it stops. The same data is submitted to regulators including the Environment Agency — it's not a courtesy dataset, it's regulatory record.
Thames Water exposes this data through a public API. WaterWatch polls that API every 15 minutes. We capture every transition — every Start and Stop event — and store it with the precise timestamp Thames Water recorded. Nothing more, nothing less.
WaterWatch reads directly from api.thameswater.co.uk/opendata/v2 — the same endpoint Thames Water makes available to the public and regulators. We are a presentation layer, not an interpretation layer.
The timezone problem we fixed
Early in WaterWatch's development, we discovered a significant data integrity issue — one that affects any system ingesting from Thames Water's API without careful handling. Thames Water returns timestamps in naive local time: a string like 2024-07-15T14:30:00 with no timezone suffix. During British Summer Time (BST, UTC+1), this means the actual UTC time is 13:30 — but naively treating the string as UTC stores it an hour late.
This bug — one that could easily go undetected — caused historical records to be stored with incorrect timestamps, and in some cases created phantom duplicate records. When we identified it, we didn't patch around it. We built a full remediation system that audited every single one of our 150,000+ historical records against Thames Water's API, corrected every affected timestamp, and verified the fixes programmatically.
The corrected ingestion pipeline has been live since March 2026. Every new record is converted from Thames Water's naive local time to UTC before storage. The conversion uses the IANA timezone database via JavaScript's Intl.DateTimeFormat API — meaning BST and GMT transitions are handled precisely, including edge cases at the DST boundary.
What we show and what we don't
A lot of environmental data platforms apply their own scoring, classification, or severity ratings on top of raw sensor data. WaterWatch doesn't. We surface exactly three things: whether a site is discharging right now, when it started and stopped discharging historically, and for how long each episode lasted. Duration is calculated from the raw timestamps — we don't weight it, cap it, or smooth it.
“We surface sensor limitations rather than hiding them. If a site was offline for three months, you'll see that gap. We won't fill it with estimates.”
Where sensors are offline — either because of maintenance, failure, or periods where Thames Water's API returned no data — WaterWatch shows those gaps explicitly as offline periods. They appear as distinct events in site histories. This is a deliberate choice: a gap in data is not the same as a gap in discharges, and we will not present them as equivalent.
How our pipeline works
Understanding the architecture helps you understand where our data can and can't be trusted. Here's the full chain:
| Stage | What happens | Frequency |
|---|---|---|
| TW API poll | AWS Lambda fetches live discharge status for all 700+ sites | Every 15 min |
| Transition detection | Compares current status to previous — records Start/Stop events when status changes | Every 15 min |
| Timezone conversion | All TW naive local timestamps converted to UTC before any storage | Every write |
| Live cache | Cloudflare D1 database stores current site statuses for fast map rendering | Continuous |
| Historical store | Supabase PostgreSQL holds all discharge events since April 2022 | Continuous |
| Integrity spot-check | Automated Lambda samples sites and compares DB records against live TW API | Continuous |
The spot-check system runs continuously in the background, sampling a random set of sites on every run and comparing our stored records against what Thames Water's API currently returns. If anything is missing, we're alerted within the hour. This isn't a quarterly audit — it's permanent, automated vigilance.
What we don't claim
Honesty about limitations is as important as accuracy in what we show. Here is what WaterWatch explicitly does not claim:
| We don't claim | Why |
|---|---|
| Real-time to the second | Our ingestion runs every 15 minutes. A discharge that begins and ends within a 15-minute window may not be captured. |
| Complete pre-2022 history | Thames Water's API does not reliably serve data older than April 2022. We don't backfill with estimates. |
| Volume or severity | EDM sensors record duration, not volume. We don't convert hours into litres — that calculation requires flow rate data we don't have. |
| Environmental impact | We record what was discharged and for how long. Ecological impact depends on factors — dilution, river flow, species sensitivity — outside our data. |
| Sensor accuracy | We present what Thames Water's sensors record. If a sensor malfunctions, the data we show reflects that malfunction. |
Why this matters
In 2023, the Environmental Audit Committee found that EDM sensor coverage across English water companies was incomplete and that data quality varied significantly. Thames Water has since invested in expanding its network. WaterWatch's role is not to adjudicate on that quality — it's to present it faithfully and let users draw their own conclusions.
We built WaterWatch because the gap between raw data and public understanding of sewage discharges was too large. Media coverage often cited figures that were difficult to trace to primary sources. Academic analyses were paywalled or required specialist knowledge to interpret. The data was public in theory — but not in practice.
A platform that claims to close that gap has an obligation to be rigorously honest about its own limitations. That means showing sensor gaps, not hiding them. It means publishing our methodology, not summarising it. It means fixing data integrity bugs publicly and completely, not patching them quietly.
All discharge events shown on WaterWatch are derived directly from Thames Water's Event Duration Monitoring (EDM) API. Timestamps are stored in UTC. No editorial weighting, severity scoring, or volume estimation is applied. Offline periods are surfaced as distinct events. Historical data begins April 2022.
If you find a discrepancy — a site showing the wrong duration, a record that doesn't match Thames Water's own data portal, anything that looks wrong — email us. Data quality is not a launch feature. It's an ongoing responsibility.
How the architecture is built
WaterWatch runs across three layers: an AWS Lambda that ingests from Thames Water every 15 minutes, a Cloudflare Worker that serves the live map and site data, and a Supabase PostgreSQL database that stores the full historical record. These are deliberately separate concerns — the ingestion layer can fail without taking down the public interface, and the Worker can be updated without touching the historical store.
What happens if Thames Water's API goes down?
This is a question worth answering precisely rather than vaguely. The short answer: nothing breaks immediately, and full correction happens within 15 minutes of the API coming back up.
The live ingestion Lambda polls Thames Water every 15 minutes. If a poll fails — whether because Thames Water's API is down, rate-limiting requests, or returning malformed data — the Lambda logs the error and exits. The Cloudflare D1 cache retains the last known status of all 700+ sites. The public map continues to serve data.
0–15 minutes of outage: No visible impact. D1 cache serves last known status.
15+ minutes: The status page timestamp shows data age — users can see it's stale.
On API recovery: The next Lambda invocation succeeds. Within 15 minutes, all sites are updated and any missed transitions are captured.
Full correction: Complete within one ingestion cycle (≤15 minutes) of the API recovering.
If something goes wrong in our pipeline, how long does correction take?
Different failure modes have different correction windows. Here's the honest breakdown:
| Failure type | Detection time | Correction time |
|---|---|---|
| Lambda ingestion failure (single cycle) | Immediate — CloudWatch alarm | ≤15 min on recovery |
| Timestamp conversion bug (new code) | Minutes — spot-check Lambda samples sites continuously | Fix deployed + remediation run: typically 4–8 hours for all 547 sites |
| Missing records (ingestion gap) | Minutes — spot-check flags missing records against live TW API | Manual catch-up run: 1–2 hours |
| D1 cache stale (Worker cron failure) | 5 min — /status endpoint timestamp goes stale | Next successful cron: ≤5 min |
| Thames Water API data correction | Hours — spot-check detects mismatch | Next remediation run: 4–8 hours |
Sources: Thames Water EDM API (api.thameswater.co.uk/opendata/v2) · Environment Agency Event Duration Monitoring data · Water Industry Act 1991, Schedule 22 (EDM obligations)