Skip to main content
Mean time to recovery (MTTR) measures how long it takes to restore service after a failed deployment. It captures your team’s ability to detect, respond to, and fix production incidents caused by bad deployments.

What Periscope tracks

Periscope calculates MTTR by finding failure-to-success deployment pairs for the same service. The dashboard shows:
  • Percentiles — p50, p75, and p95 recovery times (in minutes)
  • Average recovery time
  • Individual incidents showing the failed deployment, the recovery deployment, and the time between them

DORA benchmarks

LevelBenchmark
EliteLess than 1 hour
HighLess than 1 day
Medium1 day to 1 week
LowMore than 1 week

How it is calculated

For each service, Periscope orders deployments by completedAt and finds pairs where:
  1. A deployment has status: "failure"
  2. The next deployment for the same service has status: "success"
The recovery time is success_deployment.completedAt - failure_deployment.completedAt. Percentiles are computed across all incident pairs in the selected time range.
MTTR requires the service field in your deployment payload to correctly pair failures with recoveries. Without it, Periscope cannot determine which successful deployment “recovered” from which failure. MTTR also requires completedAt timestamps on both deployments.

Interpreting the data

  • Low MTTR means your team can quickly detect and fix production issues. This often comes from good monitoring, automated rollback, and small deployment batches.
  • High MTTR suggests problems with incident detection, slow CI/CD pipelines for hotfixes, or complex rollback procedures.
  • Large gap between p50 and p95 means most incidents are resolved quickly but some take much longer — investigate those outliers.
  • Decreasing MTTR is a sign of improving operational maturity, even if change failure rate stays constant.

Reducing MTTR

  • Implement automated rollback on health check failures
  • Ensure your CI/CD pipeline supports fast hotfix deployments
  • Use feature flags to disable problematic features without redeploying
  • Invest in monitoring and alerting to reduce detection time
  • Keep deployments small so the blast radius is limited and the fix is easier to identify

MCP tool

Query MTTR from your AI coding assistant:
get_mttr(time_range: "30d")
Returns p50, p75, p95 (in minutes), average, incident count, and details for up to 10 recent incidents.

Change failure rate

CFR measures how often failures happen. MTTR measures how fast you recover.