Loading...
Infrastructure that worked at 10 customers breaks at 100, and rebuilds at 1,000 are expensive. This checklist describes the SRE-grade discipline a growth-stage SaaS needs to scale without rewrites: an SLO-driven observability stack, infrastructure as code, security baselines aligned to CIS Benchmarks and NIST 800-53, and the operational cadences that turn fragility into reliability. Customers running this discipline reach 99.95% measured uptime within 60 days of onboarding and hold sev-1 MTTR under 15 minutes.
Enterprise customers write uptime into MSAs. SOC 2 reports document Availability as a Trust Services Criterion. Investors ask about platform reliability in due diligence. The cost of a serious outage is no longer just refunded MRR; it is renewal risk and a hole in the next round's narrative.
The Google SRE handbook framed the modern operating model: "Hope is not a strategy. Manage error budgets, not heroics." The implication is operational: every customer-facing service has a Service Level Objective; every alert is tied to that SLO; every page is accountable to a measurable threshold. The discipline scales linearly; alert fatigue does not.
Define SLOs per customer-facing service, not per server. A typical pattern:
The error budget — what the SLO permits — drives operational decisions. When the budget is healthy, ship faster. When the budget is depleted, slow down and stabilize.
The exact tools matter less than the contract: traces, metrics, and logs in one queryable surface; alerts wired to error budgets; pages routed to on-call rotations tied to service ownership.
Application instrumentation through OpenTelemetry avoids vendor lock-in at the SDK layer. Distributed traces flow into Datadog APM or Grafana Tempo; metrics into Datadog Metrics or Prometheus; logs into Datadog Logs or Loki. The combination matters less than the discipline of consistent instrumentation across services.
Burn-rate alerts page on error budget consumption velocity, not raw error counts. A 14.4x burn rate over 1 hour means the entire monthly budget burns in 2 hours; that is page-the-on-call territory. A 3x burn rate over 6 hours is a notification, not a page.
Every customer-facing service has a dashboard with: traffic, error rate, latency p50/p95/p99, saturation. The "four golden signals" from the Google SRE book remain the canonical starting set.
PagerDuty rotations are tied to service ownership through tag-based routing. The on-call SRE acknowledges within 5 minutes; sev-1 incidents have a documented runbook linked from the alert; escalation chains are documented and tested.
The line between professional infrastructure operations and ad-hoc click-ops is whether the production environment can be reconstructed from version control.
Terraform with remote state in S3 plus DynamoDB locking is the operational baseline. Every AWS resource — VPC, subnet, security group, IAM role, RDS instance, S3 bucket, ALB, ASG, ECS service — is defined in code and reviewed through pull requests.
Reusable modules per resource family (network, data, compute, security) plus environment-specific compositions. The same module deploys to dev, staging, and production with environment-specific variable overrides — never separate copies of similar code.
Production state isolated from non-production. State locking enforced. Drift detection runs nightly via terraform plan in CI; non-zero drift opens a ticket. Manual changes in the AWS console after week two of an engagement should be exception, not norm.
Security is encoded in the same Terraform modules that provision the resources, not added afterward.
The CIS Benchmarks for AWS provide concrete, hardenable controls per service. NIST 800-53 defines the broader control catalog used as the basis for SOC 2 and FedRAMP control selection. Mapping CIS to NIST and to the AWS Well-Architected Framework's Security Pillar gives one set of controls that satisfies multiple compliance regimes.
Controls evidenced in code are continuously verified by the same code. AWS Config conformance packs (or third-party GRC platforms) generate the audit evidence automatically. Quarterly threat-model review with engineering leads catches design-level gaps the rule engines miss.
Resilience patterns must match workload tier. Stateless customer-facing services run multi-AZ active-active with automatic failover; stateful services use cross-region read replicas and documented promotion runbooks.
Quarterly tabletops and annual full DR exercises produce dated evidence; per-component functional tests (read-replica promotion, ASG cross-region, S3 cross-region restore) run semi-annually. Without dated evidence, the DR program is a PDF.
Infrastructure rigor at growth-stage SaaS scale is a discipline most companies do not need to carry as headcount until they are well past Series B. We bring 10+ years of SRE experience across AWS, Azure, GCP, and hybrid environments, and we hand off a Terraform-managed environment with documented runbooks when you do hire your own team. The same Terraform modules that provision the infrastructure also generate the SOC 2, HIPAA, and ISO 27001 evidence your auditor needs — because our roots are in audit and compliance, not just engineering.