Guide
Matt S
Matt S
Platform engineer at Fortem··8 min read

What Does DevOps Automation Miss Beyond CI/CD?

Your CI/CD pipeline builds, tests, and deploys. Your on-call rotation handles incidents. What happens between deploy and incident? Who keeps track of the environments? Who keeps the AWS bill from spiraling? Who gives developers the power to do their jobs without pinging the platform team? This is the ops automation gap — and every team with 10+ environments eventually discovers it.

TL;DR
  • ·CI/CD automates deployment — not operations. Between deploy and incident, there's a gap no pipeline touches.
  • ·Five things every team eventually discovers they need: scheduling, self-service, cost tracking, cloning, and orphan detection.
  • ·A 2-person platform team spends 30–50% of their week on these five gaps — Slack messages, Excel sheets, one-off Terraform modules.
  • ·Building all five from scratch takes 16–40 weeks. Buying a platform fills them in one onboarding.

CI/CD is deployment automation. It's not operations.

CI/CD automates build, test, and deploy — not the 95% of a service lifecycle that follows: scheduling, self-service, cost tracking, cloning, and orphan detection.

CI/CD covers the deploy pipeline: build, test, deploy, rollback. When the pipeline is green, a new version is running. When it's red, on-call gets paged. This model works — every modern team has it.

What it doesn't cover: environment scheduling, developer self-service, cost tracking, cloning, waste detection. These aren't bugs in CI/CD. They're operations problems that live in a different category. One is a pipeline. The other is a control plane.

CI/CD PipelineDeploy?Operations
The ops automation gap
Key insight

CI/CD automated the deploy button. Nobody automated what happens after — the day-to-day of managing environments at scale. The deploy pipeline is solved. Operations is still manual.

CI/CD pipelines automate the build, test, and deployment phases of the software delivery lifecycle. Ongoing operational responsibilities — environment management, cost visibility, and access control — are outside the scope of a deployment pipeline.

— Paraphrased from AWS Well-Architected DevOps Guidance

Gap 1: Environment scheduling

Non-prod ECS environments run 168 hrs/week but are used ~55; scheduling them off nights and weekends cuts compute spend by 60–70%, the largest single cost lever available.

There are 168 hours in a week. Your team works 40–50 of them. Your dev and staging environments run all 168 — billing by the second through nights, weekends, and holidays. Nobody needs a dev environment at 3am on Sunday.

The DIY build:Lambda functions + EventBridge rules per environment. Each needs a separate cron, a per-timezone configuration, and an override mechanism for ad-hoc work. At 30 environments, that's 30 Lambda functions — each one a deployment artifact to maintain, monitor, and debug when it silently stops working.

Platform team cost: 2–4 weeks to build, ongoing maintenance as environments are added. The tools exist — EventBridge and Lambda are free or near-free — but the integration and maintenance burden scales linearly with fleet size. The full mechanics of ECS environment scheduling show why this breaks past ten environments.

Gap 2: Developer self-service

Developer self-service means scoped per-environment RBAC — restart staging, tail logs, flip a feature flag — without AWS Console access or a platform-team Slack ping.

“Can you restart staging?” — the Slack message every platform engineer receives at least twice a week. Developers can deploy to production via CI. They can't restart staging without you. The AWS Console is too dangerous. The IAM policy for “restart this one service” is too granular to hand out.

The DIY build: a web UI with RBAC per environment, backed by AWS API Gateway and scoped IAM policies. You need role-per-env mappings, a way to audit who did what, and a mechanism to revoke access. The IAM part alone — managing policies per service per environment — is the reason most teams give up and hand out broad AWS Console access.

Platform team cost: 4–8 weeks to build a minimum viable self-service portal. 2–5 hours per week spent on restart and status requests that the portal would handle. The pattern of developers blocked from restarting stagingrepeats on every team that hasn't solved scoped access.

Gap 3: Fleet visibility and cost tracking

AWS Cost Explorer shows one aggregate Fargate line item; per-environment cost attribution requires tagging every resource, waiting 24 hrs for propagation, and rebuilding monthly.

AWS Cost Explorer shows aggregate spend. It doesn't show per-environment cost. You can't answer “how much does staging cost vs dev this month?” without a spreadsheet and 24 hours of tag propagation delay.

The DIY build:tag every resource with an Environment key. Activate cost allocation tags in Billing. Wait 24 hours for tags to propagate. Export a CSV. Build a spreadsheet. Repeat monthly. This works — for a while. As the fleet grows, the spreadsheet gets abandoned, and the CTO asks “why is AWS up 30% this quarter?” and nobody can answer without a day of Cost Explorer archaeology.

Key insight

By the time you see the numbers in the spreadsheet, the money is spent. Real-time per-environment cost visibility — not aggregate billing data — is what lets platform teams react before the quarter ends.

Gap 4: Environment cloning

Cloning a production environment manually — ALB, RDS, 15 services, SSM params — takes 12 steps and 2–4 hours; a template-based clone reduces that to under 30 seconds.

“QA needs an isolated copy of staging to test the compliance flow.” This request lands in the platform team's Slack every few weeks. The environment has 18 services, 4 databases, networking rules, and a dozen environment variables. Building a copy means: write a new Terraform module, override variables, pray you didn't miss a dependency.

The DIY build: Terraform module that parameterizes every service, database, and network rule from a source environment. This is a 300–400 line module that needs to stay in sync with the source. Every time the source changes, the clone module drifts.

Platform team cost: 4–8 hours per clone request. At 3 clones per month, that's 12–24 hours of engineering time. Plus the cost of the days QA waits for the environment to be ready.

Gap 5: Orphaned environment detection

At 15 ECS environments, 2–3 are typically orphaned — no recent deploys, no active owner — each billing $200–400/mo for compute that serves zero production traffic.

Every team has them. An environment was spun up for a demo six months ago. A PR preview for a feature branch that was merged and abandoned. A hackathon project that shipped and was forgotten. Nobody deploys to these environments. Nobody owns them. They bill — quietly, every month.

The DIY build: pull the last deployment timestamp per environment. Cross-reference with the team directory. Environments with no deploy in 30+ days and no active owner go on a review list. The platform team reviews, confirms abandonment, and deletes the infrastructure.

Key insight

In a fleet of 30+ environments, most teams find 2–5 orphaned environments when they look seriously. At $200–400/month each, that's $400–$2,000/month — or $4,800–$24,000/year — for compute serving zero requests. A one-time audit catches it. Without it, the environments keep billing forever.

Building vs buying the ops layer

Building all five ops gaps in-house with AWS-native tools takes 16–40 engineer-weeks and adds 20–30% annual maintenance overhead as the fleet grows beyond 10 environments.

GapDIY weeksFortem
Scheduling2–4Built-in, per-timezone
Self-service4–8RBAC by environment
Cost tracking2–3Live per-env cost
Cloning3–6Clone in a few clicks
Orphans1–2Last deploy + owner visible
16–40 weeks· $90–220k in labor

Platform engineer at $180–220k/yr loaded (~$3,500–4,200/week, Glassdoor + Levels.fyi, 2026). Maintenance adds 20–30% annually as the fleet grows.

A fair objection: with AI coding tools — Claude, Copilot, Codex — the build time drops. A skilled engineer using current LLMs could ship these five tools in 6–16 weeks, not 16–40. That changes the cost equation. What it doesn't change: the maintenance burden, the integration surface (IAM, EventBridge, Cost Explorer APIs), and the fact that you're building internal tools while your competitors ship product. The gap isn't only about time. It's about focus.

Key insight

A platform team spending 30–50% of their week on these five gaps isn't building product. They're maintaining internal tools — the same tools every team builds, differently, from scratch. CI/CD automated the deploy button. Ops needs the same treatment.

Fortem ships all five ops gaps — scheduling, self-service, cost tracking, cloning, and orphan detection — as a single control plane for ECS Fargate fleets. Seven-day onboarding, no Terraform changes required.

Book a 20-min call →

If you read this, you might also want to know

How do I know which of the five gaps is costing us the most?

Start with scheduling — it's the highest-leverage lever for most teams. Run the Fortem Fleet Audit (or the AWS CLI version in this article) to see how many hours your dev/staging environments ran last week vs how many were used. If that number is above 100 hrs/week per environment, scheduling alone will pay back any tooling cost.

Can a single platform engineer build all five gaps in-house?

Each gap takes roughly 2–4 weeks to build and 2–4 hours/month to maintain. Five gaps = 10–20 weeks of build time and 10–20 hrs/month ongoing. That's a significant chunk of one engineer's capacity — before accounting for on-call interruptions and feature work.

Does this apply if we're on EC2 launch type instead of Fargate?

Yes — the five gaps (scheduling, self-service, cost tracking, cloning, orphan detection) are the same on EC2. Scheduling uses Auto Scaling desired count instead of ECS service desired count. Cost tracking is harder on EC2 because instances aren't per-environment. Fargate makes per-environment cost math cleaner.

Common questions

We automated the ops layer.

Fortem fills all five gaps — scheduling, self-service, cost tracking, cloning, and orphan detection — in a 7-day onboarding. No Terraform changes. No internal tool maintenance.

Response within 4 hours, weekdays.

Continue reading