Module 7: Strategic Roadmapping - The 12-Month Plan
Duration: 60 minutes
Learning Objectives
By the end of this module, you will be able to:
-
Translate technical capacity metrics into executive-level business language
-
Build a 12-month capacity roadmap with quarterly milestones
-
Forecast infrastructure budgets using Pod Velocity models (Module 2)
-
Recommend Reserved Instance (RI) or Savings Plan commitments with risk analysis
-
Present a 3-minute capacity plan pitch to leadership
-
Identify capacity-related risks and mitigation strategies
From Metrics to Strategy
Throughout this workshop, you’ve worked with technical metrics:
-
Pod Velocity (deployments/month)
-
maxPods configuration (250 vs 500)
-
etcd database size (GB)
-
CPU request vs usage gaps (%)
-
HPA scaling patterns
Executives don’t care about these details. They care about:
-
Cost: "How much will infrastructure cost next year?"
-
Risk: "What happens if we under-provision? Over-provision?"
-
Growth: "Can we handle 2× customer growth in Q2?"
-
Commitments: "Should we buy Reserved Instances?"
Your job as a platform engineer is to translate capacity planning into strategic recommendations that leadership can act on.
The Translation Framework
| Technical Metric | Business Metric | Executive Question |
|---|---|---|
Pod Velocity: +50 services/quarter |
Growth Rate: 150% service count increase/year |
"Can our platform scale with product roadmap?" |
Current: 42% CPU allocated, 12% used |
Waste: 30% unused capacity = $18K/month |
"Where can we cut costs without risk?" |
etcd at 3.8GB, limit 8GB |
Cluster at 48% of maximum safe size |
"When do we need to split the cluster?" |
HPA scaled to max during Black Friday |
Emergency capacity cost: $800/hour |
"Should we pre-provision for peak events?" |
Average CPU request: 200m per pod |
Standard service footprint: 0.2 cores |
"What’s our unit economics (cost per service)?" |
|
The Golden Rule of Executive Communication: Lead with the recommendation, then justify with data. ❌ Bad: "etcd is at 3.8GB and the limit is 8GB, so we’re at 48% capacity, and based on Pod Velocity of 50 services/quarter…" ✅ Good: "We need to plan a cluster split in Q3 2026 (cost: $15K). Here’s why: [show data]" Executives make decisions. Give them options, not information dumps. |
The 12-Month Strategic Roadmap Template
A capacity roadmap answers five questions:
-
Where are we today? (Module 1 baseline)
-
Where are we going? (Module 2 forecasting)
-
What could go wrong? (Risk analysis)
-
What actions do we take? (Quarterly milestones)
-
What does it cost? (Budget forecast)
Part 1: Current State (Baseline)
From Module 1, you should have:
Baseline (as of April 2026):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Clusters: 3 (prod-east, prod-west, dev)
Worker Nodes: 24 total (8 per prod cluster, 8 dev)
Total Pods: 1,247
CPU Allocated: 102 cores (42% of 243 allocatable)
CPU Used: 18.3 cores (7.6% utilization)
Waste: 83.7 cores unused (34% of allocatable)
Monthly Cost: $50,000 (all infrastructure)
Cost per Core: $50,000 / 243 = $206/core/month
Waste Cost: 83.7 cores × $206 = $17,242/month
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
If you don’t have exact numbers: Use order-of-magnitude estimates:
Precision creates false confidence. Round to the nearest 10% for strategic planning. |
Part 2: Growth Forecast (Pod Velocity Model)
From Module 2, apply the Pod Velocity formula:
Pod Velocity (30-day observation):
- New deployments: +15 services in March 2026
- Average replicas: 3 per service
- Average CPU request: 200m per pod
Quarterly Projection (Q2 2026):
- New services: 15/month × 3 months = 45 services
- New pods: 45 services × 3 replicas = 135 pods
- New CPU demand: 135 pods × 0.2 cores = 27 cores
Year-End Projection (Q2-Q4 2026):
- 9 months × 15 services/month = 135 new services
- 135 services × 3 replicas = 405 new pods
- 405 pods × 0.2 cores = 81 cores
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Current: 102 cores allocated
Year-End: 102 + 81 = 183 cores needed
Growth: +79% in 9 months
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
Caveats in the Model: This assumes: * Linear growth (no seasonal spikes) * No major product launches * Average service size remains 200m CPU Adjust for known events:
Forecasting is not fortune-telling. Build in scenario planning (best case, worst case, likely case). |
Part 3: Quarterly Milestones
Break the year into actionable quarters:
| Quarter | Key Actions | Capacity Additions | Cost Impact |
|---|---|---|---|
Q2 2026 (Apr-Jun) |
- Right-size workloads (Module 3) |
+0 nodes (optimization phase) |
-$5,000/month (waste reduction) |
Q3 2026 (Jul-Sep) |
- Add 4 worker nodes to prod-east |
+4 nodes (prod-east) |
+$3,200/month (4 × $200/node × 4 clusters) |
Q4 2026 (Oct-Dec) |
- Split prod-east into prod-east-1 and prod-east-2 |
+12 nodes (split + Black Friday buffer) |
+$6,000/month baseline |
Q1 2027 (Jan-Mar) |
- Scale down Black Friday capacity |
+0 nodes (back to baseline) |
+$0 (net neutral) |
|
How to Build Milestones:
|
Part 4: Budget Forecast
Translate node additions into dollars:
2026 Infrastructure Budget Forecast
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Baseline (Current): $50,000/month × 12 = $600,000/year
Additions by Quarter:
Q2: -$5,000/month (optimization) × 3 months = -$15,000
Q3: +$3,200/month (4 nodes) × 3 months = +$9,600
Q4: +$6,000/month (12 nodes) × 3 months = +$18,000
Q4: +$12,800 (Black Friday temporary capacity, 8 nodes × $200/hour × 8 hours)
Subtotal: $600,000 - $15,000 + $9,600 + $18,000 + $12,800 = $625,400
Contingency (10%): +$62,540
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total 2026 Budget: $687,940
Year-over-Year: +14.7% vs 2025
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
Always Include Contingency: 10-20% buffer for: * Unexpected product launches * Higher-than-forecasted growth * Emergency capacity (Module 6 incidents) * Price increases from cloud provider Under-budgeting is worse than over-budgeting. If you ask for $600K and spend $700K, you’ve lost credibility. If you ask for $700K and spend $650K, you’re a hero. |
Part 5: Commitment Strategy (Reserved Instances / Savings Plans)
Cloud providers offer discounts for capacity commitments:
| Commitment Type | Discount | Lock-In Period | Best For |
|---|---|---|---|
On-Demand |
0% (baseline) |
None |
Temporary capacity, testing |
1-Year Reserved |
30-40% |
1 year |
Stable baseline capacity |
3-Year Reserved |
50-60% |
3 years |
Well-established platforms |
Savings Plans |
20-30% |
1-3 years |
Flexible (can change instance types) |
Commitment Recommendation:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Baseline (24 nodes × $200/month): $4,800/month on-demand
Proposed:
- 16 nodes (67%) on 1-Year Reserved: $3,200/month (-33% discount)
- 8 nodes (33%) on-demand: $1,600/month (no discount)
Total Monthly: $4,800/month
Savings: 16 nodes × $66/month (33%) = $1,056/month saved
Annual Impact: $1,056 × 12 = $12,672 saved
ROI if we over-committed:
- Risk: What if growth is slower and we only need 20 nodes?
- Penalty: We pay for 16 RIs even if we only use 12 = $800/month wasted
- Break-even: Savings ($1,056) > Risk ($800) → Still worth it
Recommendation: Commit to 16 nodes (1-Year RI), keep 8 on-demand for flexibility
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
The Commitment Trap: Many organizations over-commit because "60% discount sounds great!" Reality check: If you commit to 30 nodes but only use 20: * You save 60% on 20 nodes: 20 × $200 × 0.6 = $2,400 discount * You waste 100% on 10 unused RIs: 10 × $200 × 1.0 = $2,000 wasted * Net savings: $2,400 - $2,000 = $400 vs If you commit to 15 nodes and use all 20: * You save 60% on 15 nodes: 15 × $200 × 0.6 = $1,800 discount * You pay full price on 5 nodes: 5 × $200 × 1.0 = $1,000 on-demand * Net savings: $1,800 - $0 = $1,800 Under-committing is safer than over-committing. Use the formula:
Never commit to 100% of forecasted capacity - you’ll regret it. |
Risk Assessment Framework
Every capacity plan has risks. Your job is to identify them, quantify them, and propose mitigations.
Risk Matrix Template
| Risk | Likelihood | Impact | Cost if it Happens | Mitigation |
|---|---|---|---|---|
Growth faster than forecasted (+50%) |
Medium |
High |
$15K/month emergency capacity |
Monitor Pod Velocity monthly; set alert at +20% variance |
Growth slower than forecasted (-30%) |
Low |
Medium |
$5K/month wasted RIs |
Only commit to 60% of baseline (under-commit strategy) |
Major product launch (unplanned) |
Medium |
High |
$30K one-time spike |
Require capacity review in product planning process |
etcd limit hit before cluster split |
Low |
Critical |
12-hour outage, $500K revenue loss |
Set alert at 5GB; force cluster split planning |
Black Friday capacity insufficient |
Low |
High |
$100K revenue loss per hour |
Pre-provision 2× baseline in Q4; run load test in October |
Key platform engineer leaves |
High |
Medium |
3-month delay in optimization work |
Document capacity playbooks; cross-train 2nd engineer |
|
How to Quantify Risk Impact: Revenue Loss (for customer-facing outages):
Example: * Company revenue: $50M/year * Hourly revenue: $50M / 365 / 24 = $5,708/hour * 2-hour outage affecting 50% of platform: $5,708 × 2 × 0.5 = $5,708 Waste Cost (for over-provisioning):
Example: * Over-provisioned by 20 cores * Cost: $200/core/month * Duration: 6 months * Waste: 20 × $200 × 6 = $24,000 |
Lab 7: The Executive Pitch
In this final lab, you’ll create a 12-month capacity roadmap for your organization and present it in a 3-minute executive pitch.
Step 1: Gather Your Data
Run the capacity-roadmap-generator.sh script. It queries Prometheus directly to pull
live values for pod count, worker nodes, CPU allocation and usage, etcd database size,
and pod velocity — the same metrics you collected across Modules 1–5.
bash ~/examples/module-07/capacity-roadmap-generator.sh
The script writes two files:
-
~/capacity-roadmap-data.txt— a plain-text data summary (all five module metrics) -
~/12-month-capacity-roadmap.md— a filled-in roadmap template with your actual cluster numbers
|
Override defaults with environment variables:
Set |
══════════════════════════════════════════════════
Module 7 — Capacity Roadmap Generator
══════════════════════════════════════════════════
[INFO] Namespace : capacity-workshop
[INFO] Window : 30 days
[INFO] Node CPU : 8 cores (allocatable)
[INFO] Node cost : $200/month
[INFO] Step 1/7 — Locating Prometheus route …
[OK] Prometheus : https://prometheus-k8s-openshift-monitoring.apps...
[INFO] Step 3/7 — Gathering cluster baseline …
[OK] Total pods : 297
[OK] Worker nodes : 3
[OK] Allocatable CPU : 23.5 cores
[OK] CPU allocated (req) : 9.8 cores (41.7% of allocatable)
[OK] CPU used (actual) : 2.9 cores (12.3% of allocatable)
[OK] Waste (alloc - used) : 6.9 cores (29.4%)
[INFO] Step 4/7 — Calculating pod velocity (30-day window) …
[OK] Pods started (30d) : 142
[OK] Pod velocity : 4.73 pods/day
[INFO] Step 5/7 — Checking etcd database size …
[OK] etcd DB size : 0.12 GB (1.5% of 8GB limit)
[INFO] Step 7/7 — Generating 12-month capacity roadmap …
Roadmap written to /home/lab-user/12-month-capacity-roadmap.md
[OK] 12-month-capacity-roadmap.md → /home/lab-user/12-month-capacity-roadmap.md
Step 2: Review the Generated Roadmap
The generator from Step 1 already wrote ~/12-month-capacity-roadmap.md with your live
cluster data. Open it and replace the placeholder fields ([Your Name]) with your details:
cat ~/12-month-capacity-roadmap.md
|
Customize the template before your pitch:
|
|
For reference, the full template structure is shown below. Your generated file will have the same sections but populated with live numbers from your cluster.
|
Step 3: Practice Your 3-Minute Pitch
You’ll present this roadmap to leadership. Structure it like this:
Slide 1 (30 seconds): The Ask
"I’m requesting approval for $688K in infrastructure spend for 2026—a 15% increase over last year. This will support 79% growth in our service count while reducing waste by $60K through optimization."
Slide 2 (45 seconds): The Why
"Our platform is growing fast: we’re adding 15 new services per month. At that pace, we’ll hit capacity limits in Q3 without action. We also have a Black Friday risk—last year’s spike showed we need 2× baseline capacity for 8 hours."
Slide 3 (45 seconds): The Plan
"The plan has four phases: Q2 - Optimize existing workloads, save $5K/month. Q3 - Add 4 nodes to prod-east as we approach 80% capacity. Q4 - Split the cluster to avoid etcd limits and pre-provision for Black Friday. Q1 2027 - Scale down and stabilize.
We’ll also buy 1-Year Reserved Instances for 67% of our baseline, saving $12.7K annually."
Slide 4 (30 seconds): The Risks
"The biggest risk is faster-than-forecasted growth. If we add 20 services/month instead of 15, we’ll need emergency capacity in Q4 costing $15K/month. We’re mitigating this with monthly monitoring and quarterly reviews."
Slide 5 (30 seconds): Next Steps
"I need three things: 1. Budget approval by April 15 so we can start Q2 optimization. 2. Headcount approval for +1 Platform Engineer in Q3 to support the cluster split. 3. Monthly 30-minute capacity reviews with this group to track actuals vs forecast.
Questions?"
|
The 3-Minute Rule: Executives have limited attention spans. Your pitch must: ✅ Start with the recommendation ("approve $688K") ✅ Explain why in business terms ("support 79% growth") ✅ Show you’ve thought about risks ("if growth is faster…") ✅ End with clear next steps ("budget approval by April 15") ❌ Don’t: * Start with "Let me give you some background…" (skip to the ask) * Use jargon ("etcd database size", "kubelet maxPods") (translate to business impact) * Present without a recommendation ("here are three options, what do you think?") (pick one, defend it) You’re the expert. They’re counting on you to make a recommendation, not just present data. |
Step 4: Peer Review Exercise
|
Facilitator: Split the room into groups of 3-4 Instructions:
Scoring Rubric (1-5 scale):
Feedback Format:
Total time: 30 minutes per group |
Step 5: Export the Roadmap
Display your roadmap and copy it to take home:
cat ~/12-month-capacity-roadmap.md
The file is saved at ~/12-month-capacity-roadmap.md. Copy the contents and paste into
Google Docs, Confluence, or your organization’s documentation system.
|
Taking the roadmap further: If you have
|
|
Back at Your Organization: After this workshop:
Don’t wait for perfection. A rough roadmap this month is better than a perfect one in six months. |
Module 7 Summary: From Tactics to Strategy
You’ve now completed the full capacity planning lifecycle:
✅ Module 1: Established baseline (where are we today?) ✅ Module 2: Built forecasting model (where are we going?) ✅ Module 3: Right-sized workloads (developer perspective) ✅ Module 4: Understood infrastructure limits (density, etcd, split vs grow) ✅ Module 5: Enabled observability (multi-cluster visibility) ✅ Module 6: Survived a chaos simulation (decision-making under pressure) ✅ Module 7: Created a strategic roadmap (executive communication)
The Maturity Model Revisited
Where is your organization on the capacity planning maturity curve?
Level 1 - Reactive (Firefighting): * No forecasting, just react to outages * Add nodes when the cluster is already full * No budget planning, just "emergency POs"
Level 2 - Tactical (Short-Term): * Quarterly capacity reviews * Add nodes based on current trends (linear extrapolation) * Budget planning exists but is often wrong
Level 3 - Operational (Data-Driven) ← This Workshop Gets You Here: * Monthly capacity reviews with Pod Velocity tracking * Right-sizing workloads to reduce waste * 12-month roadmap with quarterly milestones * RHACM Observability for fleet visibility
Level 4 - Strategic (Predictive): * Automated forecasting integrated with product roadmap * Capacity planning drives architecture decisions (split vs grow) * FinOps team partnership (chargeback, showback, commitment optimization) * Self-service capacity dashboards for app teams
|
Most Platform Teams Are at Level 1-2 This workshop moved you to Level 3 in one day. Getting to Level 4 takes 6-12 months:
Start small: Pick one cluster, run the baseline audit (Module 1), and build the Pod Velocity model (Module 2). That’s 80% of the value. |
Key Takeaways
-
Executives care about cost, risk, and growth—not technical metrics like etcd size
-
Lead with the recommendation, then justify with data
-
A 12-month roadmap should have quarterly milestones and a contingency budget
-
Under-commit on Reserved Instances (60-80% of baseline) to avoid waste
-
Risk analysis should quantify impact in dollars and propose mitigations
-
Capacity planning is a collaboration between platform, app teams, finance, and leadership
Next Module
In Module 8: AI-Assisted Capacity Operations, you will use OpenShift Lightspeed — an AI assistant embedded in the OCP web console — to operationalise everything you built in Modules 1–7. Lightspeed will help you write the PromQL queries from Module 2, debug the QoS issues from Module 3, reason about node density from Module 4, and translate your metrics into the executive language you practised in this module.
Next Steps
-
Within 7 days: Run the Module 1 baseline audit on your production clusters
-
Within 30 days: Build the Module 2 Pod Velocity dashboard in your RHACM Grafana
-
Within 60 days: Complete a 12-month capacity roadmap and present to leadership
-
Within 90 days: Implement Module 3 right-sizing for your top 20 over-provisioned workloads
|
Share Your Success: After implementing these practices, share your results:
Teaching others reinforces your own learning and builds capacity planning muscle across your organization. |
Additional Resources
-
Why platform teams need a capacity planning discipline, not just more dashboards — Red Hat thought leadership post on translating technical metrics into executive roadmaps.
-
OpenShift capacity planning: from reactive firefighting to predictive forecasting — Foundational explainer on planning horizons and the Pod Velocity Model.
-
Red Hat Docs: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html/scalability_and_performance/index
-
RHACM Observability: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/observability/index
-
FinOps Foundation: https://www.finops.org/ (cloud cost optimization best practices)
-
CNCF Cost Optimization: https://tag-runtime.cncf.io/wgs/cost-optimization/
Feedback
We’d love to hear how this workshop helped you:
-
What worked well?
-
What was confusing?
-
What would you add/change?
-
Did you implement these practices at your organization? What happened?
Share feedback: [Insert feedback form URL or email]
Thank you for attending! Now go forth and plan some capacity. 🚀