Module 1: The Planning Horizon & Baselines
Duration: 60 minutes
Learning Objectives
By the end of this module, you will be able to:
-
Differentiate between Tactical (0-3 mo), Operational (3-12 mo), and Strategic (1-3 yr) planning
-
Audit a current OpenShift cluster to establish a capacity baseline
-
Identify the gap between allocated resources and actual consumption
-
Understand why capacity planning starts with "knowing where you are today"
Understanding Planning Horizons
Most teams fail at capacity planning because they don’t know their current utilization versus allocation. Before you can forecast where you’re going, you must establish where you are today.
The Three Planning Horizons
- Tactical Planning (0-3 months)
-
Focus on immediate needs and firefighting. Questions like "Can we handle this deployment tomorrow?" or "Do we need to add nodes this week?"
- Operational Planning (3-12 months)
-
Quarterly resource allocation, budget cycles, and predictable growth patterns. This is where most platform teams operate.
- Strategic Planning (1-3 years)
-
Multi-year infrastructure investments, commitment purchases (Reserved Instances, Savings Plans), and architectural decisions.
|
The FinOps Lifecycle Capacity planning maps directly to the FinOps framework:
|
The Cost of Ignorance
Consider this real-world scenario:
-
A team requests CPU/Memory for 1000 pods
-
Kubernetes reserves that capacity on nodes
-
Actual CPU usage is only 20% of requested
-
Result: 80% of infrastructure spend is wasted on unused capacity
|
The gap between "Allocated" (what developers asked for) and "Consumed" (what hardware actually uses) is where millions of dollars in cloud spend disappear. |
Lab 1: The 90-Day Audit
|
All terminal commands in this lab run on your student cluster. If your SSH session ended, reconnect before continuing:
SSH password: |
In this hands-on lab, you will audit the capacity baseline of your OpenShift cluster, comparing requested resources versus actual usage.
Step 1: Check Cluster Nodes
First, let’s see what nodes are available in the cluster:
oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-128-107.us-east-2.compute.internal Ready control-plane,master 5d v1.28.5+4bc5c35
ip-10-0-145-209.us-east-2.compute.internal Ready worker 5d v1.28.5+4bc5c35
ip-10-0-155-132.us-east-2.compute.internal Ready worker 5d v1.28.5+4bc5c35
ip-10-0-174-87.us-east-2.compute.internal Ready worker 5d v1.28.5+4bc5c35
Step 2: Compare Allocation vs. Consumption
Now let’s compare what’s requested (allocated) versus what’s actually consumed:
oc adm top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-0-145-209.us-east-2.compute.internal 423m 5% 4201Mi 13%
ip-10-0-155-132.us-east-2.compute.internal 891m 11% 5932Mi 19%
ip-10-0-174-87.us-east-2.compute.internal 512m 6% 3845Mi 12%
|
What does this tell us?
|
Step 3: Examine Node Allocatable Capacity
Let’s dive deeper into a single worker node to see allocatable capacity:
oc describe node -l node-role.kubernetes.io/worker | grep -A 10 "Allocated resources"
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3420m (42%) 0 (0%)
memory 8192Mi (26%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
|
Key Finding: This node has 42% CPU requested but is only consuming 11%. This means:
|
Step 4: Identify High-Variance Namespaces
Let’s find which namespaces have the largest gap between requests and actual usage:
oc get pods -A -o json | jq -r '
.items[] |
select(.spec.containers[].resources.requests != null) |
"\(.metadata.namespace)\t\(.metadata.name)"
' | sort | uniq -c | sort -rn | head -10
Now let’s check the capacity-workshop namespace specifically:
oc adm top pods -n capacity-workshop
NAME CPU(cores) MEMORY(bytes)
besteffort-app-5598f76d5b-9jtz4 1m 12Mi
burstable-app-9b598b884-m2h5w 2m 24Mi
critical-app-6c4456cfc5-bj4zs 15m 48Mi
critical-app-6c4456cfc5-ln9pp 14m 47Mi
guaranteed-app-bc4d9685d-jpbhm 50m 128Mi
load-generator-67567c576-hdvkj 102m 64Mi
Now compare this to what was requested:
oc get pods -n capacity-workshop -o json | jq -r '
.items[] |
"\(.metadata.name)\t\(.spec.containers[0].resources.requests.cpu // "none")\t\(.spec.containers[0].resources.requests.memory // "none")"
' | column -t
besteffort-app-5598f76d5b-9jtz4 none none
burstable-app-9b598b884-m2h5w 100m 128Mi
critical-app-6c4456cfc5-bj4zs 200m 256Mi
critical-app-6c4456cfc5-ln9pp 200m 256Mi
guaranteed-app-bc4d9685d-jpbhm 200m 256Mi
load-generator-67567c576-hdvkj 100m 128Mi
|
Analysis:
Financial Impact: If this pattern exists across 100 namespaces, you’re wasting 85-90% of your infrastructure budget. |
Step 5: Calculate Cluster-Wide Waste
Let’s quantify the total waste across the cluster:
echo "=== Cluster-Wide Capacity Summary ==="
echo ""
echo "Total Allocatable CPU (all workers):"
oc get nodes -l node-role.kubernetes.io/worker -o json | \
jq -r '[.items[].status.allocatable.cpu | gsub("m";"") | tonumber] | add / 1000' | \
xargs printf "%.2f cores\n"
echo ""
echo "Total Requested CPU:"
oc get pods -A -o json | \
jq -r '[.items[].spec.containers[].resources.requests.cpu // "0" | gsub("m";"") | tonumber] | add / 1000' | \
xargs printf "%.2f cores\n"
echo ""
echo "Total Consumed CPU (current):"
oc adm top nodes --no-headers | awk '{sum+=$2} END {print sum/1000 " cores"}'
=== Cluster-Wide Capacity Summary ===
Total Allocatable CPU (all workers):
24.00 cores
Total Requested CPU:
10.50 cores (43.75% of allocatable)
Total Consumed CPU (current):
1.83 cores (7.6% of allocatable)
|
The Baseline Verdict:
If this cluster costs $50,000/month, you’re wasting approximately $18,000/month on unused reservations. |
Lab 1 Summary: Your Baseline
You’ve now completed the 90-Day Audit. You should have discovered:
✅ Total allocatable cluster capacity ✅ Percentage of capacity requested by workloads ✅ Percentage of capacity actually consumed ✅ Gap between requests and usage (the "waste zone") ✅ Namespaces with the highest variance
|
Facilitator Note: This is the moment to address developer friction. Developers often see resource requests as "bureaucratic overhead" or feel pressured to over-request "just in case." Emphasize: Accurate requests are not a restriction - they’re a contract with the scheduler. Without them, the scheduler is blind, leading to:
We’ll prove this in Module 3. |
Key Takeaways
-
Planning starts with baselining - You can’t forecast if you don’t know your current state
-
The gap between allocated and consumed resources is where money disappears
-
Tactical, Operational, and Strategic planning require different time horizons
-
Most infrastructure waste comes from inaccurate resource requests, not infrastructure overhead
Next Steps
In Module 2: Mathematics of Forecasting, we’ll take this baseline and build predictive models using Pod Velocity and PromQL to forecast future capacity needs.
|
Pre-Module 2 Homework (Optional): If you have access to your own production clusters, run the same audit commands and bring your findings to Module 2. We’ll use real data to build your forecasting model. |
Further reading
-
OpenShift capacity planning: from reactive firefighting to predictive forecasting — Red Hat blog post covering planning horizons, the allocated vs. consumed gap, and the Pod Velocity Model introduced in Modules 1 and 2.