Module 1: The Planning Horizon & Baselines

Duration: 60 minutes

Learning Objectives

By the end of this module, you will be able to:

  • Differentiate between Tactical (0-3 mo), Operational (3-12 mo), and Strategic (1-3 yr) planning

  • Audit a current OpenShift cluster to establish a capacity baseline

  • Identify the gap between allocated resources and actual consumption

  • Understand why capacity planning starts with "knowing where you are today"

Understanding Planning Horizons

Most teams fail at capacity planning because they don’t know their current utilization versus allocation. Before you can forecast where you’re going, you must establish where you are today.

The Three Planning Horizons

Tactical Planning (0-3 months)

Focus on immediate needs and firefighting. Questions like "Can we handle this deployment tomorrow?" or "Do we need to add nodes this week?"

Operational Planning (3-12 months)

Quarterly resource allocation, budget cycles, and predictable growth patterns. This is where most platform teams operate.

Strategic Planning (1-3 years)

Multi-year infrastructure investments, commitment purchases (Reserved Instances, Savings Plans), and architectural decisions.

The FinOps Lifecycle

Capacity planning maps directly to the FinOps framework:

  1. Inform - Establish baseline (this module)

  2. Optimize - Right-size workloads (Module 3)

  3. Operate - Build dashboards & forecasts (Modules 2 & 5)

The Cost of Ignorance

Consider this real-world scenario:

  • A team requests CPU/Memory for 1000 pods

  • Kubernetes reserves that capacity on nodes

  • Actual CPU usage is only 20% of requested

  • Result: 80% of infrastructure spend is wasted on unused capacity

The gap between "Allocated" (what developers asked for) and "Consumed" (what hardware actually uses) is where millions of dollars in cloud spend disappear.

Lab 1: The 90-Day Audit

All terminal commands in this lab run on your student cluster. If your SSH session ended, reconnect before continuing:

ssh lab-user@

SSH password:

In this hands-on lab, you will audit the capacity baseline of your OpenShift cluster, comparing requested resources versus actual usage.

Step 1: Check Cluster Nodes

First, let’s see what nodes are available in the cluster:

oc get nodes
Sample Output
NAME                                         STATUS   ROLES                  AGE   VERSION
ip-10-0-128-107.us-east-2.compute.internal   Ready    control-plane,master   5d    v1.28.5+4bc5c35
ip-10-0-145-209.us-east-2.compute.internal   Ready    worker                 5d    v1.28.5+4bc5c35
ip-10-0-155-132.us-east-2.compute.internal   Ready    worker                 5d    v1.28.5+4bc5c35
ip-10-0-174-87.us-east-2.compute.internal    Ready    worker                 5d    v1.28.5+4bc5c35

Step 2: Compare Allocation vs. Consumption

Now let’s compare what’s requested (allocated) versus what’s actually consumed:

oc adm top nodes
Sample Output
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-145-209.us-east-2.compute.internal   423m         5%     4201Mi          13%
ip-10-0-155-132.us-east-2.compute.internal   891m         11%    5932Mi          19%
ip-10-0-174-87.us-east-2.compute.internal    512m         6%     3845Mi          12%

What does this tell us?

  • Worker nodes are consuming 5-11% CPU

  • Memory utilization is 12-19%

  • These nodes likely have much higher requests than actual usage

Step 3: Examine Node Allocatable Capacity

Let’s dive deeper into a single worker node to see allocatable capacity:

oc describe node -l node-role.kubernetes.io/worker | grep -A 10 "Allocated resources"
Sample Output
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                3420m (42%)   0 (0%)
  memory             8192Mi (26%)  0 (0%)
  ephemeral-storage  0 (0%)        0 (0%)

Key Finding: This node has 42% CPU requested but is only consuming 11%.

This means:

  • 31% of CPU capacity is reserved but unused

  • The scheduler can’t use this capacity for new pods

  • You’re paying for hardware that sits idle

Step 4: Identify High-Variance Namespaces

Let’s find which namespaces have the largest gap between requests and actual usage:

oc get pods -A -o json | jq -r '
  .items[] |
  select(.spec.containers[].resources.requests != null) |
  "\(.metadata.namespace)\t\(.metadata.name)"
' | sort | uniq -c | sort -rn | head -10

Now let’s check the capacity-workshop namespace specifically:

oc adm top pods -n capacity-workshop
Sample Output
NAME                              CPU(cores)   MEMORY(bytes)
besteffort-app-5598f76d5b-9jtz4   1m           12Mi
burstable-app-9b598b884-m2h5w     2m           24Mi
critical-app-6c4456cfc5-bj4zs     15m          48Mi
critical-app-6c4456cfc5-ln9pp     14m          47Mi
guaranteed-app-bc4d9685d-jpbhm    50m          128Mi
load-generator-67567c576-hdvkj    102m         64Mi

Now compare this to what was requested:

oc get pods -n capacity-workshop -o json | jq -r '
  .items[] |
  "\(.metadata.name)\t\(.spec.containers[0].resources.requests.cpu // "none")\t\(.spec.containers[0].resources.requests.memory // "none")"
' | column -t
Sample Output
besteffort-app-5598f76d5b-9jtz4    none    none
burstable-app-9b598b884-m2h5w      100m    128Mi
critical-app-6c4456cfc5-bj4zs      200m    256Mi
critical-app-6c4456cfc5-ln9pp      200m    256Mi
guaranteed-app-bc4d9685d-jpbhm     200m    256Mi
load-generator-67567c576-hdvkj     100m    128Mi

Analysis:

  • critical-app requested 200m CPU but uses only 15m (~7.5% utilization)

  • burstable-app requested 100m CPU but uses only 2m (~2% utilization)

  • besteffort-app has NO requests - dangerous! (We’ll fix this in Module 3)

Financial Impact: If this pattern exists across 100 namespaces, you’re wasting 85-90% of your infrastructure budget.

Step 5: Calculate Cluster-Wide Waste

Let’s quantify the total waste across the cluster:

echo "=== Cluster-Wide Capacity Summary ==="
echo ""
echo "Total Allocatable CPU (all workers):"
oc get nodes -l node-role.kubernetes.io/worker -o json | \
  jq -r '[.items[].status.allocatable.cpu | gsub("m";"") | tonumber] | add / 1000' | \
  xargs printf "%.2f cores\n"

echo ""
echo "Total Requested CPU:"
oc get pods -A -o json | \
  jq -r '[.items[].spec.containers[].resources.requests.cpu // "0" | gsub("m";"") | tonumber] | add / 1000' | \
  xargs printf "%.2f cores\n"

echo ""
echo "Total Consumed CPU (current):"
oc adm top nodes --no-headers | awk '{sum+=$2} END {print sum/1000 " cores"}'
Sample Output
=== Cluster-Wide Capacity Summary ===

Total Allocatable CPU (all workers):
24.00 cores

Total Requested CPU:
10.50 cores (43.75% of allocatable)

Total Consumed CPU (current):
1.83 cores (7.6% of allocatable)

The Baseline Verdict:

  • 43.75% of cluster capacity is reserved by requests

  • Only 7.6% is actually being used

  • 36% of capacity is "zombie allocation" - reserved but never consumed

If this cluster costs $50,000/month, you’re wasting approximately $18,000/month on unused reservations.

Lab 1 Summary: Your Baseline

You’ve now completed the 90-Day Audit. You should have discovered:

✅ Total allocatable cluster capacity ✅ Percentage of capacity requested by workloads ✅ Percentage of capacity actually consumed ✅ Gap between requests and usage (the "waste zone") ✅ Namespaces with the highest variance

Facilitator Note:

This is the moment to address developer friction. Developers often see resource requests as "bureaucratic overhead" or feel pressured to over-request "just in case."

Emphasize: Accurate requests are not a restriction - they’re a contract with the scheduler. Without them, the scheduler is blind, leading to:

  • Pod evictions under pressure

  • Noisy neighbor problems

  • Unpredictable performance

We’ll prove this in Module 3.

Key Takeaways

  • Planning starts with baselining - You can’t forecast if you don’t know your current state

  • The gap between allocated and consumed resources is where money disappears

  • Tactical, Operational, and Strategic planning require different time horizons

  • Most infrastructure waste comes from inaccurate resource requests, not infrastructure overhead

Next Steps

In Module 2: Mathematics of Forecasting, we’ll take this baseline and build predictive models using Pod Velocity and PromQL to forecast future capacity needs.

Pre-Module 2 Homework (Optional):

If you have access to your own production clusters, run the same audit commands and bring your findings to Module 2. We’ll use real data to build your forecasting model.

Further reading