Module 4: Infrastructure Track - Fleet Architecture & Sizing
Duration: 90 minutes
Learning Objectives
By the end of this module, you will be able to:
-
Calculate minimum worker node counts using maxPods mathematics
-
Understand OpenShift control plane limits (etcd constraints)
-
Determine when to split clusters versus grow them (the monolith vs federation decision)
-
Tune kubeletConfig to increase node density from 250 to 500+ pods per node
-
Measure the impact of density changes on node memory overhead and control plane performance
The Mathematics of Node Density
As clusters grow from hundreds to thousands of nodes, new bottlenecks emerge that aren’t visible at smaller scales. The most critical constraint is how many pods can run on a single worker node.
The Default: 250 Pods Per Node
OpenShift sets a default limit of 250 pods per node via the kubelet configuration. This seems like a large number, but it’s surprisingly easy to hit:
Scenario: Microservices architecture with 3 replicas per service
100 services × 3 replicas = 300 pods
300 pods ÷ 250 pods/node = 1.2 nodes minimum
But you need N+1 redundancy, so:
1.2 × 2 = 2.4 → 3 worker nodes minimum
|
The Pod Density Formula:
Where: * Total Pods = sum of all deployment replicas across the cluster * maxPods per node = kubelet configuration (default 250, tunable to 500+) * Redundancy Factor = typically 1.5-2.0 for production (allows for node failures and rolling updates) Example: * 1000 total pods * 250 maxPods/node * 2.0 redundancy factor
|
Why 250? The Memory Overhead Problem
Each pod consumes memory beyond its container requests for:
-
Kubelet tracking structures (~2KB per pod)
-
CNI plugin state (Open vSwitch flows, iptables rules)
-
Container runtime metadata (CRI-O)
-
Linux kernel structures (cgroups, namespaces)
At 250 pods per node, this overhead is approximately:
250 pods × ~10MB overhead = ~2.5GB node memory reserved for Kubernetes itself
|
The Hidden Tax: On a 32GB worker node: * Application pods see ~29GB allocatable memory (32GB - 3GB reserved for system) * At 250 pods, ~2.5GB of that is Kubernetes overhead * Effective capacity for app workloads: ~26.5GB (~17% overhead) If you increase maxPods to 500: * 500 pods × ~10MB = ~5GB Kubernetes overhead * Effective capacity: ~24GB (~25% overhead) The trade-off: Higher density = more efficient node usage, but less memory per pod on average. |
When to Increase maxPods (500+)
Increase maxPods beyond 250 when:
✅ You have many small pods (10-50MB memory each) - e.g., serverless functions, edge workloads ✅ Your nodes have 64GB+ memory (overhead percentage decreases) ✅ You’re optimizing for node count reduction (cost savings on bare-metal) ✅ Control plane can handle the API request load (see etcd limits below)
Do NOT increase maxPods if:
❌ Pods average >500MB memory (you won’t fit 500 pods anyway) ❌ Cluster has >5000 total pods (etcd limit - see next section) ❌ Control plane nodes are undersized (<8 cores, <32GB RAM)
The etcd 8GB Database Limit
OpenShift’s control plane stores all cluster state in etcd, a distributed key-value store. The hard limit is:
|
etcd database maximum size: 8GB Once etcd reaches 8GB, the cluster enters a degraded state: * API requests slow down (300ms → 3000ms response times) * New pod creations fail intermittently * kubectl/oc commands timeout * etcd requires defragmentation and compaction |
What Consumes etcd Space?
The primary consumers are:
| Object Type | Typical Size | Impact at Scale |
|---|---|---|
Pods |
~4KB per pod |
10,000 pods = 40MB |
Events |
~1KB per event |
High churn apps generate 100K+ events/day = 100MB/day |
ConfigMaps/Secrets |
Variable (1KB-1MB) |
Large ConfigMaps (>100KB) are dangerous |
Custom Resources |
Variable |
Operators creating 1000s of CRs can consume GB |
|
The Event Explosion Problem: A single failing CrashLoopBackOff pod generates:
100 failing pods = 860MB/day of etcd growth from events alone! Solution: OpenShift automatically prunes events older than 3 hours, but high-churn environments need monitoring. |
The 10,000 Pod Ceiling
Based on Red Hat’s official support limits:
Maximum supported cluster size (OpenShift 4.14+):
- 500 worker nodes
- 10,000 pods total
- 120,000 total objects in etcd
|
How to Monitor etcd Usage:
Alert Thresholds: * 4GB (50%) - Start planning capacity relief * 6GB (75%) - Urgent action required * 7GB (87.5%) - Critical, begin cluster split planning |
The Monolith vs Federation Decision
When you hit etcd limits or operational complexity, you face a choice:
Option 1: Grow the Cluster (Scale Up)
Approach: Add more worker nodes, tune maxPods, optimize object count
Pros: * Simpler operations (one cluster to manage) * Shared resource pool (better bin-packing efficiency) * Centralized observability
Cons: * Blast radius increases (outage affects everyone) * etcd limits eventually force a split anyway * Harder to isolate noisy neighbors
Best For: Organizations with <5000 pods, strong platform team, low multi-tenancy requirements
Option 2: Split the Cluster (Scale Out)
Approach: Create multiple clusters, federate management via RHACM
Pros: * Fault isolation (one cluster failure doesn’t take down all apps) * Separate etcd databases (10,000 pod limit per cluster → 30,000 pods across 3 clusters) * Multi-tenancy enforcement (different teams get different clusters)
Cons: * Operational overhead (3 clusters = 3× control planes to patch/upgrade) * Resource fragmentation (each cluster needs spare capacity, can’t share) * Cross-cluster networking complexity (if apps need to talk across clusters)
Best For: Organizations with >10,000 pods, strong multi-tenancy requirements, or geographic distribution needs
The Hybrid Approach: RHACM Fleet Management
Red Hat Advanced Cluster Management (RHACM) enables the "best of both worlds":
-
Multiple clusters for isolation and scale
-
Centralized policy management (governance, security, config)
-
Unified observability (one Grafana dashboard across all clusters)
-
Application portability (ArgoCD ApplicationSets deploy to multiple clusters)
|
The RHACM Fleet Model:
Management Overhead: * Hub cluster: 16 core / 64GB RAM control plane * Each managed cluster: Standard 3-node control plane Capacity Planning Impact: You now forecast per-cluster, then aggregate at the hub level. Module 5 covers this in detail. |
Lab 4: The Density Game
|
All terminal commands in this lab run on your student cluster. If your SSH session ended, reconnect before continuing:
SSH password: |
In this hands-on lab, you’ll modify the kubeletConfig to increase maxPods from 250 to 500, then observe the impact on node memory overhead and control plane performance during a mass-scheduling event.
|
Single-User Mode: Namespace-Scoped Permissions
You have namespace-scoped permissions on the hub cluster. Some operations in this lab require cluster-admin access:
For learning purposes, this module will demonstrate the commands and expected output. In a production environment or multi-user workshop with dedicated SNO clusters, you would have the necessary permissions to execute these changes. You can still complete the conceptual exercises and calculations. Skip hands-on steps that require cluster-admin permissions. |
Part 1: Baseline - Check Current Density
First, let’s see the current maxPods setting:
oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.capacity.pods}{"\n"}{end}'
ip-10-0-145-209.us-east-2.compute.internal 250
ip-10-0-155-132.us-east-2.compute.internal 250
ip-10-0-174-87.us-east-2.compute.internal 250
Now check how many pods are currently running per node:
oc get pods -A -o json | jq -r '.items[] | .spec.nodeName' | sort | uniq -c | sort -rn | head -5
42 ip-10-0-145-209.us-east-2.compute.internal
38 ip-10-0-155-132.us-east-2.compute.internal
35 ip-10-0-174-87.us-east-2.compute.internal
|
Current State: * maxPods = 250 * Actual pods per node = 35-42 (~16% utilization) * Plenty of headroom for growth Question for Discussion: If we have 100 nodes at 16% pod density utilization, are we over-provisioned? |
Part 2: Check Node Memory Overhead
Let’s see how much memory is consumed by Kubernetes overhead (not app pods):
oc adm top nodes -l node-role.kubernetes.io/worker | awk '{print $1 "\t" $5}'
NAME MEMORY%
ip-10-0-145-209.us-east-2.compute.internal 13%
ip-10-0-155-132.us-east-2.compute.internal 19%
ip-10-0-174-87.us-east-2.compute.internal 12%
Now let’s calculate the overhead:
# Get allocatable memory for a worker node
oc get node -l node-role.kubernetes.io/worker -o json | jq -r '.items[0].status.allocatable.memory'
30988400Ki
|
Converting Ki to GB: Divide by 1,048,576. |
|
Memory Overhead Calculation: On a 32GB node (e.g., Overhead per pod: 1.3GB / 42 pods ≈ 31MB per pod (This is higher than the theoretical 10MB because it includes CNI and CRI-O state.) |
Part 3: Increase maxPods to 500
First, confirm which MachineConfigPool owns your nodes — this determines the selector to use:
oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT
master rendered-master-abc123 True False False 3 3 3
worker rendered-worker-def456 True False False 0 0 0
The worker pool shows MACHINECOUNT: 0 because this is a compact cluster — all nodes carry both master and worker roles but are assigned to the master MachineConfigPool. Apply the KubeletConfig targeting master:
cat <<EOF | oc apply -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: set-max-pods-500
spec:
machineConfigPoolSelector:
matchLabels:
pools.operator.machineconfiguration.openshift.io/master: ""
kubeletConfig:
maxPods: 500
EOF
kubeletconfig.machineconfiguration.openshift.io/set-max-pods-500 created
|
SNO Clusters
Single Node OpenShift uses the same |
|
Standard Multi-Node Clusters (dedicated workers)
On a standard cluster with separate worker nodes,
Only worker nodes are rebooted — control-plane nodes are unaffected. |
|
What Just Happened? The Machine Config Operator (MCO) will:
Expected Duration: 10-15 minutes per node (30-45 minutes total for a 3-node compact cluster) Watch the rollout:
|
Monitor the Machine Config Pool update:
oc get mcp master -w
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT
master rendered-master-abc123 True False False 3 3 3
master rendered-master-def456 False True False 3 2 1
master rendered-master-def456 False True False 3 1 2
master rendered-master-def456 True False False 3 3 3
Press Ctrl+C once UPDATED changes back to True.
Now verify maxPods increased:
oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.capacity.pods}{"\n"}{end}'
ip-10-0-28-47.us-east-2.compute.internal 500
ip-10-0-58-247.us-east-2.compute.internal 500
ip-10-0-95-86.us-east-2.compute.internal 500
|
Success Criteria:
* All nodes now show If any node shows DEGRADED=True, check:
|
Part 4: The Mass-Scheduling Event
Now we’ll simulate a "Black Friday deployment" where you need to scale from 50 pods to 500 pods in under 2 minutes.
First, check current control plane API response times:
time oc get pods -A --no-headers | wc -l
127
real 0m0.342s ← Baseline: 342ms to list all pods
Now deploy the mass-scheduling test:
oc create deployment density-test --image=registry.access.redhat.com/ubi9/ubi-micro:latest --replicas=400 -n capacity-workshop -- sleep infinity
oc set resources deployment density-test -n capacity-workshop --requests=cpu=10m,memory=16Mi --limits=cpu=50m,memory=32Mi
Watch the scheduling progress:
watch -n 2 'oc get deployment density-test -n capacity-workshop && echo "" && oc get pods -n capacity-workshop --field-selector=status.phase!=Running | wc -l'
NAME READY UP-TO-DATE AVAILABLE AGE
density-test 287/400 400 287 1m
113 ← Pods still pending/creating
|
What You’re Observing: During mass-scheduling, the control plane is processing:
If the cluster was already near maxPods limits, some pods would remain Pending with:
|
Once the deployment is fully ready, test API response time again:
time oc get pods -A --no-headers | wc -l
527
real 0m1.245s ← Degraded: 1245ms (3.6× slower than baseline)
|
Performance Impact Analysis: Before mass-scheduling: * 127 pods across cluster * API response: 342ms After mass-scheduling: * 527 pods across cluster (4.1× increase) * API response: 1245ms (3.6× slower) Why? * etcd query time increases with object count * API server has to filter/sort more results * Network serialization overhead (more data to send) Is this acceptable? For most use cases, yes. But if you’re building a platform for 10,000+ pods, you need to: * Scale control plane nodes vertically (more CPU/RAM) * Implement API priority and fairness (limit low-priority requests) * Consider cluster federation (split the load) |
Part 5: Check Memory Overhead Impact
Now let’s see if memory overhead increased with the higher pod count:
oc adm top nodes -l node-role.kubernetes.io/worker
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-0-145-209.us-east-2.compute.internal 423m 5% 6201Mi 21%
ip-10-0-155-132.us-east-2.compute.internal 512m 6% 7932Mi 27%
ip-10-0-174-87.us-east-2.compute.internal 445m 5% 5845Mi 20%
Compare to Part 2 (before the test): * Before: 13-19% memory usage at 35-42 pods/node * After: 20-27% memory usage at 150-180 pods/node
Calculate overhead per pod:
NODE=$(oc get nodes -l node-role.kubernetes.io/worker -o jsonpath='{.items[0].metadata.name}')
MEM=$(oc adm top nodes "$NODE" --no-headers | awk '{print $4, $5}')
POD_COUNT=$(oc get pods -A --field-selector spec.nodeName="$NODE" --no-headers | wc -l)
echo "Node: $NODE"
echo "Memory usage: $MEM"
echo "Pod count: $POD_COUNT pods"
Node: ip-10-0-28-47.us-east-2.compute.internal
Memory usage: 7932Mi 27%
Pod count: 176 pods
|
Updated Overhead Calculation:
Overhead per pod: 532Mi / 176 pods ≈ 3MB per pod Why is this lower than Part 2? The density-test pods are tiny (16Mi each) and mostly idle, so they’re not generating much CNI/CRI-O state. In production with active workloads, expect 10-30MB overhead per pod. Key Finding: Kubernetes overhead is NOT linear with pod count—it depends on pod activity (network connections, volume mounts, etc.). |
Part 6: Clean Up
Remove the density test deployment:
oc delete deployment density-test -n capacity-workshop
Optionally, revert maxPods back to 250 (if this was just a test):
oc delete kubeletconfig set-max-pods-500
|
Rolling Back maxPods: This will trigger another MCO rollout (10-15 minutes per node). Only do this if:
For production: Leave maxPods at 500 if your nodes have 64GB+ RAM and you have many small pods. |
Lab 4 Summary: Density Decisions
You’ve now experienced:
✅ maxPods configuration via KubeletConfig CRD ✅ Machine Config Operator rollout behavior ✅ Mass-scheduling impact on control plane API performance ✅ Memory overhead measurement at different pod densities ✅ Trade-offs between node density and operational complexity
The Density Decision Matrix
Use this matrix to choose your maxPods setting:
| Pod Count | Avg Pod Size | Node RAM | Recommended maxPods |
|---|---|---|---|
<5000 |
<100MB |
32GB |
250 (default) |
<5000 |
<50MB |
64GB |
500 |
>5000 |
<100MB |
64GB |
500 (and plan cluster split at 8000 pods) |
>5000 |
>200MB |
64GB |
250 (you won’t hit 500 anyway due to memory) |
|
The Infrastructure Social Contract: Platform teams often resist increasing maxPods because:
Developers often push for higher maxPods because:
The resolution:
Accurate capacity planning requires collaboration between infra and dev teams. This lab proves the impact is measurable and manageable. |
Key Takeaways
-
maxPods default (250) is conservative but safe for most workloads
-
Increasing to 500 makes sense for small pods (10-50MB) on large nodes (64GB+)
-
Memory overhead is 10-30MB per pod depending on workload activity
-
etcd 8GB limit forces cluster split at ~10,000 pods regardless of maxPods tuning
-
Control plane API performance degrades with pod count—plan for 3-5× slower responses at 10K pods
-
RHACM fleet management enables scaling beyond single-cluster limits
Next Steps
In Module 5: Fleet Observability with RHACM, you’ll enable Multi-Cluster Observability, customize metric allowlists, and build a unified capacity dashboard that shows pod density, etcd usage, and node requirements across your entire fleet.
The maxPods tuning from this module directly impacts the density metrics you’ll visualize in Module 5’s "God’s-Eye Dashboard."