Module 4: Core Performance Tuning with Performance Profiles
Module Overview
This module is the heart of the workshop, where you’ll apply node-level performance tuning using OpenShift’s Performance Profile Controller. You’ll configure CPU isolation, HugePages allocation, and real-time kernel settings to dramatically improve latency characteristics.
-
Understand Performance Profiles and their components
-
Configure CPU isolation for high-performance workloads
-
Allocate and manage HugePages for reduced memory latency
-
Apply real-time kernel tuning profiles
-
Measure the performance improvements from your optimizations
Before starting this module, ensure you have completed:
-
Module 1: Workshop Setup and Multi-Cluster Management
-
Module 2: GitOps Deployment and Operator Installation
-
Module 3: Baseline Performance Measurement and Analysis
-
Established baseline performance metrics using kube-burner from Module 3
Understanding Performance Profiles
The Performance Profile Controller (PPC) is integrated into the Node Tuning Operator and provides a declarative way to configure multiple low-latency optimizations through a single custom resource.
-
CPU Management: Isolate specific CPUs for high-performance workloads
-
Memory Tuning: Configure HugePages and memory-related kernel parameters
-
Kernel Tuning: Apply real-time kernel settings and tuned profiles
-
NUMA Awareness: Optimize for Non-Uniform Memory Access architectures
-
Single Configuration: One CR manages multiple complex tuning parameters
-
Declarative: Version-controlled, repeatable configuration
-
Node Pool Isolation: Apply tuning to specific worker nodes only
-
Rolling Updates: Orchestrated updates with minimal disruption
Hands-on Exercise: Creating a Performance Profile
Step 1: Identify Target Nodes and Environment
First, we’ll identify your cluster architecture and configure the appropriate nodes for performance tuning.
-
Switch to your target cluster context:
oc config use-context target-cluster
-
Detect your cluster architecture:
# Check cluster architecture TOTAL_NODES=$(oc get nodes --no-headers | wc -l) WORKER_NODES=$(oc get nodes -l node-role.kubernetes.io/worker= --no-headers | wc -l) MASTER_NODES=$(oc get nodes -l node-role.kubernetes.io/master= --no-headers | wc -l) echo "=== Cluster Architecture Detection ===" echo "Total nodes: $TOTAL_NODES" echo "Master nodes: $MASTER_NODES" echo "Worker nodes: $WORKER_NODES" if [ $TOTAL_NODES -eq 1 ]; then echo "✅ Detected: Single Node OpenShift (SNO)" CLUSTER_TYPE="SNO" TARGET_NODE=$(oc get nodes -o jsonpath='{.items[0].metadata.name}') echo "Target node: $TARGET_NODE" elif [ $WORKER_NODES -gt 0 ]; then echo "✅ Detected: Multi-node cluster with dedicated workers" CLUSTER_TYPE="MULTI_NODE" TARGET_NODE=$(oc get nodes -l node-role.kubernetes.io/worker= -o jsonpath='{.items[0].metadata.name}') echo "Target worker node: $TARGET_NODE" else echo "⚠️ Detected: Multi-node cluster without dedicated workers" CLUSTER_TYPE="MULTI_MASTER" TARGET_NODE=$(oc get nodes -l node-role.kubernetes.io/master= -o jsonpath='{.items[0].metadata.name}') echo "Target master node: $TARGET_NODE" fi echo "CLUSTER_TYPE=$CLUSTER_TYPE" > /tmp/cluster-config echo "TARGET_NODE=$TARGET_NODE" >> /tmp/cluster-config
-
Check node resources and CPU topology:
# Load cluster configuration source /tmp/cluster-config # Get detailed node information echo "=== Node Information for $TARGET_NODE ===" oc describe node $TARGET_NODE | grep -E "(Name:|Roles:|Capacity:|Allocatable:)" -A 10 # Check CPU information echo "=== CPU Topology ===" oc debug node/$TARGET_NODE -- chroot /host lscpu | grep -E "(CPU\(s\)|Thread|Core|Socket|NUMA)"
-
Configure node labels based on cluster type:
# Load cluster configuration source /tmp/cluster-config if [ "$CLUSTER_TYPE" = "SNO" ]; then echo "=== SNO Configuration ===" echo "✅ Single Node OpenShift detected - no additional labeling needed" echo "Performance Profile will target: node-role.kubernetes.io/master" elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo "=== Multi-Node Configuration ===" echo "🏷️ Labeling worker node for performance tuning..." # Label the worker node(s) for performance tuning oc label node $TARGET_NODE node-role.kubernetes.io/worker-rt= --overwrite # Verify the label echo "✅ Labeled nodes:" oc get nodes --show-labels | grep worker-rt else echo "=== Multi-Master Configuration ===" echo "🏷️ Labeling master node for performance tuning..." # Label a master node for performance tuning (advanced scenario) oc label node $TARGET_NODE node-role.kubernetes.io/master-rt= --overwrite # Verify the label echo "✅ Labeled nodes:" oc get nodes --show-labels | grep master-rt fi
Step 2: Create Machine Config Pool (Multi-Node Only)
For multi-node clusters, create a dedicated Machine Config Pool to isolate performance configuration changes. SNO clusters can skip this step.
-
Create Machine Config Pool based on cluster type:
# Load cluster configuration source /tmp/cluster-config if [ "$CLUSTER_TYPE" = "SNO" ]; then echo "=== SNO Configuration ===" echo "✅ Skipping Machine Config Pool creation for Single Node OpenShift" echo "📝 SNO uses existing 'master' Machine Config Pool" echo "" echo "Existing Machine Config Pools:" oc get mcp elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo "=== Multi-Node Configuration ===" echo "🔧 Creating dedicated worker-rt Machine Config Pool..." cat << 'EOF' | oc apply -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-rt labels: machineconfiguration.openshift.io/role: worker-rt spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker, worker-rt]} nodeSelector: matchLabels: node-role.kubernetes.io/worker-rt: "" paused: false EOF echo "✅ Created worker-rt Machine Config Pool" else echo "=== Multi-Master Configuration ===" echo "🔧 Creating dedicated master-rt Machine Config Pool..." cat << 'EOF' | oc apply -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: master-rt labels: machineconfiguration.openshift.io/role: master-rt spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [master, master-rt]} nodeSelector: matchLabels: node-role.kubernetes.io/master-rt: "" paused: false EOF echo "✅ Created master-rt Machine Config Pool" fi
-
Verify Machine Config Pool status:
# Load cluster configuration source /tmp/cluster-config echo "=== Machine Config Pool Status ===" if [ "$CLUSTER_TYPE" = "SNO" ]; then echo "📋 SNO uses existing pools:" oc get mcp else echo "📋 All Machine Config Pools:" oc get mcp if [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo "" echo "📊 Worker-RT Pool Details:" oc get mcp worker-rt -o yaml | grep -A 5 -B 5 "readyMachineCount\|updatedMachineCount" else echo "" echo "📊 Master-RT Pool Details:" oc get mcp master-rt -o yaml | grep -A 5 -B 5 "readyMachineCount\|updatedMachineCount" fi fi
Step 3: Create the Performance Profile
Now we’ll create a Performance Profile that configures CPU isolation, HugePages, and real-time kernel settings optimized for your cluster architecture.
-
Determine optimal CPU allocation based on your cluster type:
# Run the CPU allocation calculator script bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.sh
What This Script Does:
-
Detects CPU count on the target node
-
Calculates optimal CPU allocation based on cluster type (SNO, Multi-Node, Multi-Master)
-
Validates CPU ranges to prevent configuration errors
-
Saves configuration to
/tmp/cluster-config
for next steps
Benefits of Using the Script:
-
Error Prevention: Validates CPU ranges before saving
-
Handles Edge Cases: Prevents division by zero and invalid ranges
-
Clear Output: Shows allocation strategy and percentages
-
Reusable: Can be run multiple times to recalculate
The script implements workshop-friendly conservative allocation that preserves cluster functionality while demonstrating performance benefits.
-
-
Create the Performance Profile optimized for your cluster architecture:
# Run the Performance Profile creation script bash ~/low-latency-performance-workshop/scripts/module04-create-performance-profile.sh
What This Script Does:
-
Loads and validates CPU allocation from previous step
-
Validates CPU ranges to prevent invalid PerformanceProfile
-
Determines HugePages allocation based on cluster type
-
Creates PerformanceProfile with proper node selector
-
Shows configuration before applying (requires confirmation)
Benefits of Using the Script:
-
Prevents Errors: Validates CPU ranges before creating PerformanceProfile
-
Interactive: Shows configuration and asks for confirmation
-
Clear Output: Displays profile summary and next steps
-
Error Handling: Provides helpful error messages if creation fails
The script will ask for confirmation before applying the PerformanceProfile.
If you see an error like
invalid range "0—2"
, this means the CPU allocation calculation failed. Run the CPU allocation calculator script again:bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.sh
Then retry the Performance Profile creation.
-
Step 4: Monitor the Performance Profile Application
The Performance Profile will trigger a rolling update of your nodes. This process includes installing the real-time kernel and applying all the specified optimizations. The monitoring approach varies by cluster architecture.
-
Monitor Machine Config Pool status based on cluster type:
# Load cluster configuration source /tmp/cluster-config echo "=== Monitoring Performance Profile Application ===" echo "Cluster type: $CLUSTER_TYPE" echo "Profile name: $PROFILE_NAME" echo "Target node: $TARGET_NODE" if [ "$CLUSTER_TYPE" = "SNO" ]; then echo "" echo "🔍 SNO Monitoring Strategy:" echo " - Monitor 'master' Machine Config Pool" echo " - Single node will reboot (cluster temporarily unavailable)" echo " - Expect 5-15 minutes for RT kernel installation" echo "" echo "📊 Current master MCP status:" oc get mcp master echo "" echo "⏱️ Starting continuous monitoring (Ctrl+C to stop):" echo "watch 'oc get mcp master; echo; oc get nodes'" elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo "" echo "🔍 Multi-Node Monitoring Strategy:" echo " - Monitor 'worker-rt' Machine Config Pool" echo " - Worker nodes will reboot sequentially" echo " - Control plane remains available" echo "" echo "📊 Current worker-rt MCP status:" oc get mcp worker-rt echo "" echo "⏱️ Starting continuous monitoring (Ctrl+C to stop):" echo "watch 'oc get mcp worker-rt; echo; oc get nodes'" else echo "" echo "🔍 Multi-Master Monitoring Strategy:" echo " - Monitor 'master-rt' Machine Config Pool" echo " - Master nodes will reboot sequentially" echo " - Cluster maintains quorum during updates" echo "" echo "📊 Current master-rt MCP status:" oc get mcp master-rt echo "" echo "⏱️ Starting continuous monitoring (Ctrl+C to stop):" echo "watch 'oc get mcp master-rt; echo; oc get nodes'" fi
-
Monitor node updates in detail:
# Load cluster configuration source /tmp/cluster-config echo "=== Detailed Node Update Monitoring ===" # Check machine config daemon status echo "📋 Machine Config Daemon Pods:" oc get pods -n openshift-machine-config-operator | grep daemon echo "" echo "📋 Recent Node Events:" oc get events --sort-by='.lastTimestamp' --field-selector involvedObject.name=$TARGET_NODE | tail -10 echo "" echo "📋 Machine Config Status:" if [ "$CLUSTER_TYPE" = "SNO" ]; then oc describe mcp master | grep -A 10 -B 5 "Conditions:" elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then oc describe mcp worker-rt | grep -A 10 -B 5 "Conditions:" else oc describe mcp master-rt | grep -A 10 -B 5 "Conditions:" fi
-
Wait for the update to complete:
# Load cluster configuration source /tmp/cluster-config echo "=== Waiting for Performance Profile Application ===" echo "This process may take 10-20 minutes depending on your environment" if [ "$CLUSTER_TYPE" = "SNO" ]; then echo "" echo "⏳ SNO Update Process:" echo " 1. Machine config generation" echo " 2. Node cordoning and draining" echo " 3. RT kernel installation and reboot" echo " 4. Node rejoin and ready state" echo "" echo "🔄 Waiting for master MCP to be updated..." oc wait --for=condition=Updated mcp/master --timeout=1200s echo "🔄 Waiting for node to be ready after reboot..." oc wait --for=condition=Ready node/$TARGET_NODE --timeout=600s elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo "" echo "⏳ Multi-Node Update Process:" echo " 1. Worker-RT MCP configuration" echo " 2. Sequential worker node updates" echo " 3. RT kernel installation per node" echo " 4. Node rejoin and ready state" echo "" echo "🔄 Waiting for worker-rt MCP to be updated..." oc wait --for=condition=Updated mcp/worker-rt --timeout=1200s echo "🔄 Waiting for target node to be ready..." oc wait --for=condition=Ready node/$TARGET_NODE --timeout=600s else echo "" echo "⏳ Multi-Master Update Process:" echo " 1. Master-RT MCP configuration" echo " 2. Sequential master node updates" echo " 3. RT kernel installation per node" echo " 4. Node rejoin and ready state" echo "" echo "🔄 Waiting for master-rt MCP to be updated..." oc wait --for=condition=Updated mcp/master-rt --timeout=1200s echo "🔄 Waiting for target node to be ready..." oc wait --for=condition=Ready node/$TARGET_NODE --timeout=600s fi echo "" echo "✅ Performance Profile application completed!" echo "📊 Final status:" oc get nodes oc get mcp echo "" echo "🧪 Testing cluster functionality after performance tuning..." # Create a simple test pod to verify the cluster is still functional cat << EOF | oc apply -f - apiVersion: v1 kind: Pod metadata: name: post-tuning-test namespace: default labels: app: post-tuning-test spec: containers: - name: test-container image: registry.redhat.io/ubi8/ubi-minimal:latest command: ["sleep"] args: ["30"] resources: requests: memory: "32Mi" cpu: "50m" limits: memory: "64Mi" cpu: "100m" restartPolicy: Never EOF # Wait for pod to be scheduled and running echo "⏱️ Waiting for test pod to verify cluster functionality..." if oc wait --for=condition=Ready pod/post-tuning-test --timeout=120s -n default 2>/dev/null; then echo " ✅ Cluster is functional after performance tuning!" echo " 📍 Test pod scheduled successfully" oc delete pod post-tuning-test -n default --ignore-not-found=true else echo " ⚠️ Cluster may have scheduling issues after performance tuning" echo " 💡 Consider using the revert script if problems persist" echo " 🔍 Check pod status: oc describe pod post-tuning-test -n default" fi
Step 5: Verify Performance Profile Effects
Once the update is complete, verify that all the performance optimizations have been applied correctly across your cluster architecture.
-
Comprehensive verification using Python health check script:
echo "=== Performance Profile Verification ===" echo "Running comprehensive cluster health check..." echo "" # Use Python script for thorough verification python3 ~/low-latency-performance-workshop/scripts/module04-cluster-health-check.py echo "" echo "✅ Comprehensive verification completed!" echo "" echo "💡 The health check script validates:" echo " - Cluster architecture detection" echo " - Performance Profile status" echo " - Real-time kernel installation" echo " - CPU isolation configuration" echo " - Pod scheduling functionality"
-
Get a quick performance tuning summary:
echo "=== Performance Tuning Summary ===" echo "" # Get color-coded summary of current performance settings python3 ~/low-latency-performance-workshop/scripts/module04-performance-summary.py echo "" echo "💡 This summary shows:" echo " - Current CPU allocation strategy" echo " - Performance vs stability balance" echo " - Recommendations for optimization"
Detailed Performance Tuning Validation
The workshop provides additional scripts to help with CPU allocation, Performance Profile creation, and validation.
-
CPU Allocation Calculator - Calculate optimal CPU allocation:
# Calculate CPU allocation based on cluster type bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.sh
This script:
-
Detects CPU count on target node
-
Calculates optimal reserved/isolated CPU allocation
-
Validates CPU ranges to prevent errors
-
Saves configuration for Performance Profile creation
-
-
Performance Profile Creator - Create validated Performance Profile:
# Create Performance Profile with validated configuration bash ~/low-latency-performance-workshop/scripts/module04-create-performance-profile.sh
This script:
-
Validates CPU allocation from previous step
-
Prevents invalid CPU range errors
-
Shows configuration before applying
-
Requires confirmation for safety
-
-
Performance Tuning Validator - Comprehensive validation of Performance Profile:
# Validate Performance Profile configuration python3 ~/low-latency-performance-workshop/scripts/module04-tuning-validator.py # Skip educational explanations for automation python3 ~/low-latency-performance-workshop/scripts/module04-tuning-validator.py --skip-explanation
This script validates:
-
Performance Profile existence and configuration
-
Machine Config Pool (MCP) status and readiness
-
Real-Time kernel installation on target nodes
-
Overall tuning configuration health
-
-
CPU Isolation Checker - Detailed CPU allocation analysis:
# Check CPU isolation configuration python3 ~/low-latency-performance-workshop/scripts/module04-cpu-isolation-checker.py # Specify a custom Performance Profile python3 ~/low-latency-performance-workshop/scripts/module04-cpu-isolation-checker.py \ --profile my-performance-profile
This script provides:
-
Visual representation of CPU allocation
-
Reserved vs isolated CPU validation
-
CPU allocation strategy explanation
-
Best practices for CPU isolation
-
Configuration recommendations
-
-
HugePages Validator - HugePages configuration verification:
# Validate HugePages configuration python3 ~/low-latency-performance-workshop/scripts/module04-hugepages-validator.py # Check specific Performance Profile python3 ~/low-latency-performance-workshop/scripts/module04-hugepages-validator.py \ --profile my-performance-profile
This script validates:
-
HugePages configuration in Performance Profile
-
Node HugePages allocation and availability
-
HugePages benefits and use cases
-
How to use HugePages in pod specifications
-
Total HugePages memory allocation
-
These validation scripts are educational tools that help you understand:
Run them after applying Performance Profiles to ensure everything is configured correctly. |
Performance Testing: Measuring Improvements
Now let’s run the same baseline test to measure the performance improvements from our optimizations.
Step 6: Re-run Performance Tests
-
Re-run the kube-burner performance test on the optimized cluster:
cd ~/kube-burner-configs # Load cluster configuration source /tmp/cluster-config echo "=== Creating Tuned Performance Test Configuration ===" echo "Cluster type: $CLUSTER_TYPE" echo "Profile name: $PROFILE_NAME" echo "Target node: $TARGET_NODE" # Create a new test configuration for the tuned cluster cat > tuned-config.yml << EOF global: measurements: - name: podLatency thresholds: - conditionType: Ready metric: P99 threshold: 15000ms # Expect better performance after tuning metricsEndpoints: - indexer: type: local metricsDirectory: collected-metrics-tuned jobs: - name: tuned-workload jobType: create jobIterations: 20 namespace: tuned-workload namespacedIterations: true cleanup: false podWait: false waitWhenFinished: true verifyObjects: true errorOnVerify: false objects: - objectTemplate: tuned-pod.yml replicas: 5 inputVars: containerImage: registry.redhat.io/ubi8/ubi:latest EOF # Create a tuned pod template with appropriate node selector if [ "$CLUSTER_TYPE" = "SNO" ]; then NODE_SELECTOR_YAML='nodeSelector: node-role.kubernetes.io/master: ""' echo "📝 SNO: Using master node selector for pod placement" elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then NODE_SELECTOR_YAML='nodeSelector: node-role.kubernetes.io/worker-rt: ""' echo "📝 Multi-Node: Using worker-rt node selector for pod placement" else NODE_SELECTOR_YAML='nodeSelector: node-role.kubernetes.io/master-rt: ""' echo "📝 Multi-Master: Using master-rt node selector for pod placement" fi cat > tuned-pod.yml << EOF apiVersion: v1 kind: Pod metadata: name: tuned-pod-{{.Iteration}}-{{.Replica}} labels: app: tuned-test iteration: "{{.Iteration}}" cluster-type: "$CLUSTER_TYPE" spec: $NODE_SELECTOR_YAML containers: - name: tuned-container image: {{.containerImage}} command: ["sleep"] args: ["300"] resources: requests: memory: "64Mi" cpu: "100m" limits: memory: "128Mi" cpu: "200m" restartPolicy: Never EOF echo "" echo "🚀 Running Performance Test on Tuned Cluster" echo " - Test configuration: tuned-config.yml" echo " - Pod template: tuned-pod.yml" echo " - Target: $CLUSTER_TYPE cluster with $PROFILE_NAME" echo "" # Run the performance test kube-burner init -c tuned-config.yml --log-level=info echo "" echo "✅ Tuned performance test completed!" echo "📊 Results stored in: collected-metrics-tuned/"
-
Analyze the tuned performance results using Python script:
cd ~/kube-burner-configs echo "🔍 Analyzing tuned performance results..." echo "" # Use Python script for clean, color-coded analysis python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py --single collected-metrics-tuned echo "" echo "✅ Tuned performance analysis completed!" echo "" echo "💡 The enhanced analysis provides:" echo " � Educational context explaining what each metric means" echo " 🎯 Performance explanations (why some metrics are slower)" echo " 🚀 Color-coded results: Excellent/Good/Needs Attention" echo " 📋 Suggested next steps based on your results"
-
Compare results with your baseline using Python analysis:
cd ~/kube-burner-configs echo "📊 Comparing baseline vs tuned performance..." echo "" # Use module-specific analysis script for clean Module 4 results echo "🎯 Module 4 Focused Analysis (Container Performance Only)..." python3 ~/low-latency-performance-workshop/scripts/module-specific-analysis.py 4 echo "" echo "📄 Generating Module 4 Performance Report..." echo " 🎯 Focus: Container performance optimization (baseline vs tuned)" echo " 📊 Scope: CPU isolation and HugePages impact on pod startup" echo "" # Generate Module 4 specific markdown report (baseline vs tuned only) REPORT_FILE="module4-performance-comparison-$(date +%Y%m%d-%H%M).md" python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \ --baseline collected-metrics \ --tuned collected-metrics-tuned \ --report "$REPORT_FILE" echo "" echo "💡 Module 4 Analysis Scope:" echo " ✅ Baseline container performance (from Module 3)" echo " ✅ Tuned container performance (from Module 4)" echo " ⚠️ Note: If VMI data exists from Module 5, it will also be shown" echo " 🎯 Focus on the 'Performance Comparison' section for Module 4 results" echo " ℹ️ Comprehensive analysis across all modules happens in Module 6" echo "" echo "📚 How to Read the Analysis:" echo " 1. Individual test sections show raw performance data" echo " 2. 'Performance Comparison' section shows Module 4 improvements" echo " 3. VMI data (if shown) is for reference - focus on container metrics" echo "" echo "📊 Performance Comparison Summary:" echo "==================================" if [ -f "$REPORT_FILE" ]; then # Display key sections of the report head -30 "$REPORT_FILE" echo "" echo "📄 Full report available at: $REPORT_FILE" else echo "⚠️ Report generation failed - check if both baseline and tuned metrics exist" fi echo "" echo "✅ Performance comparison completed!" echo "" echo "💡 The comparison analysis explains:" echo " � What P99/P95/P50 improvements mean for your workloads" echo " 🎯 Why scheduling became instant (0ms) with CPU isolation" echo " ⚖️ Why container operations may be slower (expected trade-off)" echo " 🏆 Overall assessment of your performance tuning effectiveness"
Step 7: Validate kube-burner Test Pods and Cluster Stability
Before proceeding, let’s validate that our performance tuning doesn’t interfere with normal cluster operations and that kube-burner can successfully create and schedule pods.
-
Check the status of kube-burner test pods:
# Load cluster configuration source /tmp/cluster-config echo "=== Validating kube-burner Test Pod Scheduling ===" echo "Cluster type: $CLUSTER_TYPE" echo "Target node: $TARGET_NODE" # Check if tuned test pods were created successfully echo "" echo "📋 Tuned Test Pod Status:" TUNED_PODS=$(oc get pods -A -l app=tuned-test --no-headers 2>/dev/null | wc -l) if [ $TUNED_PODS -gt 0 ]; then echo " ✅ Found $TUNED_PODS tuned test pods" echo "" echo " 📊 Pod Distribution by Node:" oc get pods -A -l app=tuned-test -o wide --no-headers | awk '{print $8}' | sort | uniq -c | sed 's/^/ /' echo "" echo " 📊 Pod Status Summary:" oc get pods -A -l app=tuned-test --no-headers | awk '{print $4}' | sort | uniq -c | sed 's/^/ /' # Check for any failed pods FAILED_PODS=$(oc get pods -A -l app=tuned-test --no-headers | grep -v "Running\|Completed" | wc -l) if [ $FAILED_PODS -gt 0 ]; then echo "" echo " ⚠️ Found $FAILED_PODS pods not in Running/Completed state:" oc get pods -A -l app=tuned-test --no-headers | grep -v "Running\|Completed" | sed 's/^/ /' fi else echo " ⚠️ No tuned test pods found - this may indicate scheduling issues" fi # Check baseline test pods as well echo "" echo "📋 Baseline Test Pod Status:" BASELINE_PODS=$(oc get pods -A -l app=baseline-test --no-headers 2>/dev/null | wc -l) if [ $BASELINE_PODS -gt 0 ]; then echo " ✅ Found $BASELINE_PODS baseline test pods" echo " 📊 Baseline Pod Status Summary:" oc get pods -A -l app=baseline-test --no-headers | awk '{print $4}' | sort | uniq -c | sed 's/^/ /' else echo " ℹ️ No baseline test pods found (may have been cleaned up)" fi
-
Test cluster responsiveness and pod scheduling:
# Load cluster configuration source /tmp/cluster-config echo "=== Testing Cluster Responsiveness ===" # Create a simple test pod to verify scheduling works echo "🧪 Creating test pod to verify cluster functionality..." cat << EOF | oc apply -f - apiVersion: v1 kind: Pod metadata: name: cluster-health-test namespace: default labels: app: cluster-health-test spec: containers: - name: test-container image: registry.redhat.io/ubi8/ubi-minimal:latest command: ["sleep"] args: ["60"] resources: requests: memory: "32Mi" cpu: "50m" limits: memory: "64Mi" cpu: "100m" restartPolicy: Never EOF # Wait for pod to be scheduled and running echo "⏱️ Waiting for test pod to start..." oc wait --for=condition=Ready pod/cluster-health-test --timeout=60s -n default if [ $? -eq 0 ]; then echo " ✅ Test pod started successfully - cluster is responsive" # Check which node it was scheduled on TEST_POD_NODE=$(oc get pod cluster-health-test -n default -o jsonpath='{.spec.nodeName}') echo " 📍 Test pod scheduled on node: $TEST_POD_NODE" # Clean up test pod oc delete pod cluster-health-test -n default --ignore-not-found=true else echo " ❌ Test pod failed to start - cluster may have scheduling issues" echo " 🔍 Pod events:" oc describe pod cluster-health-test -n default | grep -A 10 "Events:" | sed 's/^/ /' fi
Step 8: Optional - Revert Performance Tuning for Workshop Stability
If you experience any issues with cluster stability or want to continue with other workshop modules without the aggressive performance tuning, you can revert the changes.
-
Create a revert script for easy cleanup:
# Load cluster configuration source /tmp/cluster-config echo "=== Creating Performance Tuning Revert Script ===" # Create revert script cat > ~/revert-performance-tuning.sh << 'EOF' #!/bin/bash # Load cluster configuration if [ -f /tmp/cluster-config ]; then source /tmp/cluster-config echo "=== Reverting Performance Tuning ===" echo "Cluster type: $CLUSTER_TYPE" echo "Profile name: $PROFILE_NAME" echo "Target node: $TARGET_NODE" else echo "❌ Cluster configuration not found. Cannot revert automatically." echo "💡 You can manually delete performance profiles with:" echo " oc get performanceprofile" echo " oc delete performanceprofile <profile-name>" exit 1 fi # Delete the Performance Profile echo "" echo "🗑️ Removing Performance Profile: $PROFILE_NAME" if oc get performanceprofile $PROFILE_NAME >/dev/null 2>&1; then oc delete performanceprofile $PROFILE_NAME echo " ✅ Performance Profile deleted" else echo " ℹ️ Performance Profile not found (may already be deleted)" fi # Remove custom Machine Config Pool (if created) if [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo "" echo "🗑️ Removing worker-rt Machine Config Pool..." if oc get mcp worker-rt >/dev/null 2>&1; then oc delete mcp worker-rt echo " ✅ worker-rt Machine Config Pool deleted" else echo " ℹ️ worker-rt Machine Config Pool not found" fi # Remove worker-rt label from nodes echo "" echo "🏷️ Removing worker-rt labels from nodes..." oc label nodes -l node-role.kubernetes.io/worker-rt node-role.kubernetes.io/worker-rt- --ignore-not-found=true elif [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then echo "" echo "🗑️ Removing master-rt Machine Config Pool..." if oc get mcp master-rt >/dev/null 2>&1; then oc delete mcp master-rt echo " ✅ master-rt Machine Config Pool deleted" else echo " ℹ️ master-rt Machine Config Pool not found" fi # Remove master-rt label from nodes echo "" echo "🏷️ Removing master-rt labels from nodes..." oc label nodes -l node-role.kubernetes.io/master-rt node-role.kubernetes.io/master-rt- --ignore-not-found=true fi echo "" echo "⏳ Waiting for nodes to revert to standard kernel..." echo " 💡 This process will take 10-15 minutes as nodes reboot" echo " 📊 Monitor progress with: watch 'oc get nodes; oc get mcp'" # Wait for machine config pools to be updated if [ "$CLUSTER_TYPE" = "SNO" ]; then echo " 🔄 Waiting for master MCP to update..." oc wait --for=condition=Updated mcp/master --timeout=1200s elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo " 🔄 Waiting for worker MCP to update..." oc wait --for=condition=Updated mcp/worker --timeout=1200s else echo " 🔄 Waiting for master MCP to update..." oc wait --for=condition=Updated mcp/master --timeout=1200s fi echo "" echo "✅ Performance tuning revert completed!" echo "🔍 Verify with: oc debug node/$TARGET_NODE -- chroot /host uname -r" echo " (Should show standard kernel without 'rt')" EOF chmod +x ~/revert-performance-tuning.sh echo "✅ Revert script created: ~/revert-performance-tuning.sh" echo "" echo "💡 To revert performance tuning later, run:" echo " ~/revert-performance-tuning.sh" echo "" echo "⚠️ Note: Reverting will cause nodes to reboot back to standard kernel"
-
Optional - Run the revert script if needed:
echo "=== Performance Tuning Revert Decision ===" echo "" echo "🤔 Do you want to revert the performance tuning now?" echo "" echo "✅ Keep Performance Tuning if:" echo " - Cluster is stable and responsive" echo " - Test pods are scheduling successfully" echo " - You want to continue with optimized performance" echo "" echo "🔄 Revert Performance Tuning if:" echo " - Experiencing cluster stability issues" echo " - Pods are failing to schedule" echo " - Want to continue workshop with standard settings" echo "" echo "💡 You can always re-apply performance tuning later by re-running this module" echo "" echo "To revert now, run: ~/revert-performance-tuning.sh" echo "To keep current settings, continue to the next step"
Expected Improvements
With proper performance tuning, you should see significant improvements:
-
Pod Creation P99: 50-70% reduction in latency
-
Pod Creation P95: 40-60% reduction in latency
-
Consistency: Much lower variance between P50 and P99
-
Jitter Reduction: More predictable response times
-
CPU Isolation: Eliminates interference from system processes
-
Real-time Kernel: Provides deterministic scheduling
-
HugePages: Reduces memory management overhead
-
NUMA Optimization: Ensures local memory access
Troubleshooting Common Issues
This section provides architecture-specific troubleshooting guidance for common Performance Profile issues.
Architecture-Specific Troubleshooting
If nodes don’t start updating after Performance Profile creation:
# Load cluster configuration
source /tmp/cluster-config
echo "=== Node Update Troubleshooting ==="
echo "Cluster type: $CLUSTER_TYPE"
echo "Target node: $TARGET_NODE"
if [ "$CLUSTER_TYPE" = "SNO" ]; then
echo ""
echo "🔍 SNO Troubleshooting:"
echo " - Check master Machine Config Pool status"
oc describe mcp master | grep -A 10 -B 5 "Conditions:"
echo ""
echo " - Check for conflicting machine configs:"
oc get mc | grep master
elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
echo ""
echo "🔍 Multi-Node Troubleshooting:"
echo " - Check worker-rt Machine Config Pool status"
oc describe mcp worker-rt | grep -A 10 -B 5 "Conditions:"
echo ""
echo " - Check for conflicting machine configs:"
oc get mc | grep worker-rt
else
echo ""
echo "🔍 Multi-Master Troubleshooting:"
echo " - Check master-rt Machine Config Pool status"
oc describe mcp master-rt | grep -A 10 -B 5 "Conditions:"
echo ""
echo " - Check for conflicting machine configs:"
oc get mc | grep master-rt
fi
echo ""
echo "📋 Performance Profile Status:"
oc describe performanceprofile $PROFILE_NAME | grep -A 10 -B 5 "Status:"
If the RT kernel fails to install:
# Load cluster configuration
source /tmp/cluster-config
echo "=== Real-Time Kernel Troubleshooting ==="
echo "Target node: $TARGET_NODE"
# Check node events for errors
echo "📋 Recent Node Events:"
oc get events --sort-by='.lastTimestamp' --field-selector involvedObject.name=$TARGET_NODE | tail -15
echo ""
echo "📋 Machine Config Daemon Logs:"
MCD_POD=$(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=$TARGET_NODE -o jsonpath='{.items[0].metadata.name}')
if [ -n "$MCD_POD" ]; then
echo " MCD Pod: $MCD_POD"
oc logs -n openshift-machine-config-operator $MCD_POD --tail=20
else
echo " ❌ No MCD pod found for node $TARGET_NODE"
fi
echo ""
echo "📋 RT Kernel Package Availability:"
oc debug node/$TARGET_NODE -- chroot /host sh -c 'yum list available | grep kernel-rt || dnf list available | grep kernel-rt' 2>/dev/null || echo " Unable to check RT kernel packages"
echo ""
echo "📋 Current Kernel Information:"
oc debug node/$TARGET_NODE -- chroot /host uname -a 2>/dev/null || echo " Unable to check current kernel"
If HugePages aren’t configured correctly:
# Load cluster configuration
source /tmp/cluster-config
echo "=== HugePages Troubleshooting ==="
echo "Target node: $TARGET_NODE"
# Check available memory
echo "📋 Memory Information:"
oc debug node/$TARGET_NODE -- chroot /host free -h 2>/dev/null || echo " Unable to check memory"
echo ""
echo "📋 HugePages Configuration:"
oc debug node/$TARGET_NODE -- chroot /host cat /proc/meminfo 2>/dev/null | grep -i huge || echo " No HugePages information found"
echo ""
echo "📋 HugePages Mount Points:"
oc debug node/$TARGET_NODE -- chroot /host mount 2>/dev/null | grep huge || echo " No HugePages mount points found"
echo ""
echo "📋 Kernel Command Line (HugePages args):"
oc debug node/$TARGET_NODE -- chroot /host cat /proc/cmdline 2>/dev/null | grep -o 'hugepages[^[:space:]]*' || echo " No HugePages kernel arguments found"
echo ""
echo "📋 Performance Profile HugePages Spec:"
oc get performanceprofile $PROFILE_NAME -o jsonpath='{.spec.hugepages}' | jq '.' 2>/dev/null || echo " Unable to read Performance Profile HugePages spec"
If the Performance Profile exists but optimizations aren’t applied:
# Load cluster configuration
source /tmp/cluster-config
echo "=== Performance Profile Application Troubleshooting ==="
# Check Performance Profile status
echo "📋 Performance Profile Status:"
oc get performanceprofile $PROFILE_NAME -o yaml | grep -A 20 "status:" || echo " No status information available"
echo ""
echo "📋 Node Tuning Operator Status:"
oc get pods -n openshift-cluster-node-tuning-operator
echo ""
echo "📋 TuneD Daemon Status:"
TUNED_POD=$(oc get pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned --field-selector spec.nodeName=$TARGET_NODE -o jsonpath='{.items[0].metadata.name}')
if [ -n "$TUNED_POD" ]; then
echo " TuneD Pod: $TUNED_POD"
oc logs -n openshift-cluster-node-tuning-operator $TUNED_POD --tail=10
else
echo " ❌ No TuneD pod found for node $TARGET_NODE"
fi
echo ""
echo "📋 Generated TuneD Profiles:"
oc get tuned -n openshift-cluster-node-tuning-operator
Module Summary
In this module, you have successfully implemented architecture-aware performance tuning:
✅ Detected your cluster architecture (SNO, Multi-Node, or Multi-Master)
✅ Created an optimized Performance Profile tailored to your environment
✅ Configured CPU isolation appropriate for your cluster’s control plane needs
✅ Allocated HugePages sized correctly for your available resources
✅ Applied real-time kernel settings for deterministic scheduling
✅ Verified all performance optimizations are working correctly
✅ Measured significant performance improvements through comparative testing
Single Node OpenShift (SNO): * Conservative CPU allocation preserving control plane stability * Optimized HugePages configuration for resource-constrained environments * Master node targeting with appropriate performance isolation
Multi-Node Clusters: * Aggressive CPU isolation on dedicated worker nodes * Maximum performance optimization without control plane impact * Dedicated Machine Config Pool for isolated performance tuning
Multi-Master Clusters: * Balanced CPU allocation for control plane and workload performance * Strategic node selection for performance optimization * Maintained cluster stability during rolling updates
-
Performance Profiles adapt automatically to different cluster architectures
-
CPU isolation strategies must account for control plane requirements
-
Real-time kernels provide predictable, low-latency scheduling across all architectures
-
HugePages allocation should be sized appropriately for available resources
-
Proper architecture-aware tuning can achieve 50-70% latency improvements
-
SNO environments can achieve exceptional performance due to simplified architecture
Based on your cluster type, you should observe: * Pod Creation Latency: 50-70% reduction in P99 times * Consistency: Dramatically reduced variance between P50 and P99 * Jitter: More predictable and deterministic response times * Resource Utilization: Optimized CPU and memory usage patterns
Optional: Reverting Performance Tuning for Workshop Continuation
If you need to revert the performance tuning to continue with other workshop modules or if you experience any cluster stability issues, you can easily remove the performance optimizations.
Creating a Revert Script
-
Create an automated revert script:
# Load cluster configuration source /tmp/cluster-config echo "=== Creating Performance Tuning Revert Script ===" # Create revert script cat > ~/revert-performance-tuning.sh << 'EOF' #!/bin/bash # Load cluster configuration if [ -f /tmp/cluster-config ]; then source /tmp/cluster-config echo "=== Reverting Performance Tuning ===" echo "Cluster type: $CLUSTER_TYPE" echo "Profile name: $PROFILE_NAME" echo "Target node: $TARGET_NODE" else echo "❌ Cluster configuration not found. Attempting manual cleanup..." PROFILE_NAME=$(oc get performanceprofile -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) if [ -z "$PROFILE_NAME" ]; then echo "No Performance Profiles found to delete." exit 0 fi fi # Delete the Performance Profile echo "" echo "🗑️ Removing Performance Profile: $PROFILE_NAME" if oc get performanceprofile $PROFILE_NAME >/dev/null 2>&1; then oc delete performanceprofile $PROFILE_NAME echo " ✅ Performance Profile deleted" else echo " ℹ️ Performance Profile not found (may already be deleted)" fi # Remove custom Machine Config Pool and labels (if created) if [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo "" echo "🗑️ Cleaning up worker-rt configuration..." oc delete mcp worker-rt --ignore-not-found=true oc label nodes -l node-role.kubernetes.io/worker-rt node-role.kubernetes.io/worker-rt- --ignore-not-found=true echo " ✅ Worker-rt configuration cleaned up" elif [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then echo "" echo "🗑️ Cleaning up master-rt configuration..." oc delete mcp master-rt --ignore-not-found=true oc label nodes -l node-role.kubernetes.io/master-rt node-role.kubernetes.io/master-rt- --ignore-not-found=true echo " ✅ Master-rt configuration cleaned up" fi echo "" echo "⏳ Waiting for nodes to revert to standard kernel..." echo " 💡 This process will take 10-15 minutes as nodes reboot" echo " 📊 Monitor progress with: watch 'oc get nodes; oc get mcp'" # Wait for machine config pools to be updated if [ "$CLUSTER_TYPE" = "SNO" ] || [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then echo " 🔄 Waiting for master MCP to update..." oc wait --for=condition=Updated mcp/master --timeout=1200s elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo " 🔄 Waiting for worker MCP to update..." oc wait --for=condition=Updated mcp/worker --timeout=1200s fi echo "" echo "✅ Performance tuning revert completed!" echo "🔍 Verify standard kernel with:" echo " oc debug node/\$(oc get nodes -o jsonpath='{.items[0].metadata.name}') -- chroot /host uname -r" echo " (Should show standard kernel without 'rt')" EOF chmod +x ~/revert-performance-tuning.sh echo "✅ Revert script created: ~/revert-performance-tuning.sh" echo "" echo "💡 To revert performance tuning at any time, run:" echo " ~/revert-performance-tuning.sh" echo "" echo "⚠️ Note: Reverting will cause nodes to reboot back to standard kernel"
When to Use the Revert Script
-
Pods failing to schedule or start
-
Cluster becoming unresponsive
-
High resource contention
-
Need to continue with other workshop modules that require standard settings
-
Cluster is stable and responsive
-
Test pods are scheduling successfully
-
You want to continue with performance-optimized settings
-
Planning to proceed directly to Module 5 (Virtualization)
Manual Cleanup (Alternative)
If the automated script doesn’t work, you can manually clean up:
# Manual cleanup commands
echo "=== Manual Performance Tuning Cleanup ==="
# List and delete performance profiles
echo "📋 Current Performance Profiles:"
oc get performanceprofile
echo ""
echo "🗑️ Delete Performance Profile:"
echo "oc delete performanceprofile <profile-name>"
# List and delete custom MCPs
echo ""
echo "📋 Current Machine Config Pools:"
oc get mcp
echo ""
echo "🗑️ Delete custom MCP (if created):"
echo "oc delete mcp worker-rt # or master-rt"
echo ""
echo "💡 After deletion, nodes will automatically reboot to standard kernel"
In Module 5, you will learn how to apply these performance optimizations to OpenShift Virtualization, creating high-performance virtual machines that leverage your tuned infrastructure to achieve near bare-metal latency characteristics. The architecture-aware approach you’ve learned will be essential for optimizing VM placement and resource allocation.
Workshop Flexibility: You can proceed to Module 5 either with or without the performance tuning active. The virtualization module will adapt to your current cluster configuration. |