Module 4: Core Performance Tuning with Performance Profiles
Module Overview
This module is the heart of the workshop, where you’ll apply node-level performance tuning using OpenShift’s Performance Profile Controller. You’ll configure CPU isolation, HugePages allocation, and real-time kernel settings to dramatically improve latency characteristics.
-
Understand Performance Profiles and their components
-
Configure CPU isolation for high-performance workloads
-
Allocate and manage HugePages for reduced memory latency
-
Apply real-time kernel tuning profiles
-
Measure the performance improvements from your optimizations
Before starting this module, ensure you have completed:
-
Module 1: Low-Latency Performance Fundamentals
-
Module 2: Environment Setup and Verification
-
Module 3: Baseline Performance Measurement and Analysis
-
Established baseline performance metrics using kube-burner from Module 3
Understanding Performance Profiles
The Performance Profile Controller (PPC) is integrated into the Node Tuning Operator and provides a declarative way to configure multiple low-latency optimizations through a single custom resource.
-
CPU Management: Isolate specific CPUs for high-performance workloads
-
Memory Tuning: Configure HugePages and memory-related kernel parameters
-
Kernel Tuning: Apply real-time kernel settings and tuned profiles
-
NUMA Awareness: Optimize for Non-Uniform Memory Access architectures
-
Single Configuration: One CR manages multiple complex tuning parameters
-
Declarative: Version-controlled, repeatable configuration
-
Node Pool Isolation: Apply tuning to specific worker nodes only
-
Rolling Updates: Orchestrated updates with minimal disruption
Hands-on Exercise: Creating a Performance Profile
Step 1: Identify Target Nodes and Environment
First, we’ll identify your cluster architecture and configure the appropriate nodes for performance tuning.
-
Verify cluster access (you should already be connected via bastion):
# Verify you're connected to the SNO cluster oc whoami oc get nodes -
Get your SNO node information:
# Get the SNO node name TARGET_NODE=$(oc get nodes -o jsonpath='{.items[0].metadata.name}') echo "Target node: $TARGET_NODE" # Get detailed node information echo "=== Node Information ===" oc describe node $TARGET_NODE | grep -E "(Name:|Roles:|Capacity:|Allocatable:)" -A 10 # Check CPU information echo "=== CPU Topology ===" oc debug node/$TARGET_NODE -- chroot /host lscpu | grep -E "(CPU\(s\)|Thread|Core|Socket|NUMA)"
|
SNO Configuration Note Since this is a Single Node OpenShift cluster, no additional node labeling is needed. The Performance Profile will automatically target the master node (which also acts as the worker in SNO). |
Step 2: Verify Machine Config Pool (SNO)
For SNO clusters, the default master Machine Config Pool is used. No additional pool creation is needed.
-
Verify Machine Config Pool status:
# SNO uses the default master pool echo "=== Machine Config Pools ===" oc get mcp echo "" echo "๐ Master Pool Details:" oc get mcp master -o yaml | grep -E "(readyMachineCount|updatedMachineCount|machineCount)" | head -3
|
SNO Machine Config Pool Single Node OpenShift uses the built-in |
Step 3: Create the Performance Profile
Now we’ll create a Performance Profile that configures CPU isolation, HugePages, and real-time kernel settings optimized for your cluster architecture.
|
Real-Time (RT) Kernel Requirements and Cost Considerations The Linux Real-Time kernel extension ( AWS Instance Requirements: Cost Comparison (us-east-2 region): Workshop Recommendation: By default, the Performance Profile script creates a profile WITHOUT RT kernel enabled. This configuration still provides: This demonstrates 80% of low-latency concepts at a fraction of the cost, making it ideal for workshops. To Enable RT Kernel (if using bare-metal instances): If your cluster is deployed on bare-metal instances and you want to enable RT kernel:
The script will automatically detect your instance type and validate RT kernel requirements. If you attempt to enable RT kernel on a virtualized instance, the script will fail with a clear error message explaining the requirements. |
|
SNO-Specific Considerations: For Single Node OpenShift (SNO) clusters, applying a Performance Profile will cause the entire cluster to reboot, making the API server unavailable for 10-20 minutes. Automatic Optimizations for Virtualized Instances: If Your Node Gets Stuck After Reboot: |
-
Determine optimal CPU allocation based on your cluster type:
# Run the CPU allocation calculator script bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.shWhat This Script Does:
-
Detects CPU count on the target node
-
Calculates optimal CPU allocation based on cluster type (SNO, Multi-Node, Multi-Master)
-
Validates CPU ranges to prevent configuration errors
-
Saves configuration to
/tmp/cluster-configfor next steps
Benefits of Using the Script:
-
Error Prevention: Validates CPU ranges before saving
-
Handles Edge Cases: Prevents division by zero and invalid ranges
-
Clear Output: Shows allocation strategy and percentages
-
Reusable: Can be run multiple times to recalculate
The script implements workshop-friendly conservative allocation that preserves cluster functionality while demonstrating performance benefits.
-
-
Create the Performance Profile optimized for your cluster architecture:
# Run the Performance Profile creation script bash ~/low-latency-performance-workshop/scripts/module04-create-performance-profile.shWhat This Script Does:
-
Loads and validates CPU allocation from previous step
-
Detects instance type (metal vs virtualized)
-
Validates RT kernel requirements if enabled
-
Validates CPU ranges to prevent invalid PerformanceProfile
-
Determines HugePages allocation based on cluster type
-
Creates PerformanceProfile with proper node selector
-
Conditionally enables RT kernel based on instance type and configuration
-
Shows configuration before applying (requires confirmation)
Benefits of Using the Script:
-
Prevents Errors: Validates CPU ranges and RT kernel requirements before creating PerformanceProfile
-
Smart Defaults: RT kernel disabled by default (works on all instance types)
-
Instance Detection: Automatically detects bare-metal vs virtualized instances
-
Interactive: Shows configuration and asks for confirmation
-
Clear Output: Displays profile summary and next steps
-
Error Handling: Provides helpful error messages if creation fails
The script will ask for confirmation before applying the PerformanceProfile.
RT Kernel Configuration:
-
Default: RT kernel is disabled (works on all instance types)
-
To Enable: Set
ENABLE_RT_KERNEL=trueor use--enable-rt-kernelflag -
Validation: Script will fail with clear error if RT kernel is requested on non-metal instances
If you see an error like
invalid range "0—2", this means the CPU allocation calculation failed. Run the CPU allocation calculator script again:bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.shThen retry the Performance Profile creation.
-
Step 4: Monitor the Performance Profile Application
The Performance Profile will trigger a rolling update of your nodes. This process includes applying CPU isolation, HugePages, NUMA tuning, and (if enabled) installing the real-time kernel. The monitoring approach varies by cluster architecture.
|
Node Reboot Required: The Performance Profile applies kernel boot parameters (CPU isolation, HugePages, NUMA tuning) which always require a node reboot, regardless of RT kernel setting. The Machine Config Pool (MCP) will automatically trigger the reboot. Expected Timeline: SNO-Specific Considerations: Virtualized Instance Optimization: |
-
Monitor Machine Config Pool status:
echo "๐ Current master MCP status:" oc get mcp master echo "" echo "โฑ๏ธ Starting continuous monitoring (Ctrl+C to stop):" echo "watch 'oc get mcp master; echo; oc get nodes'" -
Monitor node updates in detail:
# Load cluster configuration source /tmp/cluster-config echo "=== Detailed Node Update Monitoring ===" # Check machine config daemon status echo "๐ Machine Config Daemon Pods:" oc get pods -n openshift-machine-config-operator | grep daemon echo "" echo "๐ Recent Node Events:" oc get events --sort-by='.lastTimestamp' --field-selector involvedObject.name=$TARGET_NODE | tail -10 echo "" echo "๐ Machine Config Status:" oc describe mcp master | grep -A 10 -B 5 "Conditions:" -
Wait for the update to complete:
# Load cluster configuration source /tmp/cluster-config echo "=== Waiting for Performance Profile Application ===" echo "This process may take 5-20 minutes depending on your configuration" echo "" echo "โณ SNO Update Process:" echo " 1. Machine config generation" echo " 2. Node cordoning and draining" if [ "${ENABLE_RT_KERNEL:-false}" = "true" ]; then echo " 3. RT kernel installation and reboot (RT kernel enabled)" echo " 4. Node rejoin and ready state" else echo " 3. Kernel parameter updates (no reboot required)" echo " 4. Node ready state" fi echo "" echo "๐ Waiting for master MCP to be updated..." oc wait --for=condition=Updated mcp/master --timeout=1200s # Only wait for reboot if RT kernel is enabled if [ "${ENABLE_RT_KERNEL:-false}" = "true" ]; then echo "๐ Waiting for node to be ready after reboot..." oc wait --for=condition=Ready node/$TARGET_NODE --timeout=600s else echo "โ Node update complete (no reboot required)" fi echo "" echo "โ Performance Profile application completed!" echo "๐ Final status:" oc get nodes oc get mcp echo "" echo "๐งช Testing cluster functionality after performance tuning..." # Create a simple test pod to verify the cluster is still functional cat << EOF | oc apply -f - apiVersion: v1 kind: Pod metadata: name: post-tuning-test namespace: default labels: app: post-tuning-test spec: containers: - name: test-container image: registry.redhat.io/ubi8/ubi-minimal:latest command: ["sleep"] args: ["30"] resources: requests: memory: "32Mi" cpu: "50m" limits: memory: "64Mi" cpu: "100m" restartPolicy: Never EOF # Wait for pod to be scheduled and running echo "โฑ๏ธ Waiting for test pod to verify cluster functionality..." if oc wait --for=condition=Ready pod/post-tuning-test --timeout=120s -n default 2>/dev/null; then echo " โ Cluster is functional after performance tuning!" echo " ๐ Test pod scheduled successfully" oc delete pod post-tuning-test -n default --ignore-not-found=true else echo " โ ๏ธ Cluster may have scheduling issues after performance tuning" echo " ๐ก Consider using the revert script if problems persist" echo " ๐ Check pod status: oc describe pod post-tuning-test -n default" fi
Step 5: Verify Performance Profile Effects
Once the update is complete, verify that all the performance optimizations have been applied correctly across your cluster architecture.
-
Comprehensive verification using Python health check script:
echo "=== Performance Profile Verification ===" echo "Running comprehensive cluster health check..." echo "" # Use Python script for thorough verification python3 ~/low-latency-performance-workshop/scripts/module04-cluster-health-check.py echo "" echo "โ Comprehensive verification completed!" echo "" echo "๐ก The health check script validates:" echo " - Cluster architecture detection" echo " - Performance Profile status" echo " - Real-time kernel installation" echo " - CPU isolation configuration" echo " - Pod scheduling functionality" -
Get a quick performance tuning summary:
echo "=== Performance Tuning Summary ===" echo "" # Get color-coded summary of current performance settings python3 ~/low-latency-performance-workshop/scripts/module04-performance-summary.py echo "" echo "๐ก This summary shows:" echo " - Current CPU allocation strategy" echo " - Performance vs stability balance" echo " - Recommendations for optimization"
Detailed Performance Tuning Validation
The workshop provides additional scripts to help with CPU allocation, Performance Profile creation, and validation.
-
CPU Allocation Calculator - Calculate optimal CPU allocation:
# Calculate CPU allocation based on cluster type bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.shThis script:
-
Detects CPU count on target node
-
Calculates optimal reserved/isolated CPU allocation
-
Validates CPU ranges to prevent errors
-
Saves configuration for Performance Profile creation
-
-
Performance Profile Creator - Create validated Performance Profile:
# Create Performance Profile with validated configuration bash ~/low-latency-performance-workshop/scripts/module04-create-performance-profile.shThis script:
-
Validates CPU allocation from previous step
-
Prevents invalid CPU range errors
-
Shows configuration before applying
-
Requires confirmation for safety
-
-
Performance Tuning Validator - Comprehensive validation of Performance Profile:
# Validate Performance Profile configuration python3 ~/low-latency-performance-workshop/scripts/module04-tuning-validator.pyThis script validates:
-
Performance Profile existence and configuration
-
Machine Config Pool (MCP) status and readiness
-
Real-Time kernel installation on target nodes
-
Overall tuning configuration health
-
-
CPU Isolation Checker - Detailed CPU allocation analysis:
# Check CPU isolation configuration python3 ~/low-latency-performance-workshop/scripts/module04-cpu-isolation-checker.pyThis script provides:
-
Visual representation of CPU allocation
-
Reserved vs isolated CPU validation
-
CPU allocation strategy explanation
-
Best practices for CPU isolation
-
Configuration recommendations
-
-
HugePages Validator - HugePages configuration verification:
# Validate HugePages configuration python3 ~/low-latency-performance-workshop/scripts/module04-hugepages-validator.py # Check specific Performance Profile python3 ~/low-latency-performance-workshop/scripts/module04-hugepages-validator.py \ --profile my-performance-profileThis script validates:
-
HugePages configuration in Performance Profile
-
Node HugePages allocation and availability
-
HugePages benefits and use cases
-
How to use HugePages in pod specifications
-
Total HugePages memory allocation
-
|
These validation scripts are educational tools that help you understand:
Run them after applying Performance Profiles to ensure everything is configured correctly. |
Performance Testing: Measuring Improvements
Now let’s run the same baseline test to measure the performance improvements from our optimizations.
Step 6: Re-run Performance Tests
-
Re-run the kube-burner performance test on the optimized cluster:
cd ~/kube-burner-configs # Load cluster configuration source /tmp/cluster-config echo "=== Creating Tuned Performance Test Configuration ===" echo "Cluster type: $CLUSTER_TYPE" echo "Profile name: $PROFILE_NAME" echo "Target node: $TARGET_NODE" # Create a new test configuration for the tuned cluster cat > tuned-config.yml << EOF global: measurements: - name: podLatency thresholds: - conditionType: Ready metric: P99 threshold: 15000ms # Expect better performance after tuning metricsEndpoints: - indexer: type: local metricsDirectory: collected-metrics-tuned jobs: - name: tuned-workload jobType: create jobIterations: 20 namespace: tuned-workload namespacedIterations: true cleanup: false podWait: false waitWhenFinished: true verifyObjects: true errorOnVerify: false objects: - objectTemplate: tuned-pod.yml replicas: 5 inputVars: containerImage: registry.redhat.io/ubi8/ubi:latest EOF # Create a tuned pod template with appropriate node selector if [ "$CLUSTER_TYPE" = "SNO" ]; then NODE_SELECTOR_YAML='nodeSelector: node-role.kubernetes.io/master: ""' echo "๐ SNO: Using master node selector for pod placement" elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then NODE_SELECTOR_YAML='nodeSelector: node-role.kubernetes.io/worker-rt: ""' echo "๐ Multi-Node: Using worker-rt node selector for pod placement" else NODE_SELECTOR_YAML='nodeSelector: node-role.kubernetes.io/master-rt: ""' echo "๐ Multi-Master: Using master-rt node selector for pod placement" fi cat > tuned-pod.yml << EOF apiVersion: v1 kind: Pod metadata: name: tuned-pod-{{.Iteration}}-{{.Replica}} labels: app: tuned-test iteration: "{{.Iteration}}" cluster-type: "$CLUSTER_TYPE" spec: $NODE_SELECTOR_YAML containers: - name: tuned-container image: {{.containerImage}} command: ["sleep"] args: ["300"] resources: requests: memory: "64Mi" cpu: "100m" limits: memory: "128Mi" cpu: "200m" restartPolicy: Never EOF echo "" echo "๐ Running Performance Test on Tuned Cluster" echo " - Test configuration: tuned-config.yml" echo " - Pod template: tuned-pod.yml" echo " - Target: $CLUSTER_TYPE cluster with $PROFILE_NAME" echo "" # Run the performance test kube-burner init -c tuned-config.yml --log-level=info echo "" echo "โ Tuned performance test completed!" echo "๐ Results stored in: collected-metrics-tuned/" -
Analyze the tuned performance results using Python script:
cd ~/kube-burner-configs echo "๐ Analyzing tuned performance results..." echo "" # Use Python script for clean, color-coded analysis python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py --single collected-metrics-tuned echo "" echo "โ Tuned performance analysis completed!" echo "" echo "๐ก The enhanced analysis provides:" echo " ๏ฟฝ Educational context explaining what each metric means" echo " ๐ฏ Performance explanations (why some metrics are slower)" echo " ๐ Color-coded results: Excellent/Good/Needs Attention" echo " ๐ Suggested next steps based on your results" -
Compare results with your baseline using Python analysis:
cd ~/kube-burner-configs echo "๐ Comparing baseline vs tuned performance..." echo "" # Use module-specific analysis script for clean Module 4 results echo "๐ฏ Module 4 Focused Analysis (Container Performance Only)..." python3 ~/low-latency-performance-workshop/scripts/module-specific-analysis.py 4 echo "" echo "๐ Generating Module 4 Performance Report..." echo " ๐ฏ Focus: Container performance optimization (baseline vs tuned)" echo " ๐ Scope: CPU isolation and HugePages impact on pod startup" echo "" # Generate Module 4 specific markdown report (baseline vs tuned only) REPORT_FILE="module4-performance-comparison-$(date +%Y%m%d-%H%M).md" python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \ --baseline collected-metrics \ --tuned collected-metrics-tuned \ --report "$REPORT_FILE" echo "" echo "๐ก Module 4 Analysis Scope:" echo " โ Baseline container performance (from Module 3)" echo " โ Tuned container performance (from Module 4)" echo " โ ๏ธ Note: If VMI data exists from Module 5, it will also be shown" echo " ๐ฏ Focus on the 'Performance Comparison' section for Module 4 results" echo " โน๏ธ Comprehensive analysis across all modules happens in Module 6" echo "" echo "๐ How to Read the Analysis:" echo " 1. Individual test sections show raw performance data" echo " 2. 'Performance Comparison' section shows Module 4 improvements" echo " 3. VMI data (if shown) is for reference - focus on container metrics" echo "" echo "๐ Performance Comparison Summary:" echo "==================================" if [ -f "$REPORT_FILE" ]; then # Display key sections of the report head -30 "$REPORT_FILE" echo "" echo "๐ Full report available at: $REPORT_FILE" else echo "โ ๏ธ Report generation failed - check if both baseline and tuned metrics exist" fi echo "" echo "โ Performance comparison completed!" echo "" echo "๐ก The comparison analysis explains:" echo " ๏ฟฝ What P99/P95/P50 improvements mean for your workloads" echo " ๐ฏ Why scheduling became instant (0ms) with CPU isolation" echo " โ๏ธ Why container operations may be slower (expected trade-off)" echo " ๐ Overall assessment of your performance tuning effectiveness"
Step 7: Validate kube-burner Test Pods and Cluster Stability
Before proceeding, let’s validate that our performance tuning doesn’t interfere with normal cluster operations and that kube-burner can successfully create and schedule pods.
-
Check the status of kube-burner test pods:
# Load cluster configuration source /tmp/cluster-config echo "=== Validating kube-burner Test Pod Scheduling ===" echo "Cluster type: $CLUSTER_TYPE" echo "Target node: $TARGET_NODE" # Check if tuned test pods were created successfully echo "" echo "๐ Tuned Test Pod Status:" TUNED_PODS=$(oc get pods -A -l app=tuned-test --no-headers 2>/dev/null | wc -l) if [ $TUNED_PODS -gt 0 ]; then echo " โ Found $TUNED_PODS tuned test pods" echo "" echo " ๐ Pod Distribution by Node:" oc get pods -A -l app=tuned-test -o wide --no-headers | awk '{print $8}' | sort | uniq -c | sed 's/^/ /' echo "" echo " ๐ Pod Status Summary:" oc get pods -A -l app=tuned-test --no-headers | awk '{print $4}' | sort | uniq -c | sed 's/^/ /' # Check for any failed pods FAILED_PODS=$(oc get pods -A -l app=tuned-test --no-headers | grep -v "Running\|Completed" | wc -l) if [ $FAILED_PODS -gt 0 ]; then echo "" echo " โ ๏ธ Found $FAILED_PODS pods not in Running/Completed state:" oc get pods -A -l app=tuned-test --no-headers | grep -v "Running\|Completed" | sed 's/^/ /' fi else echo " โ ๏ธ No tuned test pods found - this may indicate scheduling issues" fi # Check baseline test pods as well echo "" echo "๐ Baseline Test Pod Status:" BASELINE_PODS=$(oc get pods -A -l app=baseline-test --no-headers 2>/dev/null | wc -l) if [ $BASELINE_PODS -gt 0 ]; then echo " โ Found $BASELINE_PODS baseline test pods" echo " ๐ Baseline Pod Status Summary:" oc get pods -A -l app=baseline-test --no-headers | awk '{print $4}' | sort | uniq -c | sed 's/^/ /' else echo " โน๏ธ No baseline test pods found (may have been cleaned up)" fi -
Test cluster responsiveness and pod scheduling:
# Load cluster configuration source /tmp/cluster-config echo "=== Testing Cluster Responsiveness ===" # Create a simple test pod to verify scheduling works echo "๐งช Creating test pod to verify cluster functionality..." cat << EOF | oc apply -f - apiVersion: v1 kind: Pod metadata: name: cluster-health-test namespace: default labels: app: cluster-health-test spec: containers: - name: test-container image: registry.redhat.io/ubi8/ubi-minimal:latest command: ["sleep"] args: ["60"] resources: requests: memory: "32Mi" cpu: "50m" limits: memory: "64Mi" cpu: "100m" restartPolicy: Never EOF # Wait for pod to be scheduled and running echo "โฑ๏ธ Waiting for test pod to start..." oc wait --for=condition=Ready pod/cluster-health-test --timeout=60s -n default if [ $? -eq 0 ]; then echo " โ Test pod started successfully - cluster is responsive" # Check which node it was scheduled on TEST_POD_NODE=$(oc get pod cluster-health-test -n default -o jsonpath='{.spec.nodeName}') echo " ๐ Test pod scheduled on node: $TEST_POD_NODE" # Clean up test pod oc delete pod cluster-health-test -n default --ignore-not-found=true else echo " โ Test pod failed to start - cluster may have scheduling issues" echo " ๐ Pod events:" oc describe pod cluster-health-test -n default | grep -A 10 "Events:" | sed 's/^/ /' fi
Step 8: Optional - Revert Performance Tuning for Workshop Stability
If you experience any issues with cluster stability or want to continue with other workshop modules without the aggressive performance tuning, you can revert the changes.
-
Create a revert script for easy cleanup:
# Load cluster configuration source /tmp/cluster-config echo "=== Creating Performance Tuning Revert Script ===" # Create revert script cat > ~/revert-performance-tuning.sh << 'EOF' #!/bin/bash # Load cluster configuration if [ -f /tmp/cluster-config ]; then source /tmp/cluster-config echo "=== Reverting Performance Tuning ===" echo "Cluster type: $CLUSTER_TYPE" echo "Profile name: $PROFILE_NAME" echo "Target node: $TARGET_NODE" else echo "โ Cluster configuration not found. Cannot revert automatically." echo "๐ก You can manually delete performance profiles with:" echo " oc get performanceprofile" echo " oc delete performanceprofile <profile-name>" exit 1 fi # Delete the Performance Profile echo "" echo "๐๏ธ Removing Performance Profile: $PROFILE_NAME" if oc get performanceprofile $PROFILE_NAME >/dev/null 2>&1; then oc delete performanceprofile $PROFILE_NAME echo " โ Performance Profile deleted" else echo " โน๏ธ Performance Profile not found (may already be deleted)" fi # Remove custom Machine Config Pool (if created) if [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo "" echo "๐๏ธ Removing worker-rt Machine Config Pool..." if oc get mcp worker-rt >/dev/null 2>&1; then oc delete mcp worker-rt echo " โ worker-rt Machine Config Pool deleted" else echo " โน๏ธ worker-rt Machine Config Pool not found" fi # Remove worker-rt label from nodes echo "" echo "๐ท๏ธ Removing worker-rt labels from nodes..." oc label nodes -l node-role.kubernetes.io/worker-rt node-role.kubernetes.io/worker-rt- --ignore-not-found=true elif [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then echo "" echo "๐๏ธ Removing master-rt Machine Config Pool..." if oc get mcp master-rt >/dev/null 2>&1; then oc delete mcp master-rt echo " โ master-rt Machine Config Pool deleted" else echo " โน๏ธ master-rt Machine Config Pool not found" fi # Remove master-rt label from nodes echo "" echo "๐ท๏ธ Removing master-rt labels from nodes..." oc label nodes -l node-role.kubernetes.io/master-rt node-role.kubernetes.io/master-rt- --ignore-not-found=true fi echo "" echo "โณ Waiting for nodes to revert to standard kernel..." echo " ๐ก This process will take 10-15 minutes as nodes reboot" echo " ๐ Monitor progress with: watch 'oc get nodes; oc get mcp'" # Wait for machine config pools to be updated if [ "$CLUSTER_TYPE" = "SNO" ]; then echo " ๐ Waiting for master MCP to update..." oc wait --for=condition=Updated mcp/master --timeout=1200s elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo " ๐ Waiting for worker MCP to update..." oc wait --for=condition=Updated mcp/worker --timeout=1200s else echo " ๐ Waiting for master MCP to update..." oc wait --for=condition=Updated mcp/master --timeout=1200s fi echo "" echo "โ Performance tuning revert completed!" echo "๐ Verify with: oc debug node/$TARGET_NODE -- chroot /host uname -r" echo " (Should show standard kernel without 'rt')" EOF chmod +x ~/revert-performance-tuning.sh echo "โ Revert script created: ~/revert-performance-tuning.sh" echo "" echo "๐ก To revert performance tuning later, run:" echo " ~/revert-performance-tuning.sh" echo "" echo "โ ๏ธ Note: Reverting will cause nodes to reboot back to standard kernel" -
Optional - Run the revert script if needed:
echo "=== Performance Tuning Revert Decision ===" echo "" echo "๐ค Do you want to revert the performance tuning now?" echo "" echo "โ Keep Performance Tuning if:" echo " - Cluster is stable and responsive" echo " - Test pods are scheduling successfully" echo " - You want to continue with optimized performance" echo "" echo "๐ Revert Performance Tuning if:" echo " - Experiencing cluster stability issues" echo " - Pods are failing to schedule" echo " - Want to continue workshop with standard settings" echo "" echo "๐ก You can always re-apply performance tuning later by re-running this module" echo "" echo "To revert now, run: ~/revert-performance-tuning.sh" echo "To keep current settings, continue to the next step"
Expected Improvements
With proper performance tuning, you should see significant improvements:
-
Pod Creation P99: 50-70% reduction in latency
-
Pod Creation P95: 40-60% reduction in latency
-
Consistency: Much lower variance between P50 and P99
-
Jitter Reduction: More predictable response times
-
CPU Isolation: Eliminates interference from system processes
-
Real-time Kernel: Provides deterministic scheduling
-
HugePages: Reduces memory management overhead
-
NUMA Optimization: Ensures local memory access
Troubleshooting Common Issues
This section provides architecture-specific troubleshooting guidance for common Performance Profile issues.
Architecture-Specific Troubleshooting
If nodes don’t start updating after Performance Profile creation:
# Load cluster configuration
source /tmp/cluster-config
echo "=== Node Update Troubleshooting ==="
echo "Cluster type: $CLUSTER_TYPE"
echo "Target node: $TARGET_NODE"
if [ "$CLUSTER_TYPE" = "SNO" ]; then
echo ""
echo "๐ SNO Troubleshooting:"
echo " - Check master Machine Config Pool status"
oc describe mcp master | grep -A 10 -B 5 "Conditions:"
echo ""
echo " - Check for conflicting machine configs:"
oc get mc | grep master
elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
echo ""
echo "๐ Multi-Node Troubleshooting:"
echo " - Check worker-rt Machine Config Pool status"
oc describe mcp worker-rt | grep -A 10 -B 5 "Conditions:"
echo ""
echo " - Check for conflicting machine configs:"
oc get mc | grep worker-rt
else
echo ""
echo "๐ Multi-Master Troubleshooting:"
echo " - Check master-rt Machine Config Pool status"
oc describe mcp master-rt | grep -A 10 -B 5 "Conditions:"
echo ""
echo " - Check for conflicting machine configs:"
oc get mc | grep master-rt
fi
echo ""
echo "๐ Performance Profile Status:"
oc describe performanceprofile $PROFILE_NAME | grep -A 10 -B 5 "Status:"
If the RT kernel fails to install:
# Load cluster configuration
source /tmp/cluster-config
echo "=== Real-Time Kernel Troubleshooting ==="
echo "Target node: $TARGET_NODE"
# Check node events for errors
echo "๐ Recent Node Events:"
oc get events --sort-by='.lastTimestamp' --field-selector involvedObject.name=$TARGET_NODE | tail -15
echo ""
echo "๐ Machine Config Daemon Logs:"
MCD_POD=$(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=$TARGET_NODE -o jsonpath='{.items[0].metadata.name}')
if [ -n "$MCD_POD" ]; then
echo " MCD Pod: $MCD_POD"
oc logs -n openshift-machine-config-operator $MCD_POD --tail=20
else
echo " โ No MCD pod found for node $TARGET_NODE"
fi
echo ""
echo "๐ RT Kernel Package Availability:"
oc debug node/$TARGET_NODE -- chroot /host sh -c 'yum list available | grep kernel-rt || dnf list available | grep kernel-rt' 2>/dev/null || echo " Unable to check RT kernel packages"
echo ""
echo "๐ Current Kernel Information:"
oc debug node/$TARGET_NODE -- chroot /host uname -a 2>/dev/null || echo " Unable to check current kernel"
If HugePages aren’t configured correctly:
# Load cluster configuration
source /tmp/cluster-config
echo "=== HugePages Troubleshooting ==="
echo "Target node: $TARGET_NODE"
# Check available memory
echo "๐ Memory Information:"
oc debug node/$TARGET_NODE -- chroot /host free -h 2>/dev/null || echo " Unable to check memory"
echo ""
echo "๐ HugePages Configuration:"
oc debug node/$TARGET_NODE -- chroot /host cat /proc/meminfo 2>/dev/null | grep -i huge || echo " No HugePages information found"
echo ""
echo "๐ HugePages Mount Points:"
oc debug node/$TARGET_NODE -- chroot /host mount 2>/dev/null | grep huge || echo " No HugePages mount points found"
echo ""
echo "๐ Kernel Command Line (HugePages args):"
oc debug node/$TARGET_NODE -- chroot /host cat /proc/cmdline 2>/dev/null | grep -o 'hugepages[^[:space:]]*' || echo " No HugePages kernel arguments found"
echo ""
echo "๐ Performance Profile HugePages Spec:"
oc get performanceprofile $PROFILE_NAME -o jsonpath='{.spec.hugepages}' | jq '.' 2>/dev/null || echo " Unable to read Performance Profile HugePages spec"
If the node reboots but the API server doesn’t come back up (SNO) or node remains NotReady:
# Load cluster configuration
source /tmp/cluster-config
echo "=== Node Stuck After Reboot Troubleshooting ==="
echo "Cluster type: $CLUSTER_TYPE"
echo "Target node: $TARGET_NODE"
echo "Instance type: ${INSTANCE_TYPE:-unknown}"
# Check node status
echo ""
echo "๐ Node Status:"
oc get node $TARGET_NODE -o wide 2>/dev/null || echo " โ Cannot reach API server (node may be rebooting)"
# Check AWS instance status (if AWS CLI available)
if command -v aws &> /dev/null; then
echo ""
echo "๐ AWS Instance Status:"
INSTANCE_ID=$(aws ec2 describe-instances \
--filters "Name=private-ip-address,Values=$(oc get node $TARGET_NODE -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}' 2>/dev/null || echo 'unknown')" \
--query 'Reservations[*].Instances[*].[InstanceId,State.Name]' \
--output text 2>/dev/null | head -1)
if [ -n "$INSTANCE_ID" ]; then
echo " Instance: $INSTANCE_ID"
aws ec2 describe-instance-status --instance-ids $INSTANCE_ID \
--query 'InstanceStatuses[*].[SystemStatus.Status,InstanceStatus.Status]' \
--output table 2>/dev/null || echo " Unable to check AWS status"
fi
fi
echo ""
echo "๐ก Common Causes:"
echo " 1. CPU isolation too aggressive - not enough CPUs for control plane"
echo " 2. nosmt parameter on virtualized instance - halves available CPUs"
echo " 3. Insufficient memory after HugePages allocation"
echo ""
echo "๐ง Recovery Options:"
echo " 1. Wait 20-30 minutes - node may still be applying configuration"
echo " 2. Reboot instance: aws ec2 reboot-instances --instance-ids <instance-id>"
echo " 3. Delete Performance Profile and recreate with more conservative settings:"
echo " oc delete performanceprofile $PROFILE_NAME"
echo " # Then re-run CPU allocation calculator and profile creator"
If the Performance Profile exists but optimizations aren’t applied:
# Load cluster configuration
source /tmp/cluster-config
echo "=== Performance Profile Application Troubleshooting ==="
# Check Performance Profile status
echo "๐ Performance Profile Status:"
oc get performanceprofile $PROFILE_NAME -o yaml | grep -A 20 "status:" || echo " No status information available"
echo ""
echo "๐ Node Tuning Operator Status:"
oc get pods -n openshift-cluster-node-tuning-operator
echo ""
echo "๐ TuneD Daemon Status:"
TUNED_POD=$(oc get pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned --field-selector spec.nodeName=$TARGET_NODE -o jsonpath='{.items[0].metadata.name}')
if [ -n "$TUNED_POD" ]; then
echo " TuneD Pod: $TUNED_POD"
oc logs -n openshift-cluster-node-tuning-operator $TUNED_POD --tail=10
else
echo " โ No TuneD pod found for node $TARGET_NODE"
fi
echo ""
echo "๐ Generated TuneD Profiles:"
oc get tuned -n openshift-cluster-node-tuning-operator
Module Summary
In this module, you have successfully implemented architecture-aware performance tuning:
โ
Detected your cluster architecture (SNO, Multi-Node, or Multi-Master)
โ
Created an optimized Performance Profile tailored to your environment
โ
Configured CPU isolation appropriate for your cluster’s control plane needs
โ
Allocated HugePages sized correctly for your available resources
โ
Applied real-time kernel settings for deterministic scheduling
โ
Verified all performance optimizations are working correctly
โ
Measured significant performance improvements through comparative testing
Single Node OpenShift (SNO):
* Conservative CPU allocation preserving control plane stability
* Optimized HugePages configuration for resource-constrained environments
* Master node targeting with appropriate performance isolation
Multi-Node Clusters:
* Aggressive CPU isolation on dedicated worker nodes
* Maximum performance optimization without control plane impact
* Dedicated Machine Config Pool for isolated performance tuning
Multi-Master Clusters:
* Balanced CPU allocation for control plane and workload performance
* Strategic node selection for performance optimization
* Maintained cluster stability during rolling updates
Key Takeaways
* Performance Profiles adapt automatically to different cluster architectures
* CPU isolation strategies must account for control plane requirements
* Real-time kernels provide predictable, low-latency scheduling across all architectures
* HugePages allocation should be sized appropriately for available resources
* Proper architecture-aware tuning can achieve 50-70% latency improvements
* SNO environments can achieve exceptional performance due to simplified architecture
Based on your cluster type, you should observe:
* Pod Creation Latency: 50-70% reduction in P99 times
* Consistency: Dramatically reduced variance between P50 and P99
* Jitter: More predictable and deterministic response times
* Resource Utilization: Optimized CPU and memory usage patterns
Optional: Reverting Performance Tuning for Workshop Continuation
If you need to revert the performance tuning to continue with other workshop modules or if you experience any cluster stability issues, you can easily remove the performance optimizations.
Creating a Revert Script
-
Create an automated revert script:
# Load cluster configuration source /tmp/cluster-config echo "=== Creating Performance Tuning Revert Script ===" # Create revert script cat > ~/revert-performance-tuning.sh << 'EOF' #!/bin/bash # Load cluster configuration if [ -f /tmp/cluster-config ]; then source /tmp/cluster-config echo "=== Reverting Performance Tuning ===" echo "Cluster type: $CLUSTER_TYPE" echo "Profile name: $PROFILE_NAME" echo "Target node: $TARGET_NODE" else echo "โ Cluster configuration not found. Attempting manual cleanup..." PROFILE_NAME=$(oc get performanceprofile -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) if [ -z "$PROFILE_NAME" ]; then echo "No Performance Profiles found to delete." exit 0 fi fi # Delete the Performance Profile echo "" echo "๐๏ธ Removing Performance Profile: $PROFILE_NAME" if oc get performanceprofile $PROFILE_NAME >/dev/null 2>&1; then oc delete performanceprofile $PROFILE_NAME echo " โ Performance Profile deleted" else echo " โน๏ธ Performance Profile not found (may already be deleted)" fi # Remove custom Machine Config Pool and labels (if created) if [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo "" echo "๐๏ธ Cleaning up worker-rt configuration..." oc delete mcp worker-rt --ignore-not-found=true oc label nodes -l node-role.kubernetes.io/worker-rt node-role.kubernetes.io/worker-rt- --ignore-not-found=true echo " โ Worker-rt configuration cleaned up" elif [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then echo "" echo "๐๏ธ Cleaning up master-rt configuration..." oc delete mcp master-rt --ignore-not-found=true oc label nodes -l node-role.kubernetes.io/master-rt node-role.kubernetes.io/master-rt- --ignore-not-found=true echo " โ Master-rt configuration cleaned up" fi echo "" echo "โณ Waiting for nodes to revert to standard kernel..." echo " ๐ก This process will take 10-15 minutes as nodes reboot" echo " ๐ Monitor progress with: watch 'oc get nodes; oc get mcp'" # Wait for machine config pools to be updated if [ "$CLUSTER_TYPE" = "SNO" ] || [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then echo " ๐ Waiting for master MCP to update..." oc wait --for=condition=Updated mcp/master --timeout=1200s elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then echo " ๐ Waiting for worker MCP to update..." oc wait --for=condition=Updated mcp/worker --timeout=1200s fi echo "" echo "โ Performance tuning revert completed!" echo "๐ Verify standard kernel with:" echo " oc debug node/\$(oc get nodes -o jsonpath='{.items[0].metadata.name}') -- chroot /host uname -r" echo " (Should show standard kernel without 'rt')" EOF chmod +x ~/revert-performance-tuning.sh echo "โ Revert script created: ~/revert-performance-tuning.sh" echo "" echo "๐ก To revert performance tuning at any time, run:" echo " ~/revert-performance-tuning.sh" echo "" echo "โ ๏ธ Note: Reverting will cause nodes to reboot back to standard kernel"
When to Use the Revert Script
-
Pods failing to schedule or start
-
Cluster becoming unresponsive
-
High resource contention
-
Need to continue with other workshop modules that require standard settings
-
Cluster is stable and responsive
-
Test pods are scheduling successfully
-
You want to continue with performance-optimized settings
-
Planning to proceed directly to Module 5 (Virtualization)
Manual Cleanup (Alternative)
If the automated script doesn’t work, you can manually clean up:
# Manual cleanup commands
echo "=== Manual Performance Tuning Cleanup ==="
# List and delete performance profiles
echo "๐ Current Performance Profiles:"
oc get performanceprofile
echo ""
echo "๐๏ธ Delete Performance Profile:"
echo "oc delete performanceprofile <profile-name>"
# List and delete custom MCPs
echo ""
echo "๐ Current Machine Config Pools:"
oc get mcp
echo ""
echo "๐๏ธ Delete custom MCP (if created):"
echo "oc delete mcp worker-rt # or master-rt"
echo ""
echo "๐ก After deletion, nodes will automatically reboot to standard kernel"
In Module 5, you will learn how to apply these performance optimizations to OpenShift Virtualization, creating high-performance virtual machines that leverage your tuned infrastructure to achieve near bare-metal latency characteristics. The architecture-aware approach you’ve learned will be essential for optimizing VM placement and resource allocation.
|
Workshop Flexibility: You can proceed to Module 5 either with or without the performance tuning active. The virtualization module will adapt to your current cluster configuration. |