Module 4: Core Performance Tuning with Performance Profiles

Module Overview

This module is the heart of the workshop, where you’ll apply node-level performance tuning using OpenShift’s Performance Profile Controller. You’ll configure CPU isolation, HugePages allocation, and real-time kernel settings to dramatically improve latency characteristics.

Learning Objectives
  • Understand Performance Profiles and their components

  • Configure CPU isolation for high-performance workloads

  • Allocate and manage HugePages for reduced memory latency

  • Apply real-time kernel tuning profiles

  • Measure the performance improvements from your optimizations

Prerequisites

Before starting this module, ensure you have completed:

  • Module 1: Workshop Setup and Multi-Cluster Management

  • Module 2: GitOps Deployment and Operator Installation

  • Module 3: Baseline Performance Measurement and Analysis

  • Established baseline performance metrics using kube-burner from Module 3

Understanding Performance Profiles

The Performance Profile Controller (PPC) is integrated into the Node Tuning Operator and provides a declarative way to configure multiple low-latency optimizations through a single custom resource.

Key Components
  • CPU Management: Isolate specific CPUs for high-performance workloads

  • Memory Tuning: Configure HugePages and memory-related kernel parameters

  • Kernel Tuning: Apply real-time kernel settings and tuned profiles

  • NUMA Awareness: Optimize for Non-Uniform Memory Access architectures

Performance Profile Benefits
  • Single Configuration: One CR manages multiple complex tuning parameters

  • Declarative: Version-controlled, repeatable configuration

  • Node Pool Isolation: Apply tuning to specific worker nodes only

  • Rolling Updates: Orchestrated updates with minimal disruption

Hands-on Exercise: Creating a Performance Profile

Step 1: Identify Target Nodes and Environment

First, we’ll identify your cluster architecture and configure the appropriate nodes for performance tuning.

  1. Switch to your target cluster context:

    oc config use-context target-cluster
  2. Detect your cluster architecture:

    # Check cluster architecture
    TOTAL_NODES=$(oc get nodes --no-headers | wc -l)
    WORKER_NODES=$(oc get nodes -l node-role.kubernetes.io/worker= --no-headers | wc -l)
    MASTER_NODES=$(oc get nodes -l node-role.kubernetes.io/master= --no-headers | wc -l)
    
    echo "=== Cluster Architecture Detection ==="
    echo "Total nodes: $TOTAL_NODES"
    echo "Master nodes: $MASTER_NODES"
    echo "Worker nodes: $WORKER_NODES"
    
    if [ $TOTAL_NODES -eq 1 ]; then
        echo "✅ Detected: Single Node OpenShift (SNO)"
        CLUSTER_TYPE="SNO"
        TARGET_NODE=$(oc get nodes -o jsonpath='{.items[0].metadata.name}')
        echo "Target node: $TARGET_NODE"
    elif [ $WORKER_NODES -gt 0 ]; then
        echo "✅ Detected: Multi-node cluster with dedicated workers"
        CLUSTER_TYPE="MULTI_NODE"
        TARGET_NODE=$(oc get nodes -l node-role.kubernetes.io/worker= -o jsonpath='{.items[0].metadata.name}')
        echo "Target worker node: $TARGET_NODE"
    else
        echo "⚠️  Detected: Multi-node cluster without dedicated workers"
        CLUSTER_TYPE="MULTI_MASTER"
        TARGET_NODE=$(oc get nodes -l node-role.kubernetes.io/master= -o jsonpath='{.items[0].metadata.name}')
        echo "Target master node: $TARGET_NODE"
    fi
    
    echo "CLUSTER_TYPE=$CLUSTER_TYPE" > /tmp/cluster-config
    echo "TARGET_NODE=$TARGET_NODE" >> /tmp/cluster-config
  3. Check node resources and CPU topology:

    # Load cluster configuration
    source /tmp/cluster-config
    
    # Get detailed node information
    echo "=== Node Information for $TARGET_NODE ==="
    oc describe node $TARGET_NODE | grep -E "(Name:|Roles:|Capacity:|Allocatable:)" -A 10
    
    # Check CPU information
    echo "=== CPU Topology ==="
    oc debug node/$TARGET_NODE -- chroot /host lscpu | grep -E "(CPU\(s\)|Thread|Core|Socket|NUMA)"
  4. Configure node labels based on cluster type:

    # Load cluster configuration
    source /tmp/cluster-config
    
    if [ "$CLUSTER_TYPE" = "SNO" ]; then
        echo "=== SNO Configuration ==="
        echo "✅ Single Node OpenShift detected - no additional labeling needed"
        echo "Performance Profile will target: node-role.kubernetes.io/master"
    
    elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo "=== Multi-Node Configuration ==="
        echo "🏷️  Labeling worker node for performance tuning..."
    
        # Label the worker node(s) for performance tuning
        oc label node $TARGET_NODE node-role.kubernetes.io/worker-rt= --overwrite
    
        # Verify the label
        echo "✅ Labeled nodes:"
        oc get nodes --show-labels | grep worker-rt
    
    else
        echo "=== Multi-Master Configuration ==="
        echo "🏷️  Labeling master node for performance tuning..."
    
        # Label a master node for performance tuning (advanced scenario)
        oc label node $TARGET_NODE node-role.kubernetes.io/master-rt= --overwrite
    
        # Verify the label
        echo "✅ Labeled nodes:"
        oc get nodes --show-labels | grep master-rt
    fi

Step 2: Create Machine Config Pool (Multi-Node Only)

For multi-node clusters, create a dedicated Machine Config Pool to isolate performance configuration changes. SNO clusters can skip this step.

  1. Create Machine Config Pool based on cluster type:

    # Load cluster configuration
    source /tmp/cluster-config
    
    if [ "$CLUSTER_TYPE" = "SNO" ]; then
        echo "=== SNO Configuration ==="
        echo "✅ Skipping Machine Config Pool creation for Single Node OpenShift"
        echo "📝 SNO uses existing 'master' Machine Config Pool"
        echo ""
        echo "Existing Machine Config Pools:"
        oc get mcp
    
    elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo "=== Multi-Node Configuration ==="
        echo "🔧 Creating dedicated worker-rt Machine Config Pool..."
    
        cat << 'EOF' | oc apply -f -
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      name: worker-rt
      labels:
        machineconfiguration.openshift.io/role: worker-rt
    spec:
      machineConfigSelector:
        matchExpressions:
        - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker, worker-rt]}
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/worker-rt: ""
      paused: false
    EOF
    
        echo "✅ Created worker-rt Machine Config Pool"
    
    else
        echo "=== Multi-Master Configuration ==="
        echo "🔧 Creating dedicated master-rt Machine Config Pool..."
    
        cat << 'EOF' | oc apply -f -
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      name: master-rt
      labels:
        machineconfiguration.openshift.io/role: master-rt
    spec:
      machineConfigSelector:
        matchExpressions:
        - {key: machineconfiguration.openshift.io/role, operator: In, values: [master, master-rt]}
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/master-rt: ""
      paused: false
    EOF
    
        echo "✅ Created master-rt Machine Config Pool"
    fi
  2. Verify Machine Config Pool status:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Machine Config Pool Status ==="
    if [ "$CLUSTER_TYPE" = "SNO" ]; then
        echo "📋 SNO uses existing pools:"
        oc get mcp
    else
        echo "📋 All Machine Config Pools:"
        oc get mcp
    
        if [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
            echo ""
            echo "📊 Worker-RT Pool Details:"
            oc get mcp worker-rt -o yaml | grep -A 5 -B 5 "readyMachineCount\|updatedMachineCount"
        else
            echo ""
            echo "📊 Master-RT Pool Details:"
            oc get mcp master-rt -o yaml | grep -A 5 -B 5 "readyMachineCount\|updatedMachineCount"
        fi
    fi

Step 3: Create the Performance Profile

Now we’ll create a Performance Profile that configures CPU isolation, HugePages, and real-time kernel settings optimized for your cluster architecture.

  1. Determine optimal CPU allocation based on your cluster type:

    # Run the CPU allocation calculator script
    bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.sh

    What This Script Does:

    • Detects CPU count on the target node

    • Calculates optimal CPU allocation based on cluster type (SNO, Multi-Node, Multi-Master)

    • Validates CPU ranges to prevent configuration errors

    • Saves configuration to /tmp/cluster-config for next steps

    Benefits of Using the Script:

    • Error Prevention: Validates CPU ranges before saving

    • Handles Edge Cases: Prevents division by zero and invalid ranges

    • Clear Output: Shows allocation strategy and percentages

    • Reusable: Can be run multiple times to recalculate

    The script implements workshop-friendly conservative allocation that preserves cluster functionality while demonstrating performance benefits.

  2. Create the Performance Profile optimized for your cluster architecture:

    # Run the Performance Profile creation script
    bash ~/low-latency-performance-workshop/scripts/module04-create-performance-profile.sh

    What This Script Does:

    • Loads and validates CPU allocation from previous step

    • Validates CPU ranges to prevent invalid PerformanceProfile

    • Determines HugePages allocation based on cluster type

    • Creates PerformanceProfile with proper node selector

    • Shows configuration before applying (requires confirmation)

    Benefits of Using the Script:

    • Prevents Errors: Validates CPU ranges before creating PerformanceProfile

    • Interactive: Shows configuration and asks for confirmation

    • Clear Output: Displays profile summary and next steps

    • Error Handling: Provides helpful error messages if creation fails

    The script will ask for confirmation before applying the PerformanceProfile.

    If you see an error like invalid range "0—​2", this means the CPU allocation calculation failed. Run the CPU allocation calculator script again:

    bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.sh

    Then retry the Performance Profile creation.

Step 4: Monitor the Performance Profile Application

The Performance Profile will trigger a rolling update of your nodes. This process includes installing the real-time kernel and applying all the specified optimizations. The monitoring approach varies by cluster architecture.

  1. Monitor Machine Config Pool status based on cluster type:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Monitoring Performance Profile Application ==="
    echo "Cluster type: $CLUSTER_TYPE"
    echo "Profile name: $PROFILE_NAME"
    echo "Target node: $TARGET_NODE"
    
    if [ "$CLUSTER_TYPE" = "SNO" ]; then
        echo ""
        echo "🔍 SNO Monitoring Strategy:"
        echo "   - Monitor 'master' Machine Config Pool"
        echo "   - Single node will reboot (cluster temporarily unavailable)"
        echo "   - Expect 5-15 minutes for RT kernel installation"
        echo ""
        echo "📊 Current master MCP status:"
        oc get mcp master
        echo ""
        echo "⏱️  Starting continuous monitoring (Ctrl+C to stop):"
        echo "watch 'oc get mcp master; echo; oc get nodes'"
    
    elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo ""
        echo "🔍 Multi-Node Monitoring Strategy:"
        echo "   - Monitor 'worker-rt' Machine Config Pool"
        echo "   - Worker nodes will reboot sequentially"
        echo "   - Control plane remains available"
        echo ""
        echo "📊 Current worker-rt MCP status:"
        oc get mcp worker-rt
        echo ""
        echo "⏱️  Starting continuous monitoring (Ctrl+C to stop):"
        echo "watch 'oc get mcp worker-rt; echo; oc get nodes'"
    
    else
        echo ""
        echo "🔍 Multi-Master Monitoring Strategy:"
        echo "   - Monitor 'master-rt' Machine Config Pool"
        echo "   - Master nodes will reboot sequentially"
        echo "   - Cluster maintains quorum during updates"
        echo ""
        echo "📊 Current master-rt MCP status:"
        oc get mcp master-rt
        echo ""
        echo "⏱️  Starting continuous monitoring (Ctrl+C to stop):"
        echo "watch 'oc get mcp master-rt; echo; oc get nodes'"
    fi
  2. Monitor node updates in detail:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Detailed Node Update Monitoring ==="
    
    # Check machine config daemon status
    echo "📋 Machine Config Daemon Pods:"
    oc get pods -n openshift-machine-config-operator | grep daemon
    
    echo ""
    echo "📋 Recent Node Events:"
    oc get events --sort-by='.lastTimestamp' --field-selector involvedObject.name=$TARGET_NODE | tail -10
    
    echo ""
    echo "📋 Machine Config Status:"
    if [ "$CLUSTER_TYPE" = "SNO" ]; then
        oc describe mcp master | grep -A 10 -B 5 "Conditions:"
    elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        oc describe mcp worker-rt | grep -A 10 -B 5 "Conditions:"
    else
        oc describe mcp master-rt | grep -A 10 -B 5 "Conditions:"
    fi
  3. Wait for the update to complete:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Waiting for Performance Profile Application ==="
    echo "This process may take 10-20 minutes depending on your environment"
    
    if [ "$CLUSTER_TYPE" = "SNO" ]; then
        echo ""
        echo "⏳ SNO Update Process:"
        echo "   1. Machine config generation"
        echo "   2. Node cordoning and draining"
        echo "   3. RT kernel installation and reboot"
        echo "   4. Node rejoin and ready state"
        echo ""
        echo "🔄 Waiting for master MCP to be updated..."
        oc wait --for=condition=Updated mcp/master --timeout=1200s
    
        echo "🔄 Waiting for node to be ready after reboot..."
        oc wait --for=condition=Ready node/$TARGET_NODE --timeout=600s
    
    elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo ""
        echo "⏳ Multi-Node Update Process:"
        echo "   1. Worker-RT MCP configuration"
        echo "   2. Sequential worker node updates"
        echo "   3. RT kernel installation per node"
        echo "   4. Node rejoin and ready state"
        echo ""
        echo "🔄 Waiting for worker-rt MCP to be updated..."
        oc wait --for=condition=Updated mcp/worker-rt --timeout=1200s
    
        echo "🔄 Waiting for target node to be ready..."
        oc wait --for=condition=Ready node/$TARGET_NODE --timeout=600s
    
    else
        echo ""
        echo "⏳ Multi-Master Update Process:"
        echo "   1. Master-RT MCP configuration"
        echo "   2. Sequential master node updates"
        echo "   3. RT kernel installation per node"
        echo "   4. Node rejoin and ready state"
        echo ""
        echo "🔄 Waiting for master-rt MCP to be updated..."
        oc wait --for=condition=Updated mcp/master-rt --timeout=1200s
    
        echo "🔄 Waiting for target node to be ready..."
        oc wait --for=condition=Ready node/$TARGET_NODE --timeout=600s
    fi
    
    echo ""
    echo "✅ Performance Profile application completed!"
    echo "📊 Final status:"
    oc get nodes
    oc get mcp
    
    echo ""
    echo "🧪 Testing cluster functionality after performance tuning..."
    
    # Create a simple test pod to verify the cluster is still functional
    cat << EOF | oc apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: post-tuning-test
      namespace: default
      labels:
        app: post-tuning-test
    spec:
      containers:
      - name: test-container
        image: registry.redhat.io/ubi8/ubi-minimal:latest
        command: ["sleep"]
        args: ["30"]
        resources:
          requests:
            memory: "32Mi"
            cpu: "50m"
          limits:
            memory: "64Mi"
            cpu: "100m"
      restartPolicy: Never
    EOF
    
    # Wait for pod to be scheduled and running
    echo "⏱️  Waiting for test pod to verify cluster functionality..."
    if oc wait --for=condition=Ready pod/post-tuning-test --timeout=120s -n default 2>/dev/null; then
        echo "   ✅ Cluster is functional after performance tuning!"
        echo "   📍 Test pod scheduled successfully"
        oc delete pod post-tuning-test -n default --ignore-not-found=true
    else
        echo "   ⚠️  Cluster may have scheduling issues after performance tuning"
        echo "   💡 Consider using the revert script if problems persist"
        echo "   🔍 Check pod status: oc describe pod post-tuning-test -n default"
    fi

Step 5: Verify Performance Profile Effects

Once the update is complete, verify that all the performance optimizations have been applied correctly across your cluster architecture.

  1. Comprehensive verification using Python health check script:

    echo "=== Performance Profile Verification ==="
    echo "Running comprehensive cluster health check..."
    echo ""
    
    # Use Python script for thorough verification
    python3 ~/low-latency-performance-workshop/scripts/module04-cluster-health-check.py
    
    echo ""
    echo "✅ Comprehensive verification completed!"
    echo ""
    echo "💡 The health check script validates:"
    echo "   - Cluster architecture detection"
    echo "   - Performance Profile status"
    echo "   - Real-time kernel installation"
    echo "   - CPU isolation configuration"
    echo "   - Pod scheduling functionality"
  2. Get a quick performance tuning summary:

    echo "=== Performance Tuning Summary ==="
    echo ""
    
    # Get color-coded summary of current performance settings
    python3 ~/low-latency-performance-workshop/scripts/module04-performance-summary.py
    
    echo ""
    echo "💡 This summary shows:"
    echo "   - Current CPU allocation strategy"
    echo "   - Performance vs stability balance"
    echo "   - Recommendations for optimization"

Detailed Performance Tuning Validation

The workshop provides additional scripts to help with CPU allocation, Performance Profile creation, and validation.

  1. CPU Allocation Calculator - Calculate optimal CPU allocation:

    # Calculate CPU allocation based on cluster type
    bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.sh

    This script:

    • Detects CPU count on target node

    • Calculates optimal reserved/isolated CPU allocation

    • Validates CPU ranges to prevent errors

    • Saves configuration for Performance Profile creation

  2. Performance Profile Creator - Create validated Performance Profile:

    # Create Performance Profile with validated configuration
    bash ~/low-latency-performance-workshop/scripts/module04-create-performance-profile.sh

    This script:

    • Validates CPU allocation from previous step

    • Prevents invalid CPU range errors

    • Shows configuration before applying

    • Requires confirmation for safety

  3. Performance Tuning Validator - Comprehensive validation of Performance Profile:

    # Validate Performance Profile configuration
    python3 ~/low-latency-performance-workshop/scripts/module04-tuning-validator.py
    
    # Skip educational explanations for automation
    python3 ~/low-latency-performance-workshop/scripts/module04-tuning-validator.py --skip-explanation

    This script validates:

    • Performance Profile existence and configuration

    • Machine Config Pool (MCP) status and readiness

    • Real-Time kernel installation on target nodes

    • Overall tuning configuration health

  4. CPU Isolation Checker - Detailed CPU allocation analysis:

    # Check CPU isolation configuration
    python3 ~/low-latency-performance-workshop/scripts/module04-cpu-isolation-checker.py
    
    # Specify a custom Performance Profile
    python3 ~/low-latency-performance-workshop/scripts/module04-cpu-isolation-checker.py \
        --profile my-performance-profile

    This script provides:

    • Visual representation of CPU allocation

    • Reserved vs isolated CPU validation

    • CPU allocation strategy explanation

    • Best practices for CPU isolation

    • Configuration recommendations

  5. HugePages Validator - HugePages configuration verification:

    # Validate HugePages configuration
    python3 ~/low-latency-performance-workshop/scripts/module04-hugepages-validator.py
    
    # Check specific Performance Profile
    python3 ~/low-latency-performance-workshop/scripts/module04-hugepages-validator.py \
        --profile my-performance-profile

    This script validates:

    • HugePages configuration in Performance Profile

    • Node HugePages allocation and availability

    • HugePages benefits and use cases

    • How to use HugePages in pod specifications

    • Total HugePages memory allocation

These validation scripts are educational tools that help you understand:

  • What each performance tuning component does

  • How to verify configurations are correct

  • Best practices for low-latency tuning

  • Troubleshooting common issues

Run them after applying Performance Profiles to ensure everything is configured correctly.

Performance Testing: Measuring Improvements

Now let’s run the same baseline test to measure the performance improvements from our optimizations.

Step 6: Re-run Performance Tests

  1. Re-run the kube-burner performance test on the optimized cluster:

    cd ~/kube-burner-configs
    
    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Creating Tuned Performance Test Configuration ==="
    echo "Cluster type: $CLUSTER_TYPE"
    echo "Profile name: $PROFILE_NAME"
    echo "Target node: $TARGET_NODE"
    
    # Create a new test configuration for the tuned cluster
    cat > tuned-config.yml << EOF
    global:
      measurements:
        - name: podLatency
          thresholds:
            - conditionType: Ready
              metric: P99
              threshold: 15000ms  # Expect better performance after tuning
    
    metricsEndpoints:
      - indexer:
          type: local
          metricsDirectory: collected-metrics-tuned
    
    jobs:
      - name: tuned-workload
        jobType: create
        jobIterations: 20
        namespace: tuned-workload
        namespacedIterations: true
        cleanup: false
        podWait: false
        waitWhenFinished: true
        verifyObjects: true
        errorOnVerify: false
        objects:
          - objectTemplate: tuned-pod.yml
            replicas: 5
            inputVars:
              containerImage: registry.redhat.io/ubi8/ubi:latest
    EOF
    
    # Create a tuned pod template with appropriate node selector
    if [ "$CLUSTER_TYPE" = "SNO" ]; then
        NODE_SELECTOR_YAML='nodeSelector:
        node-role.kubernetes.io/master: ""'
        echo "📝 SNO: Using master node selector for pod placement"
    elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        NODE_SELECTOR_YAML='nodeSelector:
        node-role.kubernetes.io/worker-rt: ""'
        echo "📝 Multi-Node: Using worker-rt node selector for pod placement"
    else
        NODE_SELECTOR_YAML='nodeSelector:
        node-role.kubernetes.io/master-rt: ""'
        echo "📝 Multi-Master: Using master-rt node selector for pod placement"
    fi
    
    cat > tuned-pod.yml << EOF
    apiVersion: v1
    kind: Pod
    metadata:
      name: tuned-pod-{{.Iteration}}-{{.Replica}}
      labels:
        app: tuned-test
        iteration: "{{.Iteration}}"
        cluster-type: "$CLUSTER_TYPE"
    spec:
      $NODE_SELECTOR_YAML
      containers:
      - name: tuned-container
        image: {{.containerImage}}
        command: ["sleep"]
        args: ["300"]
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"
      restartPolicy: Never
    EOF
    
    echo ""
    echo "🚀 Running Performance Test on Tuned Cluster"
    echo "   - Test configuration: tuned-config.yml"
    echo "   - Pod template: tuned-pod.yml"
    echo "   - Target: $CLUSTER_TYPE cluster with $PROFILE_NAME"
    echo ""
    
    # Run the performance test
    kube-burner init -c tuned-config.yml --log-level=info
    
    echo ""
    echo "✅ Tuned performance test completed!"
    echo "📊 Results stored in: collected-metrics-tuned/"
  2. Analyze the tuned performance results using Python script:

    cd ~/kube-burner-configs
    
    echo "🔍 Analyzing tuned performance results..."
    echo ""
    
    # Use Python script for clean, color-coded analysis
    python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py --single collected-metrics-tuned
    
    echo ""
    echo "✅ Tuned performance analysis completed!"
    echo ""
    echo "💡 The enhanced analysis provides:"
    echo "   � Educational context explaining what each metric means"
    echo "   🎯 Performance explanations (why some metrics are slower)"
    echo "   🚀 Color-coded results: Excellent/Good/Needs Attention"
    echo "   📋 Suggested next steps based on your results"
  3. Compare results with your baseline using Python analysis:

    cd ~/kube-burner-configs
    
    echo "📊 Comparing baseline vs tuned performance..."
    echo ""
    
    # Use module-specific analysis script for clean Module 4 results
    echo "🎯 Module 4 Focused Analysis (Container Performance Only)..."
    python3 ~/low-latency-performance-workshop/scripts/module-specific-analysis.py 4
    
    echo ""
    echo "📄 Generating Module 4 Performance Report..."
    echo "   🎯 Focus: Container performance optimization (baseline vs tuned)"
    echo "   📊 Scope: CPU isolation and HugePages impact on pod startup"
    echo ""
    
    # Generate Module 4 specific markdown report (baseline vs tuned only)
    REPORT_FILE="module4-performance-comparison-$(date +%Y%m%d-%H%M).md"
    python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \
        --baseline collected-metrics \
        --tuned collected-metrics-tuned \
        --report "$REPORT_FILE"
    
    echo ""
    echo "💡 Module 4 Analysis Scope:"
    echo "   ✅ Baseline container performance (from Module 3)"
    echo "   ✅ Tuned container performance (from Module 4)"
    echo "   ⚠️  Note: If VMI data exists from Module 5, it will also be shown"
    echo "   🎯 Focus on the 'Performance Comparison' section for Module 4 results"
    echo "   ℹ️  Comprehensive analysis across all modules happens in Module 6"
    
    echo ""
    echo "📚 How to Read the Analysis:"
    echo "   1. Individual test sections show raw performance data"
    echo "   2. 'Performance Comparison' section shows Module 4 improvements"
    echo "   3. VMI data (if shown) is for reference - focus on container metrics"
    
    echo ""
    echo "📊 Performance Comparison Summary:"
    echo "=================================="
    if [ -f "$REPORT_FILE" ]; then
        # Display key sections of the report
        head -30 "$REPORT_FILE"
        echo ""
        echo "📄 Full report available at: $REPORT_FILE"
    else
        echo "⚠️  Report generation failed - check if both baseline and tuned metrics exist"
    fi
    
    echo ""
    echo "✅ Performance comparison completed!"
    echo ""
    echo "💡 The comparison analysis explains:"
    echo "   � What P99/P95/P50 improvements mean for your workloads"
    echo "   🎯 Why scheduling became instant (0ms) with CPU isolation"
    echo "   ⚖️ Why container operations may be slower (expected trade-off)"
    echo "   🏆 Overall assessment of your performance tuning effectiveness"

Step 7: Validate kube-burner Test Pods and Cluster Stability

Before proceeding, let’s validate that our performance tuning doesn’t interfere with normal cluster operations and that kube-burner can successfully create and schedule pods.

  1. Check the status of kube-burner test pods:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Validating kube-burner Test Pod Scheduling ==="
    echo "Cluster type: $CLUSTER_TYPE"
    echo "Target node: $TARGET_NODE"
    
    # Check if tuned test pods were created successfully
    echo ""
    echo "📋 Tuned Test Pod Status:"
    TUNED_PODS=$(oc get pods -A -l app=tuned-test --no-headers 2>/dev/null | wc -l)
    if [ $TUNED_PODS -gt 0 ]; then
        echo "   ✅ Found $TUNED_PODS tuned test pods"
        echo ""
        echo "   📊 Pod Distribution by Node:"
        oc get pods -A -l app=tuned-test -o wide --no-headers | awk '{print $8}' | sort | uniq -c | sed 's/^/      /'
    
        echo ""
        echo "   📊 Pod Status Summary:"
        oc get pods -A -l app=tuned-test --no-headers | awk '{print $4}' | sort | uniq -c | sed 's/^/      /'
    
        # Check for any failed pods
        FAILED_PODS=$(oc get pods -A -l app=tuned-test --no-headers | grep -v "Running\|Completed" | wc -l)
        if [ $FAILED_PODS -gt 0 ]; then
            echo ""
            echo "   ⚠️  Found $FAILED_PODS pods not in Running/Completed state:"
            oc get pods -A -l app=tuned-test --no-headers | grep -v "Running\|Completed" | sed 's/^/      /'
        fi
    else
        echo "   ⚠️  No tuned test pods found - this may indicate scheduling issues"
    fi
    
    # Check baseline test pods as well
    echo ""
    echo "📋 Baseline Test Pod Status:"
    BASELINE_PODS=$(oc get pods -A -l app=baseline-test --no-headers 2>/dev/null | wc -l)
    if [ $BASELINE_PODS -gt 0 ]; then
        echo "   ✅ Found $BASELINE_PODS baseline test pods"
        echo "   📊 Baseline Pod Status Summary:"
        oc get pods -A -l app=baseline-test --no-headers | awk '{print $4}' | sort | uniq -c | sed 's/^/      /'
    else
        echo "   ℹ️  No baseline test pods found (may have been cleaned up)"
    fi
  2. Test cluster responsiveness and pod scheduling:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Testing Cluster Responsiveness ==="
    
    # Create a simple test pod to verify scheduling works
    echo "🧪 Creating test pod to verify cluster functionality..."
    
    cat << EOF | oc apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: cluster-health-test
      namespace: default
      labels:
        app: cluster-health-test
    spec:
      containers:
      - name: test-container
        image: registry.redhat.io/ubi8/ubi-minimal:latest
        command: ["sleep"]
        args: ["60"]
        resources:
          requests:
            memory: "32Mi"
            cpu: "50m"
          limits:
            memory: "64Mi"
            cpu: "100m"
      restartPolicy: Never
    EOF
    
    # Wait for pod to be scheduled and running
    echo "⏱️  Waiting for test pod to start..."
    oc wait --for=condition=Ready pod/cluster-health-test --timeout=60s -n default
    
    if [ $? -eq 0 ]; then
        echo "   ✅ Test pod started successfully - cluster is responsive"
    
        # Check which node it was scheduled on
        TEST_POD_NODE=$(oc get pod cluster-health-test -n default -o jsonpath='{.spec.nodeName}')
        echo "   📍 Test pod scheduled on node: $TEST_POD_NODE"
    
        # Clean up test pod
        oc delete pod cluster-health-test -n default --ignore-not-found=true
    else
        echo "   ❌ Test pod failed to start - cluster may have scheduling issues"
        echo "   🔍 Pod events:"
        oc describe pod cluster-health-test -n default | grep -A 10 "Events:" | sed 's/^/      /'
    fi

Step 8: Optional - Revert Performance Tuning for Workshop Stability

If you experience any issues with cluster stability or want to continue with other workshop modules without the aggressive performance tuning, you can revert the changes.

  1. Create a revert script for easy cleanup:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Creating Performance Tuning Revert Script ==="
    
    # Create revert script
    cat > ~/revert-performance-tuning.sh << 'EOF'
    #!/bin/bash
    
    # Load cluster configuration
    if [ -f /tmp/cluster-config ]; then
        source /tmp/cluster-config
        echo "=== Reverting Performance Tuning ==="
        echo "Cluster type: $CLUSTER_TYPE"
        echo "Profile name: $PROFILE_NAME"
        echo "Target node: $TARGET_NODE"
    else
        echo "❌ Cluster configuration not found. Cannot revert automatically."
        echo "💡 You can manually delete performance profiles with:"
        echo "   oc get performanceprofile"
        echo "   oc delete performanceprofile <profile-name>"
        exit 1
    fi
    
    # Delete the Performance Profile
    echo ""
    echo "🗑️  Removing Performance Profile: $PROFILE_NAME"
    if oc get performanceprofile $PROFILE_NAME >/dev/null 2>&1; then
        oc delete performanceprofile $PROFILE_NAME
        echo "   ✅ Performance Profile deleted"
    else
        echo "   ℹ️  Performance Profile not found (may already be deleted)"
    fi
    
    # Remove custom Machine Config Pool (if created)
    if [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo ""
        echo "🗑️  Removing worker-rt Machine Config Pool..."
        if oc get mcp worker-rt >/dev/null 2>&1; then
            oc delete mcp worker-rt
            echo "   ✅ worker-rt Machine Config Pool deleted"
        else
            echo "   ℹ️  worker-rt Machine Config Pool not found"
        fi
    
        # Remove worker-rt label from nodes
        echo ""
        echo "🏷️  Removing worker-rt labels from nodes..."
        oc label nodes -l node-role.kubernetes.io/worker-rt node-role.kubernetes.io/worker-rt- --ignore-not-found=true
    
    elif [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then
        echo ""
        echo "🗑️  Removing master-rt Machine Config Pool..."
        if oc get mcp master-rt >/dev/null 2>&1; then
            oc delete mcp master-rt
            echo "   ✅ master-rt Machine Config Pool deleted"
        else
            echo "   ℹ️  master-rt Machine Config Pool not found"
        fi
    
        # Remove master-rt label from nodes
        echo ""
        echo "🏷️  Removing master-rt labels from nodes..."
        oc label nodes -l node-role.kubernetes.io/master-rt node-role.kubernetes.io/master-rt- --ignore-not-found=true
    fi
    
    echo ""
    echo "⏳ Waiting for nodes to revert to standard kernel..."
    echo "   💡 This process will take 10-15 minutes as nodes reboot"
    echo "   📊 Monitor progress with: watch 'oc get nodes; oc get mcp'"
    
    # Wait for machine config pools to be updated
    if [ "$CLUSTER_TYPE" = "SNO" ]; then
        echo "   🔄 Waiting for master MCP to update..."
        oc wait --for=condition=Updated mcp/master --timeout=1200s
    elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo "   🔄 Waiting for worker MCP to update..."
        oc wait --for=condition=Updated mcp/worker --timeout=1200s
    else
        echo "   🔄 Waiting for master MCP to update..."
        oc wait --for=condition=Updated mcp/master --timeout=1200s
    fi
    
    echo ""
    echo "✅ Performance tuning revert completed!"
    echo "🔍 Verify with: oc debug node/$TARGET_NODE -- chroot /host uname -r"
    echo "   (Should show standard kernel without 'rt')"
    
    EOF
    
    chmod +x ~/revert-performance-tuning.sh
    
    echo "✅ Revert script created: ~/revert-performance-tuning.sh"
    echo ""
    echo "💡 To revert performance tuning later, run:"
    echo "   ~/revert-performance-tuning.sh"
    echo ""
    echo "⚠️  Note: Reverting will cause nodes to reboot back to standard kernel"
  2. Optional - Run the revert script if needed:

    echo "=== Performance Tuning Revert Decision ==="
    echo ""
    echo "🤔 Do you want to revert the performance tuning now?"
    echo ""
    echo "✅ Keep Performance Tuning if:"
    echo "   - Cluster is stable and responsive"
    echo "   - Test pods are scheduling successfully"
    echo "   - You want to continue with optimized performance"
    echo ""
    echo "🔄 Revert Performance Tuning if:"
    echo "   - Experiencing cluster stability issues"
    echo "   - Pods are failing to schedule"
    echo "   - Want to continue workshop with standard settings"
    echo ""
    echo "💡 You can always re-apply performance tuning later by re-running this module"
    echo ""
    echo "To revert now, run: ~/revert-performance-tuning.sh"
    echo "To keep current settings, continue to the next step"

Expected Improvements

With proper performance tuning, you should see significant improvements:

Typical Improvements
  • Pod Creation P99: 50-70% reduction in latency

  • Pod Creation P95: 40-60% reduction in latency

  • Consistency: Much lower variance between P50 and P99

  • Jitter Reduction: More predictable response times

Performance Factors
  • CPU Isolation: Eliminates interference from system processes

  • Real-time Kernel: Provides deterministic scheduling

  • HugePages: Reduces memory management overhead

  • NUMA Optimization: Ensures local memory access

Troubleshooting Common Issues

This section provides architecture-specific troubleshooting guidance for common Performance Profile issues.

Architecture-Specific Troubleshooting

Node Not Updating

If nodes don’t start updating after Performance Profile creation:

# Load cluster configuration
source /tmp/cluster-config

echo "=== Node Update Troubleshooting ==="
echo "Cluster type: $CLUSTER_TYPE"
echo "Target node: $TARGET_NODE"

if [ "$CLUSTER_TYPE" = "SNO" ]; then
    echo ""
    echo "🔍 SNO Troubleshooting:"
    echo "   - Check master Machine Config Pool status"
    oc describe mcp master | grep -A 10 -B 5 "Conditions:"

    echo ""
    echo "   - Check for conflicting machine configs:"
    oc get mc | grep master

elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
    echo ""
    echo "🔍 Multi-Node Troubleshooting:"
    echo "   - Check worker-rt Machine Config Pool status"
    oc describe mcp worker-rt | grep -A 10 -B 5 "Conditions:"

    echo ""
    echo "   - Check for conflicting machine configs:"
    oc get mc | grep worker-rt

else
    echo ""
    echo "🔍 Multi-Master Troubleshooting:"
    echo "   - Check master-rt Machine Config Pool status"
    oc describe mcp master-rt | grep -A 10 -B 5 "Conditions:"

    echo ""
    echo "   - Check for conflicting machine configs:"
    oc get mc | grep master-rt
fi

echo ""
echo "📋 Performance Profile Status:"
oc describe performanceprofile $PROFILE_NAME | grep -A 10 -B 5 "Status:"
Real-time Kernel Issues

If the RT kernel fails to install:

# Load cluster configuration
source /tmp/cluster-config

echo "=== Real-Time Kernel Troubleshooting ==="
echo "Target node: $TARGET_NODE"

# Check node events for errors
echo "📋 Recent Node Events:"
oc get events --sort-by='.lastTimestamp' --field-selector involvedObject.name=$TARGET_NODE | tail -15

echo ""
echo "📋 Machine Config Daemon Logs:"
MCD_POD=$(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=$TARGET_NODE -o jsonpath='{.items[0].metadata.name}')
if [ -n "$MCD_POD" ]; then
    echo "   MCD Pod: $MCD_POD"
    oc logs -n openshift-machine-config-operator $MCD_POD --tail=20
else
    echo "   ❌ No MCD pod found for node $TARGET_NODE"
fi

echo ""
echo "📋 RT Kernel Package Availability:"
oc debug node/$TARGET_NODE -- chroot /host sh -c 'yum list available | grep kernel-rt || dnf list available | grep kernel-rt' 2>/dev/null || echo "   Unable to check RT kernel packages"

echo ""
echo "📋 Current Kernel Information:"
oc debug node/$TARGET_NODE -- chroot /host uname -a 2>/dev/null || echo "   Unable to check current kernel"
HugePages Not Allocated

If HugePages aren’t configured correctly:

# Load cluster configuration
source /tmp/cluster-config

echo "=== HugePages Troubleshooting ==="
echo "Target node: $TARGET_NODE"

# Check available memory
echo "📋 Memory Information:"
oc debug node/$TARGET_NODE -- chroot /host free -h 2>/dev/null || echo "   Unable to check memory"

echo ""
echo "📋 HugePages Configuration:"
oc debug node/$TARGET_NODE -- chroot /host cat /proc/meminfo 2>/dev/null | grep -i huge || echo "   No HugePages information found"

echo ""
echo "📋 HugePages Mount Points:"
oc debug node/$TARGET_NODE -- chroot /host mount 2>/dev/null | grep huge || echo "   No HugePages mount points found"

echo ""
echo "📋 Kernel Command Line (HugePages args):"
oc debug node/$TARGET_NODE -- chroot /host cat /proc/cmdline 2>/dev/null | grep -o 'hugepages[^[:space:]]*' || echo "   No HugePages kernel arguments found"

echo ""
echo "📋 Performance Profile HugePages Spec:"
oc get performanceprofile $PROFILE_NAME -o jsonpath='{.spec.hugepages}' | jq '.' 2>/dev/null || echo "   Unable to read Performance Profile HugePages spec"
Performance Profile Not Applied

If the Performance Profile exists but optimizations aren’t applied:

# Load cluster configuration
source /tmp/cluster-config

echo "=== Performance Profile Application Troubleshooting ==="

# Check Performance Profile status
echo "📋 Performance Profile Status:"
oc get performanceprofile $PROFILE_NAME -o yaml | grep -A 20 "status:" || echo "   No status information available"

echo ""
echo "📋 Node Tuning Operator Status:"
oc get pods -n openshift-cluster-node-tuning-operator

echo ""
echo "📋 TuneD Daemon Status:"
TUNED_POD=$(oc get pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned --field-selector spec.nodeName=$TARGET_NODE -o jsonpath='{.items[0].metadata.name}')
if [ -n "$TUNED_POD" ]; then
    echo "   TuneD Pod: $TUNED_POD"
    oc logs -n openshift-cluster-node-tuning-operator $TUNED_POD --tail=10
else
    echo "   ❌ No TuneD pod found for node $TARGET_NODE"
fi

echo ""
echo "📋 Generated TuneD Profiles:"
oc get tuned -n openshift-cluster-node-tuning-operator

Module Summary

In this module, you have successfully implemented architecture-aware performance tuning:

Detected your cluster architecture (SNO, Multi-Node, or Multi-Master)
Created an optimized Performance Profile tailored to your environment
Configured CPU isolation appropriate for your cluster’s control plane needs
Allocated HugePages sized correctly for your available resources
Applied real-time kernel settings for deterministic scheduling
Verified all performance optimizations are working correctly
Measured significant performance improvements through comparative testing

Architecture-Specific Achievements

Single Node OpenShift (SNO): * Conservative CPU allocation preserving control plane stability * Optimized HugePages configuration for resource-constrained environments * Master node targeting with appropriate performance isolation

Multi-Node Clusters: * Aggressive CPU isolation on dedicated worker nodes * Maximum performance optimization without control plane impact * Dedicated Machine Config Pool for isolated performance tuning

Multi-Master Clusters: * Balanced CPU allocation for control plane and workload performance * Strategic node selection for performance optimization * Maintained cluster stability during rolling updates

Key Takeaways
  • Performance Profiles adapt automatically to different cluster architectures

  • CPU isolation strategies must account for control plane requirements

  • Real-time kernels provide predictable, low-latency scheduling across all architectures

  • HugePages allocation should be sized appropriately for available resources

  • Proper architecture-aware tuning can achieve 50-70% latency improvements

  • SNO environments can achieve exceptional performance due to simplified architecture

Performance Impact Summary

Based on your cluster type, you should observe: * Pod Creation Latency: 50-70% reduction in P99 times * Consistency: Dramatically reduced variance between P50 and P99 * Jitter: More predictable and deterministic response times * Resource Utilization: Optimized CPU and memory usage patterns

Optional: Reverting Performance Tuning for Workshop Continuation

If you need to revert the performance tuning to continue with other workshop modules or if you experience any cluster stability issues, you can easily remove the performance optimizations.

Creating a Revert Script

  1. Create an automated revert script:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Creating Performance Tuning Revert Script ==="
    
    # Create revert script
    cat > ~/revert-performance-tuning.sh << 'EOF'
    #!/bin/bash
    
    # Load cluster configuration
    if [ -f /tmp/cluster-config ]; then
        source /tmp/cluster-config
        echo "=== Reverting Performance Tuning ==="
        echo "Cluster type: $CLUSTER_TYPE"
        echo "Profile name: $PROFILE_NAME"
        echo "Target node: $TARGET_NODE"
    else
        echo "❌ Cluster configuration not found. Attempting manual cleanup..."
        PROFILE_NAME=$(oc get performanceprofile -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
        if [ -z "$PROFILE_NAME" ]; then
            echo "No Performance Profiles found to delete."
            exit 0
        fi
    fi
    
    # Delete the Performance Profile
    echo ""
    echo "🗑️  Removing Performance Profile: $PROFILE_NAME"
    if oc get performanceprofile $PROFILE_NAME >/dev/null 2>&1; then
        oc delete performanceprofile $PROFILE_NAME
        echo "   ✅ Performance Profile deleted"
    else
        echo "   ℹ️  Performance Profile not found (may already be deleted)"
    fi
    
    # Remove custom Machine Config Pool and labels (if created)
    if [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo ""
        echo "🗑️  Cleaning up worker-rt configuration..."
        oc delete mcp worker-rt --ignore-not-found=true
        oc label nodes -l node-role.kubernetes.io/worker-rt node-role.kubernetes.io/worker-rt- --ignore-not-found=true
        echo "   ✅ Worker-rt configuration cleaned up"
    
    elif [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then
        echo ""
        echo "🗑️  Cleaning up master-rt configuration..."
        oc delete mcp master-rt --ignore-not-found=true
        oc label nodes -l node-role.kubernetes.io/master-rt node-role.kubernetes.io/master-rt- --ignore-not-found=true
        echo "   ✅ Master-rt configuration cleaned up"
    fi
    
    echo ""
    echo "⏳ Waiting for nodes to revert to standard kernel..."
    echo "   💡 This process will take 10-15 minutes as nodes reboot"
    echo "   📊 Monitor progress with: watch 'oc get nodes; oc get mcp'"
    
    # Wait for machine config pools to be updated
    if [ "$CLUSTER_TYPE" = "SNO" ] || [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then
        echo "   🔄 Waiting for master MCP to update..."
        oc wait --for=condition=Updated mcp/master --timeout=1200s
    elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo "   🔄 Waiting for worker MCP to update..."
        oc wait --for=condition=Updated mcp/worker --timeout=1200s
    fi
    
    echo ""
    echo "✅ Performance tuning revert completed!"
    echo "🔍 Verify standard kernel with:"
    echo "   oc debug node/\$(oc get nodes -o jsonpath='{.items[0].metadata.name}') -- chroot /host uname -r"
    echo "   (Should show standard kernel without 'rt')"
    
    EOF
    
    chmod +x ~/revert-performance-tuning.sh
    
    echo "✅ Revert script created: ~/revert-performance-tuning.sh"
    echo ""
    echo "💡 To revert performance tuning at any time, run:"
    echo "   ~/revert-performance-tuning.sh"
    echo ""
    echo "⚠️  Note: Reverting will cause nodes to reboot back to standard kernel"

When to Use the Revert Script

Revert Performance Tuning if you experience:
  • Pods failing to schedule or start

  • Cluster becoming unresponsive

  • High resource contention

  • Need to continue with other workshop modules that require standard settings

Keep Performance Tuning if:
  • Cluster is stable and responsive

  • Test pods are scheduling successfully

  • You want to continue with performance-optimized settings

  • Planning to proceed directly to Module 5 (Virtualization)

Manual Cleanup (Alternative)

If the automated script doesn’t work, you can manually clean up:

# Manual cleanup commands
echo "=== Manual Performance Tuning Cleanup ==="

# List and delete performance profiles
echo "📋 Current Performance Profiles:"
oc get performanceprofile

echo ""
echo "🗑️  Delete Performance Profile:"
echo "oc delete performanceprofile <profile-name>"

# List and delete custom MCPs
echo ""
echo "📋 Current Machine Config Pools:"
oc get mcp

echo ""
echo "🗑️  Delete custom MCP (if created):"
echo "oc delete mcp worker-rt  # or master-rt"

echo ""
echo "💡 After deletion, nodes will automatically reboot to standard kernel"
Next Steps

In Module 5, you will learn how to apply these performance optimizations to OpenShift Virtualization, creating high-performance virtual machines that leverage your tuned infrastructure to achieve near bare-metal latency characteristics. The architecture-aware approach you’ve learned will be essential for optimizing VM placement and resource allocation.

Workshop Flexibility: You can proceed to Module 5 either with or without the performance tuning active. The virtualization module will adapt to your current cluster configuration.