Module 4: Core Performance Tuning with Performance Profiles

Module Overview

This module is the heart of the workshop, where you’ll apply node-level performance tuning using OpenShift’s Performance Profile Controller. You’ll configure CPU isolation, HugePages allocation, and real-time kernel settings to dramatically improve latency characteristics.

Learning Objectives
  • Understand Performance Profiles and their components

  • Configure CPU isolation for high-performance workloads

  • Allocate and manage HugePages for reduced memory latency

  • Apply real-time kernel tuning profiles

  • Measure the performance improvements from your optimizations

Prerequisites

Before starting this module, ensure you have completed:

  • Module 1: Low-Latency Performance Fundamentals

  • Module 2: Environment Setup and Verification

  • Module 3: Baseline Performance Measurement and Analysis

  • Established baseline performance metrics using kube-burner from Module 3

Understanding Performance Profiles

The Performance Profile Controller (PPC) is integrated into the Node Tuning Operator and provides a declarative way to configure multiple low-latency optimizations through a single custom resource.

Key Components
  • CPU Management: Isolate specific CPUs for high-performance workloads

  • Memory Tuning: Configure HugePages and memory-related kernel parameters

  • Kernel Tuning: Apply real-time kernel settings and tuned profiles

  • NUMA Awareness: Optimize for Non-Uniform Memory Access architectures

Performance Profile Benefits
  • Single Configuration: One CR manages multiple complex tuning parameters

  • Declarative: Version-controlled, repeatable configuration

  • Node Pool Isolation: Apply tuning to specific worker nodes only

  • Rolling Updates: Orchestrated updates with minimal disruption

Hands-on Exercise: Creating a Performance Profile

Step 1: Identify Target Nodes and Environment

First, we’ll identify your cluster architecture and configure the appropriate nodes for performance tuning.

  1. Verify cluster access (you should already be connected via bastion):

    # Verify you're connected to the SNO cluster
    oc whoami
    oc get nodes
  2. Get your SNO node information:

    # Get the SNO node name
    TARGET_NODE=$(oc get nodes -o jsonpath='{.items[0].metadata.name}')
    echo "Target node: $TARGET_NODE"
    
    # Get detailed node information
    echo "=== Node Information ==="
    oc describe node $TARGET_NODE | grep -E "(Name:|Roles:|Capacity:|Allocatable:)" -A 10
    
    # Check CPU information
    echo "=== CPU Topology ==="
    oc debug node/$TARGET_NODE -- chroot /host lscpu | grep -E "(CPU\(s\)|Thread|Core|Socket|NUMA)"

SNO Configuration Note

Since this is a Single Node OpenShift cluster, no additional node labeling is needed. The Performance Profile will automatically target the master node (which also acts as the worker in SNO).

Step 2: Verify Machine Config Pool (SNO)

For SNO clusters, the default master Machine Config Pool is used. No additional pool creation is needed.

  1. Verify Machine Config Pool status:

    # SNO uses the default master pool
    echo "=== Machine Config Pools ==="
    oc get mcp
    
    echo ""
    echo "๐Ÿ“Š Master Pool Details:"
    oc get mcp master -o yaml | grep -E "(readyMachineCount|updatedMachineCount|machineCount)" | head -3

SNO Machine Config Pool

Single Node OpenShift uses the built-in master Machine Config Pool. The Performance Profile will automatically apply to this pool, so no additional configuration is needed.

Step 3: Create the Performance Profile

Now we’ll create a Performance Profile that configures CPU isolation, HugePages, and real-time kernel settings optimized for your cluster architecture.

Real-Time (RT) Kernel Requirements and Cost Considerations

The Linux Real-Time kernel extension (kernel-rt) provides deterministic scheduling and preemption capabilities essential for ultra-low latency workloads. However, it has specific infrastructure requirements:

AWS Instance Requirements:
* RT kernel requires bare-metal EC2 instances (instance types ending in .metal, e.g., m5zn.metal, c5.metal)
* RT kernel cannot run on virtualized instances (e.g., m5.4xlarge, c5.xlarge)
* This is because RT kernel needs direct hardware access for precise timing control

Cost Comparison (us-east-2 region):
* Virtualized instance (m5.4xlarge): ~$0.77/hour
* Bare-metal instance (m5zn.metal): ~$3.96/hour (5x more expensive)

Workshop Recommendation:

By default, the Performance Profile script creates a profile WITHOUT RT kernel enabled. This configuration still provides:
* โœ… CPU isolation (isolcpus)
* โœ… HugePages allocation
* โœ… NUMA topology optimization
* โœ… Kernel tuning parameters

This demonstrates 80% of low-latency concepts at a fraction of the cost, making it ideal for workshops.

To Enable RT Kernel (if using bare-metal instances):

If your cluster is deployed on bare-metal instances and you want to enable RT kernel:

# Enable RT kernel via environment variable
ENABLE_RT_KERNEL=true bash ~/low-latency-performance-workshop/scripts/module04-create-performance-profile.sh

# Or use the command-line flag
bash ~/low-latency-performance-workshop/scripts/module04-create-performance-profile.sh --enable-rt-kernel

The script will automatically detect your instance type and validate RT kernel requirements. If you attempt to enable RT kernel on a virtualized instance, the script will fail with a clear error message explaining the requirements.

SNO-Specific Considerations:

For Single Node OpenShift (SNO) clusters, applying a Performance Profile will cause the entire cluster to reboot, making the API server unavailable for 10-20 minutes.

Automatic Optimizations for Virtualized Instances:
* The scripts automatically detect virtualized instances (e.g., m5.4xlarge)
* For virtualized instances, the nosmt kernel parameter is automatically skipped to preserve hyperthreading
* CPU allocation is automatically adjusted to be more conservative, reserving more CPUs for the control plane
* This ensures the API server can start successfully after reboot

If Your Node Gets Stuck After Reboot:
* Wait 20-30 minutes - configuration may still be applying
* Check AWS instance status: aws ec2 describe-instance-status --instance-ids <id>
* If node doesn’t recover, delete the Performance Profile and recreate with more conservative settings
* See the "Troubleshooting Common Issues" section for detailed recovery steps

  1. Determine optimal CPU allocation based on your cluster type:

    # Run the CPU allocation calculator script
    bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.sh

    What This Script Does:

    • Detects CPU count on the target node

    • Calculates optimal CPU allocation based on cluster type (SNO, Multi-Node, Multi-Master)

    • Validates CPU ranges to prevent configuration errors

    • Saves configuration to /tmp/cluster-config for next steps

    Benefits of Using the Script:

    • Error Prevention: Validates CPU ranges before saving

    • Handles Edge Cases: Prevents division by zero and invalid ranges

    • Clear Output: Shows allocation strategy and percentages

    • Reusable: Can be run multiple times to recalculate

    The script implements workshop-friendly conservative allocation that preserves cluster functionality while demonstrating performance benefits.

  2. Create the Performance Profile optimized for your cluster architecture:

    # Run the Performance Profile creation script
    bash ~/low-latency-performance-workshop/scripts/module04-create-performance-profile.sh

    What This Script Does:

    • Loads and validates CPU allocation from previous step

    • Detects instance type (metal vs virtualized)

    • Validates RT kernel requirements if enabled

    • Validates CPU ranges to prevent invalid PerformanceProfile

    • Determines HugePages allocation based on cluster type

    • Creates PerformanceProfile with proper node selector

    • Conditionally enables RT kernel based on instance type and configuration

    • Shows configuration before applying (requires confirmation)

    Benefits of Using the Script:

    • Prevents Errors: Validates CPU ranges and RT kernel requirements before creating PerformanceProfile

    • Smart Defaults: RT kernel disabled by default (works on all instance types)

    • Instance Detection: Automatically detects bare-metal vs virtualized instances

    • Interactive: Shows configuration and asks for confirmation

    • Clear Output: Displays profile summary and next steps

    • Error Handling: Provides helpful error messages if creation fails

    The script will ask for confirmation before applying the PerformanceProfile.

    RT Kernel Configuration:

    • Default: RT kernel is disabled (works on all instance types)

    • To Enable: Set ENABLE_RT_KERNEL=true or use --enable-rt-kernel flag

    • Validation: Script will fail with clear error if RT kernel is requested on non-metal instances

    If you see an error like invalid range "0—​2", this means the CPU allocation calculation failed. Run the CPU allocation calculator script again:

    bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.sh

    Then retry the Performance Profile creation.

Step 4: Monitor the Performance Profile Application

The Performance Profile will trigger a rolling update of your nodes. This process includes applying CPU isolation, HugePages, NUMA tuning, and (if enabled) installing the real-time kernel. The monitoring approach varies by cluster architecture.

Node Reboot Required:

The Performance Profile applies kernel boot parameters (CPU isolation, HugePages, NUMA tuning) which always require a node reboot, regardless of RT kernel setting. The Machine Config Pool (MCP) will automatically trigger the reboot.

Expected Timeline:
* SNO Clusters: API server will be unavailable for 10-20 minutes during reboot
* Multi-Node Clusters: Rolling update, one node at a time (10-20 minutes per node)
* RT Kernel Enabled: Additional 5-10 minutes for RT kernel installation

SNO-Specific Considerations:
* The entire cluster (API server) will be unavailable during the reboot
* Plan for 15-30 minutes of downtime
* Monitor the node status: oc get nodes (will show NotReady during reboot)
* After reboot, wait for the node to show Ready status before proceeding

Virtualized Instance Optimization:
* For virtualized AWS instances (e.g., m5.4xlarge), the nosmt kernel parameter is automatically skipped
* This preserves hyperthreading and ensures sufficient CPUs remain for the control plane
* CPU allocation is automatically adjusted to be more conservative for virtualized instances

  1. Monitor Machine Config Pool status:

    echo "๐Ÿ“Š Current master MCP status:"
    oc get mcp master
    echo ""
    echo "โฑ๏ธ  Starting continuous monitoring (Ctrl+C to stop):"
    echo "watch 'oc get mcp master; echo; oc get nodes'"
  2. Monitor node updates in detail:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Detailed Node Update Monitoring ==="
    
    # Check machine config daemon status
    echo "๐Ÿ“‹ Machine Config Daemon Pods:"
    oc get pods -n openshift-machine-config-operator | grep daemon
    
    echo ""
    echo "๐Ÿ“‹ Recent Node Events:"
    oc get events --sort-by='.lastTimestamp' --field-selector involvedObject.name=$TARGET_NODE | tail -10
    
    echo ""
    echo "๐Ÿ“‹ Machine Config Status:"
    oc describe mcp master | grep -A 10 -B 5 "Conditions:"
  3. Wait for the update to complete:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Waiting for Performance Profile Application ==="
    echo "This process may take 5-20 minutes depending on your configuration"
    
    echo ""
    echo "โณ SNO Update Process:"
    echo "   1. Machine config generation"
    echo "   2. Node cordoning and draining"
    if [ "${ENABLE_RT_KERNEL:-false}" = "true" ]; then
        echo "   3. RT kernel installation and reboot (RT kernel enabled)"
        echo "   4. Node rejoin and ready state"
    else
        echo "   3. Kernel parameter updates (no reboot required)"
        echo "   4. Node ready state"
    fi
    echo ""
    echo "๐Ÿ”„ Waiting for master MCP to be updated..."
    oc wait --for=condition=Updated mcp/master --timeout=1200s
    
    # Only wait for reboot if RT kernel is enabled
    if [ "${ENABLE_RT_KERNEL:-false}" = "true" ]; then
        echo "๐Ÿ”„ Waiting for node to be ready after reboot..."
        oc wait --for=condition=Ready node/$TARGET_NODE --timeout=600s
    else
        echo "โœ… Node update complete (no reboot required)"
    fi
    
    echo ""
    echo "โœ… Performance Profile application completed!"
    echo "๐Ÿ“Š Final status:"
    oc get nodes
    oc get mcp
    
    echo ""
    echo "๐Ÿงช Testing cluster functionality after performance tuning..."
    
    # Create a simple test pod to verify the cluster is still functional
    cat << EOF | oc apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: post-tuning-test
      namespace: default
      labels:
        app: post-tuning-test
    spec:
      containers:
      - name: test-container
        image: registry.redhat.io/ubi8/ubi-minimal:latest
        command: ["sleep"]
        args: ["30"]
        resources:
          requests:
            memory: "32Mi"
            cpu: "50m"
          limits:
            memory: "64Mi"
            cpu: "100m"
      restartPolicy: Never
    EOF
    
    # Wait for pod to be scheduled and running
    echo "โฑ๏ธ  Waiting for test pod to verify cluster functionality..."
    if oc wait --for=condition=Ready pod/post-tuning-test --timeout=120s -n default 2>/dev/null; then
        echo "   โœ… Cluster is functional after performance tuning!"
        echo "   ๐Ÿ“ Test pod scheduled successfully"
        oc delete pod post-tuning-test -n default --ignore-not-found=true
    else
        echo "   โš ๏ธ  Cluster may have scheduling issues after performance tuning"
        echo "   ๐Ÿ’ก Consider using the revert script if problems persist"
        echo "   ๐Ÿ” Check pod status: oc describe pod post-tuning-test -n default"
    fi

Step 5: Verify Performance Profile Effects

Once the update is complete, verify that all the performance optimizations have been applied correctly across your cluster architecture.

  1. Comprehensive verification using Python health check script:

    echo "=== Performance Profile Verification ==="
    echo "Running comprehensive cluster health check..."
    echo ""
    
    # Use Python script for thorough verification
    python3 ~/low-latency-performance-workshop/scripts/module04-cluster-health-check.py
    
    echo ""
    echo "โœ… Comprehensive verification completed!"
    echo ""
    echo "๐Ÿ’ก The health check script validates:"
    echo "   - Cluster architecture detection"
    echo "   - Performance Profile status"
    echo "   - Real-time kernel installation"
    echo "   - CPU isolation configuration"
    echo "   - Pod scheduling functionality"
  2. Get a quick performance tuning summary:

    echo "=== Performance Tuning Summary ==="
    echo ""
    
    # Get color-coded summary of current performance settings
    python3 ~/low-latency-performance-workshop/scripts/module04-performance-summary.py
    
    echo ""
    echo "๐Ÿ’ก This summary shows:"
    echo "   - Current CPU allocation strategy"
    echo "   - Performance vs stability balance"
    echo "   - Recommendations for optimization"

Detailed Performance Tuning Validation

The workshop provides additional scripts to help with CPU allocation, Performance Profile creation, and validation.

  1. CPU Allocation Calculator - Calculate optimal CPU allocation:

    # Calculate CPU allocation based on cluster type
    bash ~/low-latency-performance-workshop/scripts/module04-calculate-cpu-allocation.sh

    This script:

    • Detects CPU count on target node

    • Calculates optimal reserved/isolated CPU allocation

    • Validates CPU ranges to prevent errors

    • Saves configuration for Performance Profile creation

  2. Performance Profile Creator - Create validated Performance Profile:

    # Create Performance Profile with validated configuration
    bash ~/low-latency-performance-workshop/scripts/module04-create-performance-profile.sh

    This script:

    • Validates CPU allocation from previous step

    • Prevents invalid CPU range errors

    • Shows configuration before applying

    • Requires confirmation for safety

  3. Performance Tuning Validator - Comprehensive validation of Performance Profile:

    # Validate Performance Profile configuration
    python3 ~/low-latency-performance-workshop/scripts/module04-tuning-validator.py

    This script validates:

    • Performance Profile existence and configuration

    • Machine Config Pool (MCP) status and readiness

    • Real-Time kernel installation on target nodes

    • Overall tuning configuration health

  4. CPU Isolation Checker - Detailed CPU allocation analysis:

    # Check CPU isolation configuration
    python3 ~/low-latency-performance-workshop/scripts/module04-cpu-isolation-checker.py

    This script provides:

    • Visual representation of CPU allocation

    • Reserved vs isolated CPU validation

    • CPU allocation strategy explanation

    • Best practices for CPU isolation

    • Configuration recommendations

  5. HugePages Validator - HugePages configuration verification:

    # Validate HugePages configuration
    python3 ~/low-latency-performance-workshop/scripts/module04-hugepages-validator.py
    
    # Check specific Performance Profile
    python3 ~/low-latency-performance-workshop/scripts/module04-hugepages-validator.py \
        --profile my-performance-profile

    This script validates:

    • HugePages configuration in Performance Profile

    • Node HugePages allocation and availability

    • HugePages benefits and use cases

    • How to use HugePages in pod specifications

    • Total HugePages memory allocation

These validation scripts are educational tools that help you understand:

  • What each performance tuning component does

  • How to verify configurations are correct

  • Best practices for low-latency tuning

  • Troubleshooting common issues

Run them after applying Performance Profiles to ensure everything is configured correctly.

Performance Testing: Measuring Improvements

Now let’s run the same baseline test to measure the performance improvements from our optimizations.

Step 6: Re-run Performance Tests

  1. Re-run the kube-burner performance test on the optimized cluster:

    cd ~/kube-burner-configs
    
    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Creating Tuned Performance Test Configuration ==="
    echo "Cluster type: $CLUSTER_TYPE"
    echo "Profile name: $PROFILE_NAME"
    echo "Target node: $TARGET_NODE"
    
    # Create a new test configuration for the tuned cluster
    cat > tuned-config.yml << EOF
    global:
      measurements:
        - name: podLatency
          thresholds:
            - conditionType: Ready
              metric: P99
              threshold: 15000ms  # Expect better performance after tuning
    
    metricsEndpoints:
      - indexer:
          type: local
          metricsDirectory: collected-metrics-tuned
    
    jobs:
      - name: tuned-workload
        jobType: create
        jobIterations: 20
        namespace: tuned-workload
        namespacedIterations: true
        cleanup: false
        podWait: false
        waitWhenFinished: true
        verifyObjects: true
        errorOnVerify: false
        objects:
          - objectTemplate: tuned-pod.yml
            replicas: 5
            inputVars:
              containerImage: registry.redhat.io/ubi8/ubi:latest
    EOF
    
    # Create a tuned pod template with appropriate node selector
    if [ "$CLUSTER_TYPE" = "SNO" ]; then
        NODE_SELECTOR_YAML='nodeSelector:
        node-role.kubernetes.io/master: ""'
        echo "๐Ÿ“ SNO: Using master node selector for pod placement"
    elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        NODE_SELECTOR_YAML='nodeSelector:
        node-role.kubernetes.io/worker-rt: ""'
        echo "๐Ÿ“ Multi-Node: Using worker-rt node selector for pod placement"
    else
        NODE_SELECTOR_YAML='nodeSelector:
        node-role.kubernetes.io/master-rt: ""'
        echo "๐Ÿ“ Multi-Master: Using master-rt node selector for pod placement"
    fi
    
    cat > tuned-pod.yml << EOF
    apiVersion: v1
    kind: Pod
    metadata:
      name: tuned-pod-{{.Iteration}}-{{.Replica}}
      labels:
        app: tuned-test
        iteration: "{{.Iteration}}"
        cluster-type: "$CLUSTER_TYPE"
    spec:
      $NODE_SELECTOR_YAML
      containers:
      - name: tuned-container
        image: {{.containerImage}}
        command: ["sleep"]
        args: ["300"]
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"
      restartPolicy: Never
    EOF
    
    echo ""
    echo "๐Ÿš€ Running Performance Test on Tuned Cluster"
    echo "   - Test configuration: tuned-config.yml"
    echo "   - Pod template: tuned-pod.yml"
    echo "   - Target: $CLUSTER_TYPE cluster with $PROFILE_NAME"
    echo ""
    
    # Run the performance test
    kube-burner init -c tuned-config.yml --log-level=info
    
    echo ""
    echo "โœ… Tuned performance test completed!"
    echo "๐Ÿ“Š Results stored in: collected-metrics-tuned/"
  2. Analyze the tuned performance results using Python script:

    cd ~/kube-burner-configs
    
    echo "๐Ÿ” Analyzing tuned performance results..."
    echo ""
    
    # Use Python script for clean, color-coded analysis
    python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py --single collected-metrics-tuned
    
    echo ""
    echo "โœ… Tuned performance analysis completed!"
    echo ""
    echo "๐Ÿ’ก The enhanced analysis provides:"
    echo "   ๏ฟฝ Educational context explaining what each metric means"
    echo "   ๐ŸŽฏ Performance explanations (why some metrics are slower)"
    echo "   ๐Ÿš€ Color-coded results: Excellent/Good/Needs Attention"
    echo "   ๐Ÿ“‹ Suggested next steps based on your results"
  3. Compare results with your baseline using Python analysis:

    cd ~/kube-burner-configs
    
    echo "๐Ÿ“Š Comparing baseline vs tuned performance..."
    echo ""
    
    # Use module-specific analysis script for clean Module 4 results
    echo "๐ŸŽฏ Module 4 Focused Analysis (Container Performance Only)..."
    python3 ~/low-latency-performance-workshop/scripts/module-specific-analysis.py 4
    
    echo ""
    echo "๐Ÿ“„ Generating Module 4 Performance Report..."
    echo "   ๐ŸŽฏ Focus: Container performance optimization (baseline vs tuned)"
    echo "   ๐Ÿ“Š Scope: CPU isolation and HugePages impact on pod startup"
    echo ""
    
    # Generate Module 4 specific markdown report (baseline vs tuned only)
    REPORT_FILE="module4-performance-comparison-$(date +%Y%m%d-%H%M).md"
    python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \
        --baseline collected-metrics \
        --tuned collected-metrics-tuned \
        --report "$REPORT_FILE"
    
    echo ""
    echo "๐Ÿ’ก Module 4 Analysis Scope:"
    echo "   โœ… Baseline container performance (from Module 3)"
    echo "   โœ… Tuned container performance (from Module 4)"
    echo "   โš ๏ธ  Note: If VMI data exists from Module 5, it will also be shown"
    echo "   ๐ŸŽฏ Focus on the 'Performance Comparison' section for Module 4 results"
    echo "   โ„น๏ธ  Comprehensive analysis across all modules happens in Module 6"
    
    echo ""
    echo "๐Ÿ“š How to Read the Analysis:"
    echo "   1. Individual test sections show raw performance data"
    echo "   2. 'Performance Comparison' section shows Module 4 improvements"
    echo "   3. VMI data (if shown) is for reference - focus on container metrics"
    
    echo ""
    echo "๐Ÿ“Š Performance Comparison Summary:"
    echo "=================================="
    if [ -f "$REPORT_FILE" ]; then
        # Display key sections of the report
        head -30 "$REPORT_FILE"
        echo ""
        echo "๐Ÿ“„ Full report available at: $REPORT_FILE"
    else
        echo "โš ๏ธ  Report generation failed - check if both baseline and tuned metrics exist"
    fi
    
    echo ""
    echo "โœ… Performance comparison completed!"
    echo ""
    echo "๐Ÿ’ก The comparison analysis explains:"
    echo "   ๏ฟฝ What P99/P95/P50 improvements mean for your workloads"
    echo "   ๐ŸŽฏ Why scheduling became instant (0ms) with CPU isolation"
    echo "   โš–๏ธ Why container operations may be slower (expected trade-off)"
    echo "   ๐Ÿ† Overall assessment of your performance tuning effectiveness"

Step 7: Validate kube-burner Test Pods and Cluster Stability

Before proceeding, let’s validate that our performance tuning doesn’t interfere with normal cluster operations and that kube-burner can successfully create and schedule pods.

  1. Check the status of kube-burner test pods:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Validating kube-burner Test Pod Scheduling ==="
    echo "Cluster type: $CLUSTER_TYPE"
    echo "Target node: $TARGET_NODE"
    
    # Check if tuned test pods were created successfully
    echo ""
    echo "๐Ÿ“‹ Tuned Test Pod Status:"
    TUNED_PODS=$(oc get pods -A -l app=tuned-test --no-headers 2>/dev/null | wc -l)
    if [ $TUNED_PODS -gt 0 ]; then
        echo "   โœ… Found $TUNED_PODS tuned test pods"
        echo ""
        echo "   ๐Ÿ“Š Pod Distribution by Node:"
        oc get pods -A -l app=tuned-test -o wide --no-headers | awk '{print $8}' | sort | uniq -c | sed 's/^/      /'
    
        echo ""
        echo "   ๐Ÿ“Š Pod Status Summary:"
        oc get pods -A -l app=tuned-test --no-headers | awk '{print $4}' | sort | uniq -c | sed 's/^/      /'
    
        # Check for any failed pods
        FAILED_PODS=$(oc get pods -A -l app=tuned-test --no-headers | grep -v "Running\|Completed" | wc -l)
        if [ $FAILED_PODS -gt 0 ]; then
            echo ""
            echo "   โš ๏ธ  Found $FAILED_PODS pods not in Running/Completed state:"
            oc get pods -A -l app=tuned-test --no-headers | grep -v "Running\|Completed" | sed 's/^/      /'
        fi
    else
        echo "   โš ๏ธ  No tuned test pods found - this may indicate scheduling issues"
    fi
    
    # Check baseline test pods as well
    echo ""
    echo "๐Ÿ“‹ Baseline Test Pod Status:"
    BASELINE_PODS=$(oc get pods -A -l app=baseline-test --no-headers 2>/dev/null | wc -l)
    if [ $BASELINE_PODS -gt 0 ]; then
        echo "   โœ… Found $BASELINE_PODS baseline test pods"
        echo "   ๐Ÿ“Š Baseline Pod Status Summary:"
        oc get pods -A -l app=baseline-test --no-headers | awk '{print $4}' | sort | uniq -c | sed 's/^/      /'
    else
        echo "   โ„น๏ธ  No baseline test pods found (may have been cleaned up)"
    fi
  2. Test cluster responsiveness and pod scheduling:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Testing Cluster Responsiveness ==="
    
    # Create a simple test pod to verify scheduling works
    echo "๐Ÿงช Creating test pod to verify cluster functionality..."
    
    cat << EOF | oc apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: cluster-health-test
      namespace: default
      labels:
        app: cluster-health-test
    spec:
      containers:
      - name: test-container
        image: registry.redhat.io/ubi8/ubi-minimal:latest
        command: ["sleep"]
        args: ["60"]
        resources:
          requests:
            memory: "32Mi"
            cpu: "50m"
          limits:
            memory: "64Mi"
            cpu: "100m"
      restartPolicy: Never
    EOF
    
    # Wait for pod to be scheduled and running
    echo "โฑ๏ธ  Waiting for test pod to start..."
    oc wait --for=condition=Ready pod/cluster-health-test --timeout=60s -n default
    
    if [ $? -eq 0 ]; then
        echo "   โœ… Test pod started successfully - cluster is responsive"
    
        # Check which node it was scheduled on
        TEST_POD_NODE=$(oc get pod cluster-health-test -n default -o jsonpath='{.spec.nodeName}')
        echo "   ๐Ÿ“ Test pod scheduled on node: $TEST_POD_NODE"
    
        # Clean up test pod
        oc delete pod cluster-health-test -n default --ignore-not-found=true
    else
        echo "   โŒ Test pod failed to start - cluster may have scheduling issues"
        echo "   ๐Ÿ” Pod events:"
        oc describe pod cluster-health-test -n default | grep -A 10 "Events:" | sed 's/^/      /'
    fi

Step 8: Optional - Revert Performance Tuning for Workshop Stability

If you experience any issues with cluster stability or want to continue with other workshop modules without the aggressive performance tuning, you can revert the changes.

  1. Create a revert script for easy cleanup:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Creating Performance Tuning Revert Script ==="
    
    # Create revert script
    cat > ~/revert-performance-tuning.sh << 'EOF'
    #!/bin/bash
    
    # Load cluster configuration
    if [ -f /tmp/cluster-config ]; then
        source /tmp/cluster-config
        echo "=== Reverting Performance Tuning ==="
        echo "Cluster type: $CLUSTER_TYPE"
        echo "Profile name: $PROFILE_NAME"
        echo "Target node: $TARGET_NODE"
    else
        echo "โŒ Cluster configuration not found. Cannot revert automatically."
        echo "๐Ÿ’ก You can manually delete performance profiles with:"
        echo "   oc get performanceprofile"
        echo "   oc delete performanceprofile <profile-name>"
        exit 1
    fi
    
    # Delete the Performance Profile
    echo ""
    echo "๐Ÿ—‘๏ธ  Removing Performance Profile: $PROFILE_NAME"
    if oc get performanceprofile $PROFILE_NAME >/dev/null 2>&1; then
        oc delete performanceprofile $PROFILE_NAME
        echo "   โœ… Performance Profile deleted"
    else
        echo "   โ„น๏ธ  Performance Profile not found (may already be deleted)"
    fi
    
    # Remove custom Machine Config Pool (if created)
    if [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo ""
        echo "๐Ÿ—‘๏ธ  Removing worker-rt Machine Config Pool..."
        if oc get mcp worker-rt >/dev/null 2>&1; then
            oc delete mcp worker-rt
            echo "   โœ… worker-rt Machine Config Pool deleted"
        else
            echo "   โ„น๏ธ  worker-rt Machine Config Pool not found"
        fi
    
        # Remove worker-rt label from nodes
        echo ""
        echo "๐Ÿท๏ธ  Removing worker-rt labels from nodes..."
        oc label nodes -l node-role.kubernetes.io/worker-rt node-role.kubernetes.io/worker-rt- --ignore-not-found=true
    
    elif [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then
        echo ""
        echo "๐Ÿ—‘๏ธ  Removing master-rt Machine Config Pool..."
        if oc get mcp master-rt >/dev/null 2>&1; then
            oc delete mcp master-rt
            echo "   โœ… master-rt Machine Config Pool deleted"
        else
            echo "   โ„น๏ธ  master-rt Machine Config Pool not found"
        fi
    
        # Remove master-rt label from nodes
        echo ""
        echo "๐Ÿท๏ธ  Removing master-rt labels from nodes..."
        oc label nodes -l node-role.kubernetes.io/master-rt node-role.kubernetes.io/master-rt- --ignore-not-found=true
    fi
    
    echo ""
    echo "โณ Waiting for nodes to revert to standard kernel..."
    echo "   ๐Ÿ’ก This process will take 10-15 minutes as nodes reboot"
    echo "   ๐Ÿ“Š Monitor progress with: watch 'oc get nodes; oc get mcp'"
    
    # Wait for machine config pools to be updated
    if [ "$CLUSTER_TYPE" = "SNO" ]; then
        echo "   ๐Ÿ”„ Waiting for master MCP to update..."
        oc wait --for=condition=Updated mcp/master --timeout=1200s
    elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo "   ๐Ÿ”„ Waiting for worker MCP to update..."
        oc wait --for=condition=Updated mcp/worker --timeout=1200s
    else
        echo "   ๐Ÿ”„ Waiting for master MCP to update..."
        oc wait --for=condition=Updated mcp/master --timeout=1200s
    fi
    
    echo ""
    echo "โœ… Performance tuning revert completed!"
    echo "๐Ÿ” Verify with: oc debug node/$TARGET_NODE -- chroot /host uname -r"
    echo "   (Should show standard kernel without 'rt')"
    
    EOF
    
    chmod +x ~/revert-performance-tuning.sh
    
    echo "โœ… Revert script created: ~/revert-performance-tuning.sh"
    echo ""
    echo "๐Ÿ’ก To revert performance tuning later, run:"
    echo "   ~/revert-performance-tuning.sh"
    echo ""
    echo "โš ๏ธ  Note: Reverting will cause nodes to reboot back to standard kernel"
  2. Optional - Run the revert script if needed:

    echo "=== Performance Tuning Revert Decision ==="
    echo ""
    echo "๐Ÿค” Do you want to revert the performance tuning now?"
    echo ""
    echo "โœ… Keep Performance Tuning if:"
    echo "   - Cluster is stable and responsive"
    echo "   - Test pods are scheduling successfully"
    echo "   - You want to continue with optimized performance"
    echo ""
    echo "๐Ÿ”„ Revert Performance Tuning if:"
    echo "   - Experiencing cluster stability issues"
    echo "   - Pods are failing to schedule"
    echo "   - Want to continue workshop with standard settings"
    echo ""
    echo "๐Ÿ’ก You can always re-apply performance tuning later by re-running this module"
    echo ""
    echo "To revert now, run: ~/revert-performance-tuning.sh"
    echo "To keep current settings, continue to the next step"

Expected Improvements

With proper performance tuning, you should see significant improvements:

Typical Improvements
  • Pod Creation P99: 50-70% reduction in latency

  • Pod Creation P95: 40-60% reduction in latency

  • Consistency: Much lower variance between P50 and P99

  • Jitter Reduction: More predictable response times

Performance Factors
  • CPU Isolation: Eliminates interference from system processes

  • Real-time Kernel: Provides deterministic scheduling

  • HugePages: Reduces memory management overhead

  • NUMA Optimization: Ensures local memory access

Troubleshooting Common Issues

This section provides architecture-specific troubleshooting guidance for common Performance Profile issues.

Architecture-Specific Troubleshooting

Node Not Updating

If nodes don’t start updating after Performance Profile creation:

# Load cluster configuration
source /tmp/cluster-config

echo "=== Node Update Troubleshooting ==="
echo "Cluster type: $CLUSTER_TYPE"
echo "Target node: $TARGET_NODE"

if [ "$CLUSTER_TYPE" = "SNO" ]; then
    echo ""
    echo "๐Ÿ” SNO Troubleshooting:"
    echo "   - Check master Machine Config Pool status"
    oc describe mcp master | grep -A 10 -B 5 "Conditions:"

    echo ""
    echo "   - Check for conflicting machine configs:"
    oc get mc | grep master

elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
    echo ""
    echo "๐Ÿ” Multi-Node Troubleshooting:"
    echo "   - Check worker-rt Machine Config Pool status"
    oc describe mcp worker-rt | grep -A 10 -B 5 "Conditions:"

    echo ""
    echo "   - Check for conflicting machine configs:"
    oc get mc | grep worker-rt

else
    echo ""
    echo "๐Ÿ” Multi-Master Troubleshooting:"
    echo "   - Check master-rt Machine Config Pool status"
    oc describe mcp master-rt | grep -A 10 -B 5 "Conditions:"

    echo ""
    echo "   - Check for conflicting machine configs:"
    oc get mc | grep master-rt
fi

echo ""
echo "๐Ÿ“‹ Performance Profile Status:"
oc describe performanceprofile $PROFILE_NAME | grep -A 10 -B 5 "Status:"
Real-time Kernel Issues

If the RT kernel fails to install:

# Load cluster configuration
source /tmp/cluster-config

echo "=== Real-Time Kernel Troubleshooting ==="
echo "Target node: $TARGET_NODE"

# Check node events for errors
echo "๐Ÿ“‹ Recent Node Events:"
oc get events --sort-by='.lastTimestamp' --field-selector involvedObject.name=$TARGET_NODE | tail -15

echo ""
echo "๐Ÿ“‹ Machine Config Daemon Logs:"
MCD_POD=$(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon --field-selector spec.nodeName=$TARGET_NODE -o jsonpath='{.items[0].metadata.name}')
if [ -n "$MCD_POD" ]; then
    echo "   MCD Pod: $MCD_POD"
    oc logs -n openshift-machine-config-operator $MCD_POD --tail=20
else
    echo "   โŒ No MCD pod found for node $TARGET_NODE"
fi

echo ""
echo "๐Ÿ“‹ RT Kernel Package Availability:"
oc debug node/$TARGET_NODE -- chroot /host sh -c 'yum list available | grep kernel-rt || dnf list available | grep kernel-rt' 2>/dev/null || echo "   Unable to check RT kernel packages"

echo ""
echo "๐Ÿ“‹ Current Kernel Information:"
oc debug node/$TARGET_NODE -- chroot /host uname -a 2>/dev/null || echo "   Unable to check current kernel"
HugePages Not Allocated

If HugePages aren’t configured correctly:

# Load cluster configuration
source /tmp/cluster-config

echo "=== HugePages Troubleshooting ==="
echo "Target node: $TARGET_NODE"

# Check available memory
echo "๐Ÿ“‹ Memory Information:"
oc debug node/$TARGET_NODE -- chroot /host free -h 2>/dev/null || echo "   Unable to check memory"

echo ""
echo "๐Ÿ“‹ HugePages Configuration:"
oc debug node/$TARGET_NODE -- chroot /host cat /proc/meminfo 2>/dev/null | grep -i huge || echo "   No HugePages information found"

echo ""
echo "๐Ÿ“‹ HugePages Mount Points:"
oc debug node/$TARGET_NODE -- chroot /host mount 2>/dev/null | grep huge || echo "   No HugePages mount points found"

echo ""
echo "๐Ÿ“‹ Kernel Command Line (HugePages args):"
oc debug node/$TARGET_NODE -- chroot /host cat /proc/cmdline 2>/dev/null | grep -o 'hugepages[^[:space:]]*' || echo "   No HugePages kernel arguments found"

echo ""
echo "๐Ÿ“‹ Performance Profile HugePages Spec:"
oc get performanceprofile $PROFILE_NAME -o jsonpath='{.spec.hugepages}' | jq '.' 2>/dev/null || echo "   Unable to read Performance Profile HugePages spec"
Node Stuck After Reboot (API Server Not Starting)

If the node reboots but the API server doesn’t come back up (SNO) or node remains NotReady:

# Load cluster configuration
source /tmp/cluster-config

echo "=== Node Stuck After Reboot Troubleshooting ==="
echo "Cluster type: $CLUSTER_TYPE"
echo "Target node: $TARGET_NODE"
echo "Instance type: ${INSTANCE_TYPE:-unknown}"

# Check node status
echo ""
echo "๐Ÿ“‹ Node Status:"
oc get node $TARGET_NODE -o wide 2>/dev/null || echo "   โŒ Cannot reach API server (node may be rebooting)"

# Check AWS instance status (if AWS CLI available)
if command -v aws &> /dev/null; then
    echo ""
    echo "๐Ÿ“‹ AWS Instance Status:"
    INSTANCE_ID=$(aws ec2 describe-instances \
        --filters "Name=private-ip-address,Values=$(oc get node $TARGET_NODE -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}' 2>/dev/null || echo 'unknown')" \
        --query 'Reservations[*].Instances[*].[InstanceId,State.Name]' \
        --output text 2>/dev/null | head -1)
    if [ -n "$INSTANCE_ID" ]; then
        echo "   Instance: $INSTANCE_ID"
        aws ec2 describe-instance-status --instance-ids $INSTANCE_ID \
            --query 'InstanceStatuses[*].[SystemStatus.Status,InstanceStatus.Status]' \
            --output table 2>/dev/null || echo "   Unable to check AWS status"
    fi
fi

echo ""
echo "๐Ÿ’ก Common Causes:"
echo "   1. CPU isolation too aggressive - not enough CPUs for control plane"
echo "   2. nosmt parameter on virtualized instance - halves available CPUs"
echo "   3. Insufficient memory after HugePages allocation"
echo ""
echo "๐Ÿ”ง Recovery Options:"
echo "   1. Wait 20-30 minutes - node may still be applying configuration"
echo "   2. Reboot instance: aws ec2 reboot-instances --instance-ids <instance-id>"
echo "   3. Delete Performance Profile and recreate with more conservative settings:"
echo "      oc delete performanceprofile $PROFILE_NAME"
echo "      # Then re-run CPU allocation calculator and profile creator"
Performance Profile Not Applied

If the Performance Profile exists but optimizations aren’t applied:

# Load cluster configuration
source /tmp/cluster-config

echo "=== Performance Profile Application Troubleshooting ==="

# Check Performance Profile status
echo "๐Ÿ“‹ Performance Profile Status:"
oc get performanceprofile $PROFILE_NAME -o yaml | grep -A 20 "status:" || echo "   No status information available"

echo ""
echo "๐Ÿ“‹ Node Tuning Operator Status:"
oc get pods -n openshift-cluster-node-tuning-operator

echo ""
echo "๐Ÿ“‹ TuneD Daemon Status:"
TUNED_POD=$(oc get pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned --field-selector spec.nodeName=$TARGET_NODE -o jsonpath='{.items[0].metadata.name}')
if [ -n "$TUNED_POD" ]; then
    echo "   TuneD Pod: $TUNED_POD"
    oc logs -n openshift-cluster-node-tuning-operator $TUNED_POD --tail=10
else
    echo "   โŒ No TuneD pod found for node $TARGET_NODE"
fi

echo ""
echo "๐Ÿ“‹ Generated TuneD Profiles:"
oc get tuned -n openshift-cluster-node-tuning-operator

Module Summary

In this module, you have successfully implemented architecture-aware performance tuning:

โœ… Detected your cluster architecture (SNO, Multi-Node, or Multi-Master)
โœ… Created an optimized Performance Profile tailored to your environment
โœ… Configured CPU isolation appropriate for your cluster’s control plane needs
โœ… Allocated HugePages sized correctly for your available resources
โœ… Applied real-time kernel settings for deterministic scheduling
โœ… Verified all performance optimizations are working correctly
โœ… Measured significant performance improvements through comparative testing

Architecture-Specific Achievements

Single Node OpenShift (SNO):
* Conservative CPU allocation preserving control plane stability
* Optimized HugePages configuration for resource-constrained environments
* Master node targeting with appropriate performance isolation

Multi-Node Clusters:
* Aggressive CPU isolation on dedicated worker nodes
* Maximum performance optimization without control plane impact
* Dedicated Machine Config Pool for isolated performance tuning

Multi-Master Clusters:
* Balanced CPU allocation for control plane and workload performance
* Strategic node selection for performance optimization
* Maintained cluster stability during rolling updates

Key Takeaways
* Performance Profiles adapt automatically to different cluster architectures
* CPU isolation strategies must account for control plane requirements
* Real-time kernels provide predictable, low-latency scheduling across all architectures
* HugePages allocation should be sized appropriately for available resources
* Proper architecture-aware tuning can achieve 50-70% latency improvements
* SNO environments can achieve exceptional performance due to simplified architecture

Performance Impact Summary

Based on your cluster type, you should observe:
* Pod Creation Latency: 50-70% reduction in P99 times
* Consistency: Dramatically reduced variance between P50 and P99
* Jitter: More predictable and deterministic response times
* Resource Utilization: Optimized CPU and memory usage patterns

Optional: Reverting Performance Tuning for Workshop Continuation

If you need to revert the performance tuning to continue with other workshop modules or if you experience any cluster stability issues, you can easily remove the performance optimizations.

Creating a Revert Script

  1. Create an automated revert script:

    # Load cluster configuration
    source /tmp/cluster-config
    
    echo "=== Creating Performance Tuning Revert Script ==="
    
    # Create revert script
    cat > ~/revert-performance-tuning.sh << 'EOF'
    #!/bin/bash
    
    # Load cluster configuration
    if [ -f /tmp/cluster-config ]; then
        source /tmp/cluster-config
        echo "=== Reverting Performance Tuning ==="
        echo "Cluster type: $CLUSTER_TYPE"
        echo "Profile name: $PROFILE_NAME"
        echo "Target node: $TARGET_NODE"
    else
        echo "โŒ Cluster configuration not found. Attempting manual cleanup..."
        PROFILE_NAME=$(oc get performanceprofile -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
        if [ -z "$PROFILE_NAME" ]; then
            echo "No Performance Profiles found to delete."
            exit 0
        fi
    fi
    
    # Delete the Performance Profile
    echo ""
    echo "๐Ÿ—‘๏ธ  Removing Performance Profile: $PROFILE_NAME"
    if oc get performanceprofile $PROFILE_NAME >/dev/null 2>&1; then
        oc delete performanceprofile $PROFILE_NAME
        echo "   โœ… Performance Profile deleted"
    else
        echo "   โ„น๏ธ  Performance Profile not found (may already be deleted)"
    fi
    
    # Remove custom Machine Config Pool and labels (if created)
    if [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo ""
        echo "๐Ÿ—‘๏ธ  Cleaning up worker-rt configuration..."
        oc delete mcp worker-rt --ignore-not-found=true
        oc label nodes -l node-role.kubernetes.io/worker-rt node-role.kubernetes.io/worker-rt- --ignore-not-found=true
        echo "   โœ… Worker-rt configuration cleaned up"
    
    elif [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then
        echo ""
        echo "๐Ÿ—‘๏ธ  Cleaning up master-rt configuration..."
        oc delete mcp master-rt --ignore-not-found=true
        oc label nodes -l node-role.kubernetes.io/master-rt node-role.kubernetes.io/master-rt- --ignore-not-found=true
        echo "   โœ… Master-rt configuration cleaned up"
    fi
    
    echo ""
    echo "โณ Waiting for nodes to revert to standard kernel..."
    echo "   ๐Ÿ’ก This process will take 10-15 minutes as nodes reboot"
    echo "   ๐Ÿ“Š Monitor progress with: watch 'oc get nodes; oc get mcp'"
    
    # Wait for machine config pools to be updated
    if [ "$CLUSTER_TYPE" = "SNO" ] || [ "$CLUSTER_TYPE" = "MULTI_MASTER" ]; then
        echo "   ๐Ÿ”„ Waiting for master MCP to update..."
        oc wait --for=condition=Updated mcp/master --timeout=1200s
    elif [ "$CLUSTER_TYPE" = "MULTI_NODE" ]; then
        echo "   ๐Ÿ”„ Waiting for worker MCP to update..."
        oc wait --for=condition=Updated mcp/worker --timeout=1200s
    fi
    
    echo ""
    echo "โœ… Performance tuning revert completed!"
    echo "๐Ÿ” Verify standard kernel with:"
    echo "   oc debug node/\$(oc get nodes -o jsonpath='{.items[0].metadata.name}') -- chroot /host uname -r"
    echo "   (Should show standard kernel without 'rt')"
    
    EOF
    
    chmod +x ~/revert-performance-tuning.sh
    
    echo "โœ… Revert script created: ~/revert-performance-tuning.sh"
    echo ""
    echo "๐Ÿ’ก To revert performance tuning at any time, run:"
    echo "   ~/revert-performance-tuning.sh"
    echo ""
    echo "โš ๏ธ  Note: Reverting will cause nodes to reboot back to standard kernel"

When to Use the Revert Script

Revert Performance Tuning if you experience:
  • Pods failing to schedule or start

  • Cluster becoming unresponsive

  • High resource contention

  • Need to continue with other workshop modules that require standard settings

Keep Performance Tuning if:
  • Cluster is stable and responsive

  • Test pods are scheduling successfully

  • You want to continue with performance-optimized settings

  • Planning to proceed directly to Module 5 (Virtualization)

Manual Cleanup (Alternative)

If the automated script doesn’t work, you can manually clean up:

# Manual cleanup commands
echo "=== Manual Performance Tuning Cleanup ==="

# List and delete performance profiles
echo "๐Ÿ“‹ Current Performance Profiles:"
oc get performanceprofile

echo ""
echo "๐Ÿ—‘๏ธ  Delete Performance Profile:"
echo "oc delete performanceprofile <profile-name>"

# List and delete custom MCPs
echo ""
echo "๐Ÿ“‹ Current Machine Config Pools:"
oc get mcp

echo ""
echo "๐Ÿ—‘๏ธ  Delete custom MCP (if created):"
echo "oc delete mcp worker-rt  # or master-rt"

echo ""
echo "๐Ÿ’ก After deletion, nodes will automatically reboot to standard kernel"
Next Steps

In Module 5, you will learn how to apply these performance optimizations to OpenShift Virtualization, creating high-performance virtual machines that leverage your tuned infrastructure to achieve near bare-metal latency characteristics. The architecture-aware approach you’ve learned will be essential for optimizing VM placement and resource allocation.

Workshop Flexibility: You can proceed to Module 5 either with or without the performance tuning active. The virtualization module will adapt to your current cluster configuration.