Module 6: Monitoring and Validation

Module Overview

This module focuses on comprehensive monitoring and validation of low-latency performance optimizations. You’ll learn to set up monitoring dashboards, create performance alerts, validate optimizations across the entire stack, and implement continuous performance testing to ensure sustained performance improvements.

Prerequisites

  • Completed Module 5 (Low-latency Virtualization)

  • Performance baselines established from Module 3

  • Optional: Performance tuning from Module 4 (for enhanced monitoring)

  • VMI and network policy testing from Module 5

Module 4 Dependency

This module can validate performance with or without Module 4 optimizations:

  • With Module 4: Monitor optimized performance and validate improvements

  • Without Module 4: Monitor baseline performance and identify optimization opportunities

The monitoring setup adapts to your cluster’s current configuration.

Key Learning Objectives

  • Set up comprehensive performance monitoring with Prometheus and Grafana

  • Create alerting for performance regressions and threshold violations

  • Validate optimizations across containers, VMs, and networking

  • Implement continuous performance testing workflows

  • Analyze long-term performance trends and capacity planning

  • Troubleshoot performance issues using monitoring data

Workshop Performance Journey Recap

Before diving into monitoring, let’s recap the performance improvements achieved:

Module Focus Key Metrics Performance Achievement

Module 3

Baseline Testing

Pod startup latency

5.4s average (baseline established)

Module 4

Performance Tuning (Optional)

Tuned pod latency

50-70% improvement with CPU isolation (if applied)

Module 5

Virtualization

VMI startup + Network policies

VMI: 60-90s (varies by tuning), NetPol: <10s

Module 6

Monitoring (current)

End-to-end validation

Sustained performance assurance

Performance Expectations by Configuration

  • Without Module 4 tuning: Baseline container performance (~5.4s), VMI startup 90-150s

  • With Module 4 tuning: Improved container performance (~2-3s), VMI startup 60-90s

  • Network policies: 5-20s depending on cluster configuration and tuning

Performance Monitoring Setup

Cluster Configuration Assessment

Before setting up monitoring, let’s assess the current cluster configuration to understand what performance characteristics to monitor.

  1. Detect current performance optimizations:

    echo "🔍 Assessing Current Cluster Performance Configuration..."
    echo ""
    
    # Check for Performance Profiles (Module 4)
    PERF_PROFILES=$(oc get performanceprofile --no-headers 2>/dev/null | wc -l)
    if [ "$PERF_PROFILES" -gt 0 ]; then
        echo "✅ Performance Profiles Found: $PERF_PROFILES"
        echo "   📊 Expected Performance: Optimized (50-70% improvement)"
        echo "   đŸŽ¯ Monitoring Focus: Validate tuning effectiveness"
        CLUSTER_TUNING="OPTIMIZED"
        oc get performanceprofile -o custom-columns=NAME:.metadata.name,ISOLATED:.spec.cpu.isolated,RESERVED:.spec.cpu.reserved
    else
        echo "â„šī¸  No Performance Profiles Found"
        echo "   📊 Expected Performance: Baseline"
        echo "   đŸŽ¯ Monitoring Focus: Identify optimization opportunities"
        CLUSTER_TUNING="BASELINE"
    fi
    
    echo ""
    
    # Check for VMI test results (Module 5)
    VMI_METRICS_DIR="~/kube-burner-configs/collected-metrics-vmi"
    if [ -d "$VMI_METRICS_DIR" ] && [ -f "$VMI_METRICS_DIR/vmiLatencyMeasurement-vmi-latency-test.json" ]; then
        echo "✅ VMI Performance Data Available"
        echo "   📊 VMI tests completed in Module 5"
        VMI_TESTING="COMPLETED"
    else
        echo "â„šī¸  No VMI Performance Data Found"
        echo "   📊 VMI testing may not have been completed"
        VMI_TESTING="PENDING"
    fi
    
    echo ""
    
    # Check for baseline metrics (Module 3)
    BASELINE_METRICS_DIR="~/kube-burner-configs/collected-metrics"
    if [ -d "$BASELINE_METRICS_DIR" ]; then
        echo "✅ Baseline Performance Data Available"
        echo "   📊 Baseline established in Module 3"
        BASELINE_TESTING="COMPLETED"
    else
        echo "âš ī¸  No Baseline Performance Data Found"
        echo "   📊 Baseline testing may not have been completed"
        BASELINE_TESTING="MISSING"
    fi
    
    echo ""
    echo "đŸŽ¯ Monitoring Configuration Summary:"
    echo "   Cluster Tuning: $CLUSTER_TUNING"
    echo "   VMI Testing: $VMI_TESTING"
    echo "   Baseline Data: $BASELINE_TESTING"
    
    # Save configuration for later use
    echo "CLUSTER_TUNING=$CLUSTER_TUNING" > /tmp/monitoring-config
    echo "VMI_TESTING=$VMI_TESTING" >> /tmp/monitoring-config
    echo "BASELINE_TESTING=$BASELINE_TESTING" >> /tmp/monitoring-config

OpenShift Built-in Monitoring Stack

OpenShift includes a comprehensive monitoring stack based on Prometheus, Grafana, and Alertmanager. Let’s configure it for low-latency performance monitoring.

  1. Verify the monitoring stack is available:

    # Check monitoring operator status
    oc get clusteroperator monitoring
    
    # Verify Prometheus is running
    oc get pods -n openshift-monitoring | grep prometheus
    
    # Check Grafana availability
    oc get route -n openshift-monitoring | grep grafana
  2. Enable user workload monitoring for custom metrics:

    cat << EOF | oc apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    data:
      config.yaml: |
        enableUserWorkload: true
        prometheusK8s:
          retention: 7d
          volumeClaimTemplate:
            spec:
              storageClassName: gp3-csi
              resources:
                requests:
                  storage: 20Gi
        alertmanagerMain:
          volumeClaimTemplate:
            spec:
              storageClassName: gp3-csi
              resources:
                requests:
                  storage: 2Gi
    EOF
  3. Configure user workload monitoring:

    cat << EOF | oc apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: user-workload-monitoring-config
      namespace: openshift-user-workload-monitoring
    data:
      config.yaml: |
        prometheus:
          retention: 7d
          resources:
            requests:
              cpu: 200m
              memory: 2Gi
          volumeClaimTemplate:
            spec:
              storageClassName: gp3-csi
              resources:
                requests:
                  storage: 10Gi
    EOF

Performance Metrics Collection

  1. Create a ServiceMonitor for kube-burner metrics:

    cat << EOF | oc apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: performance-testing-metrics
      namespace: openshift-monitoring
      labels:
        app: performance-testing
    spec:
      selector:
        matchLabels:
          app: kube-burner-metrics
      endpoints:
      - port: metrics
        interval: 30s
        path: /metrics
    EOF

Custom Performance Dashboards

  1. Create a comprehensive performance dashboard:

    cd ~/kube-burner-configs
    
    cat << EOF > performance-dashboard.json
    {
      "dashboard": {
        "id": null,
        "title": "Low-Latency Performance Workshop",
        "tags": ["performance", "low-latency", "workshop"],
        "timezone": "browser",
        "panels": [
          {
            "id": 1,
            "title": "Pod Startup Latency (P99)",
            "type": "stat",
            "targets": [
              {
                "expr": "histogram_quantile(0.99, sum(rate(kubelet_pod_start_duration_seconds_bucket[5m])) by (le))",
                "legendFormat": "Pod P99 Latency"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "s",
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": null},
                    {"color": "yellow", "value": 10},
                    {"color": "red", "value": 30}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
          },
          {
            "id": 2,
            "title": "VMI Startup Latency (P99)",
            "type": "stat",
            "targets": [
              {
                "expr": "histogram_quantile(0.99, sum(rate(kubevirt_vmi_phase_transition_time_seconds_bucket{phase=\"Running\"}[5m])) by (le))",
                "legendFormat": "VMI P99 Latency"
              }
            ],
            "fieldConfig": {
              "defaults": {
                "unit": "s",
                "thresholds": {
                  "steps": [
                    {"color": "green", "value": null},
                    {"color": "yellow", "value": 30},
                    {"color": "red", "value": 45}
                  ]
                }
              }
            },
            "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
          },
          {
            "id": 3,
            "title": "CPU Isolation Effectiveness",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(node_cpu_seconds_total{mode=\"idle\"}[5m]) * 100",
                "legendFormat": "CPU {{cpu}} Idle %"
              }
            ],
            "yAxes": [
              {"min": 0, "max": 100, "unit": "percent"}
            ],
            "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}
          }
        ],
        "time": {"from": "now-1h", "to": "now"},
        "refresh": "30s"
      }
    }
    EOF
    
    # Note: This dashboard would be imported via Grafana UI or API
    echo "📊 Performance dashboard configuration created"
    echo "Import this dashboard into Grafana for visualization"

Performance Alerting Rules

  1. Create alerting rules for performance regressions:

    cat << EOF | oc apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: low-latency-performance-alerts
      namespace: openshift-monitoring
      labels:
        prometheus: kube-prometheus
        role: alert-rules
    spec:
      groups:
      - name: low-latency-performance
        rules:
        - alert: PodStartupLatencyHigh
          expr: histogram_quantile(0.99, sum(rate(kubelet_pod_start_duration_seconds_bucket[5m])) by (le)) > 30
          for: 2m
          labels:
            severity: warning
            component: pod-startup
          annotations:
            summary: "Pod startup latency is too high"
            description: "Pod P99 startup latency is {{ \$value }}s, exceeding 30s threshold"
    
        - alert: VMIStartupLatencyHigh
          expr: histogram_quantile(0.99, sum(rate(kubevirt_vmi_phase_transition_time_seconds_bucket{phase="Running"}[5m])) by (le)) > 45
          for: 2m
          labels:
            severity: warning
            component: vmi-startup
          annotations:
            summary: "VMI startup latency is too high"
            description: "VMI P99 startup latency is {{ \$value }}s, exceeding 45s threshold"
    
        - alert: CPUIsolationBreach
          expr: rate(node_cpu_seconds_total{mode!="idle",cpu=~"2|3"}[5m]) > 0.1
          for: 1m
          labels:
            severity: critical
            component: cpu-isolation
          annotations:
            summary: "CPU isolation breach detected"
            description: "Isolated CPU {{ \$labels.cpu }} showing {{ \$value }} utilization"
    
        - alert: HugePagesExhausted
          expr: (node_memory_HugePages_Total - node_memory_HugePages_Free) / node_memory_HugePages_Total > 0.9
          for: 1m
          labels:
            severity: warning
            component: hugepages
          annotations:
            summary: "HugePages utilization is high"
            description: "HugePages utilization is {{ \$value | humanizePercentage }} on node {{ \$labels.instance }}"
    EOF

Validation Tools and Techniques

Comprehensive Performance Validation

Now let’s run comprehensive validation tests to verify all optimizations are working correctly.

  1. Create a comprehensive validation test suite:

    cd ~/kube-burner-configs
    
    cat << EOF > comprehensive-validation.sh
    #!/bin/bash
    
    # Comprehensive Performance Validation Suite
    # Tests all optimizations from Modules 3-5
    
    set -euo pipefail
    
    # Colors for output
    RED='\033[0;31m'
    GREEN='\033[0;32m'
    YELLOW='\033[1;33m'
    BLUE='\033[0;34m'
    NC='\033[0m'
    
    log_info() { echo -e "\${BLUE}[INFO]\${NC} \$1"; }
    log_success() { echo -e "\${GREEN}[SUCCESS]\${NC} \$1"; }
    log_warning() { echo -e "\${YELLOW}[WARNING]\${NC} \$1"; }
    log_error() { echo -e "\${RED}[ERROR]\${NC} \$1"; }
    
    echo "🚀 Starting Comprehensive Performance Validation"
    echo "=============================================="
    
    # Test 1: Baseline Pod Performance
    log_info "Test 1: Validating baseline pod performance..."
    if kube-burner init -c baseline-config.yml --log-level=warn; then
        BASELINE_P99=\$(find collected-metrics/ -name "*podLatencyQuantilesMeasurement*" -type f | head -1 | xargs cat | jq -r '.[] | select(.quantileName == "Ready") | .P99' 2>/dev/null || echo "0")
        if (( \$(echo "\$BASELINE_P99 < 30000" | bc -l) )); then
            log_success "Baseline pod P99: \${BASELINE_P99}ms (< 30s threshold)"
        else
            log_warning "Baseline pod P99: \${BASELINE_P99}ms (exceeds 30s threshold)"
        fi
    else
        log_error "Baseline test failed"
    fi
    
    # Test 2: Performance-Tuned Pod Performance
    log_info "Test 2: Validating performance-tuned pod performance..."
    if kube-burner init -c tuned-config.yml --log-level=warn; then
        TUNED_P99=\$(find collected-metrics-tuned/ -name "*podLatencyQuantilesMeasurement*" -type f | head -1 | xargs cat | jq -r '.[] | select(.quantileName == "Ready") | .P99' 2>/dev/null || echo "0")
        if (( \$(echo "\$TUNED_P99 < 15000" | bc -l) )); then
            log_success "Tuned pod P99: \${TUNED_P99}ms (< 15s threshold)"
        else
            log_warning "Tuned pod P99: \${TUNED_P99}ms (exceeds 15s threshold)"
        fi
    
        # Calculate improvement
        if [[ "\$BASELINE_P99" != "0" && "\$TUNED_P99" != "0" ]]; then
            IMPROVEMENT=\$(echo "scale=1; (\$BASELINE_P99 - \$TUNED_P99) / \$BASELINE_P99 * 100" | bc -l)
            log_info "Performance improvement: \${IMPROVEMENT}% faster"
        fi
    else
        log_error "Tuned pod test failed"
    fi
    
    # Test 3: VMI Performance
    log_info "Test 3: Validating VMI performance..."
    if kube-burner init -c vmi-latency-config.yml --log-level=warn; then
        VMI_P99=\$(find collected-metrics-vmi/ -name "*vmiLatencyQuantilesMeasurement*" -type f | head -1 | xargs cat | jq -r '.[] | select(.quantileName == "VMIRunning") | .P99' 2>/dev/null || echo "0")
        if (( \$(echo "\$VMI_P99 < 45000" | bc -l) )); then
            log_success "VMI P99: \${VMI_P99}ms (< 45s threshold)"
        else
            log_warning "VMI P99: \${VMI_P99}ms (exceeds 45s threshold)"
        fi
    else
        log_error "VMI test failed"
    fi
    
    # Test 4: Network Policy Performance
    log_info "Test 4: Validating network policy performance..."
    if kube-burner init -c network-policy-latency-config.yml --log-level=warn; then
        NETPOL_P99=\$(find collected-metrics-netpol/ -name "*netpolLatencyQuantilesMeasurement*" -type f | head -1 | xargs cat | jq -r '.[] | select(.quantileName == "Ready") | .P99' 2>/dev/null || echo "0")
        if (( \$(echo "\$NETPOL_P99 < 5000" | bc -l) )); then
            log_success "Network Policy P99: \${NETPOL_P99}ms (< 5s threshold)"
        else
            log_warning "Network Policy P99: \${NETPOL_P99}ms (exceeds 5s threshold)"
        fi
    else
        log_error "Network policy test failed"
    fi
    
    # Generate comprehensive report
    log_info "Generating comprehensive validation report..."
    cat > comprehensive-validation-report-\$(date +%Y%m%d).md << EOF
    # Comprehensive Performance Validation Report - \$(date)
    
    ## Test Results Summary
    
    | Test Type | P99 Latency | Threshold | Status |
    |-----------|-------------|-----------|--------|
    | **Baseline Pods** | \${BASELINE_P99}ms | < 30,000ms | \$(if (( \$(echo "\$BASELINE_P99 < 30000" | bc -l) )); then echo "✅ PASS"; else echo "❌ FAIL"; fi) |
    | **Tuned Pods** | \${TUNED_P99}ms | < 15,000ms | \$(if (( \$(echo "\$TUNED_P99 < 15000" | bc -l) )); then echo "✅ PASS"; else echo "❌ FAIL"; fi) |
    | **VMI Startup** | \${VMI_P99}ms | < 45,000ms | \$(if (( \$(echo "\$VMI_P99 < 45000" | bc -l) )); then echo "✅ PASS"; else echo "❌ FAIL"; fi) |
    | **Network Policy** | \${NETPOL_P99}ms | < 5,000ms | \$(if (( \$(echo "\$NETPOL_P99 < 5000" | bc -l) )); then echo "✅ PASS"; else echo "❌ FAIL"; fi) |
    
    ## Performance Improvements
    
    - **Pod Performance**: \$(if [[ "\$BASELINE_P99" != "0" && "\$TUNED_P99" != "0" ]]; then echo "scale=1; (\$BASELINE_P99 - \$TUNED_P99) / \$BASELINE_P99 * 100" | bc -l | sed 's/$/% improvement/'; else echo "TBD"; fi)
    - **VM vs Container**: \$(if [[ "\$VMI_P99" != "0" && "\$TUNED_P99" != "0" ]]; then echo "scale=1; \$VMI_P99 / \$TUNED_P99" | bc -l | sed 's/$/x slower (expected)/'; else echo "TBD"; fi)
    
    ## Validation Status
    
    \$(if (( \$(echo "\$BASELINE_P99 < 30000 && \$TUNED_P99 < 15000 && \$VMI_P99 < 45000 && \$NETPOL_P99 < 5000" | bc -l) )); then echo "🎉 **ALL TESTS PASSED** - Low-latency optimizations are working correctly!"; else echo "âš ī¸ **SOME TESTS FAILED** - Review individual test results above"; fi)
    
    ## Next Steps
    
    1. Monitor performance trends over time
    2. Set up alerting for performance regressions
    3. Implement continuous performance testing
    4. Fine-tune thresholds based on workload requirements
    EOF
    
    log_success "Comprehensive validation completed!"
    echo ""
    echo "📊 Validation Report:"
    echo "===================="
    cat comprehensive-validation-report-\$(date +%Y%m%d).md
    
    EOF
    
    chmod +x comprehensive-validation.sh
  2. Run the comprehensive validation suite:

    cd ~/kube-burner-configs
    
    # Execute the comprehensive validation
    echo "🚀 Running comprehensive performance validation..."
    ./comprehensive-validation.sh
    
    # The script will:
    # 1. Run all test types (baseline, tuned, VMI, network policy)
    # 2. Validate against performance thresholds
    # 3. Calculate performance improvements
    # 4. Generate a comprehensive report

Continuous Performance Testing

  1. Set up automated performance testing with a cron job:

    cat << EOF | oc apply -f -
    apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: performance-validation
      namespace: default
    spec:
      schedule: "0 2 * * *"  # Run daily at 2 AM
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: performance-test
                image: quay.io/cloud-bulldozer/kube-burner:latest
                command: ["/bin/bash"]
                args:
                - -c
                - |
                  cd /workspace
                  git clone https://github.com/tosin2013/low-latency-performance-workshop.git
                  cd low-latency-performance-workshop/kube-burner-configs
                  ./comprehensive-validation.sh
                  # Send results to monitoring system
                  curl -X POST http://webhook-receiver:8080/performance-results \
                    -H "Content-Type: application/json" \
                    -d @comprehensive-validation-report-\$(date +%Y%m%d).md
                volumeMounts:
                - name: workspace
                  mountPath: /workspace
              volumes:
              - name: workspace
                emptyDir: {}
              restartPolicy: OnFailure
    EOF

Performance Regression Detection

  1. Create a performance regression detection script:

    cat << EOF > detect-regressions.sh
    #!/bin/bash
    
    # Performance Regression Detection
    # Compares current performance against historical baselines
    
    set -euo pipefail
    
    BASELINE_FILE="performance-baseline.json"
    CURRENT_RESULTS="comprehensive-validation-report-\$(date +%Y%m%d).md"
    
    # Extract current metrics
    CURRENT_POD_P99=\$(grep "Baseline Pods" \$CURRENT_RESULTS | grep -o '[0-9]*ms' | head -1 | sed 's/ms//')
    CURRENT_VMI_P99=\$(grep "VMI Startup" \$CURRENT_RESULTS | grep -o '[0-9]*ms' | head -1 | sed 's/ms//')
    
    # Load baseline (create if doesn't exist)
    if [[ ! -f \$BASELINE_FILE ]]; then
        echo "Creating initial baseline..."
        cat > \$BASELINE_FILE << EOB
    {
      "pod_p99": \$CURRENT_POD_P99,
      "vmi_p99": \$CURRENT_VMI_P99,
      "timestamp": "\$(date -Iseconds)"
    }
    EOB
        echo "✅ Baseline created with current values"
        exit 0
    fi
    
    # Compare against baseline
    BASELINE_POD_P99=\$(jq -r '.pod_p99' \$BASELINE_FILE)
    BASELINE_VMI_P99=\$(jq -r '.vmi_p99' \$BASELINE_FILE)
    
    # Calculate regression percentage (>10% slower = regression)
    POD_REGRESSION=\$(echo "scale=2; (\$CURRENT_POD_P99 - \$BASELINE_POD_P99) / \$BASELINE_POD_P99 * 100" | bc -l)
    VMI_REGRESSION=\$(echo "scale=2; (\$CURRENT_VMI_P99 - \$BASELINE_VMI_P99) / \$BASELINE_VMI_P99 * 100" | bc -l)
    
    echo "📊 Performance Regression Analysis"
    echo "================================="
    echo "Pod P99: \${CURRENT_POD_P99}ms (baseline: \${BASELINE_POD_P99}ms) - \${POD_REGRESSION}% change"
    echo "VMI P99: \${CURRENT_VMI_P99}ms (baseline: \${BASELINE_VMI_P99}ms) - \${VMI_REGRESSION}% change"
    
    # Alert on significant regressions
    if (( \$(echo "\$POD_REGRESSION > 10" | bc -l) )) || (( \$(echo "\$VMI_REGRESSION > 10" | bc -l) )); then
        echo "🚨 PERFORMANCE REGRESSION DETECTED!"
        echo "Consider investigating recent changes or system issues"
        exit 1
    else
        echo "✅ No significant performance regression detected"
    fi
    EOF
    
    chmod +x detect-regressions.sh

Best Practices for Production Monitoring

Monitoring Strategy Best Practices

  1. Establish Clear Baselines:

    • Document performance baselines for all workload types

    • Update baselines after infrastructure changes

    • Use statistical methods (P50, P95, P99) not just averages

  2. Implement Layered Monitoring:

    • Infrastructure: Node resources, network, storage

    • Platform: Kubernetes/OpenShift metrics

    • Application: Workload-specific performance metrics

    • Business: End-user experience metrics

  3. Set Meaningful Thresholds:

    • Base thresholds on business requirements, not arbitrary values

    • Use different thresholds for different workload types

    • Implement both warning and critical alert levels

Alerting Best Practices

Alert Type Purpose Example Threshold

Performance Regression

Detect degradation over time

>20% increase in P99 latency

Threshold Breach

Immediate performance issues

Pod startup >30s, VMI startup >45s

Resource Exhaustion

Prevent capacity issues

HugePages >90% utilized

Configuration Drift

Ensure optimization integrity

CPU isolation breach detected

Continuous Improvement Process

  1. Weekly Performance Reviews:

    • Analyze performance trends

    • Identify optimization opportunities

    • Review alert effectiveness

  2. Monthly Capacity Planning:

    • Assess resource utilization trends

    • Plan for growth and scaling

    • Update performance baselines

  3. Quarterly Optimization Cycles:

    • Test new performance features

    • Validate optimization effectiveness

    • Update monitoring and alerting

Troubleshooting Performance Issues

Systematic Troubleshooting Approach

  1. Identify the Scope:

    • Is it affecting all workloads or specific types?

    • When did the issue start?

    • What changed recently?

  2. Gather Data:

    • Check monitoring dashboards

    • Review recent alerts

    • Examine system logs

  3. Isolate the Cause:

    • Test with minimal workloads

    • Compare against known-good baselines

    • Use performance profiling tools

  4. Implement and Validate Fix:

    • Apply targeted fixes

    • Validate with performance tests

    • Monitor for sustained improvement

Common Performance Issues and Solutions

Issue Symptoms Solution

CPU Contention

High latency, CPU isolation breaches

Review CPU allocation, check for noisy neighbors

Memory Pressure

Increased swap usage, OOM kills

Verify HugePages configuration, check memory limits

Network Bottlenecks

High network latency, packet drops

Review network policies, check SR-IOV configuration

Storage Performance

High I/O wait times, slow disk operations

Optimize storage classes, check disk utilization

Performance Testing in CI/CD

  1. Integrate performance tests in pipelines:

    • Run lightweight performance tests on every deployment

    • Execute comprehensive tests on release candidates

    • Block deployments that fail performance thresholds

  2. Automated performance reporting:

    • Generate performance reports for each build

    • Track performance trends over time

    • Alert on performance regressions

Documentation and Knowledge Management

  1. Maintain Performance Runbooks:

    • Document common performance issues and solutions

    • Include step-by-step troubleshooting guides

    • Keep contact information for escalation

  2. Performance Baseline Documentation:

    • Document expected performance for each workload type

    • Include configuration details and optimization settings

    • Update after significant changes

Hands-on Exercise: Complete Validation

Exercise Objectives

  • Execute comprehensive performance validation across all workshop modules

  • Set up monitoring and alerting for production readiness

  • Implement continuous performance testing workflows

    1. Run the comprehensive workshop validator:

      # Validate entire workshop completion
      echo "🔍 Running Comprehensive Workshop Validation..."
      python3 ~/low-latency-performance-workshop/scripts/module06-comprehensive-validator.py
      
      # Generate detailed validation report
      python3 ~/low-latency-performance-workshop/scripts/module06-comprehensive-validator.py --report
      
      # Specify custom metrics directory if needed
      python3 ~/low-latency-performance-workshop/scripts/module06-comprehensive-validator.py \
          --metrics-dir ~/kube-burner-configs \
          --report

      This validator checks:

  • Module 3: Baseline performance metrics collected

  • Module 4: Performance tuning applied (optional)

  • Module 5: VMI and network testing completed

  • Cluster health and configuration

  • Overall performance improvements

    1. Generate workshop summary:

      # Generate comprehensive workshop summary
      echo "🎉 Generating Workshop Summary..."
      python3 ~/low-latency-performance-workshop/scripts/module06-workshop-summary.py
      
      # Save summary to file
      python3 ~/low-latency-performance-workshop/scripts/module06-workshop-summary.py --no-color \
          > workshop-summary-$(date +%Y%m%d).txt

      The summary includes:

  • Module overview and learning journey

  • Performance improvement statistics

  • Key technologies and techniques

  • Best practices learned

  • Production readiness checklist

  • Next steps and resources

    1. Detect performance regressions:

      # Save current metrics as baseline (first time)
      echo "💾 Saving Performance Baseline..."
      python3 ~/low-latency-performance-workshop/scripts/module06-performance-regression-detector.py \
          --save-baseline
      
      # Later, detect regressions against baseline
      echo "🔍 Detecting Performance Regressions..."
      python3 ~/low-latency-performance-workshop/scripts/module06-performance-regression-detector.py
      
      # Use custom threshold (default: 10%)
      python3 ~/low-latency-performance-workshop/scripts/module06-performance-regression-detector.py \
          --threshold 15
      
      # Specify custom baseline file
      python3 ~/low-latency-performance-workshop/scripts/module06-performance-regression-detector.py \
          --baseline /path/to/baseline.json

      The regression detector:

  • Compares current vs historical performance

  • Alerts on regressions exceeding threshold

  • Analyzes P50, P95, P99 percentiles

  • Saves/loads performance baselines

  • Provides actionable recommendations

Recommended Workflow:

  1. Run module06-comprehensive-validator.py to verify all modules are complete

  2. Run module06-performance-regression-detector.py --save-baseline to save current performance

  3. After any changes, run the regression detector to check for performance degradation

  4. Run module06-workshop-summary.py to generate final workshop report

These scripts work together to provide complete workshop validation and ongoing performance monitoring.

  1. Execute the comprehensive validation (optional bash scripts):

    cd ~/kube-burner-configs
    
    # Run the complete validation suite
    echo "🚀 Executing comprehensive performance validation..."
    ./comprehensive-validation.sh
    
    # Check for performance regressions
    echo "🔍 Checking for performance regressions..."
    ./detect-regressions.sh
    
    # View the final validation report
    echo "📊 Final Validation Report:"
    echo "=========================="
    cat comprehensive-validation-report-$(date +%Y%m%d).md
  2. Generate comprehensive Python-based analysis:

    cd ~/kube-burner-configs
    
    # Load monitoring configuration from earlier assessment
    source /tmp/monitoring-config 2>/dev/null || {
        echo "âš ī¸  Monitoring config not found, detecting current state..."
        CLUSTER_TUNING="UNKNOWN"
        VMI_TESTING="UNKNOWN"
        BASELINE_TESTING="UNKNOWN"
    }
    
    echo "🎓 Generating Final Workshop Performance Analysis..."
    echo "Configuration: Tuning=$CLUSTER_TUNING, VMI=$VMI_TESTING, Baseline=$BASELINE_TESTING"
    echo ""
    
    # Generate comprehensive Module 6 analysis
    echo "đŸŽ¯ Module 6 Comprehensive Analysis (All Workshop Data)..."
    python3 ~/low-latency-performance-workshop/scripts/module-specific-analysis.py 6
    
    echo ""
    echo "📄 Generating Final Workshop Report..."
    REPORT_FILE="final_workshop_analysis_$(date +%Y%m%d-%H%M).md"
    
    # Generate comprehensive report with all available data
    if [ -d "collected-metrics" ] && [ -d "collected-metrics-tuned" ] && [ -d "collected-metrics-vmi" ]; then
        echo "📊 Full workshop report: Baseline + Tuned + VMI"
        python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \
            --baseline collected-metrics \
            --tuned collected-metrics-tuned \
            --vmi collected-metrics-vmi \
            --report "$REPORT_FILE"
    elif [ -d "collected-metrics" ] && [ -d "collected-metrics-vmi" ]; then
        echo "📊 Baseline + VMI report"
        python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \
            --baseline collected-metrics \
            --vmi collected-metrics-vmi \
            --report "$REPORT_FILE"
    elif [ -d "collected-metrics-vmi" ]; then
        echo "📊 VMI-only report"
        python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \
            --single collected-metrics-vmi \
            --report "$REPORT_FILE"
    else
        echo "❌ No performance metrics found for comprehensive analysis"
        echo "   Please ensure you've completed the performance tests from previous modules"
    fi
    
    echo ""
    echo "📊 Performance Summary:"
    python3 ~/low-latency-performance-workshop/scripts/module04-performance-summary.py
    
    echo ""
    echo "📄 Generated Analysis Reports:"
    ls -la *analysis*.md 2>/dev/null || echo "No analysis reports found"
  3. Verify monitoring and alerting setup:

    # Check that alerting rules are active
    oc get prometheusrule low-latency-performance-alerts -n openshift-monitoring
    
    # Verify monitoring stack is healthy
    oc get pods -n openshift-monitoring | grep -E "(prometheus|grafana|alertmanager)"
    
    # Test alert firing (optional - creates temporary high latency)
    echo "Testing alert system (creates temporary load)..."
    oc run test-load --image=busybox --restart=Never -- sleep 300

Module Summary

This module completed the low-latency performance workshop by implementing comprehensive monitoring and validation:

  • ✅ Set up performance monitoring with Prometheus, Grafana, and custom dashboards

  • ✅ Created alerting rules for performance regressions and threshold violations

  • ✅ Implemented comprehensive validation testing all optimizations from Modules 3-5

  • ✅ Established continuous testing with automated performance validation

  • ✅ Built regression detection to maintain performance over time

  • ✅ Documented best practices for production monitoring and troubleshooting

Workshop Performance Journey - Final Results

Module Focus Target Metrics Achievement Status

Module 3

Baseline Testing

Pod startup: Establish baseline

✅ 5.4s average baseline established

Module 4

Performance Tuning (Optional)

Pod startup: <3s P99 (50-70% improvement if applied)

✅ CPU isolation and HugePages (if completed)

Module 5

Virtualization

VMI startup: <45s P99, NetPol: <5s P99

✅ Low-latency VMs and network testing

Module 6

Monitoring & Validation

End-to-end validation and monitoring

✅ Complete monitoring and validation suite

Key Performance Achievements

Based on the comprehensive validation, the workshop demonstrates:

  1. Container Performance: Significant improvement in pod startup latency through CPU isolation and HugePages

  2. VM Performance: Optimized virtual machine startup times with dedicated resources

  3. Network Performance: Validated network policy enforcement latency

  4. Monitoring Maturity: Production-ready monitoring, alerting, and continuous validation

Production Readiness Checklist

✅ Performance Baselines: Established and documented
✅ Monitoring Stack: Prometheus, Grafana, and Alertmanager configured
✅ Alerting Rules: Performance regression and threshold alerts active
✅ Continuous Testing: Automated validation and regression detection
✅ Documentation: Runbooks and troubleshooting guides available
✅ Best Practices: Implemented monitoring and optimization strategies

Knowledge Check

  1. What are the key components of a comprehensive performance monitoring strategy?

  2. How do you detect performance regressions in a production environment?

  3. What thresholds should be set for pod startup, VMI startup, and network policy latency?

  4. How would you troubleshoot a sudden increase in container startup latency?

Advanced Topics for Further Learning

  • Multi-cluster Performance Monitoring: Scaling monitoring across multiple OpenShift clusters

  • Advanced Performance Profiling: Using tools like perf, flamegraphs, and eBPF for deep analysis

  • Predictive Performance Analytics: Using machine learning for capacity planning and anomaly detection

  • Custom Performance Metrics: Developing application-specific performance measurements

Workshop Conclusion

Congratulations! You have successfully completed the Low-Latency Performance Workshop for OpenShift 4.19.

You now have the knowledge and tools to: - Establish performance baselines using modern testing tools - Apply low-latency optimizations with Performance Profiles and CPU isolation - Optimize virtual machines for high-performance workloads - Monitor and validate performance improvements in production - Maintain performance through continuous testing and alerting

The skills and techniques learned in this workshop are directly applicable to production environments requiring deterministic, low-latency performance.

Next Steps

Continue your low-latency performance journey by:

  1. Applying these techniques to your production workloads

  2. Customizing monitoring for your specific application requirements

  3. Exploring advanced features like SR-IOV networking and real-time kernels

  4. Contributing back to the community with your performance optimization experiences

Thank you for participating in the Low-Latency Performance Workshop!