Module 6: Monitoring and Validation
Module Overview
This module focuses on comprehensive monitoring and validation of low-latency performance optimizations. You’ll learn to set up monitoring dashboards, create performance alerts, validate optimizations across the entire stack, and implement continuous performance testing to ensure sustained performance improvements.
Prerequisites
-
Completed Module 5 (Low-latency Virtualization)
-
Performance baselines established from Module 3
-
Optional: Performance tuning from Module 4 (for enhanced monitoring)
-
VMI and network policy testing from Module 5
Module 4 Dependency This module can validate performance with or without Module 4 optimizations:
The monitoring setup adapts to your cluster’s current configuration. |
Key Learning Objectives
-
Set up comprehensive performance monitoring with Prometheus and Grafana
-
Create alerting for performance regressions and threshold violations
-
Validate optimizations across containers, VMs, and networking
-
Implement continuous performance testing workflows
-
Analyze long-term performance trends and capacity planning
-
Troubleshoot performance issues using monitoring data
Workshop Performance Journey Recap
Before diving into monitoring, let’s recap the performance improvements achieved:
Module | Focus | Key Metrics | Performance Achievement |
---|---|---|---|
Module 3 |
Baseline Testing |
Pod startup latency |
5.4s average (baseline established) |
Module 4 |
Performance Tuning (Optional) |
Tuned pod latency |
50-70% improvement with CPU isolation (if applied) |
Module 5 |
Virtualization |
VMI startup + Network policies |
VMI: 60-90s (varies by tuning), NetPol: <10s |
Module 6 |
Monitoring (current) |
End-to-end validation |
Sustained performance assurance |
Performance Expectations by Configuration
|
Performance Monitoring Setup
Cluster Configuration Assessment
Before setting up monitoring, let’s assess the current cluster configuration to understand what performance characteristics to monitor.
-
Detect current performance optimizations:
echo "đ Assessing Current Cluster Performance Configuration..." echo "" # Check for Performance Profiles (Module 4) PERF_PROFILES=$(oc get performanceprofile --no-headers 2>/dev/null | wc -l) if [ "$PERF_PROFILES" -gt 0 ]; then echo "â Performance Profiles Found: $PERF_PROFILES" echo " đ Expected Performance: Optimized (50-70% improvement)" echo " đ¯ Monitoring Focus: Validate tuning effectiveness" CLUSTER_TUNING="OPTIMIZED" oc get performanceprofile -o custom-columns=NAME:.metadata.name,ISOLATED:.spec.cpu.isolated,RESERVED:.spec.cpu.reserved else echo "âšī¸ No Performance Profiles Found" echo " đ Expected Performance: Baseline" echo " đ¯ Monitoring Focus: Identify optimization opportunities" CLUSTER_TUNING="BASELINE" fi echo "" # Check for VMI test results (Module 5) VMI_METRICS_DIR="~/kube-burner-configs/collected-metrics-vmi" if [ -d "$VMI_METRICS_DIR" ] && [ -f "$VMI_METRICS_DIR/vmiLatencyMeasurement-vmi-latency-test.json" ]; then echo "â VMI Performance Data Available" echo " đ VMI tests completed in Module 5" VMI_TESTING="COMPLETED" else echo "âšī¸ No VMI Performance Data Found" echo " đ VMI testing may not have been completed" VMI_TESTING="PENDING" fi echo "" # Check for baseline metrics (Module 3) BASELINE_METRICS_DIR="~/kube-burner-configs/collected-metrics" if [ -d "$BASELINE_METRICS_DIR" ]; then echo "â Baseline Performance Data Available" echo " đ Baseline established in Module 3" BASELINE_TESTING="COMPLETED" else echo "â ī¸ No Baseline Performance Data Found" echo " đ Baseline testing may not have been completed" BASELINE_TESTING="MISSING" fi echo "" echo "đ¯ Monitoring Configuration Summary:" echo " Cluster Tuning: $CLUSTER_TUNING" echo " VMI Testing: $VMI_TESTING" echo " Baseline Data: $BASELINE_TESTING" # Save configuration for later use echo "CLUSTER_TUNING=$CLUSTER_TUNING" > /tmp/monitoring-config echo "VMI_TESTING=$VMI_TESTING" >> /tmp/monitoring-config echo "BASELINE_TESTING=$BASELINE_TESTING" >> /tmp/monitoring-config
OpenShift Built-in Monitoring Stack
OpenShift includes a comprehensive monitoring stack based on Prometheus, Grafana, and Alertmanager. Let’s configure it for low-latency performance monitoring.
-
Verify the monitoring stack is available:
# Check monitoring operator status oc get clusteroperator monitoring # Verify Prometheus is running oc get pods -n openshift-monitoring | grep prometheus # Check Grafana availability oc get route -n openshift-monitoring | grep grafana
-
Enable user workload monitoring for custom metrics:
cat << EOF | oc apply -f - apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true prometheusK8s: retention: 7d volumeClaimTemplate: spec: storageClassName: gp3-csi resources: requests: storage: 20Gi alertmanagerMain: volumeClaimTemplate: spec: storageClassName: gp3-csi resources: requests: storage: 2Gi EOF
-
Configure user workload monitoring:
cat << EOF | oc apply -f - apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheus: retention: 7d resources: requests: cpu: 200m memory: 2Gi volumeClaimTemplate: spec: storageClassName: gp3-csi resources: requests: storage: 10Gi EOF
Performance Metrics Collection
-
Create a ServiceMonitor for kube-burner metrics:
cat << EOF | oc apply -f - apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: performance-testing-metrics namespace: openshift-monitoring labels: app: performance-testing spec: selector: matchLabels: app: kube-burner-metrics endpoints: - port: metrics interval: 30s path: /metrics EOF
Custom Performance Dashboards
-
Create a comprehensive performance dashboard:
cd ~/kube-burner-configs cat << EOF > performance-dashboard.json { "dashboard": { "id": null, "title": "Low-Latency Performance Workshop", "tags": ["performance", "low-latency", "workshop"], "timezone": "browser", "panels": [ { "id": 1, "title": "Pod Startup Latency (P99)", "type": "stat", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(kubelet_pod_start_duration_seconds_bucket[5m])) by (le))", "legendFormat": "Pod P99 Latency" } ], "fieldConfig": { "defaults": { "unit": "s", "thresholds": { "steps": [ {"color": "green", "value": null}, {"color": "yellow", "value": 10}, {"color": "red", "value": 30} ] } } }, "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0} }, { "id": 2, "title": "VMI Startup Latency (P99)", "type": "stat", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(kubevirt_vmi_phase_transition_time_seconds_bucket{phase=\"Running\"}[5m])) by (le))", "legendFormat": "VMI P99 Latency" } ], "fieldConfig": { "defaults": { "unit": "s", "thresholds": { "steps": [ {"color": "green", "value": null}, {"color": "yellow", "value": 30}, {"color": "red", "value": 45} ] } } }, "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0} }, { "id": 3, "title": "CPU Isolation Effectiveness", "type": "graph", "targets": [ { "expr": "rate(node_cpu_seconds_total{mode=\"idle\"}[5m]) * 100", "legendFormat": "CPU {{cpu}} Idle %" } ], "yAxes": [ {"min": 0, "max": 100, "unit": "percent"} ], "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8} } ], "time": {"from": "now-1h", "to": "now"}, "refresh": "30s" } } EOF # Note: This dashboard would be imported via Grafana UI or API echo "đ Performance dashboard configuration created" echo "Import this dashboard into Grafana for visualization"
Performance Alerting Rules
-
Create alerting rules for performance regressions:
cat << EOF | oc apply -f - apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: low-latency-performance-alerts namespace: openshift-monitoring labels: prometheus: kube-prometheus role: alert-rules spec: groups: - name: low-latency-performance rules: - alert: PodStartupLatencyHigh expr: histogram_quantile(0.99, sum(rate(kubelet_pod_start_duration_seconds_bucket[5m])) by (le)) > 30 for: 2m labels: severity: warning component: pod-startup annotations: summary: "Pod startup latency is too high" description: "Pod P99 startup latency is {{ \$value }}s, exceeding 30s threshold" - alert: VMIStartupLatencyHigh expr: histogram_quantile(0.99, sum(rate(kubevirt_vmi_phase_transition_time_seconds_bucket{phase="Running"}[5m])) by (le)) > 45 for: 2m labels: severity: warning component: vmi-startup annotations: summary: "VMI startup latency is too high" description: "VMI P99 startup latency is {{ \$value }}s, exceeding 45s threshold" - alert: CPUIsolationBreach expr: rate(node_cpu_seconds_total{mode!="idle",cpu=~"2|3"}[5m]) > 0.1 for: 1m labels: severity: critical component: cpu-isolation annotations: summary: "CPU isolation breach detected" description: "Isolated CPU {{ \$labels.cpu }} showing {{ \$value }} utilization" - alert: HugePagesExhausted expr: (node_memory_HugePages_Total - node_memory_HugePages_Free) / node_memory_HugePages_Total > 0.9 for: 1m labels: severity: warning component: hugepages annotations: summary: "HugePages utilization is high" description: "HugePages utilization is {{ \$value | humanizePercentage }} on node {{ \$labels.instance }}" EOF
Validation Tools and Techniques
Comprehensive Performance Validation
Now let’s run comprehensive validation tests to verify all optimizations are working correctly.
-
Create a comprehensive validation test suite:
cd ~/kube-burner-configs cat << EOF > comprehensive-validation.sh #!/bin/bash # Comprehensive Performance Validation Suite # Tests all optimizations from Modules 3-5 set -euo pipefail # Colors for output RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[1;33m' BLUE='\033[0;34m' NC='\033[0m' log_info() { echo -e "\${BLUE}[INFO]\${NC} \$1"; } log_success() { echo -e "\${GREEN}[SUCCESS]\${NC} \$1"; } log_warning() { echo -e "\${YELLOW}[WARNING]\${NC} \$1"; } log_error() { echo -e "\${RED}[ERROR]\${NC} \$1"; } echo "đ Starting Comprehensive Performance Validation" echo "==============================================" # Test 1: Baseline Pod Performance log_info "Test 1: Validating baseline pod performance..." if kube-burner init -c baseline-config.yml --log-level=warn; then BASELINE_P99=\$(find collected-metrics/ -name "*podLatencyQuantilesMeasurement*" -type f | head -1 | xargs cat | jq -r '.[] | select(.quantileName == "Ready") | .P99' 2>/dev/null || echo "0") if (( \$(echo "\$BASELINE_P99 < 30000" | bc -l) )); then log_success "Baseline pod P99: \${BASELINE_P99}ms (< 30s threshold)" else log_warning "Baseline pod P99: \${BASELINE_P99}ms (exceeds 30s threshold)" fi else log_error "Baseline test failed" fi # Test 2: Performance-Tuned Pod Performance log_info "Test 2: Validating performance-tuned pod performance..." if kube-burner init -c tuned-config.yml --log-level=warn; then TUNED_P99=\$(find collected-metrics-tuned/ -name "*podLatencyQuantilesMeasurement*" -type f | head -1 | xargs cat | jq -r '.[] | select(.quantileName == "Ready") | .P99' 2>/dev/null || echo "0") if (( \$(echo "\$TUNED_P99 < 15000" | bc -l) )); then log_success "Tuned pod P99: \${TUNED_P99}ms (< 15s threshold)" else log_warning "Tuned pod P99: \${TUNED_P99}ms (exceeds 15s threshold)" fi # Calculate improvement if [[ "\$BASELINE_P99" != "0" && "\$TUNED_P99" != "0" ]]; then IMPROVEMENT=\$(echo "scale=1; (\$BASELINE_P99 - \$TUNED_P99) / \$BASELINE_P99 * 100" | bc -l) log_info "Performance improvement: \${IMPROVEMENT}% faster" fi else log_error "Tuned pod test failed" fi # Test 3: VMI Performance log_info "Test 3: Validating VMI performance..." if kube-burner init -c vmi-latency-config.yml --log-level=warn; then VMI_P99=\$(find collected-metrics-vmi/ -name "*vmiLatencyQuantilesMeasurement*" -type f | head -1 | xargs cat | jq -r '.[] | select(.quantileName == "VMIRunning") | .P99' 2>/dev/null || echo "0") if (( \$(echo "\$VMI_P99 < 45000" | bc -l) )); then log_success "VMI P99: \${VMI_P99}ms (< 45s threshold)" else log_warning "VMI P99: \${VMI_P99}ms (exceeds 45s threshold)" fi else log_error "VMI test failed" fi # Test 4: Network Policy Performance log_info "Test 4: Validating network policy performance..." if kube-burner init -c network-policy-latency-config.yml --log-level=warn; then NETPOL_P99=\$(find collected-metrics-netpol/ -name "*netpolLatencyQuantilesMeasurement*" -type f | head -1 | xargs cat | jq -r '.[] | select(.quantileName == "Ready") | .P99' 2>/dev/null || echo "0") if (( \$(echo "\$NETPOL_P99 < 5000" | bc -l) )); then log_success "Network Policy P99: \${NETPOL_P99}ms (< 5s threshold)" else log_warning "Network Policy P99: \${NETPOL_P99}ms (exceeds 5s threshold)" fi else log_error "Network policy test failed" fi # Generate comprehensive report log_info "Generating comprehensive validation report..." cat > comprehensive-validation-report-\$(date +%Y%m%d).md << EOF # Comprehensive Performance Validation Report - \$(date) ## Test Results Summary | Test Type | P99 Latency | Threshold | Status | |-----------|-------------|-----------|--------| | **Baseline Pods** | \${BASELINE_P99}ms | < 30,000ms | \$(if (( \$(echo "\$BASELINE_P99 < 30000" | bc -l) )); then echo "â PASS"; else echo "â FAIL"; fi) | | **Tuned Pods** | \${TUNED_P99}ms | < 15,000ms | \$(if (( \$(echo "\$TUNED_P99 < 15000" | bc -l) )); then echo "â PASS"; else echo "â FAIL"; fi) | | **VMI Startup** | \${VMI_P99}ms | < 45,000ms | \$(if (( \$(echo "\$VMI_P99 < 45000" | bc -l) )); then echo "â PASS"; else echo "â FAIL"; fi) | | **Network Policy** | \${NETPOL_P99}ms | < 5,000ms | \$(if (( \$(echo "\$NETPOL_P99 < 5000" | bc -l) )); then echo "â PASS"; else echo "â FAIL"; fi) | ## Performance Improvements - **Pod Performance**: \$(if [[ "\$BASELINE_P99" != "0" && "\$TUNED_P99" != "0" ]]; then echo "scale=1; (\$BASELINE_P99 - \$TUNED_P99) / \$BASELINE_P99 * 100" | bc -l | sed 's/$/% improvement/'; else echo "TBD"; fi) - **VM vs Container**: \$(if [[ "\$VMI_P99" != "0" && "\$TUNED_P99" != "0" ]]; then echo "scale=1; \$VMI_P99 / \$TUNED_P99" | bc -l | sed 's/$/x slower (expected)/'; else echo "TBD"; fi) ## Validation Status \$(if (( \$(echo "\$BASELINE_P99 < 30000 && \$TUNED_P99 < 15000 && \$VMI_P99 < 45000 && \$NETPOL_P99 < 5000" | bc -l) )); then echo "đ **ALL TESTS PASSED** - Low-latency optimizations are working correctly!"; else echo "â ī¸ **SOME TESTS FAILED** - Review individual test results above"; fi) ## Next Steps 1. Monitor performance trends over time 2. Set up alerting for performance regressions 3. Implement continuous performance testing 4. Fine-tune thresholds based on workload requirements EOF log_success "Comprehensive validation completed!" echo "" echo "đ Validation Report:" echo "====================" cat comprehensive-validation-report-\$(date +%Y%m%d).md EOF chmod +x comprehensive-validation.sh
-
Run the comprehensive validation suite:
cd ~/kube-burner-configs # Execute the comprehensive validation echo "đ Running comprehensive performance validation..." ./comprehensive-validation.sh # The script will: # 1. Run all test types (baseline, tuned, VMI, network policy) # 2. Validate against performance thresholds # 3. Calculate performance improvements # 4. Generate a comprehensive report
Continuous Performance Testing
-
Set up automated performance testing with a cron job:
cat << EOF | oc apply -f - apiVersion: batch/v1 kind: CronJob metadata: name: performance-validation namespace: default spec: schedule: "0 2 * * *" # Run daily at 2 AM jobTemplate: spec: template: spec: containers: - name: performance-test image: quay.io/cloud-bulldozer/kube-burner:latest command: ["/bin/bash"] args: - -c - | cd /workspace git clone https://github.com/tosin2013/low-latency-performance-workshop.git cd low-latency-performance-workshop/kube-burner-configs ./comprehensive-validation.sh # Send results to monitoring system curl -X POST http://webhook-receiver:8080/performance-results \ -H "Content-Type: application/json" \ -d @comprehensive-validation-report-\$(date +%Y%m%d).md volumeMounts: - name: workspace mountPath: /workspace volumes: - name: workspace emptyDir: {} restartPolicy: OnFailure EOF
Performance Regression Detection
-
Create a performance regression detection script:
cat << EOF > detect-regressions.sh #!/bin/bash # Performance Regression Detection # Compares current performance against historical baselines set -euo pipefail BASELINE_FILE="performance-baseline.json" CURRENT_RESULTS="comprehensive-validation-report-\$(date +%Y%m%d).md" # Extract current metrics CURRENT_POD_P99=\$(grep "Baseline Pods" \$CURRENT_RESULTS | grep -o '[0-9]*ms' | head -1 | sed 's/ms//') CURRENT_VMI_P99=\$(grep "VMI Startup" \$CURRENT_RESULTS | grep -o '[0-9]*ms' | head -1 | sed 's/ms//') # Load baseline (create if doesn't exist) if [[ ! -f \$BASELINE_FILE ]]; then echo "Creating initial baseline..." cat > \$BASELINE_FILE << EOB { "pod_p99": \$CURRENT_POD_P99, "vmi_p99": \$CURRENT_VMI_P99, "timestamp": "\$(date -Iseconds)" } EOB echo "â Baseline created with current values" exit 0 fi # Compare against baseline BASELINE_POD_P99=\$(jq -r '.pod_p99' \$BASELINE_FILE) BASELINE_VMI_P99=\$(jq -r '.vmi_p99' \$BASELINE_FILE) # Calculate regression percentage (>10% slower = regression) POD_REGRESSION=\$(echo "scale=2; (\$CURRENT_POD_P99 - \$BASELINE_POD_P99) / \$BASELINE_POD_P99 * 100" | bc -l) VMI_REGRESSION=\$(echo "scale=2; (\$CURRENT_VMI_P99 - \$BASELINE_VMI_P99) / \$BASELINE_VMI_P99 * 100" | bc -l) echo "đ Performance Regression Analysis" echo "=================================" echo "Pod P99: \${CURRENT_POD_P99}ms (baseline: \${BASELINE_POD_P99}ms) - \${POD_REGRESSION}% change" echo "VMI P99: \${CURRENT_VMI_P99}ms (baseline: \${BASELINE_VMI_P99}ms) - \${VMI_REGRESSION}% change" # Alert on significant regressions if (( \$(echo "\$POD_REGRESSION > 10" | bc -l) )) || (( \$(echo "\$VMI_REGRESSION > 10" | bc -l) )); then echo "đ¨ PERFORMANCE REGRESSION DETECTED!" echo "Consider investigating recent changes or system issues" exit 1 else echo "â No significant performance regression detected" fi EOF chmod +x detect-regressions.sh
Best Practices for Production Monitoring
Monitoring Strategy Best Practices
-
Establish Clear Baselines:
-
Document performance baselines for all workload types
-
Update baselines after infrastructure changes
-
Use statistical methods (P50, P95, P99) not just averages
-
-
Implement Layered Monitoring:
-
Infrastructure: Node resources, network, storage
-
Platform: Kubernetes/OpenShift metrics
-
Application: Workload-specific performance metrics
-
Business: End-user experience metrics
-
-
Set Meaningful Thresholds:
-
Base thresholds on business requirements, not arbitrary values
-
Use different thresholds for different workload types
-
Implement both warning and critical alert levels
-
Alerting Best Practices
Alert Type | Purpose | Example Threshold |
---|---|---|
Performance Regression |
Detect degradation over time |
>20% increase in P99 latency |
Threshold Breach |
Immediate performance issues |
Pod startup >30s, VMI startup >45s |
Resource Exhaustion |
Prevent capacity issues |
HugePages >90% utilized |
Configuration Drift |
Ensure optimization integrity |
CPU isolation breach detected |
Continuous Improvement Process
-
Weekly Performance Reviews:
-
Analyze performance trends
-
Identify optimization opportunities
-
Review alert effectiveness
-
-
Monthly Capacity Planning:
-
Assess resource utilization trends
-
Plan for growth and scaling
-
Update performance baselines
-
-
Quarterly Optimization Cycles:
-
Test new performance features
-
Validate optimization effectiveness
-
Update monitoring and alerting
-
Troubleshooting Performance Issues
Systematic Troubleshooting Approach
-
Identify the Scope:
-
Is it affecting all workloads or specific types?
-
When did the issue start?
-
What changed recently?
-
-
Gather Data:
-
Check monitoring dashboards
-
Review recent alerts
-
Examine system logs
-
-
Isolate the Cause:
-
Test with minimal workloads
-
Compare against known-good baselines
-
Use performance profiling tools
-
-
Implement and Validate Fix:
-
Apply targeted fixes
-
Validate with performance tests
-
Monitor for sustained improvement
-
Common Performance Issues and Solutions
Issue | Symptoms | Solution |
---|---|---|
CPU Contention |
High latency, CPU isolation breaches |
Review CPU allocation, check for noisy neighbors |
Memory Pressure |
Increased swap usage, OOM kills |
Verify HugePages configuration, check memory limits |
Network Bottlenecks |
High network latency, packet drops |
Review network policies, check SR-IOV configuration |
Storage Performance |
High I/O wait times, slow disk operations |
Optimize storage classes, check disk utilization |
Performance Testing in CI/CD
-
Integrate performance tests in pipelines:
-
Run lightweight performance tests on every deployment
-
Execute comprehensive tests on release candidates
-
Block deployments that fail performance thresholds
-
-
Automated performance reporting:
-
Generate performance reports for each build
-
Track performance trends over time
-
Alert on performance regressions
-
Documentation and Knowledge Management
-
Maintain Performance Runbooks:
-
Document common performance issues and solutions
-
Include step-by-step troubleshooting guides
-
Keep contact information for escalation
-
-
Performance Baseline Documentation:
-
Document expected performance for each workload type
-
Include configuration details and optimization settings
-
Update after significant changes
-
Hands-on Exercise: Complete Validation
Exercise Objectives
-
Execute comprehensive performance validation across all workshop modules
-
Set up monitoring and alerting for production readiness
-
Implement continuous performance testing workflows
-
Run the comprehensive workshop validator:
# Validate entire workshop completion echo "đ Running Comprehensive Workshop Validation..." python3 ~/low-latency-performance-workshop/scripts/module06-comprehensive-validator.py # Generate detailed validation report python3 ~/low-latency-performance-workshop/scripts/module06-comprehensive-validator.py --report # Specify custom metrics directory if needed python3 ~/low-latency-performance-workshop/scripts/module06-comprehensive-validator.py \ --metrics-dir ~/kube-burner-configs \ --report
This validator checks:
-
-
Module 3: Baseline performance metrics collected
-
Module 4: Performance tuning applied (optional)
-
Module 5: VMI and network testing completed
-
Cluster health and configuration
-
Overall performance improvements
-
Generate workshop summary:
# Generate comprehensive workshop summary echo "đ Generating Workshop Summary..." python3 ~/low-latency-performance-workshop/scripts/module06-workshop-summary.py # Save summary to file python3 ~/low-latency-performance-workshop/scripts/module06-workshop-summary.py --no-color \ > workshop-summary-$(date +%Y%m%d).txt
The summary includes:
-
-
Module overview and learning journey
-
Performance improvement statistics
-
Key technologies and techniques
-
Best practices learned
-
Production readiness checklist
-
Next steps and resources
-
Detect performance regressions:
# Save current metrics as baseline (first time) echo "đž Saving Performance Baseline..." python3 ~/low-latency-performance-workshop/scripts/module06-performance-regression-detector.py \ --save-baseline # Later, detect regressions against baseline echo "đ Detecting Performance Regressions..." python3 ~/low-latency-performance-workshop/scripts/module06-performance-regression-detector.py # Use custom threshold (default: 10%) python3 ~/low-latency-performance-workshop/scripts/module06-performance-regression-detector.py \ --threshold 15 # Specify custom baseline file python3 ~/low-latency-performance-workshop/scripts/module06-performance-regression-detector.py \ --baseline /path/to/baseline.json
The regression detector:
-
-
Compares current vs historical performance
-
Alerts on regressions exceeding threshold
-
Analyzes P50, P95, P99 percentiles
-
Saves/loads performance baselines
-
Provides actionable recommendations
Recommended Workflow:
These scripts work together to provide complete workshop validation and ongoing performance monitoring. |
-
Execute the comprehensive validation (optional bash scripts):
cd ~/kube-burner-configs # Run the complete validation suite echo "đ Executing comprehensive performance validation..." ./comprehensive-validation.sh # Check for performance regressions echo "đ Checking for performance regressions..." ./detect-regressions.sh # View the final validation report echo "đ Final Validation Report:" echo "==========================" cat comprehensive-validation-report-$(date +%Y%m%d).md
-
Generate comprehensive Python-based analysis:
cd ~/kube-burner-configs # Load monitoring configuration from earlier assessment source /tmp/monitoring-config 2>/dev/null || { echo "â ī¸ Monitoring config not found, detecting current state..." CLUSTER_TUNING="UNKNOWN" VMI_TESTING="UNKNOWN" BASELINE_TESTING="UNKNOWN" } echo "đ Generating Final Workshop Performance Analysis..." echo "Configuration: Tuning=$CLUSTER_TUNING, VMI=$VMI_TESTING, Baseline=$BASELINE_TESTING" echo "" # Generate comprehensive Module 6 analysis echo "đ¯ Module 6 Comprehensive Analysis (All Workshop Data)..." python3 ~/low-latency-performance-workshop/scripts/module-specific-analysis.py 6 echo "" echo "đ Generating Final Workshop Report..." REPORT_FILE="final_workshop_analysis_$(date +%Y%m%d-%H%M).md" # Generate comprehensive report with all available data if [ -d "collected-metrics" ] && [ -d "collected-metrics-tuned" ] && [ -d "collected-metrics-vmi" ]; then echo "đ Full workshop report: Baseline + Tuned + VMI" python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \ --baseline collected-metrics \ --tuned collected-metrics-tuned \ --vmi collected-metrics-vmi \ --report "$REPORT_FILE" elif [ -d "collected-metrics" ] && [ -d "collected-metrics-vmi" ]; then echo "đ Baseline + VMI report" python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \ --baseline collected-metrics \ --vmi collected-metrics-vmi \ --report "$REPORT_FILE" elif [ -d "collected-metrics-vmi" ]; then echo "đ VMI-only report" python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \ --single collected-metrics-vmi \ --report "$REPORT_FILE" else echo "â No performance metrics found for comprehensive analysis" echo " Please ensure you've completed the performance tests from previous modules" fi echo "" echo "đ Performance Summary:" python3 ~/low-latency-performance-workshop/scripts/module04-performance-summary.py echo "" echo "đ Generated Analysis Reports:" ls -la *analysis*.md 2>/dev/null || echo "No analysis reports found"
-
Verify monitoring and alerting setup:
# Check that alerting rules are active oc get prometheusrule low-latency-performance-alerts -n openshift-monitoring # Verify monitoring stack is healthy oc get pods -n openshift-monitoring | grep -E "(prometheus|grafana|alertmanager)" # Test alert firing (optional - creates temporary high latency) echo "Testing alert system (creates temporary load)..." oc run test-load --image=busybox --restart=Never -- sleep 300
Module Summary
This module completed the low-latency performance workshop by implementing comprehensive monitoring and validation:
-
â Set up performance monitoring with Prometheus, Grafana, and custom dashboards
-
â Created alerting rules for performance regressions and threshold violations
-
â Implemented comprehensive validation testing all optimizations from Modules 3-5
-
â Established continuous testing with automated performance validation
-
â Built regression detection to maintain performance over time
-
â Documented best practices for production monitoring and troubleshooting
Workshop Performance Journey - Final Results
Module | Focus | Target Metrics | Achievement Status |
---|---|---|---|
Module 3 |
Baseline Testing |
Pod startup: Establish baseline |
â 5.4s average baseline established |
Module 4 |
Performance Tuning (Optional) |
Pod startup: <3s P99 (50-70% improvement if applied) |
â CPU isolation and HugePages (if completed) |
Module 5 |
Virtualization |
VMI startup: <45s P99, NetPol: <5s P99 |
â Low-latency VMs and network testing |
Module 6 |
Monitoring & Validation |
End-to-end validation and monitoring |
â Complete monitoring and validation suite |
Key Performance Achievements
Based on the comprehensive validation, the workshop demonstrates:
-
Container Performance: Significant improvement in pod startup latency through CPU isolation and HugePages
-
VM Performance: Optimized virtual machine startup times with dedicated resources
-
Network Performance: Validated network policy enforcement latency
-
Monitoring Maturity: Production-ready monitoring, alerting, and continuous validation
Production Readiness Checklist
â
Performance Baselines: Established and documented
â
Monitoring Stack: Prometheus, Grafana, and Alertmanager configured
â
Alerting Rules: Performance regression and threshold alerts active
â
Continuous Testing: Automated validation and regression detection
â
Documentation: Runbooks and troubleshooting guides available
â
Best Practices: Implemented monitoring and optimization strategies
Knowledge Check
-
What are the key components of a comprehensive performance monitoring strategy?
-
How do you detect performance regressions in a production environment?
-
What thresholds should be set for pod startup, VMI startup, and network policy latency?
-
How would you troubleshoot a sudden increase in container startup latency?
Advanced Topics for Further Learning
-
Multi-cluster Performance Monitoring: Scaling monitoring across multiple OpenShift clusters
-
Advanced Performance Profiling: Using tools like perf, flamegraphs, and eBPF for deep analysis
-
Predictive Performance Analytics: Using machine learning for capacity planning and anomaly detection
-
Custom Performance Metrics: Developing application-specific performance measurements
Workshop Conclusion
Congratulations! You have successfully completed the Low-Latency Performance Workshop for OpenShift 4.19.
You now have the knowledge and tools to: - Establish performance baselines using modern testing tools - Apply low-latency optimizations with Performance Profiles and CPU isolation - Optimize virtual machines for high-performance workloads - Monitor and validate performance improvements in production - Maintain performance through continuous testing and alerting
The skills and techniques learned in this workshop are directly applicable to production environments requiring deterministic, low-latency performance.
Next Steps
Continue your low-latency performance journey by:
-
Applying these techniques to your production workloads
-
Customizing monitoring for your specific application requirements
-
Exploring advanced features like SR-IOV networking and real-time kernels
-
Contributing back to the community with your performance optimization experiences
Thank you for participating in the Low-Latency Performance Workshop!