Module 5: Low-Latency Virtualization
Module Overview
This module focuses on optimizing virtual machines for low-latency performance using OpenShift Virtualization. You’ll learn how to configure VMs with dedicated CPUs, HugePages, and SR-IOV networking, then validate performance improvements using advanced kube-burner measurements.
Prerequisites
-
Completed Module 3 (Baseline performance metrics collected)
-
OpenShift Virtualization operator installed (from Module 2)
-
Single Node OpenShift (SNO) or multi-node cluster
-
Baseline performance metrics from Module 3
-
Optional: Completed Module 4 (Performance Profiles for enhanced VM performance)
Key Learning Objectives
-
Configure OpenShift Virtualization for low-latency workloads
-
Optimize Virtual Machine Instances (VMIs) with dedicated resources
-
Implement SR-IOV networking for high-performance VM networking
-
Measure VMI startup and network latency using kube-burner
-
Validate network policy performance in virtualized environments
-
Compare VM performance against containerized workloads
OpenShift Virtualization Overview
OpenShift Virtualization enables running virtual machines alongside containers on the same OpenShift cluster, providing:
-
Unified Management: VMs and containers managed through the same platform
-
Performance Optimization: CPU pinning, HugePages, and NUMA alignment
-
Advanced Networking: SR-IOV, Multus, and high-performance networking
-
Live Migration: Zero-downtime VM migration between nodes
-
Security: VM isolation with OpenShift security policies
Architecture Components
Component | Purpose | Low-Latency Features |
---|---|---|
KubeVirt |
VM management engine |
CPU pinning, dedicated resources |
Containerized Data Importer (CDI) |
VM disk image management |
Optimized storage provisioning |
Multus CNI |
Multiple network interfaces |
SR-IOV and high-performance networking |
Node Feature Discovery |
Hardware capability detection |
NUMA topology awareness |
Verifying OpenShift Virtualization Installation
OpenShift Virtualization was deployed in Module 2 via GitOps. Let’s verify it’s ready for low-latency workloads.
-
Check if OpenShift Virtualization is installed and ready:
# Check the HyperConverged operator status oc get hyperconverged -n openshift-cnv # Verify virtualization components are running oc get pods -n openshift-cnv --field-selector=status.phase=Running | head -10 # Check if KVM virtualization is available on the cluster oc get nodes -o jsonpath='{.items[*].status.allocatable.devices\.kubevirt\.io/kvm}' | grep -q "1k" && echo "✅ KVM available on cluster nodes" || echo "❌ KVM not available" # Verify the operator CSV status oc get csv -n openshift-cnv | grep kubevirt-hyperconverged # Check available VM templates echo "Available Fedora VM templates:" oc get templates -n openshift --field-selector metadata.name=fedora-server-small
-
Check the cluster environment and available resources:
# Check cluster node configuration echo "--- Cluster Node Information ---" oc get nodes -o wide # Check available CPU resources echo "" echo "--- CPU Resources ---" oc debug node/$(oc get nodes -o jsonpath='{.items[0].metadata.name}') -- chroot /host nproc # Check if Fedora VM DataSource is available echo "" echo "--- Available VM DataSources ---" oc get datasource -n openshift-virtualization-os-images | grep fedora
-
Check current performance profile status (may not exist yet):
# Check if performance profile exists (from Module 4) echo "--- Performance Profile Status ---" PERF_PROFILES=$(oc get performanceprofile --no-headers 2>/dev/null | wc -l) if [ "$PERF_PROFILES" -gt 0 ]; then echo "✅ Performance profile found:" oc get performanceprofile -o custom-columns=NAME:.metadata.name,ISOLATED:.spec.cpu.isolated,RESERVED:.spec.cpu.reserved # Get current HugePages from Performance Profile echo "" echo "--- Current HugePages Configuration ---" PROFILE_NAME=$(oc get performanceprofile -o jsonpath='{.items[0].metadata.name}') HUGEPAGES_COUNT=$(oc get performanceprofile "$PROFILE_NAME" -o jsonpath='{.spec.hugepages.pages[0].count}' 2>/dev/null || echo "0") echo "Performance Profile HugePages: ${HUGEPAGES_COUNT}GB" if [ "$HUGEPAGES_COUNT" -lt 8 ]; then echo "" echo "⚠️ Current HugePages (${HUGEPAGES_COUNT}GB) may be insufficient for Module 5" echo " Module 4 allocates minimal HugePages (1GB) for demonstration" echo " Module 5 needs more HugePages to run multiple VMs" echo "" echo "💡 Recommended HugePages for Module 5:" echo " • SNO (64GB+ RAM): 8-16GB HugePages" echo " • Multi-Node (64GB+ RAM): 16-24GB HugePages" echo "" echo " The next step will update HugePages automatically!" else echo "✅ HugePages sufficient for Module 5 (${HUGEPAGES_COUNT}GB)" fi else echo "⚠️ No performance profile found" echo " This is expected if Module 4 hasn't been completed yet" echo " VMI tests will use default cluster resources" echo "" echo "💡 Want to see enhanced VM performance?" echo " You can go back to Module 4 to configure performance profiles" echo " This will enable:" echo " • CPU isolation and dedicated CPU placement for VMs" echo " • HugePages for reduced memory latency" echo " • NUMA alignment for optimal performance" echo " • Significant improvement in VMI startup times" echo "" echo " After completing Module 4, return here to see the performance difference!" fi
Understanding HugePages Allocation:
-
Module 4: Allocates 1GB HugePages (minimal for demonstration)
-
Module 5: Needs 8-16GB HugePages (for running multiple VMs)
If you see "HugePages may be insufficient", don’t worry! The next step will automatically update HugePages to the optimal amount for your cluster.
Why the difference?
-
Module 4 focuses on demonstrating performance tuning concepts
-
Module 5 focuses on running actual VMs with realistic workloads
-
The scripts automatically handle the transition between modules
-
-
Update HugePages allocation for VMI testing (if Performance Profile exists):
# Update HugePages to support multiple VMs bash ~/low-latency-performance-workshop/scripts/module05-update-hugepages.sh
What This Script Does:
-
Detects current HugePages allocation from Module 4
-
Calculates optimal HugePages based on total memory and cluster type
-
Accounts for VMI overhead: Each VMI needs ~3GB (2GB guest + 1GB virt-launcher)
-
Updates Performance Profile if more HugePages are needed
-
Triggers node reboot if changes are required
Why Update HugePages?
Module 4 allocates minimal HugePages (1GB) for demonstration purposes. Module 5 needs more HugePages to run multiple VMs:
-
1GB HugePages: Only 1 small VM possible
-
12GB HugePages: 4 VMs with 2GB memory each
-
24GB HugePages: 8 VMs with 2GB memory each (Module 5 default test)
-
32GB HugePages: 10+ VMs with 2GB memory each
Updated Allocation Strategy:
-
SNO (125GB+ RAM): 24GB HugePages (~8 VMIs)
-
SNO (64-128GB RAM): 24GB HugePages (~8 VMIs)
-
SNO (32-64GB RAM): 12GB HugePages (~4 VMIs)
-
Multi-Node (128GB+ worker): 48GB HugePages (~16 VMIs)
-
Multi-Node (64-128GB worker): 32GB HugePages (~10 VMIs)
The script automatically calculates the optimal allocation for your cluster!
If Node Reboots:
This is expected and required for HugePages changes. Wait 10-15 minutes for the node to come back online, then continue with the next step.
If No Performance Profile:
The script will inform you that no Performance Profile exists and suggest completing Module 4 first for enhanced VM performance.
-
-
Validate resources before testing (Important learning step!):
# Validate that your cluster has sufficient resources for VMI testing bash ~/low-latency-performance-workshop/scripts/module05-validate-vmi-resources.sh
Why This Validation Step is Critical:
This is a key learning opportunity that demonstrates real-world capacity planning for virtualized workloads!
What You’ll Learn:
-
Resource Calculation: How to calculate VMI memory requirements including overhead
-
Capacity Planning: How many VMs your cluster can support
-
Pre-Flight Validation: Why validating resources before deployment prevents failures
-
Troubleshooting: How to identify and fix resource constraints
What the Script Validates:
-
HugePages Availability: Checks if sufficient HugePages are allocated
-
VMI Capacity: Calculates max concurrent VMIs based on available resources
-
Test Scale Validation: Verifies default test (10 VMIs) will succeed
-
CPU Isolation: Validates sufficient isolated CPUs for dedicated placement
-
Recommendations: Provides specific guidance if resources are insufficient
Understanding VMI Memory Requirements:
Each VMI requires more memory than just the guest allocation:
VMI Guest Memory: 2GB (configured in VMI spec) virt-launcher overhead: 1GB (KubeVirt management pod) ──────────────────────────── Total per VMI: 3GB
Example Calculation:
Default test: 10 VMIs × 3GB = 30GB required Your cluster: 16GB HugePages available Result: ❌ Insufficient! (need 24GB minimum)
If Validation Fails:
The script will provide specific recommendations:
-
Option 1: Increase HugePages allocation (recommended)
-
Option 2: Reduce test scale to match available resources
-
Option 3: Run without HugePages (reduced performance)
Real-World Application:
This validation process mirrors production capacity planning:
-
✅ Always validate resources before deploying VMs
-
✅ Account for overhead (virt-launcher, QEMU, etc.)
-
✅ Plan for headroom (don’t use 100% of resources)
-
✅ Monitor and adjust based on actual usage
This is exactly what you’d do in production before deploying VMs!
-
VM Optimization for Low-Latency
Understanding VM Performance Characteristics
Virtual machines have different performance characteristics compared to containers:
-
Boot Time: VMs require OS initialization (typically 30-60 seconds)
-
Resource Overhead: Hypervisor and guest OS consume additional resources
-
I/O Path: Additional virtualization layer affects storage and network performance
-
Memory Management: Guest OS memory management plus hypervisor overhead
Low-Latency VM Configuration
CPU Optimization
Feature | Purpose | Configuration |
---|---|---|
CPU Pinning |
Dedicated CPU cores for VM |
|
NUMA Alignment |
Memory and CPU on same NUMA node |
Automatic with performance profile |
CPU Model |
Host CPU features exposed to VM |
|
CPU Topology |
Optimal vCPU to pCPU mapping |
Match host topology |
Memory Optimization
Feature | Purpose | Configuration |
---|---|---|
HugePages |
Reduced TLB misses |
|
Memory Backing |
Shared memory optimization |
|
NUMA Policy |
Memory locality |
|
Memory Overcommit |
Disabled for predictable performance |
|
Creating VMs for Performance Testing
Instead of creating a custom template, we’ll use the existing Fedora template and customize it for our performance testing needs.
-
Create a performance-optimized Fedora VM for testing:
# Create a namespace for our VM testing oc new-project vmi-performance-test || oc project vmi-performance-test # Clean up any existing VMs to avoid PVC conflicts echo "🧹 Cleaning up any existing performance test VMs..." oc delete vm --selector=app=vmi-performance-test --ignore-not-found=true oc delete dv --selector=app=vmi-performance-test --ignore-not-found=true # Wait a moment for cleanup to complete sleep 5 # Create a Fedora VM using the existing template with performance optimizations # Generate unique name to avoid PVC conflicts VM_NAME="fedora-perf-$(date +%s)" echo "Creating VM with unique name: $VM_NAME" cat << EOF | oc apply -f - apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: name: $VM_NAME labels: app: vmi-performance-test vm.kubevirt.io/template: fedora-server-small spec: dataVolumeTemplates: - apiVersion: cdi.kubevirt.io/v1beta1 kind: DataVolume metadata: name: $VM_NAME spec: sourceRef: kind: DataSource name: fedora namespace: openshift-virtualization-os-images storage: resources: requests: storage: 30Gi runStrategy: Manual template: metadata: labels: kubevirt.io/domain: $VM_NAME kubevirt.io/size: small spec: domain: cpu: cores: 2 sockets: 1 threads: 1 # Enable performance features if performance profile exists dedicatedCpuPlacement: false # Will be enabled conditionally model: host-model # More compatible than host-passthrough memory: guest: 2Gi # HugePages will be enabled conditionally based on availability devices: disks: - disk: bus: virtio name: rootdisk - disk: bus: virtio name: cloudinitdisk interfaces: - masquerade: {} model: virtio name: default rng: {} features: smm: enabled: true firmware: bootloader: efi: {} networks: - name: default pod: {} terminationGracePeriodSeconds: 180 volumes: - dataVolume: name: $VM_NAME name: rootdisk - cloudInitNoCloud: userData: | #cloud-config user: fedora password: workshop123 chpasswd: { expire: False } packages: - qemu-guest-agent runcmd: - systemctl enable --now qemu-guest-agent - echo "VM ready for performance testing" > /tmp/vm-ready name: cloudinitdisk EOF echo "✅ Fedora VM '$VM_NAME' created for performance testing" # Verify the VM and DataVolume were created echo "" echo "📋 Verifying VM creation:" oc get vm $VM_NAME echo "" echo "📋 Verifying DataVolume creation:" oc get dv $VM_NAME # Check for any PVC binding issues echo "" echo "📋 Checking for PVC issues:" if oc get events -n vmi-performance-test | grep -i "bound incorrectly\|pvc.*conflict" >/dev/null 2>&1; then echo "⚠️ PVC binding issues detected. This may be due to duplicate VM names." echo " The cleanup step above should have resolved this." echo " If issues persist, check: oc get events -n vmi-performance-test" else echo "✅ No PVC binding issues detected" fi
Troubleshooting PVC Conflicts If you encounter PVC binding errors like "Two claims are bound to the same volume, this one is bound incorrectly", this typically happens when:
Resolution steps:
|
VMI Latency Testing with Kube-burner
Now let’s measure Virtual Machine Instance startup performance using kube-burner’s VMI latency measurement capabilities. We’ll adapt the test for our SNO environment.
Understanding VirtualMachine vs VirtualMachineInstance Architecture This is a crucial concept for understanding OpenShift Virtualization performance testing: What exists in our cluster:
Two Different Approaches:
Why kube-burner uses direct VMIs: * ✅ Precise timing - Measures pure hypervisor startup * ✅ No controller overhead - Eliminates VM management latency * ✅ Consistent results - No management layer variability * ✅ Automated testing - Perfect for ephemeral performance tests Architecture Relationship:
This architectural difference is why you see different objects in different namespaces! |
-
Verify the architectural difference yourself:
# Compare the two approaches in your cluster echo "--- VirtualMachine Objects (Management Layer) ---" oc get VirtualMachine -A echo "" echo "--- VirtualMachineInstance Objects (Running VMs) ---" oc get VirtualMachineInstance -A echo "" echo "--- Owner Relationships ---" echo "VM-managed VMI (has owner reference):" oc get vmi fedora-perf-1759292486 -n vmi-performance-test -o jsonpath='{.metadata.ownerReferences[0].kind}' 2>/dev/null && echo " ← Managed by VirtualMachine" || echo "No owner reference" echo "" echo "Direct VMI (no owner reference):" oc get vmi fedora-vmi-0-1 -n vmi-latency-test-0 -o jsonpath='{.metadata.ownerReferences}' 2>/dev/null if [ $? -eq 0 ] && [ -n "$(oc get vmi fedora-vmi-0-1 -n vmi-latency-test-0 -o jsonpath='{.metadata.ownerReferences}' 2>/dev/null)" ]; then echo "Has owner reference" else echo "No owner reference ← Created directly by kube-burner" fi
-
Create a VMI-specific kube-burner configuration adapted for SNO:
cd ~/kube-burner-configs cat << EOF > vmi-latency-config.yml global: measurements: - name: vmiLatency thresholds: - conditionType: VMIRunning metric: P99 threshold: 90000ms # Increased for SNO environment - conditionType: VMIScheduled metric: P99 threshold: 60000ms # Increased for SNO environment metricsEndpoints: - indexer: type: local metricsDirectory: collected-metrics-vmi jobs: - name: vmi-latency-test jobType: create jobIterations: 5 # Reduced for SNO environment namespace: vmi-latency-test namespacedIterations: true cleanup: false podWait: false waitWhenFinished: true verifyObjects: true errorOnVerify: false objects: - objectTemplate: fedora-vmi.yml replicas: 2 # Small scale for SNO EOF
-
Create the Fedora VMI template for testing:
# Create VMI template using containerDisk for faster, ephemeral testing # This approach is ideal for performance testing as it doesn't require PVC provisioning echo "Creating Fedora VMI template for kube-burner testing" cat << EOF > fedora-vmi.yml apiVersion: kubevirt.io/v1 kind: VirtualMachineInstance metadata: name: fedora-vmi-{{.Iteration}}-{{.Replica}} labels: app: vmi-latency-test iteration: "{{.Iteration}}" spec: # No nodeSelector for SNO - will schedule on the single node domain: cpu: cores: 1 sockets: 1 threads: 1 # Performance features will be enabled conditionally # Using host-model instead of host-passthrough for better compatibility model: host-model memory: guest: 2Gi # Minimum required for Fedora # HugePages will be added conditionally if available devices: disks: - name: containerdisk disk: bus: virtio - name: cloudinitdisk disk: bus: virtio interfaces: - name: default masquerade: {} model: virtio rng: {} features: smm: enabled: true firmware: bootloader: efi: {} networks: - name: default pod: {} terminationGracePeriodSeconds: 180 volumes: - name: containerdisk containerDisk: image: quay.io/containerdisks/fedora:latest - name: cloudinitdisk cloudInitNoCloud: userData: | #cloud-config user: fedora password: workshop123 chpasswd: { expire: False } bootcmd: - "echo 'Fedora VMI started at' \$(date) > /tmp/vmi-start-time" EOF
Why we use containerDisk instead of DataVolumes for performance testing
For kube-burner performance testing, we use containerDisk instead of DataVolumes because:
-
Faster startup: No PVC provisioning or DataVolume import delays
-
Simpler template: Single VMI object instead of VMI + DataVolume
-
Ephemeral by design: Perfect for performance testing where persistence isn’t needed
-
Consistent results: No storage backend variability affecting measurements
containerDisk approach:
volumes: - name: containerdisk containerDisk: image: quay.io/containerdisks/fedora:latest
DataVolume approach (for production VMs):
volumes: - name: rootdisk dataVolume: name: my-vm-disk --- apiVersion: cdi.kubevirt.io/v1beta1 kind: DataVolume metadata: name: my-vm-disk spec: sourceRef: kind: DataSource name: fedora namespace: openshift-virtualization-os-images
For this performance testing module, containerDisk provides the most accurate VMI startup measurements!
-
-
Configure VMI with optimal performance settings:
# Generate optimized VMI configuration bash ~/low-latency-performance-workshop/scripts/module05-configure-vmi.sh
What This Script Does:
-
Auto-detects Performance Profile availability
-
Auto-detects HugePages configuration
-
Generates optimized VMI YAML with:
-
CPU pinning (if Performance Profile exists)
-
HugePages (if available)
-
Appropriate CPU model (host-passthrough or host-model)
-
Educational comments
-
Benefits of Using the Script:
-
Dynamic Configuration: Adapts to your cluster’s capabilities
-
Educational Feedback: Explains what features are enabled and why
-
Flexible Options: Customize VMI name, memory, CPUs, namespace
-
Consistent Results: Same configuration across different clusters
Script Options:
-
--name NAME
: VMI name (default: fedora-vmi) -
--namespace NS
: Namespace (default: default) -
--memory SIZE
: Memory size (default: 2Gi) -
--cpus NUM
: Number of CPUs (default: 2) -
--output FILE
: Output file (default: fedora-vmi.yml)
Example with Custom Settings:
bash ~/low-latency-performance-workshop/scripts/module05-configure-vmi.sh \ --name my-vm \ --memory 4Gi \ --cpus 4 \ --output my-vm.yml
If No Performance Profile:
The script will generate a VMI configuration with default settings and provide guidance on completing Module 4 for enhanced performance.
-
-
Clean up any existing VMI test resources before starting:
# Clean up any existing VMI test resources to avoid PVC conflicts echo "🧹 Cleaning up any existing VMI test resources..." oc delete vmi --selector=app=vmi-latency-test --all-namespaces --ignore-not-found=true oc delete dv --selector=app=vmi-latency-test --all-namespaces --ignore-not-found=true # Wait for cleanup to complete sleep 5 echo "✅ Cleanup completed"
-
Run the VMI latency test using the corrected configuration:
# Execute the VMI latency test with containerDisk approach echo "Starting Fedora VMI latency performance test..." echo " Test approach: Direct VMI creation with containerDisk (no PVC provisioning)" echo " Test scale: 5 iterations × 2 replicas = 10 VMIs total" echo " Environment: Single Node OpenShift (SNO)" echo " Unique namespaces: vmi-latency-test-0 through vmi-latency-test-4" echo "" kube-burner init -c vmi-latency-config.yml --log-level=info # The test will: # 1. Create VMIs directly in each namespace using containerDisk # 2. Measure pure VMI startup latency (no storage provisioning overhead) # 3. Track VMI lifecycle phases from creation to running # 4. Generate performance metrics in collected-metrics-vmi/
-
Understanding the test results:
The kube-burner test measures several key VMI startup phases:
# View the key metrics from the test echo "VMI Latency Test Results Summary:" echo "" echo "Key Metrics Measured:" echo "• VMICreated: Time to create VMI object (should be ~0ms)" echo "• VMIPending: Time VMI spends in Pending state" echo "• VMIScheduling: Time to schedule VMI to a node" echo "• VMIScheduled: Time until VMI is scheduled (containerDisk pull + pod creation)" echo "• VMIRunning: Total time until VMI is fully running (includes OS boot)" echo "" echo "Expected Results for SNO Environment with containerDisk:" echo "• VMIScheduled P99: ~30-45 seconds (container image pull + pod start)" echo "• VMIRunning P99: ~45-60 seconds (full VM boot from containerDisk)" echo "• VMIScheduling P99: <1 second (fast on SNO)" echo "" echo "📁 Detailed metrics saved in: collected-metrics-vmi/" ls -la collected-metrics-vmi/
-
Monitor VMI creation progress:
# Watch VMIs being created (press Ctrl+C to exit watch) echo "Monitoring VMI creation progress..." echo " Use Ctrl+C to exit the watch command when test completes" echo "" # Watch VMIs and their launcher pods being created watch -n 5 "echo '--- VMIs ---' && oc get vmi --all-namespaces --selector=app=vmi-latency-test && echo '' && echo '--- Launcher Pods ---' && oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher | grep vmi-latency"
-
Check VMI status and verify the architectural difference:
# Comprehensive verification of VMI test results echo "==================================================" echo "📋 VMI Latency Test - Current Status" echo "==================================================" echo "" echo "✅ VirtualMachine Objects (Management Layer):" oc get VirtualMachine -A 2>/dev/null || echo "No VMs found" echo "" echo "✅ VirtualMachineInstance Objects (Running VMs):" oc get VirtualMachineInstance -A 2>/dev/null || echo "No VMIs found" echo "" echo "==================================================" echo "� Kube-burner Test Results" echo "==================================================" echo "" echo "VMIs created by kube-burner test:" oc get vmi --all-namespaces --selector=app=vmi-latency-test -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:.status.phase,IP:.status.interfaces[0].ipAddress,READY:.status.conditions[?\(@.type==\"Ready\"\)].status 2>/dev/null || echo "No test VMIs found" echo "" echo "📋 DataVolume Status (should be empty with containerDisk):" oc get dv --all-namespaces --selector=app=vmi-latency-test 2>/dev/null || echo "No DataVolumes found (expected with containerDisk)" echo "" echo "📋 VMI Launcher Pods:" oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName 2>/dev/null | grep -E "NAMESPACE|vmi-latency" || echo "No launcher pods found" echo "" echo "==================================================" echo "✅ Test Results Summary" echo "==================================================" TOTAL_VMIS=\$(oc get vmi --all-namespaces --selector=app=vmi-latency-test --no-headers 2>/dev/null | wc -l) RUNNING_VMIS=\$(oc get vmi --all-namespaces --selector=app=vmi-latency-test --no-headers 2>/dev/null | grep -c "Running" || echo "0") echo "Total VMIs created: \$TOTAL_VMIS" echo "VMIs in Running phase: \$RUNNING_VMIS" echo "" if [ "\$RUNNING_VMIS" -eq 10 ]; then echo "🎉 SUCCESS! All 10 test VMIs are running!" echo "📊 This demonstrates direct VMI creation with containerDisk" echo "✅ No DataVolumes needed - faster startup for performance testing" echo "" echo "Key Observations:" echo "• All VMIs have IP addresses assigned" echo "• All VMIs are in Ready state" echo "• No PVC/DataVolume provisioning delays" echo "• Pure VMI startup latency measured" elif [ "\$TOTAL_VMIS" -eq 10 ]; then echo "⚠️ All 10 VMIs created, \$RUNNING_VMIS are running" echo " Some may still be pulling containerDisk images" echo " Check: oc get pods --all-namespaces | grep virt-launcher" else echo "⚠️ Expected 10 VMIs, found \$TOTAL_VMIS" echo " Review kube-burner logs for errors" echo "" echo "💡 If VMIs failed, see troubleshooting section below" fi
Troubleshooting VMI Failures
If your VMIs are not running successfully, this section will help you diagnose and fix common issues.
-
Check VMI and Pod Status:
# Get detailed status of all VMIs echo "=== VMI Status ===" oc get vmi --all-namespaces --selector=app=vmi-latency-test echo "" echo "=== virt-launcher Pod Status ===" oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher | grep vmi-latency echo "" echo "=== Failed/OOMKilled Pods ===" oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher | grep -E "OOMKilled|Error|CrashLoop" || echo "No failed pods"
-
Diagnose OOMKilled VMIs (Most Common Issue):
# Check if VMIs are OOMKilled due to insufficient HugePages echo "=== Checking for OOMKilled VMIs ===" OOMKILLED_COUNT=$(oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher -o jsonpath='{.items[*].status.containerStatuses[*].lastState.terminated.reason}' 2>/dev/null | grep -o "OOMKilled" | wc -l) if [ "$OOMKILLED_COUNT" -gt 0 ]; then echo "❌ Found $OOMKILLED_COUNT OOMKilled virt-launcher pods" echo "" echo "Root Cause: Insufficient HugePages for VMI memory + overhead" echo "" echo "Explanation:" echo " • Each VMI needs: 2GB guest + 1GB virt-launcher overhead = 3GB total" echo " • Test creates: 10 VMIs × 3GB = 30GB required" echo " • Available HugePages: $(oc get node -o jsonpath='{.items[0].status.allocatable.hugepages-1Gi}' | sed 's/Gi//g')GB" echo "" echo "Solutions:" echo "" echo " Option 1: Increase HugePages (Recommended)" echo " ─────────────────────────────────────────" echo " bash ~/low-latency-performance-workshop/scripts/module05-update-hugepages.sh" echo "" echo " This will:" echo " • Calculate optimal HugePages for your cluster" echo " • Update Performance Profile" echo " • Trigger node reboot (wait 10-15 minutes)" echo " • Allocate sufficient HugePages for 10 VMIs" echo "" echo " Option 2: Reduce Test Scale" echo " ───────────────────────────" echo " Edit ~/kube-burner-configs/vmi-latency-config.yml:" echo "" echo " Current:" echo " jobIterations: 5" echo " replicas: 2" echo " Total: 10 VMIs" echo "" echo " Recommended for 16GB HugePages:" echo " jobIterations: 2" echo " replicas: 2" echo " Total: 4 VMIs (fits in 16GB)" echo "" echo " Then clean up and re-run:" echo " oc delete vmi --selector=app=vmi-latency-test --all-namespaces" echo " kube-burner init -c vmi-latency-config.yml" echo "" else echo "✅ No OOMKilled pods found" fi
-
Check HugePages Allocation:
# Detailed HugePages analysis echo "=== HugePages Allocation Analysis ===" echo "" # Get HugePages from Performance Profile PERF_PROFILE=$(oc get performanceprofile -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) if [ -n "$PERF_PROFILE" ]; then HUGEPAGES_COUNT=$(oc get performanceprofile "$PERF_PROFILE" -o jsonpath='{.spec.hugepages.pages[0].count}' 2>/dev/null) HUGEPAGES_SIZE=$(oc get performanceprofile "$PERF_PROFILE" -o jsonpath='{.spec.hugepages.pages[0].size}' 2>/dev/null) echo "Performance Profile: $PERF_PROFILE" echo " Configured: ${HUGEPAGES_COUNT} × ${HUGEPAGES_SIZE} = ${HUGEPAGES_COUNT}GB" fi echo "" # Get HugePages from node NODE=$(oc get nodes -o jsonpath='{.items[0].metadata.name}') HUGEPAGES_CAPACITY=$(oc get node "$NODE" -o jsonpath='{.status.capacity.hugepages-1Gi}' | sed 's/Gi//g') HUGEPAGES_ALLOCATABLE=$(oc get node "$NODE" -o jsonpath='{.status.allocatable.hugepages-1Gi}' | sed 's/Gi//g') echo "Node: $NODE" echo " Capacity: ${HUGEPAGES_CAPACITY}GB" echo " Allocatable: ${HUGEPAGES_ALLOCATABLE}GB" echo "" echo "VMI Capacity Calculation:" echo " • VMI memory requirement: 3GB per VMI (2GB guest + 1GB overhead)" echo " • Available HugePages: ${HUGEPAGES_ALLOCATABLE}GB" echo " • Max concurrent VMIs: ~$((HUGEPAGES_ALLOCATABLE / 3))" echo " • Test requires: 10 VMIs = 30GB" echo "" if [ "$HUGEPAGES_ALLOCATABLE" -ge 30 ]; then echo "✅ Sufficient HugePages for 10 VMIs" elif [ "$HUGEPAGES_ALLOCATABLE" -ge 24 ]; then echo "⚠️ Sufficient for 8 VMIs, reduce test scale to 8" elif [ "$HUGEPAGES_ALLOCATABLE" -ge 18 ]; then echo "⚠️ Sufficient for 6 VMIs, reduce test scale to 6" elif [ "$HUGEPAGES_ALLOCATABLE" -ge 12 ]; then echo "⚠️ Sufficient for 4 VMIs, reduce test scale to 4" else echo "❌ Insufficient HugePages, increase allocation to at least 24GB" fi
-
Check VMI Events for Errors:
# Check events for failed VMIs echo "=== Recent VMI Events ===" oc get events --all-namespaces --field-selector involvedObject.kind=VirtualMachineInstance --sort-by='.lastTimestamp' | tail -20
-
View virt-launcher Pod Logs:
# Get logs from a failed virt-launcher pod echo "=== virt-launcher Pod Logs (first failed pod) ===" FAILED_POD=$(oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher -o jsonpath='{.items[?(@.status.phase!="Running")].metadata.name}' | awk '{print $1}' | head -1) FAILED_NS=$(oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher -o jsonpath='{.items[?(@.status.phase!="Running")].metadata.namespace}' | awk '{print $1}' | head -1) if [ -n "$FAILED_POD" ]; then echo "Pod: $FAILED_POD (namespace: $FAILED_NS)" echo "" oc logs -n "$FAILED_NS" "$FAILED_POD" --tail=50 2>/dev/null || echo "No logs available" else echo "No failed pods found" fi
Common VMI Failure Patterns:
Prevention: Always run the validation script before testing:
This will catch resource issues before they cause failures! |
Analyzing VMI Latency Results
Now let’s analyze the VMI performance results and understand what the metrics tell us about virtualization performance characteristics.
-
Examine the VMI latency metrics generated by kube-burner:
cd ~/kube-burner-configs # Check what metrics were generated echo "📊 VMI Latency Test Results:" ls -la collected-metrics-vmi/ # View the summary of VMI latency measurements echo "" echo "📋 VMI Latency Quantiles (Key Performance Indicators):" echo " All times in milliseconds (ms)" echo "" if [ -f "collected-metrics-vmi/vmiLatencyQuantilesMeasurement-vmi-latency-test.json" ]; then cat collected-metrics-vmi/vmiLatencyQuantilesMeasurement-vmi-latency-test.json | jq -r '.[] | "\(.quantileName) - P99: \(.P99)ms | P50: \(.P50)ms | Avg: \(.avg)ms"' | grep -v "VMReady" | sort else echo "VMI latency quantiles file not found" fi # Show job summary echo "" echo "📈 Test Execution Summary:" if [ -f "collected-metrics-vmi/jobSummary.json" ]; then cat collected-metrics-vmi/jobSummary.json | jq -r '.[] | "Job: \(.jobConfig.name) | Status: \(if .passed then "✅ PASSED" else "❌ FAILED" end) | Duration: \(.elapsedTime)s | QPS: \(.achievedQps)"' else echo "Job summary file not found" fi
-
Analyze VMI startup phases and understand the performance characteristics:
cd ~/kube-burner-configs # Analyze the detailed VMI latency measurements echo "🔍 Detailed VMI Startup Phase Analysis:" echo "" if [ -f "collected-metrics-vmi/vmiLatencyMeasurement-vmi-latency-test.json" ]; then echo "VMI Startup Phases (in chronological order):" echo "1. VMICreated → VMIPending: Object creation time" echo "2. VMIPending → VMIScheduling: Waiting for scheduling" echo "3. VMIScheduling → VMIScheduled: Node assignment + pod creation" echo "4. VMIScheduled → VMIRunning: containerDisk pull + VM boot" echo "" # Show actual timing data echo "📊 Actual Timing Results (Average across all VMIs):" cat collected-metrics-vmi/vmiLatencyMeasurement-vmi-latency-test.json | jq -r ' [.[] | { vmiCreated: .vmiCreatedLatency, vmiPending: .vmiPendingLatency, vmiScheduling: .vmiSchedulingLatency, vmiScheduled: .vmiScheduledLatency, vmiRunning: .vmiRunningLatency, podCreated: .podCreatedLatency, podScheduled: .podScheduledLatency, podInitialized: .podInitializedLatency, podContainersReady: .podContainersReadyLatency }] | { vmiCreated: ([.[].vmiCreated] | add / length), vmiPending: ([.[].vmiPending] | add / length), vmiScheduling: ([.[].vmiScheduling] | add / length), vmiScheduled: ([.[].vmiScheduled] | add / length), vmiRunning: ([.[].vmiRunning] | add / length), podCreated: ([.[].podCreated] | add / length), podScheduled: ([.[].podScheduled] | add / length), podInitialized: ([.[].podInitialized] | add / length), podContainersReady: ([.[].podContainersReady] | add / length) } | to_entries | .[] | " \(.key): \(.value | floor)ms" ' echo "" echo "🎯 Performance Analysis (containerDisk approach):" echo "• VMICreated should be ~0ms (object creation)" echo "• VMIScheduling should be <2000ms (fast scheduling on SNO)" echo "• VMIScheduled includes containerDisk image pull time (major component)" echo "• VMIRunning includes full Fedora boot time from containerDisk (~45-55s typical)" echo "" echo "💡 Key Insight: With containerDisk, most time is spent pulling the container" echo " image and booting the OS. No PVC provisioning or DataVolume import delays!" else echo "❌ VMI latency measurement file not found" echo "This may indicate the test didn't complete successfully" fi
-
Analyze VMI performance using the main performance analyzer:
cd ~/kube-burner-configs # Use the main performance analyzer for VMI metrics echo "🎓 Running VMI Performance Analysis..." python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \ --single collected-metrics-vmi # This analysis provides: # • VMI startup phase breakdown and timing analysis # • Performance bottleneck identification # • Statistical analysis of latency variations # • Comparison with performance thresholds # • Color-coded performance assessment
-
Compare VMI performance characteristics with container baselines:
cd ~/kube-burner-configs # Generate comprehensive comparison between VMs and containers echo "📊 VMI vs Container Performance Comparison..." # Check what metrics are available for comparison BASELINE_AVAILABLE=false TUNED_AVAILABLE=false if [ -d "collected-metrics" ]; then echo "✅ Container baseline metrics found" BASELINE_AVAILABLE=true fi if [ -d "collected-metrics-tuned" ]; then echo "✅ Container tuned metrics found" TUNED_AVAILABLE=true fi if [ -d "collected-metrics-vmi" ]; then echo "✅ VMI metrics found" else echo "❌ VMI metrics not found - check test execution above" exit 1 fi echo "" # Module 5 focused analysis - VMI performance with intelligent container context echo "🎯 Module 5 Focused Analysis (VMI Performance with Context)..." python3 ~/low-latency-performance-workshop/scripts/module-specific-analysis.py 5 echo "" echo "💡 Module 5 Learning Focus:" echo " 🔍 VMI startup phases and timing" echo " ⚖️ Virtualization vs containerization trade-offs" echo " 🎯 When to choose VMs vs containers for workloads" if [ "$TUNED_AVAILABLE" = true ]; then echo " 🚀 How performance profiles benefit both VMs and containers" else echo " ℹ️ Performance profiles (Module 4) would improve both VMs and containers" fi echo "" echo "📚 How to Read the Module 5 Analysis:" echo " 1. Individual sections show raw performance for each test type" echo " 2. VMI metrics (🖥️ section) are the focus of this module" echo " 3. Container metrics provide context for comparison" echo " 4. Look for VMI-specific phases: VMICreated → VMIPending → VMIScheduled → VMIRunning" echo "" echo "💡 This comparison explains:" echo "• Why VMs take longer to start than containers (OS boot vs process start)" echo "• The performance trade-offs of virtualization (isolation vs overhead)" echo "• When to use VMs vs containers for different workloads" echo "• How performance profiles affect both VMs and containers"
-
Generate a comprehensive performance report:
cd ~/kube-burner-configs # Generate a comprehensive markdown report with all available metrics echo "Generating Comprehensive Performance Report..." # Determine what metrics are available and generate appropriate report BASELINE_AVAILABLE=false TUNED_AVAILABLE=false VMI_AVAILABLE=false [ -d "collected-metrics" ] && BASELINE_AVAILABLE=true [ -d "collected-metrics-tuned" ] && TUNED_AVAILABLE=true [ -d "collected-metrics-vmi" ] && VMI_AVAILABLE=true # Generate Module 5 specific report with available metrics REPORT_FILE="module5-vmi-performance-report-$(date +%Y%m%d-%H%M).md" echo "📄 Generating Module 5 VMI Performance Report..." echo " 🎯 Focus: Virtual machine performance analysis" echo " 📊 Context: VMI startup vs container performance" if [ "$BASELINE_AVAILABLE" = true ] && [ "$TUNED_AVAILABLE" = true ] && [ "$VMI_AVAILABLE" = true ]; then echo " 📈 Scope: VMI + Container baseline + Container tuned" python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \ --baseline collected-metrics \ --tuned collected-metrics-tuned \ --vmi collected-metrics-vmi \ --report "$REPORT_FILE" elif [ "$BASELINE_AVAILABLE" = true ] && [ "$VMI_AVAILABLE" = true ]; then echo " 📈 Scope: VMI + Container baseline" python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \ --baseline collected-metrics \ --vmi collected-metrics-vmi \ --report "$REPORT_FILE" elif [ "$VMI_AVAILABLE" = true ]; then echo " 📈 Scope: VMI standalone analysis" python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \ --single collected-metrics-vmi \ --report "$REPORT_FILE" else echo "❌ No VMI performance metrics found for report generation" exit 1 fi echo "" echo "📄 Performance Report Generated: $REPORT_FILE" echo "📊 Report Summary:" if [ -f "$REPORT_FILE" ]; then head -20 "$REPORT_FILE" echo "" echo "💡 View the complete report: cat $REPORT_FILE" else echo "❌ Report generation failed" fi
SR-IOV Configuration for High-Performance VM Networking
SR-IOV (Single Root I/O Virtualization) provides direct hardware access to Virtual Machines, bypassing the software networking stack for maximum performance. This is particularly important for VMs that require near bare-metal network performance.
Lab Environment Considerations: This workshop supports two approaches for high-performance VM networking:
This module covers both approaches so you can learn SR-IOV concepts and test in your lab environment. |
Choosing Your Networking Approach
Approach | Use Case | Hardware Required | Performance |
---|---|---|---|
Default Pod Network |
Basic VMs, development |
None |
2-5ms latency |
User Defined Networks |
Lab environments, learning, testing |
None |
1-3ms latency |
SR-IOV |
Production NFV, real-time apps |
SR-IOV capable NICs |
<1ms latency |
Recommendation for This Workshop:
Both approaches teach the same concepts: - Dual-interface VM design - Network separation (management vs data plane) - Performance optimization techniques - Multi-network VM architecture |
Understanding SR-IOV Benefits for VMs
Feature | VM with Pod Network | VM with SR-IOV |
---|---|---|
Latency |
2-5ms (through virt-launcher pod) |
<1ms (direct hardware access) |
Throughput |
5-20 Gbps (limited by pod network) |
Near line-rate (40-100 Gbps) |
CPU Usage |
Higher (virtio + pod network overhead) |
Lower (hardware offload) |
Isolation |
Software-based (pod network) |
Hardware-enforced (dedicated VF) |
Network Stack |
VM → virtio → virt-launcher → CNI → host |
VM → SR-IOV VF → physical NIC |
Why SR-IOV Matters for VMs:
SR-IOV is the key technology for achieving container-like network performance in VMs. |
Verifying SR-IOV Network Operator
The SR-IOV Network Operator was deployed in Module 2. Let’s verify it’s ready for VM networking:
-
Check SR-IOV operator status:
# Check SR-IOV operator installation oc get csv -n openshift-sriov-network-operator # Verify SR-IOV operator pods oc get pods -n openshift-sriov-network-operator # Check if SR-IOV capable nodes are detected oc get sriovnetworknodestates -n openshift-sriov-network-operator # List available SR-IOV network node policies oc get sriovnetworknodepolicy -n openshift-sriov-network-operator # Check for SR-IOV networks configured for VMs oc get sriovnetwork -n openshift-sriov-network-operator
If SR-IOV hardware is not available or the operator shows no SR-IOV capable nodes, proceed to the Lab Simulation section below to use User Defined Networks instead.
Lab Simulation: High-Performance VM Networking with User Defined Networks
For lab environments without SR-IOV hardware, we can simulate high-performance VM networking using OVN-Kubernetes User Defined Networks (also called Secondary Networks). While not as fast as SR-IOV, this provides better performance than the default pod network and demonstrates the same networking concepts.
Clean Up Previous Test VMIs
Before creating the high-performance VM, clean up VMIs from the previous kube-burner test to free HugePages:
-
Check current VMI resource usage:
# Check running VMIs and their HugePages usage echo "=== Current VMIs ===" oc get vmi --all-namespaces echo "" echo "=== HugePages Usage ===" oc get node -o jsonpath='{.items[0].status.allocatable.hugepages-1Gi}' | sed 's/Gi/ GB available/g' echo "" # Calculate VMIs using HugePages VMI_COUNT=$(oc get vmi --all-namespaces --no-headers 2>/dev/null | wc -l) if [ "$VMI_COUNT" -gt 0 ]; then echo "Current VMIs: $VMI_COUNT" echo "Estimated HugePages in use: ~$((VMI_COUNT * 3)) GB (assuming 2GB guest + 1GB overhead per VMI)" echo "" echo "⚠️ Cleanup recommended before creating new VMs" fi
Why Cleanup is Important:
Each VMI consumes HugePages memory that remains allocated even after testing completes:
-
VMI Guest Memory: 2GB per VMI (configured in VMI spec)
-
virt-launcher Overhead: ~1GB per VMI (KubeVirt management pod)
-
Total per VMI: ~3GB
Example:
8 running VMIs × 3GB = 24GB HugePages in use Available HugePages: 24GB Result: No HugePages available for new VMs! ❌
Best Practice: Always clean up test VMIs before starting new VM deployments to avoid resource exhaustion.
-
-
Clean up test VMIs and namespaces:
# Delete all VMIs from kube-burner test echo "Cleaning up test VMIs..." oc delete vmi --selector=app=vmi-latency-test --all-namespaces --wait=false # Delete test namespaces for i in {0..4}; do oc delete namespace vmi-latency-test-$i --wait=false 2>/dev/null || true done echo "" echo "Cleanup initiated. Waiting for resources to be freed..." sleep 10 # Verify cleanup echo "" echo "=== Remaining VMIs ===" oc get vmi --all-namespaces echo "" echo "=== HugePages Now Available ===" oc get node -o jsonpath='{.items[0].status.allocatable.hugepages-1Gi}' | sed 's/Gi/ GB available/g' echo ""
If you see VMIs still terminating, wait a few moments for them to fully clean up. You can monitor with:
watch oc get vmi --all-namespaces
Press
Ctrl+C
to exit the watch command.
Create User Defined Network
Why User Defined Networks for Lab Environments:
Performance Comparison: * Default Pod Network: 2-5ms latency * User Defined Network: 1-3ms latency (30-50% improvement) * SR-IOV: <1ms latency (production target) |
-
Create a User Defined Network for high-performance VM networking:
cat << EOF | oc apply -f - apiVersion: k8s.ovn.org/v1 kind: UserDefinedNetwork metadata: name: vm-high-perf-network namespace: default spec: topology: Layer2 layer2: role: Secondary subnets: - "192.168.100.0/24" EOF
UserDefinedNetwork (UDN) - Modern OpenShift 4.18+ Approach:
This creates a Layer2 User Defined Network using native OVN-Kubernetes integration:
-
API:
k8s.ovn.org/v1
(native OVN-Kubernetes, not Multus) -
Topology: Layer2 (recommended for VM networking)
-
Role: Secondary (additional network, not replacing pod network)
-
Subnet: 192.168.100.0/24 (automatic IPAM by OVN-K)
-
Benefits:
-
Simpler configuration than NetworkAttachmentDefinition
-
Native OVN-Kubernetes IPAM (no manual IPAM configuration needed)
-
Better integration with OpenShift Virtualization
-
Recommended approach for OpenShift 4.18+
-
Why Layer2? - VMs can communicate at Layer2 (like a virtual switch) - Better for VM-to-VM communication - Supports VM live migration with persistent IPs - Simpler than Layer3 for most VM use cases
Note: OpenShift automatically creates a corresponding NetworkAttachmentDefinition for compatibility with VMs.
-
-
Verify the UserDefinedNetwork was created:
# Check UserDefinedNetwork echo "=== UserDefinedNetwork ===" oc get userdefinednetwork vm-high-perf-network -n default -o yaml echo "" echo "=== Auto-Generated NetworkAttachmentDefinition ===" # OpenShift automatically creates a NetworkAttachmentDefinition for VM compatibility oc get net-attach-def vm-high-perf-network -n default echo "" echo "=== Network Details ===" oc describe userdefinednetwork vm-high-perf-network -n default
What Just Happened:
When you create a
UserDefinedNetwork
, OpenShift automatically:-
Creates the UDN: The Layer2 network with OVN-K IPAM
-
Auto-generates NetworkAttachmentDefinition: For backward compatibility with VMs
-
Configures OVN: Sets up the virtual switch and subnet
Key Point: VMs still reference the network using
multus.networkName
in their spec, but the underlying implementation is now the modern UserDefinedNetwork instead of manual NetworkAttachmentDefinition configuration.This is why UserDefinedNetwork is better: - ✅ You define the network once (simple YAML) - ✅ OpenShift handles the NetworkAttachmentDefinition automatically - ✅ Native OVN-K integration (no manual CNI JSON) - ✅ Built-in IPAM (no configuration needed)
-
-
Create a high-performance VM with dual network interfaces (lab simulation):
cat << EOF | oc apply -f - apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: name: high-perf-vm-lab namespace: default labels: app: high-perf-vm spec: running: true dataVolumeTemplates: - metadata: name: high-perf-vm-lab-rootdisk spec: storage: resources: requests: storage: 30Gi sourceRef: kind: DataSource name: fedora namespace: openshift-virtualization-os-images template: metadata: labels: kubevirt.io/vm: high-perf-vm-lab app: high-perf-vm spec: domain: cpu: cores: 4 dedicatedCpuPlacement: true # Pin CPUs for low latency memory: hugepages: pageSize: 1Gi # Use 1Gi HugePages (matches cluster configuration) guest: 4Gi resources: requests: memory: 4Gi limits: memory: 4Gi devices: disks: - name: rootdisk disk: bus: virtio - name: cloudinitdisk disk: bus: virtio interfaces: # Primary interface: Pod network (for management) - name: default masquerade: {} # Secondary interface: User Defined Network (for high-performance data) - name: high-perf-net bridge: {} networkInterfaceMultiqueue: true # Enable multi-queue for better performance networks: # Pod network for management traffic - name: default pod: {} # User Defined Network for data plane traffic - name: high-perf-net multus: networkName: vm-high-perf-network volumes: - name: rootdisk dataVolume: name: high-perf-vm-lab-rootdisk - name: cloudinitdisk cloudInitNoCloud: userData: | #cloud-config user: fedora password: fedora chpasswd: { expire: False } runcmd: - nmcli con add type ethernet con-name eth1 ifname eth1 ip4 192.168.100.10/24 - nmcli con up eth1 EOF
Lab VM Configuration Explained:
-
Disk Configuration (DataVolume):
-
Uses
dataVolumeTemplates
to create persistent disk -
Source: VolumeSnapshot from
openshift-virtualization-os-images
-
Pre-installed Fedora image (fast boot, even without KVM)
-
30Gi storage allocation
-
Why not containerDisk? containerDisk is slow without KVM hardware virtualization
-
-
Dual Network Interfaces (same as production SR-IOV):
-
default
: Pod network for management (SSH, monitoring) -
high-perf-net
: User Defined Network for data plane
-
-
Performance Optimizations (Educational Examples):
-
dedicatedCpuPlacement: true
- Pins CPUs to VM (requires KVM for full benefit) -
hugepages: pageSize: 1Gi
- Uses 1Gi HugePages (matches cluster config from Module 4) -
resources: requests/limits: 4Gi
- Guarantees memory allocation -
networkInterfaceMultiqueue: true
- Parallel packet processing (4 queues per interface) -
bridge: {}
- Direct bridge attachment (better than masquerade)
-
-
HugePages Configuration:
-
VM requests 4GB guest memory
-
Uses 4 × 1Gi HugePages (matches Performance Profile)
-
Plus ~1GB virt-launcher overhead = ~5GB total
-
Must match cluster’s HugePages size (1Gi from Module 4)
-
Note: HugePages work with or without KVM, but provide best performance with KVM
-
-
Cloud-init Configuration:
-
Creates user
fedora
with passwordfedora
-
Automatically configures eth1 with static IP (192.168.100.10/24)
-
Sets up network interface on boot
-
Ready for testing immediately
-
This simulates SR-IOV architecture without special hardware!
Note: This VM demonstrates performance features (HugePages, CPU pinning, multi-queue) that are typically used with KVM hardware virtualization. The VM will boot and run successfully even without KVM (using software emulation), but performance features provide maximum benefit when KVM is available.
-
-
Wait for the DataVolume to be created and the VM to start:
# Check DataVolume creation progress echo "=== DataVolume Status ===" oc get dv high-perf-vm-lab-rootdisk -n default # Wait for DataVolume to be ready (cloning from snapshot) echo "" echo "Waiting for DataVolume to be ready (this may take 1-2 minutes)..." oc wait --for=condition=Ready dv/high-perf-vm-lab-rootdisk -n default --timeout=300s # Check VM status echo "" echo "=== VM Status ===" oc get vm high-perf-vm-lab -n default oc get vmi high-perf-vm-lab -n default
DataVolume Creation Process:
When you create a VM with
dataVolumeTemplates
, OpenShift Virtualization:-
Creates a DataVolume - Persistent storage for the VM
-
Clones from VolumeSnapshot - Copies the Fedora image from the snapshot
-
Creates a PVC - Persistent Volume Claim for the disk
-
Starts the VM - Once the DataVolume is ready
This process takes 1-2 minutes but results in a fast-booting VM with persistent storage.
Advantages over containerDisk: - ✅ Faster boot (pre-installed image) - ✅ Persistent storage (survives VM restarts) - ✅ Works well without KVM hardware virtualization - ✅ Same image used by OpenShift Console VM wizard
-
-
Verify the VM has dual network interfaces:
# Wait for VM to be running oc wait --for=condition=Ready vmi/high-perf-vm-lab --timeout=300s # Check VM network interfaces oc get vmi high-perf-vm-lab -o jsonpath='{.status.interfaces}' | jq # Verify both networks are attached echo "VM Network Configuration:" oc get vmi high-perf-vm-lab -o jsonpath='{.spec.networks}' | jq # Check that VM has both pod network and user defined network oc describe vmi high-perf-vm-lab | grep -A 10 "Interfaces"
-
Test the VM’s network performance:
# Use the VMI network tester to validate connectivity python3 ~/low-latency-performance-workshop/scripts/module05-vmi-network-tester.py \ --namespace default # Access the VM to verify network interfaces virtctl console high-perf-vm-lab # Inside the VM, check network interfaces ip addr show # You should see: # - eth0: Pod network interface (management) - 10.x.x.x # - eth1: User Defined Network (high-performance) - 192.168.100.10 # Test connectivity on both interfaces ping -c 4 -I eth0 8.8.8.8 # Management network ping -c 4 -I eth1 192.168.100.1 # High-performance network # Check interface statistics ip -s link show eth0 ip -s link show eth1
Lab Simulation Performance Expectations:
While not as fast as SR-IOV (<1ms), this demonstrates: - Dual-interface VM architecture - Network separation (control vs data plane) - Performance optimization techniques - Production-ready patterns This is perfect for learning and lab environments! |
Configuring SR-IOV for Virtual Machines (Production)
When to Use This Section:
For Lab Environments: Use the User Defined Networks approach above instead. |
Unlike pods, VMs require specific SR-IOV network configuration to attach Virtual Functions directly to the VM.
-
Create an SR-IOV Network for VM use (production hardware required):
cat << EOF | oc apply -f - apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: vm-sriov-network namespace: openshift-sriov-network-operator spec: resourceName: vm_sriov_net networkNamespace: default vlan: 100 # Optional: VLAN tagging capabilities: '{"ips": true, "mac": true}' # Important: This network will be used by VMs ipam: | { "type": "host-local", "subnet": "192.168.100.0/24", "rangeStart": "192.168.100.10", "rangeEnd": "192.168.100.100", "gateway": "192.168.100.1" } EOF
This creates an SR-IOV network specifically for VM use. The
resourceName
must match the SR-IOV Network Node Policy configured in Module 2. -
Create a high-performance VM with SR-IOV networking:
cat << EOF | oc apply -f - apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: name: high-performance-vm-sriov namespace: default spec: running: true template: metadata: labels: kubevirt.io/vm: high-performance-vm-sriov spec: domain: cpu: cores: 4 dedicatedCpuPlacement: true # Pin CPUs for low latency memory: hugepages: pageSize: 2Mi # Use HugePages guest: 4Gi devices: disks: - name: containerdisk disk: bus: virtio - name: cloudinitdisk disk: bus: virtio interfaces: # Primary interface: Pod network (for management) - name: default masquerade: {} # Secondary interface: SR-IOV (for high-performance data plane) - name: sriov-net sriov: {} networks: # Pod network for management traffic - name: default pod: {} # SR-IOV network for data plane traffic - name: sriov-net multus: networkName: vm-sriov-network volumes: - name: containerdisk containerDisk: image: quay.io/containerdisks/fedora:latest - name: cloudinitdisk cloudInitNoCloud: userData: | #cloud-config password: fedora chpasswd: { expire: False } EOF
VM SR-IOV Configuration Explained:
-
Two Network Interfaces:
-
default
: Pod network for management (SSH, monitoring) -
sriov-net
: SR-IOV for high-performance data traffic
-
-
Why Two Interfaces?
-
Management traffic doesn’t need SR-IOV performance
-
Data plane traffic gets direct hardware access
-
Separates control and data planes
-
-
Performance Features:
-
dedicatedCpuPlacement: true
- Pins CPUs to VM -
hugepages
- Reduces memory overhead -
sriov: {}
- Attaches SR-IOV VF directly to VM
-
-
-
Verify the VM has SR-IOV networking:
# Wait for VM to be running oc wait --for=condition=Ready vmi/high-performance-vm-sriov --timeout=300s # Check VM network interfaces oc get vmi high-performance-vm-sriov -o jsonpath='{.status.interfaces}' | jq # Verify SR-IOV VF is attached oc describe vmi high-performance-vm-sriov | grep -A 10 "Interfaces" # Check that VM has both pod network and SR-IOV echo "VM Network Configuration:" oc get vmi high-performance-vm-sriov -o jsonpath='{.spec.networks}' | jq
Testing VM SR-IOV Network Performance
Now let’s test the network performance of the VM with SR-IOV to see the improvement over pod networking.
-
Access the VM and check network interfaces:
# Access the VM console virtctl console high-performance-vm-sriov # Inside the VM, check network interfaces ip addr show # You should see: # - eth0: Pod network interface (management) # - eth1: SR-IOV interface (high-performance) # Check SR-IOV interface details ethtool -i eth1 # Test network performance (requires iperf3 installed) # From another VM or pod, run iperf3 server # Then from this VM: iperf3 -c <server-ip> -i 1 -t 30
-
Use the VMI network tester to validate SR-IOV VM connectivity:
# Test networking to the SR-IOV-enabled VM python3 ~/low-latency-performance-workshop/scripts/module05-vmi-network-tester.py \ --namespace default # This will test connectivity to VMs including SR-IOV-enabled ones # Expected results: # - Pod network interface: 2-5ms latency # - SR-IOV interface: <1ms latency (if tested directly)
SR-IOV Performance Expectations for VMs:
The SR-IOV interface provides 5-10x better latency and 2-5x better throughput compared to pod networking for VMs. |
Network Policy Latency Testing
Network policies can impact VM networking performance. Let’s test network policy enforcement latency using kube-burner’s network policy latency measurement.
-
Create network policy latency test configuration adapted for SNO:
cd ~/kube-burner-configs cat << EOF > network-policy-latency-config.yml global: measurements: - name: netpolLatency metricsEndpoints: - indexer: type: local metricsDirectory: collected-metrics-netpol jobs: # Job 1: Create pods and namespaces (reduced scale for SNO) - name: network-policy-setup jobType: create jobIterations: 3 # Reduced for SNO namespace: network-policy-perf namespacedIterations: true cleanup: false podWait: true waitWhenFinished: true verifyObjects: true errorOnVerify: false namespaceLabels: kube-burner.io/skip-networkpolicy-latency: "true" objects: - objectTemplate: network-test-pod.yml replicas: 2 # Reduced for SNO inputVars: containerImage: registry.redhat.io/ubi8/ubi-minimal:latest # Job 2: Apply network policies and test connectivity - name: network-policy-test jobType: create jobIterations: 3 # Reduced for SNO namespace: network-policy-perf namespacedIterations: false cleanup: false podWait: false waitWhenFinished: true verifyObjects: true errorOnVerify: false jobPause: 30s # Reduced pause for faster testing objects: - objectTemplate: ingress-network-policy.yml replicas: 1 # Reduced for SNO inputVars: namespaces: 3 # Reduced for SNO EOF
-
Create the network test pod template:
cat << EOF > network-test-pod.yml apiVersion: v1 kind: Pod metadata: name: network-test-pod-{{.Iteration}}-{{.Replica}} labels: app: network-test iteration: "{{.Iteration}}" replica: "{{.Replica}}" spec: # No nodeSelector for SNO - will schedule on the single node containers: - name: network-test-container image: {{.containerImage}} command: ["/bin/bash"] args: ["-c", "microdnf install -y httpd && echo 'Hello from pod {{.Iteration}}-{{.Replica}}' > /var/www/html/index.html && httpd -D FOREGROUND"] ports: - containerPort: 80 protocol: TCP resources: requests: memory: "128Mi" # Increased for httpd cpu: "100m" limits: memory: "256Mi" cpu: "200m" readinessProbe: httpGet: path: / port: 80 initialDelaySeconds: 10 periodSeconds: 5 restartPolicy: Never EOF
-
Create the ingress network policy template:
cat << EOF > ingress-network-policy.yml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: ingress-policy-{{.Iteration}}-{{.Replica}} spec: podSelector: matchLabels: app: network-test policyTypes: - Ingress ingress: - from: - namespaceSelector: matchLabels: name: network-policy-perf-{{.Iteration}} - podSelector: matchLabels: app: network-test ports: - protocol: TCP port: 80 # Updated to match httpd default port # Allow egress for DNS resolution and package installation - from: [] ports: - protocol: TCP port: 53 - protocol: UDP port: 53 EOF
-
Run the network policy latency test:
# Execute the network policy latency test adapted for SNO echo "Starting network policy latency test..." echo " Test scale: 3 iterations × 2 replicas = 6 pods total" echo " Environment: Single Node OpenShift (SNO)" echo "" kube-burner init -c network-policy-latency-config.yml --log-level=info # This test will: # 1. Create pods in multiple namespaces (reduced scale for SNO) # 2. Apply network policies with ingress rules # 3. Measure network policy enforcement latency
-
Monitor network policy test progress:
# Watch network policies being created (press Ctrl+C to exit) echo "Monitoring network policy test progress..." echo " Use Ctrl+C to exit the watch command when test completes" echo "" watch -n 5 "echo '--- Network Policies ---' && oc get networkpolicy --all-namespaces | grep network-policy-perf && echo '' && echo '--- Test Pods ---' && oc get pods --all-namespaces | grep network-test"
-
Check test results after completion:
# Check final network policy status echo "📋 Final Network Policy Status:" oc get networkpolicy --all-namespaces | grep network-policy-perf # Check pod status echo "" echo "📋 Test Pod Status:" oc get pods --all-namespaces | grep network-test # Check if pods are ready and accessible echo "" echo "📊 Pod Readiness:" oc get pods --all-namespaces -o custom-columns=NAME:.metadata.name,READY:.status.containerStatuses[0].ready,STATUS:.status.phase | grep network-test
Educational Analysis Scripts for Virtualization
The workshop provides educational scripts to help you understand VM vs container trade-offs and test VM networking.
-
VM vs Container Comparison - Educational comparison tool:
# Compare VMs and containers comprehensively python3 ~/low-latency-performance-workshop/scripts/module05-vm-vs-container-comparison.py # Disable colored output for documentation python3 ~/low-latency-performance-workshop/scripts/module05-vm-vs-container-comparison.py --no-color
This script provides:
-
Architecture and design differences explained
-
Startup time comparison (VMs: 60-90s vs Containers: 3-10s)
-
Resource usage and overhead analysis
-
Isolation and security characteristics
-
Networking performance comparison
-
Use case guidance for choosing VMs vs containers
-
-
VMI Network Tester - Test networking against Virtual Machines:
# Test networking against all VMIs in the cluster python3 ~/low-latency-performance-workshop/scripts/module05-vmi-network-tester.py # Test VMIs in specific namespace python3 ~/low-latency-performance-workshop/scripts/module05-vmi-network-tester.py \ --namespace vmi-latency-test-0 # Skip educational explanations python3 ~/low-latency-performance-workshop/scripts/module05-vmi-network-tester.py \ --skip-explanation
This script tests:
-
VMI connectivity and reachability
-
Network latency TO virtual machines (not pods!)
-
VMI IP assignment and configuration
-
Network policy impact on VM traffic
-
Creates test pods that ping VMIs to measure performance
-
The
This script helps you understand and validate VM networking performance. |
Analyzing Network Policy Latency Results with Python
Use the educational Python scripts to analyze network policy enforcement latency and understand its impact on VM networking performance.
-
Run the network policy performance analyzer:
cd ~/low-latency-performance-workshop/scripts # Run the educational network policy latency analyzer echo "🔍 Analyzing Network Policy Performance Impact..." python3 ~/low-latency-performance-workshop/scripts/module05-network-policy-analyzer.py \ --metrics-dir ~/kube-burner-configs \ --analysis-type latency # The script provides: # 1. Educational analysis of policy enforcement overhead # 2. Color-coded performance assessment # 3. Performance vs security trade-off explanations # 4. Recommendations for policy optimization
-
Generate comprehensive network policy performance insights:
cd ~/low-latency-performance-workshop/scripts # Create detailed educational analysis with report generation echo "📊 Generating Comprehensive Network Policy Analysis..." python3 ~/low-latency-performance-workshop/scripts/module05-network-policy-analyzer.py \ --metrics-dir ~/kube-burner-configs \ --analysis-type comprehensive \ --output-format educational # This educational analysis includes: # • Statistical analysis of policy enforcement latency # • Performance vs security trade-off explanations # • Best practices for low-latency network policies # • Detailed markdown report with optimization strategies # • Educational insights about CNI performance impact
Performance Optimization Best Practices
VM Configuration Best Practices
-
CPU Optimization:
-
Use
dedicatedCpuPlacement: true
for guaranteed CPU access -
Match VM vCPU count to NUMA topology
-
Use
host-model
CPU model for compatibility (orhost-passthrough
if supported) -
Consider specific CPU models (e.g.,
Haswell-noTSX
) for consistent behavior across environments
-
-
Memory Optimization:
-
Configure HugePages for reduced TLB misses
-
Align memory allocation with NUMA topology
-
Disable memory overcommit for predictable performance
-
-
Storage Optimization:
-
Use high-performance storage classes
-
Configure appropriate I/O schedulers
-
Consider local storage for ultra-low latency
-
-
Network Optimization:
-
Use SR-IOV for direct hardware access
-
Configure multiple network interfaces for traffic separation
-
Optimize network policies for minimal overhead
-
Monitoring and Validation
-
Key Metrics to Monitor:
-
VMI startup latency (target: < 90 seconds for SNO)
-
Network policy enforcement latency (target: < 10 seconds for SNO)
-
CPU utilization and isolation effectiveness
-
Memory allocation and HugePages usage
-
-
Performance Validation Tools:
-
kube-burner for comprehensive latency testing
-
iperf3 for network throughput testing
-
stress-ng for CPU and memory stress testing
-
fio for storage performance testing
-
Module Summary
This module covered low-latency virtualization with OpenShift Virtualization:
-
✅ Verified OpenShift Virtualization deployment from Module 2
-
✅ Configured high-performance VMs with dedicated CPUs and HugePages
-
✅ Measured VMI startup latency using kube-burner’s vmiLatency measurement
-
✅ Tested network policy performance with netpolLatency measurement
-
✅ Compared VM vs container performance to understand trade-offs
-
✅ Implemented SR-IOV networking for ultra-low latency networking
Key Performance Insights
Metric | Without Performance Profile | With Performance Profile | Improvement |
---|---|---|---|
Fedora VMI Startup (P99) |
90-150 seconds |
60-90 seconds |
~30-40% faster |
Network Policy Latency (P99) |
10-20 seconds |
5-10 seconds |
~50% faster |
VM vs Pod Startup |
15-25x slower |
10-15x slower |
Reduced overhead |
CPU Consistency |
Variable performance |
Predictable performance |
Eliminated jitter |
Memory Latency |
Standard pages |
HugePages optimization |
Reduced TLB misses |
Key Architectural Learning Points
VirtualMachine vs VirtualMachineInstance Usage Patterns:
Use Case | Object Type | Management | Best For |
---|---|---|---|
Production Workloads |
VirtualMachine |
Full lifecycle management |
Long-running VMs, interactive use |
Performance Testing |
VirtualMachineInstance |
Direct creation, ephemeral |
Automated testing, precise metrics |
Development/Testing |
VirtualMachine |
Start/stop capability |
Development environments |
Latency Measurement |
VirtualMachineInstance |
No controller overhead |
Pure hypervisor performance |
What You Learned: * ✅ Architecture: VMs create and manage VMIs, but VMIs can exist independently * ✅ Performance Testing: Direct VMI creation eliminates management overhead * ✅ Measurement Precision: kube-burner measures pure hypervisor startup time * ✅ Real-world Usage: Production typically uses VMs for lifecycle management
Performance Profile Impact on VMs The performance improvements from Module 4 are even more significant for VMs than containers because:
Consider completing Module 4 to see these benefits in action! |
SNO Environment Considerations
Performance Characteristics: - Single Node: All workloads compete for the same resources - Control Plane Overhead: Master components consume CPU and memory - Storage Limitations: Single storage backend affects VM boot times - Network Simplicity: Reduced network complexity but shared bandwidth
Optimization Strategies: - Resource Allocation: Careful CPU and memory allocation for VMs - Test Scaling: Reduced test scale to prevent resource exhaustion - Performance Profiles: Even more important in resource-constrained environments - Monitoring: Close monitoring of resource utilization during tests
Troubleshooting Common Issues
PVC Binding Conflicts:
# Check for PVC binding issues across all namespaces
oc get events --all-namespaces | grep -i "bound incorrectly"
# Clean up orphaned PVCs if needed
oc get pvc --all-namespaces | grep -E "(Pending|Lost)"
VM Startup Issues:
# Check VM status and events
oc describe vm <vm-name> -n <namespace>
# Check DataVolume import progress
oc get dv -n <namespace> -w
# Check CDI operator logs if DataVolume import fails
oc logs -n openshift-cnv deployment/cdi-deployment
# Check virt-launcher pod logs for VM startup issues
oc logs -n <namespace> -l kubevirt.io/created-by=<vm-name>
CPU Model Compatibility Issues:
# If you see "unsupported configuration: CPU mode 'host-passthrough'" error:
# Check available CPU models
oc get nodes -o jsonpath='{.items[0].status.nodeInfo.machineID}'
# The workshop uses 'host-model' for better compatibility
# If issues persist, you can use a specific CPU model:
# model: "Haswell-noTSX" or model: "Skylake-Client"
# Check hypervisor capabilities
oc debug node/<node-name> -- chroot /host cat /proc/cpuinfo | head -20
Resource Constraints:
# Monitor node resource usage during tests
oc adm top nodes
# Check for resource pressure
oc describe node <node-name> | grep -A 10 "Conditions:"
Workshop Progress
-
✅ Module 1: Low-latency fundamentals and concepts
-
✅ Module 2: RHACM and GitOps deployment automation
-
✅ Module 3: Baseline performance measurement and analysis
-
✅ Module 4: Performance tuning with CPU isolation (optional but recommended)
-
✅ Module 5: Low-latency virtualization with OpenShift Virtualization (current)
-
🎯 Next: Module 6 - Monitoring, alerting, and continuous validation
Performance Comparison Opportunity If you completed this module without performance profiles from Module 4: 1. Record your current VMI performance results from the Python analysis 2. Go back and complete Module 4 to configure performance profiles 3. Return and re-run the VMI tests to see the performance improvement 4. Compare the results to understand the impact of performance tuning on virtualization This approach provides valuable insights into the performance benefits of proper cluster tuning for virtualized workloads. |
Next Steps
In Module 6, you’ll learn to: * Set up comprehensive performance monitoring * Create alerting for performance regressions * Validate optimizations across the entire stack * Implement continuous performance testing
Knowledge Check
-
What are the key differences between VM and container startup latency in terms of performance characteristics?
-
How does SR-IOV improve network performance for VMs compared to traditional networking?
-
What network policy latency thresholds are acceptable for production workloads in SNO environments?
-
How do you configure a VM for maximum CPU performance using dedicated CPU placement?
-
What are the trade-offs between VM isolation and performance overhead?