Module 5: Low-Latency Virtualization

Module Overview

This module focuses on optimizing virtual machines for low-latency performance using OpenShift Virtualization. You’ll learn how to configure VMs with dedicated CPUs, HugePages, and SR-IOV networking, then validate performance improvements using advanced kube-burner measurements.

Prerequisites

  • Completed Module 3 (Baseline performance metrics collected)

  • OpenShift Virtualization operator installed (from Module 2)

  • Single Node OpenShift (SNO) or multi-node cluster

  • Baseline performance metrics from Module 3

  • Optional: Completed Module 4 (Performance Profiles for enhanced VM performance)

Key Learning Objectives

  • Configure OpenShift Virtualization for low-latency workloads

  • Optimize Virtual Machine Instances (VMIs) with dedicated resources

  • Implement SR-IOV networking for high-performance VM networking

  • Measure VMI startup and network latency using kube-burner

  • Validate network policy performance in virtualized environments

  • Compare VM performance against containerized workloads

OpenShift Virtualization Overview

OpenShift Virtualization enables running virtual machines alongside containers on the same OpenShift cluster, providing:

  • Unified Management: VMs and containers managed through the same platform

  • Performance Optimization: CPU pinning, HugePages, and NUMA alignment

  • Advanced Networking: SR-IOV, Multus, and high-performance networking

  • Live Migration: Zero-downtime VM migration between nodes

  • Security: VM isolation with OpenShift security policies

Architecture Components

Component Purpose Low-Latency Features

KubeVirt

VM management engine

CPU pinning, dedicated resources

Containerized Data Importer (CDI)

VM disk image management

Optimized storage provisioning

Multus CNI

Multiple network interfaces

SR-IOV and high-performance networking

Node Feature Discovery

Hardware capability detection

NUMA topology awareness

Verifying OpenShift Virtualization Installation

OpenShift Virtualization was deployed in Module 2 via GitOps. Let’s verify it’s ready for low-latency workloads.

  1. Check if OpenShift Virtualization is installed and ready:

      # Check the HyperConverged operator status
    oc get hyperconverged -n openshift-cnv
    
      # Verify virtualization components are running
    oc get pods -n openshift-cnv --field-selector=status.phase=Running | head -10
    
      # Check if KVM virtualization is available on the cluster
    oc get nodes -o jsonpath='{.items[*].status.allocatable.devices\.kubevirt\.io/kvm}' | grep -q "1k" && echo "✅ KVM available on cluster nodes" || echo "❌ KVM not available"
    
      # Verify the operator CSV status
    oc get csv -n openshift-cnv | grep kubevirt-hyperconverged
    
      # Check available VM templates
    echo "Available Fedora VM templates:"
    oc get templates -n openshift --field-selector metadata.name=fedora-server-small
  2. Check the cluster environment and available resources:

      # Check cluster node configuration
    echo "--- Cluster Node Information ---"
    oc get nodes -o wide
    
      # Check available CPU resources
    echo ""
    echo "--- CPU Resources ---"
    oc debug node/$(oc get nodes -o jsonpath='{.items[0].metadata.name}') -- chroot /host nproc
    
      # Check if Fedora VM DataSource is available
    echo ""
    echo "--- Available VM DataSources ---"
    oc get datasource -n openshift-virtualization-os-images | grep fedora
  3. Check current performance profile status (may not exist yet):

    # Check if performance profile exists (from Module 4)
    echo "--- Performance Profile Status ---"
    PERF_PROFILES=$(oc get performanceprofile --no-headers 2>/dev/null | wc -l)
    if [ "$PERF_PROFILES" -gt 0 ]; then
        echo "✅ Performance profile found:"
        oc get performanceprofile -o custom-columns=NAME:.metadata.name,ISOLATED:.spec.cpu.isolated,RESERVED:.spec.cpu.reserved
    
        # Get current HugePages from Performance Profile
        echo ""
        echo "--- Current HugePages Configuration ---"
        PROFILE_NAME=$(oc get performanceprofile -o jsonpath='{.items[0].metadata.name}')
        HUGEPAGES_COUNT=$(oc get performanceprofile "$PROFILE_NAME" -o jsonpath='{.spec.hugepages.pages[0].count}' 2>/dev/null || echo "0")
        echo "Performance Profile HugePages: ${HUGEPAGES_COUNT}GB"
    
        if [ "$HUGEPAGES_COUNT" -lt 8 ]; then
            echo ""
            echo "⚠️  Current HugePages (${HUGEPAGES_COUNT}GB) may be insufficient for Module 5"
            echo "   Module 4 allocates minimal HugePages (1GB) for demonstration"
            echo "   Module 5 needs more HugePages to run multiple VMs"
            echo ""
            echo "💡 Recommended HugePages for Module 5:"
            echo "   • SNO (64GB+ RAM): 8-16GB HugePages"
            echo "   • Multi-Node (64GB+ RAM): 16-24GB HugePages"
            echo ""
            echo "   The next step will update HugePages automatically!"
        else
            echo "✅ HugePages sufficient for Module 5 (${HUGEPAGES_COUNT}GB)"
        fi
    else
        echo "⚠️  No performance profile found"
        echo "   This is expected if Module 4 hasn't been completed yet"
        echo "   VMI tests will use default cluster resources"
        echo ""
        echo "💡 Want to see enhanced VM performance?"
        echo "   You can go back to Module 4 to configure performance profiles"
        echo "   This will enable:"
        echo "   • CPU isolation and dedicated CPU placement for VMs"
        echo "   • HugePages for reduced memory latency"
        echo "   • NUMA alignment for optimal performance"
        echo "   • Significant improvement in VMI startup times"
        echo ""
        echo "   After completing Module 4, return here to see the performance difference!"
    fi

    Understanding HugePages Allocation:

    • Module 4: Allocates 1GB HugePages (minimal for demonstration)

    • Module 5: Needs 8-16GB HugePages (for running multiple VMs)

    If you see "HugePages may be insufficient", don’t worry! The next step will automatically update HugePages to the optimal amount for your cluster.

    Why the difference?

    • Module 4 focuses on demonstrating performance tuning concepts

    • Module 5 focuses on running actual VMs with realistic workloads

    • The scripts automatically handle the transition between modules

  4. Update HugePages allocation for VMI testing (if Performance Profile exists):

    # Update HugePages to support multiple VMs
    bash ~/low-latency-performance-workshop/scripts/module05-update-hugepages.sh

    What This Script Does:

    • Detects current HugePages allocation from Module 4

    • Calculates optimal HugePages based on total memory and cluster type

    • Accounts for VMI overhead: Each VMI needs ~3GB (2GB guest + 1GB virt-launcher)

    • Updates Performance Profile if more HugePages are needed

    • Triggers node reboot if changes are required

    Why Update HugePages?

    Module 4 allocates minimal HugePages (1GB) for demonstration purposes. Module 5 needs more HugePages to run multiple VMs:

    • 1GB HugePages: Only 1 small VM possible

    • 12GB HugePages: 4 VMs with 2GB memory each

    • 24GB HugePages: 8 VMs with 2GB memory each (Module 5 default test)

    • 32GB HugePages: 10+ VMs with 2GB memory each

    Updated Allocation Strategy:

    • SNO (125GB+ RAM): 24GB HugePages (~8 VMIs)

    • SNO (64-128GB RAM): 24GB HugePages (~8 VMIs)

    • SNO (32-64GB RAM): 12GB HugePages (~4 VMIs)

    • Multi-Node (128GB+ worker): 48GB HugePages (~16 VMIs)

    • Multi-Node (64-128GB worker): 32GB HugePages (~10 VMIs)

    The script automatically calculates the optimal allocation for your cluster!

    If Node Reboots:

    This is expected and required for HugePages changes. Wait 10-15 minutes for the node to come back online, then continue with the next step.

    If No Performance Profile:

    The script will inform you that no Performance Profile exists and suggest completing Module 4 first for enhanced VM performance.

  5. Validate resources before testing (Important learning step!):

    # Validate that your cluster has sufficient resources for VMI testing
    bash ~/low-latency-performance-workshop/scripts/module05-validate-vmi-resources.sh

    Why This Validation Step is Critical:

    This is a key learning opportunity that demonstrates real-world capacity planning for virtualized workloads!

    What You’ll Learn:

    • Resource Calculation: How to calculate VMI memory requirements including overhead

    • Capacity Planning: How many VMs your cluster can support

    • Pre-Flight Validation: Why validating resources before deployment prevents failures

    • Troubleshooting: How to identify and fix resource constraints

    What the Script Validates:

    1. HugePages Availability: Checks if sufficient HugePages are allocated

    2. VMI Capacity: Calculates max concurrent VMIs based on available resources

    3. Test Scale Validation: Verifies default test (10 VMIs) will succeed

    4. CPU Isolation: Validates sufficient isolated CPUs for dedicated placement

    5. Recommendations: Provides specific guidance if resources are insufficient

    Understanding VMI Memory Requirements:

    Each VMI requires more memory than just the guest allocation:

    VMI Guest Memory:        2GB  (configured in VMI spec)
    virt-launcher overhead:  1GB  (KubeVirt management pod)
    ────────────────────────────
    Total per VMI:           3GB

    Example Calculation:

    Default test: 10 VMIs × 3GB = 30GB required
    Your cluster: 16GB HugePages available
    Result: ❌ Insufficient! (need 24GB minimum)

    If Validation Fails:

    The script will provide specific recommendations:

    • Option 1: Increase HugePages allocation (recommended)

    • Option 2: Reduce test scale to match available resources

    • Option 3: Run without HugePages (reduced performance)

    Real-World Application:

    This validation process mirrors production capacity planning:

    • ✅ Always validate resources before deploying VMs

    • ✅ Account for overhead (virt-launcher, QEMU, etc.)

    • ✅ Plan for headroom (don’t use 100% of resources)

    • ✅ Monitor and adjust based on actual usage

    This is exactly what you’d do in production before deploying VMs!

VM Optimization for Low-Latency

Understanding VM Performance Characteristics

Virtual machines have different performance characteristics compared to containers:

  • Boot Time: VMs require OS initialization (typically 30-60 seconds)

  • Resource Overhead: Hypervisor and guest OS consume additional resources

  • I/O Path: Additional virtualization layer affects storage and network performance

  • Memory Management: Guest OS memory management plus hypervisor overhead

Low-Latency VM Configuration

CPU Optimization

Feature Purpose Configuration

CPU Pinning

Dedicated CPU cores for VM

dedicatedCpuPlacement: true

NUMA Alignment

Memory and CPU on same NUMA node

Automatic with performance profile

CPU Model

Host CPU features exposed to VM

cpu.model: host-model (compatible) or host-passthrough (if supported)

CPU Topology

Optimal vCPU to pCPU mapping

Match host topology

Memory Optimization

Feature Purpose Configuration

HugePages

Reduced TLB misses

hugepages.pageSize: 1Gi

Memory Backing

Shared memory optimization

memoryBacking.hugepages

NUMA Policy

Memory locality

numaPolicy: preferred

Memory Overcommit

Disabled for predictable performance

memoryOvercommitPercentage: 100

Creating VMs for Performance Testing

Instead of creating a custom template, we’ll use the existing Fedora template and customize it for our performance testing needs.

  1. Create a performance-optimized Fedora VM for testing:

      # Create a namespace for our VM testing
    oc new-project vmi-performance-test || oc project vmi-performance-test
    
      # Clean up any existing VMs to avoid PVC conflicts
    echo "🧹 Cleaning up any existing performance test VMs..."
    oc delete vm --selector=app=vmi-performance-test --ignore-not-found=true
    oc delete dv --selector=app=vmi-performance-test --ignore-not-found=true
    
      # Wait a moment for cleanup to complete
    sleep 5
    
      # Create a Fedora VM using the existing template with performance optimizations
      # Generate unique name to avoid PVC conflicts
    VM_NAME="fedora-perf-$(date +%s)"
    echo "Creating VM with unique name: $VM_NAME"
    
    cat << EOF | oc apply -f -
    apiVersion: kubevirt.io/v1
    kind: VirtualMachine
    metadata:
      name: $VM_NAME
      labels:
        app: vmi-performance-test
        vm.kubevirt.io/template: fedora-server-small
    spec:
      dataVolumeTemplates:
      - apiVersion: cdi.kubevirt.io/v1beta1
        kind: DataVolume
        metadata:
          name: $VM_NAME
        spec:
          sourceRef:
            kind: DataSource
            name: fedora
            namespace: openshift-virtualization-os-images
          storage:
            resources:
              requests:
                storage: 30Gi
      runStrategy: Manual
      template:
        metadata:
          labels:
            kubevirt.io/domain: $VM_NAME
            kubevirt.io/size: small
        spec:
          domain:
            cpu:
              cores: 2
              sockets: 1
              threads: 1
              # Enable performance features if performance profile exists
              dedicatedCpuPlacement: false  # Will be enabled conditionally
              model: host-model  # More compatible than host-passthrough
            memory:
              guest: 2Gi
              # HugePages will be enabled conditionally based on availability
            devices:
              disks:
              - disk:
                  bus: virtio
                name: rootdisk
              - disk:
                  bus: virtio
                name: cloudinitdisk
              interfaces:
              - masquerade: {}
                model: virtio
                name: default
              rng: {}
            features:
              smm:
                enabled: true
            firmware:
              bootloader:
                efi: {}
          networks:
          - name: default
            pod: {}
          terminationGracePeriodSeconds: 180
          volumes:
          - dataVolume:
              name: $VM_NAME
            name: rootdisk
          - cloudInitNoCloud:
              userData: |
                #cloud-config
                user: fedora
                password: workshop123
                chpasswd: { expire: False }
                packages:
                  - qemu-guest-agent
                runcmd:
                  - systemctl enable --now qemu-guest-agent
                  - echo "VM ready for performance testing" > /tmp/vm-ready
            name: cloudinitdisk
    EOF
    
    echo "✅ Fedora VM '$VM_NAME' created for performance testing"
    
      # Verify the VM and DataVolume were created
    echo ""
    echo "📋 Verifying VM creation:"
    oc get vm $VM_NAME
    echo ""
    echo "📋 Verifying DataVolume creation:"
    oc get dv $VM_NAME
    
      # Check for any PVC binding issues
    echo ""
    echo "📋 Checking for PVC issues:"
    if oc get events -n vmi-performance-test | grep -i "bound incorrectly\|pvc.*conflict" >/dev/null 2>&1; then
        echo "⚠️  PVC binding issues detected. This may be due to duplicate VM names."
        echo "   The cleanup step above should have resolved this."
        echo "   If issues persist, check: oc get events -n vmi-performance-test"
    else
        echo "✅ No PVC binding issues detected"
    fi

Troubleshooting PVC Conflicts

If you encounter PVC binding errors like "Two claims are bound to the same volume, this one is bound incorrectly", this typically happens when:

  1. Duplicate VM names: Multiple VMs created with the same name

  2. Incomplete cleanup: Previous test runs left resources behind

Resolution steps:

  # Clean up all performance test VMs and DataVolumes
oc delete vm --selector=app=vmi-performance-test --ignore-not-found=true
oc delete dv --selector=app=vmi-performance-test --ignore-not-found=true

  # Wait for cleanup to complete
sleep 10

  # Check for any remaining PVCs
oc get pvc -n vmi-performance-test

  # If PVCs remain, delete them manually
oc delete pvc <pvc-name> -n vmi-performance-test

VMI Latency Testing with Kube-burner

Now let’s measure Virtual Machine Instance startup performance using kube-burner’s VMI latency measurement capabilities. We’ll adapt the test for our SNO environment.

Understanding VirtualMachine vs VirtualMachineInstance Architecture

This is a crucial concept for understanding OpenShift Virtualization performance testing:

What exists in our cluster:

  # Check VirtualMachine objects (high-level management)
oc get VirtualMachine -A
  # Shows: vmi-performance-test/fedora-perf-1759292486 (1 VM)

  # Check VirtualMachineInstance objects (actual running VMs)
oc get VirtualMachineInstance -A
  # Shows: 11 VMIs total (1 managed by VM + 10 direct VMIs)

Two Different Approaches:

  1. VirtualMachine (VM) Approach - Used for fedora-perf-1759292486:

    • Higher-level management object

    • Persistent lifecycle - Can start/stop/restart

    • Production-ready - Survives cluster restarts

    • Creates VMI automatically when started

    • Use case: Interactive testing, production workloads

  2. VirtualMachineInstance (VMI) Approach - Used by kube-burner:

    • Direct hypervisor objects - No management layer

    • Ephemeral - Once deleted, they’re gone

    • Pure performance testing - No controller overhead

    • Created directly by kube-burner templates

    • Use case: Automated latency measurements

Why kube-burner uses direct VMIs: * ✅ Precise timing - Measures pure hypervisor startup * ✅ No controller overhead - Eliminates VM management latency * ✅ Consistent results - No management layer variability * ✅ Automated testing - Perfect for ephemeral performance tests

Architecture Relationship:

Production Usage:    VirtualMachine → creates/manages → VirtualMachineInstance
Performance Testing: kube-burner → creates directly → VirtualMachineInstance

This architectural difference is why you see different objects in different namespaces!

  1. Verify the architectural difference yourself:

      # Compare the two approaches in your cluster
    echo "--- VirtualMachine Objects (Management Layer) ---"
    oc get VirtualMachine -A
    echo ""
    echo "--- VirtualMachineInstance Objects (Running VMs) ---"
    oc get VirtualMachineInstance -A
    echo ""
    echo "--- Owner Relationships ---"
    echo "VM-managed VMI (has owner reference):"
    oc get vmi fedora-perf-1759292486 -n vmi-performance-test -o jsonpath='{.metadata.ownerReferences[0].kind}' 2>/dev/null && echo " ← Managed by VirtualMachine" || echo "No owner reference"
    
    echo ""
    echo "Direct VMI (no owner reference):"
    oc get vmi fedora-vmi-0-1 -n vmi-latency-test-0 -o jsonpath='{.metadata.ownerReferences}' 2>/dev/null
    if [ $? -eq 0 ] && [ -n "$(oc get vmi fedora-vmi-0-1 -n vmi-latency-test-0 -o jsonpath='{.metadata.ownerReferences}' 2>/dev/null)" ]; then
        echo "Has owner reference"
    else
        echo "No owner reference ← Created directly by kube-burner"
    fi
  2. Create a VMI-specific kube-burner configuration adapted for SNO:

    cd ~/kube-burner-configs
    
    cat << EOF > vmi-latency-config.yml
    global:
      measurements:
        - name: vmiLatency
          thresholds:
            - conditionType: VMIRunning
              metric: P99
              threshold: 90000ms  # Increased for SNO environment
            - conditionType: VMIScheduled
              metric: P99
              threshold: 60000ms  # Increased for SNO environment
    
    metricsEndpoints:
      - indexer:
          type: local
          metricsDirectory: collected-metrics-vmi
    
    jobs:
      - name: vmi-latency-test
        jobType: create
        jobIterations: 5  # Reduced for SNO environment
        namespace: vmi-latency-test
        namespacedIterations: true
        cleanup: false
        podWait: false
        waitWhenFinished: true
        verifyObjects: true
        errorOnVerify: false
        objects:
          - objectTemplate: fedora-vmi.yml
            replicas: 2  # Small scale for SNO
    EOF
  3. Create the Fedora VMI template for testing:

      # Create VMI template using containerDisk for faster, ephemeral testing
      # This approach is ideal for performance testing as it doesn't require PVC provisioning
    echo "Creating Fedora VMI template for kube-burner testing"
    
    cat << EOF > fedora-vmi.yml
    apiVersion: kubevirt.io/v1
    kind: VirtualMachineInstance
    metadata:
      name: fedora-vmi-{{.Iteration}}-{{.Replica}}
      labels:
        app: vmi-latency-test
        iteration: "{{.Iteration}}"
    spec:
      # No nodeSelector for SNO - will schedule on the single node
      domain:
        cpu:
          cores: 1
          sockets: 1
          threads: 1
          # Performance features will be enabled conditionally
          # Using host-model instead of host-passthrough for better compatibility
          model: host-model
        memory:
          guest: 2Gi  # Minimum required for Fedora
          # HugePages will be added conditionally if available
        devices:
          disks:
          - name: containerdisk
            disk:
              bus: virtio
          - name: cloudinitdisk
            disk:
              bus: virtio
          interfaces:
          - name: default
            masquerade: {}
            model: virtio
          rng: {}
        features:
          smm:
            enabled: true
        firmware:
          bootloader:
            efi: {}
      networks:
      - name: default
        pod: {}
      terminationGracePeriodSeconds: 180
      volumes:
      - name: containerdisk
        containerDisk:
          image: quay.io/containerdisks/fedora:latest
      - name: cloudinitdisk
        cloudInitNoCloud:
          userData: |
            #cloud-config
            user: fedora
            password: workshop123
            chpasswd: { expire: False }
            bootcmd:
              - "echo 'Fedora VMI started at' \$(date) > /tmp/vmi-start-time"
    EOF

    Why we use containerDisk instead of DataVolumes for performance testing

    For kube-burner performance testing, we use containerDisk instead of DataVolumes because:

    1. Faster startup: No PVC provisioning or DataVolume import delays

    2. Simpler template: Single VMI object instead of VMI + DataVolume

    3. Ephemeral by design: Perfect for performance testing where persistence isn’t needed

    4. Consistent results: No storage backend variability affecting measurements

    containerDisk approach:

    volumes:
    - name: containerdisk
      containerDisk:
        image: quay.io/containerdisks/fedora:latest

    DataVolume approach (for production VMs):

    volumes:
    - name: rootdisk
      dataVolume:
        name: my-vm-disk
    ---
    apiVersion: cdi.kubevirt.io/v1beta1
    kind: DataVolume
    metadata:
      name: my-vm-disk
    spec:
      sourceRef:
        kind: DataSource
        name: fedora
        namespace: openshift-virtualization-os-images

    For this performance testing module, containerDisk provides the most accurate VMI startup measurements!

  4. Configure VMI with optimal performance settings:

    # Generate optimized VMI configuration
    bash ~/low-latency-performance-workshop/scripts/module05-configure-vmi.sh

    What This Script Does:

    • Auto-detects Performance Profile availability

    • Auto-detects HugePages configuration

    • Generates optimized VMI YAML with:

      • CPU pinning (if Performance Profile exists)

      • HugePages (if available)

      • Appropriate CPU model (host-passthrough or host-model)

      • Educational comments

    Benefits of Using the Script:

    • Dynamic Configuration: Adapts to your cluster’s capabilities

    • Educational Feedback: Explains what features are enabled and why

    • Flexible Options: Customize VMI name, memory, CPUs, namespace

    • Consistent Results: Same configuration across different clusters

    Script Options:

    • --name NAME: VMI name (default: fedora-vmi)

    • --namespace NS: Namespace (default: default)

    • --memory SIZE: Memory size (default: 2Gi)

    • --cpus NUM: Number of CPUs (default: 2)

    • --output FILE: Output file (default: fedora-vmi.yml)

    Example with Custom Settings:

    bash ~/low-latency-performance-workshop/scripts/module05-configure-vmi.sh \
      --name my-vm \
      --memory 4Gi \
      --cpus 4 \
      --output my-vm.yml

    If No Performance Profile:

    The script will generate a VMI configuration with default settings and provide guidance on completing Module 4 for enhanced performance.

  5. Clean up any existing VMI test resources before starting:

      # Clean up any existing VMI test resources to avoid PVC conflicts
    echo "🧹 Cleaning up any existing VMI test resources..."
    oc delete vmi --selector=app=vmi-latency-test --all-namespaces --ignore-not-found=true
    oc delete dv --selector=app=vmi-latency-test --all-namespaces --ignore-not-found=true
    
      # Wait for cleanup to complete
    sleep 5
    
    echo "✅ Cleanup completed"
  6. Run the VMI latency test using the corrected configuration:

      # Execute the VMI latency test with containerDisk approach
    echo "Starting Fedora VMI latency performance test..."
    echo "   Test approach: Direct VMI creation with containerDisk (no PVC provisioning)"
    echo "   Test scale: 5 iterations × 2 replicas = 10 VMIs total"
    echo "   Environment: Single Node OpenShift (SNO)"
    echo "   Unique namespaces: vmi-latency-test-0 through vmi-latency-test-4"
    echo ""
    
    kube-burner init -c vmi-latency-config.yml --log-level=info
    
      # The test will:
      # 1. Create VMIs directly in each namespace using containerDisk
      # 2. Measure pure VMI startup latency (no storage provisioning overhead)
      # 3. Track VMI lifecycle phases from creation to running
      # 4. Generate performance metrics in collected-metrics-vmi/
  7. Understanding the test results:

    The kube-burner test measures several key VMI startup phases:

      # View the key metrics from the test
    echo "VMI Latency Test Results Summary:"
    echo ""
    echo "Key Metrics Measured:"
    echo "• VMICreated: Time to create VMI object (should be ~0ms)"
    echo "• VMIPending: Time VMI spends in Pending state"
    echo "• VMIScheduling: Time to schedule VMI to a node"
    echo "• VMIScheduled: Time until VMI is scheduled (containerDisk pull + pod creation)"
    echo "• VMIRunning: Total time until VMI is fully running (includes OS boot)"
    echo ""
    echo "Expected Results for SNO Environment with containerDisk:"
    echo "• VMIScheduled P99: ~30-45 seconds (container image pull + pod start)"
    echo "• VMIRunning P99: ~45-60 seconds (full VM boot from containerDisk)"
    echo "• VMIScheduling P99: <1 second (fast on SNO)"
    echo ""
    echo "📁 Detailed metrics saved in: collected-metrics-vmi/"
    ls -la collected-metrics-vmi/
  8. Monitor VMI creation progress:

      # Watch VMIs being created (press Ctrl+C to exit watch)
    echo "Monitoring VMI creation progress..."
    echo "   Use Ctrl+C to exit the watch command when test completes"
    echo ""
    
      # Watch VMIs and their launcher pods being created
    watch -n 5 "echo '--- VMIs ---' && oc get vmi --all-namespaces --selector=app=vmi-latency-test && echo '' && echo '--- Launcher Pods ---' && oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher | grep vmi-latency"
  9. Check VMI status and verify the architectural difference:

      # Comprehensive verification of VMI test results
    echo "=================================================="
    echo "📋 VMI Latency Test - Current Status"
    echo "=================================================="
    echo ""
    echo "✅ VirtualMachine Objects (Management Layer):"
    oc get VirtualMachine -A 2>/dev/null || echo "No VMs found"
    echo ""
    echo "✅ VirtualMachineInstance Objects (Running VMs):"
    oc get VirtualMachineInstance -A 2>/dev/null || echo "No VMIs found"
    echo ""
    echo "=================================================="
    echo "� Kube-burner Test Results"
    echo "=================================================="
    echo ""
    echo "VMIs created by kube-burner test:"
    oc get vmi --all-namespaces --selector=app=vmi-latency-test -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,PHASE:.status.phase,IP:.status.interfaces[0].ipAddress,READY:.status.conditions[?\(@.type==\"Ready\"\)].status 2>/dev/null || echo "No test VMIs found"
    echo ""
    echo "📋 DataVolume Status (should be empty with containerDisk):"
    oc get dv --all-namespaces --selector=app=vmi-latency-test 2>/dev/null || echo "No DataVolumes found (expected with containerDisk)"
    echo ""
    echo "📋 VMI Launcher Pods:"
    oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName 2>/dev/null | grep -E "NAMESPACE|vmi-latency" || echo "No launcher pods found"
    echo ""
    echo "=================================================="
    echo "✅ Test Results Summary"
    echo "=================================================="
    TOTAL_VMIS=\$(oc get vmi --all-namespaces --selector=app=vmi-latency-test --no-headers 2>/dev/null | wc -l)
    RUNNING_VMIS=\$(oc get vmi --all-namespaces --selector=app=vmi-latency-test --no-headers 2>/dev/null | grep -c "Running" || echo "0")
    echo "Total VMIs created: \$TOTAL_VMIS"
    echo "VMIs in Running phase: \$RUNNING_VMIS"
    echo ""
    if [ "\$RUNNING_VMIS" -eq 10 ]; then
        echo "🎉 SUCCESS! All 10 test VMIs are running!"
        echo "📊 This demonstrates direct VMI creation with containerDisk"
        echo "✅ No DataVolumes needed - faster startup for performance testing"
        echo ""
        echo "Key Observations:"
        echo "• All VMIs have IP addresses assigned"
        echo "• All VMIs are in Ready state"
        echo "• No PVC/DataVolume provisioning delays"
        echo "• Pure VMI startup latency measured"
    elif [ "\$TOTAL_VMIS" -eq 10 ]; then
        echo "⚠️  All 10 VMIs created, \$RUNNING_VMIS are running"
        echo "   Some may still be pulling containerDisk images"
        echo "   Check: oc get pods --all-namespaces | grep virt-launcher"
    else
        echo "⚠️  Expected 10 VMIs, found \$TOTAL_VMIS"
        echo "   Review kube-burner logs for errors"
        echo ""
        echo "💡 If VMIs failed, see troubleshooting section below"
    fi

Troubleshooting VMI Failures

If your VMIs are not running successfully, this section will help you diagnose and fix common issues.

  1. Check VMI and Pod Status:

    # Get detailed status of all VMIs
    echo "=== VMI Status ==="
    oc get vmi --all-namespaces --selector=app=vmi-latency-test
    
    echo ""
    echo "=== virt-launcher Pod Status ==="
    oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher | grep vmi-latency
    
    echo ""
    echo "=== Failed/OOMKilled Pods ==="
    oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher | grep -E "OOMKilled|Error|CrashLoop" || echo "No failed pods"
  2. Diagnose OOMKilled VMIs (Most Common Issue):

    # Check if VMIs are OOMKilled due to insufficient HugePages
    echo "=== Checking for OOMKilled VMIs ==="
    OOMKILLED_COUNT=$(oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher -o jsonpath='{.items[*].status.containerStatuses[*].lastState.terminated.reason}' 2>/dev/null | grep -o "OOMKilled" | wc -l)
    
    if [ "$OOMKILLED_COUNT" -gt 0 ]; then
        echo "❌ Found $OOMKILLED_COUNT OOMKilled virt-launcher pods"
        echo ""
        echo "Root Cause: Insufficient HugePages for VMI memory + overhead"
        echo ""
        echo "Explanation:"
        echo "  • Each VMI needs: 2GB guest + 1GB virt-launcher overhead = 3GB total"
        echo "  • Test creates: 10 VMIs × 3GB = 30GB required"
        echo "  • Available HugePages: $(oc get node -o jsonpath='{.items[0].status.allocatable.hugepages-1Gi}' | sed 's/Gi//g')GB"
        echo ""
        echo "Solutions:"
        echo ""
        echo "  Option 1: Increase HugePages (Recommended)"
        echo "  ─────────────────────────────────────────"
        echo "  bash ~/low-latency-performance-workshop/scripts/module05-update-hugepages.sh"
        echo ""
        echo "  This will:"
        echo "  • Calculate optimal HugePages for your cluster"
        echo "  • Update Performance Profile"
        echo "  • Trigger node reboot (wait 10-15 minutes)"
        echo "  • Allocate sufficient HugePages for 10 VMIs"
        echo ""
        echo "  Option 2: Reduce Test Scale"
        echo "  ───────────────────────────"
        echo "  Edit ~/kube-burner-configs/vmi-latency-config.yml:"
        echo ""
        echo "  Current:"
        echo "    jobIterations: 5"
        echo "    replicas: 2"
        echo "    Total: 10 VMIs"
        echo ""
        echo "  Recommended for 16GB HugePages:"
        echo "    jobIterations: 2"
        echo "    replicas: 2"
        echo "    Total: 4 VMIs (fits in 16GB)"
        echo ""
        echo "  Then clean up and re-run:"
        echo "    oc delete vmi --selector=app=vmi-latency-test --all-namespaces"
        echo "    kube-burner init -c vmi-latency-config.yml"
        echo ""
    else
        echo "✅ No OOMKilled pods found"
    fi
  3. Check HugePages Allocation:

    # Detailed HugePages analysis
    echo "=== HugePages Allocation Analysis ==="
    echo ""
    
    # Get HugePages from Performance Profile
    PERF_PROFILE=$(oc get performanceprofile -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
    if [ -n "$PERF_PROFILE" ]; then
        HUGEPAGES_COUNT=$(oc get performanceprofile "$PERF_PROFILE" -o jsonpath='{.spec.hugepages.pages[0].count}' 2>/dev/null)
        HUGEPAGES_SIZE=$(oc get performanceprofile "$PERF_PROFILE" -o jsonpath='{.spec.hugepages.pages[0].size}' 2>/dev/null)
        echo "Performance Profile: $PERF_PROFILE"
        echo "  Configured: ${HUGEPAGES_COUNT} × ${HUGEPAGES_SIZE} = ${HUGEPAGES_COUNT}GB"
    fi
    
    echo ""
    
    # Get HugePages from node
    NODE=$(oc get nodes -o jsonpath='{.items[0].metadata.name}')
    HUGEPAGES_CAPACITY=$(oc get node "$NODE" -o jsonpath='{.status.capacity.hugepages-1Gi}' | sed 's/Gi//g')
    HUGEPAGES_ALLOCATABLE=$(oc get node "$NODE" -o jsonpath='{.status.allocatable.hugepages-1Gi}' | sed 's/Gi//g')
    
    echo "Node: $NODE"
    echo "  Capacity: ${HUGEPAGES_CAPACITY}GB"
    echo "  Allocatable: ${HUGEPAGES_ALLOCATABLE}GB"
    
    echo ""
    echo "VMI Capacity Calculation:"
    echo "  • VMI memory requirement: 3GB per VMI (2GB guest + 1GB overhead)"
    echo "  • Available HugePages: ${HUGEPAGES_ALLOCATABLE}GB"
    echo "  • Max concurrent VMIs: ~$((HUGEPAGES_ALLOCATABLE / 3))"
    echo "  • Test requires: 10 VMIs = 30GB"
    echo ""
    
    if [ "$HUGEPAGES_ALLOCATABLE" -ge 30 ]; then
        echo "✅ Sufficient HugePages for 10 VMIs"
    elif [ "$HUGEPAGES_ALLOCATABLE" -ge 24 ]; then
        echo "⚠️  Sufficient for 8 VMIs, reduce test scale to 8"
    elif [ "$HUGEPAGES_ALLOCATABLE" -ge 18 ]; then
        echo "⚠️  Sufficient for 6 VMIs, reduce test scale to 6"
    elif [ "$HUGEPAGES_ALLOCATABLE" -ge 12 ]; then
        echo "⚠️  Sufficient for 4 VMIs, reduce test scale to 4"
    else
        echo "❌ Insufficient HugePages, increase allocation to at least 24GB"
    fi
  4. Check VMI Events for Errors:

    # Check events for failed VMIs
    echo "=== Recent VMI Events ==="
    oc get events --all-namespaces --field-selector involvedObject.kind=VirtualMachineInstance --sort-by='.lastTimestamp' | tail -20
  5. View virt-launcher Pod Logs:

    # Get logs from a failed virt-launcher pod
    echo "=== virt-launcher Pod Logs (first failed pod) ==="
    FAILED_POD=$(oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher -o jsonpath='{.items[?(@.status.phase!="Running")].metadata.name}' | awk '{print $1}' | head -1)
    FAILED_NS=$(oc get pods --all-namespaces --selector=kubevirt.io=virt-launcher -o jsonpath='{.items[?(@.status.phase!="Running")].metadata.namespace}' | awk '{print $1}' | head -1)
    
    if [ -n "$FAILED_POD" ]; then
        echo "Pod: $FAILED_POD (namespace: $FAILED_NS)"
        echo ""
        oc logs -n "$FAILED_NS" "$FAILED_POD" --tail=50 2>/dev/null || echo "No logs available"
    else
        echo "No failed pods found"
    fi

Common VMI Failure Patterns:

  1. OOMKilled = Insufficient memory/HugePages

    • Solution: Increase HugePages or reduce test scale

  2. Scheduling (stuck) = No HugePages available

    • Solution: Increase HugePages allocation

  3. ImagePullBackOff = Cannot pull containerDisk image

    • Solution: Check network connectivity, image registry access

  4. CrashLoopBackOff = VMI starts but crashes

    • Solution: Check virt-launcher logs, verify CPU/memory settings

  5. Pending (stuck) = Cannot schedule to node

    • Solution: Check node resources, taints, tolerations

Prevention:

Always run the validation script before testing:

bash ~/low-latency-performance-workshop/scripts/module05-validate-vmi-resources.sh

This will catch resource issues before they cause failures!

Analyzing VMI Latency Results

Now let’s analyze the VMI performance results and understand what the metrics tell us about virtualization performance characteristics.

  1. Examine the VMI latency metrics generated by kube-burner:

    cd ~/kube-burner-configs
    
      # Check what metrics were generated
    echo "📊 VMI Latency Test Results:"
    ls -la collected-metrics-vmi/
    
      # View the summary of VMI latency measurements
    echo ""
    echo "📋 VMI Latency Quantiles (Key Performance Indicators):"
    echo "   All times in milliseconds (ms)"
    echo ""
    if [ -f "collected-metrics-vmi/vmiLatencyQuantilesMeasurement-vmi-latency-test.json" ]; then
        cat collected-metrics-vmi/vmiLatencyQuantilesMeasurement-vmi-latency-test.json | jq -r '.[] | "\(.quantileName) - P99: \(.P99)ms | P50: \(.P50)ms | Avg: \(.avg)ms"' | grep -v "VMReady" | sort
    else
        echo "VMI latency quantiles file not found"
    fi
    
      # Show job summary
    echo ""
    echo "📈 Test Execution Summary:"
    if [ -f "collected-metrics-vmi/jobSummary.json" ]; then
        cat collected-metrics-vmi/jobSummary.json | jq -r '.[] | "Job: \(.jobConfig.name) | Status: \(if .passed then "✅ PASSED" else "❌ FAILED" end) | Duration: \(.elapsedTime)s | QPS: \(.achievedQps)"'
    else
        echo "Job summary file not found"
    fi
  2. Analyze VMI startup phases and understand the performance characteristics:

    cd ~/kube-burner-configs
    
      # Analyze the detailed VMI latency measurements
    echo "🔍 Detailed VMI Startup Phase Analysis:"
    echo ""
    
    if [ -f "collected-metrics-vmi/vmiLatencyMeasurement-vmi-latency-test.json" ]; then
        echo "VMI Startup Phases (in chronological order):"
        echo "1. VMICreated → VMIPending: Object creation time"
        echo "2. VMIPending → VMIScheduling: Waiting for scheduling"
        echo "3. VMIScheduling → VMIScheduled: Node assignment + pod creation"
        echo "4. VMIScheduled → VMIRunning: containerDisk pull + VM boot"
        echo ""
    
        # Show actual timing data
        echo "📊 Actual Timing Results (Average across all VMIs):"
        cat collected-metrics-vmi/vmiLatencyMeasurement-vmi-latency-test.json | jq -r '
            [.[] | {
                vmiCreated: .vmiCreatedLatency,
                vmiPending: .vmiPendingLatency,
                vmiScheduling: .vmiSchedulingLatency,
                vmiScheduled: .vmiScheduledLatency,
                vmiRunning: .vmiRunningLatency,
                podCreated: .podCreatedLatency,
                podScheduled: .podScheduledLatency,
                podInitialized: .podInitializedLatency,
                podContainersReady: .podContainersReadyLatency
            }] |
            {
                vmiCreated: ([.[].vmiCreated] | add / length),
                vmiPending: ([.[].vmiPending] | add / length),
                vmiScheduling: ([.[].vmiScheduling] | add / length),
                vmiScheduled: ([.[].vmiScheduled] | add / length),
                vmiRunning: ([.[].vmiRunning] | add / length),
                podCreated: ([.[].podCreated] | add / length),
                podScheduled: ([.[].podScheduled] | add / length),
                podInitialized: ([.[].podInitialized] | add / length),
                podContainersReady: ([.[].podContainersReady] | add / length)
            } |
            to_entries |
            .[] |
            "  \(.key): \(.value | floor)ms"
        '
    
        echo ""
        echo "🎯 Performance Analysis (containerDisk approach):"
        echo "• VMICreated should be ~0ms (object creation)"
        echo "• VMIScheduling should be <2000ms (fast scheduling on SNO)"
        echo "• VMIScheduled includes containerDisk image pull time (major component)"
        echo "• VMIRunning includes full Fedora boot time from containerDisk (~45-55s typical)"
        echo ""
        echo "💡 Key Insight: With containerDisk, most time is spent pulling the container"
        echo "   image and booting the OS. No PVC provisioning or DataVolume import delays!"
    
    else
        echo "❌ VMI latency measurement file not found"
        echo "This may indicate the test didn't complete successfully"
    fi
  3. Analyze VMI performance using the main performance analyzer:

    cd ~/kube-burner-configs
    
      # Use the main performance analyzer for VMI metrics
    echo "🎓 Running VMI Performance Analysis..."
    python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \
        --single collected-metrics-vmi
    
      # This analysis provides:
      # • VMI startup phase breakdown and timing analysis
      # • Performance bottleneck identification
      # • Statistical analysis of latency variations
      # • Comparison with performance thresholds
      # • Color-coded performance assessment
  4. Compare VMI performance characteristics with container baselines:

    cd ~/kube-burner-configs
    
      # Generate comprehensive comparison between VMs and containers
    echo "📊 VMI vs Container Performance Comparison..."
    
      # Check what metrics are available for comparison
    BASELINE_AVAILABLE=false
    TUNED_AVAILABLE=false
    
    if [ -d "collected-metrics" ]; then
        echo "✅ Container baseline metrics found"
        BASELINE_AVAILABLE=true
    fi
    
    if [ -d "collected-metrics-tuned" ]; then
        echo "✅ Container tuned metrics found"
        TUNED_AVAILABLE=true
    fi
    
    if [ -d "collected-metrics-vmi" ]; then
        echo "✅ VMI metrics found"
    else
        echo "❌ VMI metrics not found - check test execution above"
        exit 1
    fi
    
    echo ""
    
      # Module 5 focused analysis - VMI performance with intelligent container context
    echo "🎯 Module 5 Focused Analysis (VMI Performance with Context)..."
    python3 ~/low-latency-performance-workshop/scripts/module-specific-analysis.py 5
    
    echo ""
    echo "💡 Module 5 Learning Focus:"
    echo "   🔍 VMI startup phases and timing"
    echo "   ⚖️  Virtualization vs containerization trade-offs"
    echo "   🎯 When to choose VMs vs containers for workloads"
    if [ "$TUNED_AVAILABLE" = true ]; then
        echo "   🚀 How performance profiles benefit both VMs and containers"
    else
        echo "   ℹ️  Performance profiles (Module 4) would improve both VMs and containers"
    fi
    
    echo ""
    echo "📚 How to Read the Module 5 Analysis:"
    echo "   1. Individual sections show raw performance for each test type"
    echo "   2. VMI metrics (🖥️ section) are the focus of this module"
    echo "   3. Container metrics provide context for comparison"
    echo "   4. Look for VMI-specific phases: VMICreated → VMIPending → VMIScheduled → VMIRunning"
    
    echo ""
    echo "💡 This comparison explains:"
    echo "• Why VMs take longer to start than containers (OS boot vs process start)"
    echo "• The performance trade-offs of virtualization (isolation vs overhead)"
    echo "• When to use VMs vs containers for different workloads"
    echo "• How performance profiles affect both VMs and containers"
  5. Generate a comprehensive performance report:

    cd ~/kube-burner-configs
    
      # Generate a comprehensive markdown report with all available metrics
    echo "Generating Comprehensive Performance Report..."
    
      # Determine what metrics are available and generate appropriate report
    BASELINE_AVAILABLE=false
    TUNED_AVAILABLE=false
    VMI_AVAILABLE=false
    
    [ -d "collected-metrics" ] && BASELINE_AVAILABLE=true
    [ -d "collected-metrics-tuned" ] && TUNED_AVAILABLE=true
    [ -d "collected-metrics-vmi" ] && VMI_AVAILABLE=true
    
      # Generate Module 5 specific report with available metrics
    REPORT_FILE="module5-vmi-performance-report-$(date +%Y%m%d-%H%M).md"
    
    echo "📄 Generating Module 5 VMI Performance Report..."
    echo "   🎯 Focus: Virtual machine performance analysis"
    echo "   📊 Context: VMI startup vs container performance"
    
    if [ "$BASELINE_AVAILABLE" = true ] && [ "$TUNED_AVAILABLE" = true ] && [ "$VMI_AVAILABLE" = true ]; then
        echo "   📈 Scope: VMI + Container baseline + Container tuned"
        python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \
            --baseline collected-metrics \
            --tuned collected-metrics-tuned \
            --vmi collected-metrics-vmi \
            --report "$REPORT_FILE"
    elif [ "$BASELINE_AVAILABLE" = true ] && [ "$VMI_AVAILABLE" = true ]; then
        echo "   📈 Scope: VMI + Container baseline"
        python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \
            --baseline collected-metrics \
            --vmi collected-metrics-vmi \
            --report "$REPORT_FILE"
    elif [ "$VMI_AVAILABLE" = true ]; then
        echo "   📈 Scope: VMI standalone analysis"
        python3 ~/low-latency-performance-workshop/scripts/analyze-performance.py \
            --single collected-metrics-vmi \
            --report "$REPORT_FILE"
    else
        echo "❌ No VMI performance metrics found for report generation"
        exit 1
    fi
    
    echo ""
    echo "📄 Performance Report Generated: $REPORT_FILE"
    echo "📊 Report Summary:"
    if [ -f "$REPORT_FILE" ]; then
        head -20 "$REPORT_FILE"
        echo ""
        echo "💡 View the complete report: cat $REPORT_FILE"
    else
        echo "❌ Report generation failed"
    fi

SR-IOV Configuration for High-Performance VM Networking

SR-IOV (Single Root I/O Virtualization) provides direct hardware access to Virtual Machines, bypassing the software networking stack for maximum performance. This is particularly important for VMs that require near bare-metal network performance.

Lab Environment Considerations:

This workshop supports two approaches for high-performance VM networking:

  1. Production SR-IOV (requires SR-IOV capable hardware)

    • Direct hardware access via Virtual Functions (VFs)

    • <1ms latency, near line-rate throughput

    • Requires physical SR-IOV NICs

  2. Lab Simulation with User Defined Networks (works in any environment)

    • Uses OVN-Kubernetes secondary networks

    • Better performance than default pod network

    • No special hardware required

    • Recommended for lab/learning environments

This module covers both approaches so you can learn SR-IOV concepts and test in your lab environment.

Choosing Your Networking Approach

Approach Use Case Hardware Required Performance

Default Pod Network

Basic VMs, development

None

2-5ms latency

User Defined Networks
(Lab Simulation)

Lab environments, learning, testing

None

1-3ms latency
(30-50% improvement)

SR-IOV
(Production)

Production NFV, real-time apps

SR-IOV capable NICs

<1ms latency
(near bare-metal)

Recommendation for This Workshop:

  • Lab/Learning Environment: Use User Defined Networks (covered in detail below)

  • Production Environment: Use SR-IOV (also covered for reference)

Both approaches teach the same concepts: - Dual-interface VM design - Network separation (management vs data plane) - Performance optimization techniques - Multi-network VM architecture

Understanding SR-IOV Benefits for VMs

Feature VM with Pod Network VM with SR-IOV

Latency

2-5ms (through virt-launcher pod)

<1ms (direct hardware access)

Throughput

5-20 Gbps (limited by pod network)

Near line-rate (40-100 Gbps)

CPU Usage

Higher (virtio + pod network overhead)

Lower (hardware offload)

Isolation

Software-based (pod network)

Hardware-enforced (dedicated VF)

Network Stack

VM → virtio → virt-launcher → CNI → host

VM → SR-IOV VF → physical NIC

Why SR-IOV Matters for VMs:

  • Eliminates Virtualization Overhead: VMs bypass the virt-launcher pod network entirely

  • Direct Hardware Access: Each VM gets a dedicated Virtual Function (VF) from the physical NIC

  • Predictable Performance: Hardware-enforced QoS and isolation

  • Production Workloads: Essential for NFV, real-time applications, and high-throughput VMs

SR-IOV is the key technology for achieving container-like network performance in VMs.

Verifying SR-IOV Network Operator

The SR-IOV Network Operator was deployed in Module 2. Let’s verify it’s ready for VM networking:

  1. Check SR-IOV operator status:

    # Check SR-IOV operator installation
    oc get csv -n openshift-sriov-network-operator
    
    # Verify SR-IOV operator pods
    oc get pods -n openshift-sriov-network-operator
    
    # Check if SR-IOV capable nodes are detected
    oc get sriovnetworknodestates -n openshift-sriov-network-operator
    
    # List available SR-IOV network node policies
    oc get sriovnetworknodepolicy -n openshift-sriov-network-operator
    
    # Check for SR-IOV networks configured for VMs
    oc get sriovnetwork -n openshift-sriov-network-operator

    If SR-IOV hardware is not available or the operator shows no SR-IOV capable nodes, proceed to the Lab Simulation section below to use User Defined Networks instead.

Lab Simulation: High-Performance VM Networking with User Defined Networks

For lab environments without SR-IOV hardware, we can simulate high-performance VM networking using OVN-Kubernetes User Defined Networks (also called Secondary Networks). While not as fast as SR-IOV, this provides better performance than the default pod network and demonstrates the same networking concepts.

Clean Up Previous Test VMIs

Before creating the high-performance VM, clean up VMIs from the previous kube-burner test to free HugePages:

  1. Check current VMI resource usage:

    # Check running VMIs and their HugePages usage
    echo "=== Current VMIs ==="
    oc get vmi --all-namespaces
    
    echo ""
    echo "=== HugePages Usage ==="
    oc get node -o jsonpath='{.items[0].status.allocatable.hugepages-1Gi}' | sed 's/Gi/ GB available/g'
    echo ""
    
    # Calculate VMIs using HugePages
    VMI_COUNT=$(oc get vmi --all-namespaces --no-headers 2>/dev/null | wc -l)
    if [ "$VMI_COUNT" -gt 0 ]; then
        echo "Current VMIs: $VMI_COUNT"
        echo "Estimated HugePages in use: ~$((VMI_COUNT * 3)) GB (assuming 2GB guest + 1GB overhead per VMI)"
        echo ""
        echo "⚠️  Cleanup recommended before creating new VMs"
    fi

    Why Cleanup is Important:

    Each VMI consumes HugePages memory that remains allocated even after testing completes:

    • VMI Guest Memory: 2GB per VMI (configured in VMI spec)

    • virt-launcher Overhead: ~1GB per VMI (KubeVirt management pod)

    • Total per VMI: ~3GB

    Example:

    8 running VMIs × 3GB = 24GB HugePages in use
    Available HugePages: 24GB
    Result: No HugePages available for new VMs! ❌

    Best Practice: Always clean up test VMIs before starting new VM deployments to avoid resource exhaustion.

  2. Clean up test VMIs and namespaces:

    # Delete all VMIs from kube-burner test
    echo "Cleaning up test VMIs..."
    oc delete vmi --selector=app=vmi-latency-test --all-namespaces --wait=false
    
    # Delete test namespaces
    for i in {0..4}; do
        oc delete namespace vmi-latency-test-$i --wait=false 2>/dev/null || true
    done
    
    echo ""
    echo "Cleanup initiated. Waiting for resources to be freed..."
    sleep 10
    
    # Verify cleanup
    echo ""
    echo "=== Remaining VMIs ==="
    oc get vmi --all-namespaces
    
    echo ""
    echo "=== HugePages Now Available ==="
    oc get node -o jsonpath='{.items[0].status.allocatable.hugepages-1Gi}' | sed 's/Gi/ GB available/g'
    echo ""

    If you see VMIs still terminating, wait a few moments for them to fully clean up. You can monitor with:

    watch oc get vmi --all-namespaces

    Press Ctrl+C to exit the watch command.

Create User Defined Network

Why User Defined Networks for Lab Environments:

  • No Special Hardware Required: Works on any OpenShift cluster

  • Better Performance: Dedicated network namespace, reduced overhead

  • Same Concepts: Dual-interface design, network separation

  • Production Pattern: Many production VMs use secondary networks

  • Learning Value: Understand multi-network VM architecture

Performance Comparison: * Default Pod Network: 2-5ms latency * User Defined Network: 1-3ms latency (30-50% improvement) * SR-IOV: <1ms latency (production target)

  1. Create a User Defined Network for high-performance VM networking:

    cat << EOF | oc apply -f -
    apiVersion: k8s.ovn.org/v1
    kind: UserDefinedNetwork
    metadata:
      name: vm-high-perf-network
      namespace: default
    spec:
      topology: Layer2
      layer2:
        role: Secondary
        subnets:
          - "192.168.100.0/24"
    EOF

    UserDefinedNetwork (UDN) - Modern OpenShift 4.18+ Approach:

    This creates a Layer2 User Defined Network using native OVN-Kubernetes integration:

    • API: k8s.ovn.org/v1 (native OVN-Kubernetes, not Multus)

    • Topology: Layer2 (recommended for VM networking)

    • Role: Secondary (additional network, not replacing pod network)

    • Subnet: 192.168.100.0/24 (automatic IPAM by OVN-K)

    • Benefits:

      • Simpler configuration than NetworkAttachmentDefinition

      • Native OVN-Kubernetes IPAM (no manual IPAM configuration needed)

      • Better integration with OpenShift Virtualization

      • Recommended approach for OpenShift 4.18+

    Why Layer2? - VMs can communicate at Layer2 (like a virtual switch) - Better for VM-to-VM communication - Supports VM live migration with persistent IPs - Simpler than Layer3 for most VM use cases

    Note: OpenShift automatically creates a corresponding NetworkAttachmentDefinition for compatibility with VMs.

  2. Verify the UserDefinedNetwork was created:

    # Check UserDefinedNetwork
    echo "=== UserDefinedNetwork ==="
    oc get userdefinednetwork vm-high-perf-network -n default -o yaml
    
    echo ""
    echo "=== Auto-Generated NetworkAttachmentDefinition ==="
    # OpenShift automatically creates a NetworkAttachmentDefinition for VM compatibility
    oc get net-attach-def vm-high-perf-network -n default
    
    echo ""
    echo "=== Network Details ==="
    oc describe userdefinednetwork vm-high-perf-network -n default

    What Just Happened:

    When you create a UserDefinedNetwork, OpenShift automatically:

    1. Creates the UDN: The Layer2 network with OVN-K IPAM

    2. Auto-generates NetworkAttachmentDefinition: For backward compatibility with VMs

    3. Configures OVN: Sets up the virtual switch and subnet

    Key Point: VMs still reference the network using multus.networkName in their spec, but the underlying implementation is now the modern UserDefinedNetwork instead of manual NetworkAttachmentDefinition configuration.

    This is why UserDefinedNetwork is better: - ✅ You define the network once (simple YAML) - ✅ OpenShift handles the NetworkAttachmentDefinition automatically - ✅ Native OVN-K integration (no manual CNI JSON) - ✅ Built-in IPAM (no configuration needed)

  3. Create a high-performance VM with dual network interfaces (lab simulation):

    cat << EOF | oc apply -f -
    apiVersion: kubevirt.io/v1
    kind: VirtualMachine
    metadata:
      name: high-perf-vm-lab
      namespace: default
      labels:
        app: high-perf-vm
    spec:
      running: true
      dataVolumeTemplates:
        - metadata:
            name: high-perf-vm-lab-rootdisk
          spec:
            storage:
              resources:
                requests:
                  storage: 30Gi
            sourceRef:
              kind: DataSource
              name: fedora
              namespace: openshift-virtualization-os-images
      template:
        metadata:
          labels:
            kubevirt.io/vm: high-perf-vm-lab
            app: high-perf-vm
        spec:
          domain:
            cpu:
              cores: 4
              dedicatedCpuPlacement: true  # Pin CPUs for low latency
            memory:
              hugepages:
                pageSize: 1Gi  # Use 1Gi HugePages (matches cluster configuration)
              guest: 4Gi
            resources:
              requests:
                memory: 4Gi
              limits:
                memory: 4Gi
            devices:
              disks:
                - name: rootdisk
                  disk:
                    bus: virtio
                - name: cloudinitdisk
                  disk:
                    bus: virtio
              interfaces:
                # Primary interface: Pod network (for management)
                - name: default
                  masquerade: {}
                # Secondary interface: User Defined Network (for high-performance data)
                - name: high-perf-net
                  bridge: {}
              networkInterfaceMultiqueue: true  # Enable multi-queue for better performance
          networks:
            # Pod network for management traffic
            - name: default
              pod: {}
            # User Defined Network for data plane traffic
            - name: high-perf-net
              multus:
                networkName: vm-high-perf-network
          volumes:
            - name: rootdisk
              dataVolume:
                name: high-perf-vm-lab-rootdisk
            - name: cloudinitdisk
              cloudInitNoCloud:
                userData: |
                  #cloud-config
                  user: fedora
                  password: fedora
                  chpasswd: { expire: False }
                  runcmd:
                    - nmcli con add type ethernet con-name eth1 ifname eth1 ip4 192.168.100.10/24
                    - nmcli con up eth1
    EOF

    Lab VM Configuration Explained:

    • Disk Configuration (DataVolume):

      • Uses dataVolumeTemplates to create persistent disk

      • Source: VolumeSnapshot from openshift-virtualization-os-images

      • Pre-installed Fedora image (fast boot, even without KVM)

      • 30Gi storage allocation

      • Why not containerDisk? containerDisk is slow without KVM hardware virtualization

    • Dual Network Interfaces (same as production SR-IOV):

      • default: Pod network for management (SSH, monitoring)

      • high-perf-net: User Defined Network for data plane

    • Performance Optimizations (Educational Examples):

      • dedicatedCpuPlacement: true - Pins CPUs to VM (requires KVM for full benefit)

      • hugepages: pageSize: 1Gi - Uses 1Gi HugePages (matches cluster config from Module 4)

      • resources: requests/limits: 4Gi - Guarantees memory allocation

      • networkInterfaceMultiqueue: true - Parallel packet processing (4 queues per interface)

      • bridge: {} - Direct bridge attachment (better than masquerade)

    • HugePages Configuration:

      • VM requests 4GB guest memory

      • Uses 4 × 1Gi HugePages (matches Performance Profile)

      • Plus ~1GB virt-launcher overhead = ~5GB total

      • Must match cluster’s HugePages size (1Gi from Module 4)

      • Note: HugePages work with or without KVM, but provide best performance with KVM

    • Cloud-init Configuration:

      • Creates user fedora with password fedora

      • Automatically configures eth1 with static IP (192.168.100.10/24)

      • Sets up network interface on boot

      • Ready for testing immediately

    This simulates SR-IOV architecture without special hardware!

    Note: This VM demonstrates performance features (HugePages, CPU pinning, multi-queue) that are typically used with KVM hardware virtualization. The VM will boot and run successfully even without KVM (using software emulation), but performance features provide maximum benefit when KVM is available.

  4. Wait for the DataVolume to be created and the VM to start:

    # Check DataVolume creation progress
    echo "=== DataVolume Status ==="
    oc get dv high-perf-vm-lab-rootdisk -n default
    
    # Wait for DataVolume to be ready (cloning from snapshot)
    echo ""
    echo "Waiting for DataVolume to be ready (this may take 1-2 minutes)..."
    oc wait --for=condition=Ready dv/high-perf-vm-lab-rootdisk -n default --timeout=300s
    
    # Check VM status
    echo ""
    echo "=== VM Status ==="
    oc get vm high-perf-vm-lab -n default
    oc get vmi high-perf-vm-lab -n default

    DataVolume Creation Process:

    When you create a VM with dataVolumeTemplates, OpenShift Virtualization:

    1. Creates a DataVolume - Persistent storage for the VM

    2. Clones from VolumeSnapshot - Copies the Fedora image from the snapshot

    3. Creates a PVC - Persistent Volume Claim for the disk

    4. Starts the VM - Once the DataVolume is ready

    This process takes 1-2 minutes but results in a fast-booting VM with persistent storage.

    Advantages over containerDisk: - ✅ Faster boot (pre-installed image) - ✅ Persistent storage (survives VM restarts) - ✅ Works well without KVM hardware virtualization - ✅ Same image used by OpenShift Console VM wizard

  5. Verify the VM has dual network interfaces:

    # Wait for VM to be running
    oc wait --for=condition=Ready vmi/high-perf-vm-lab --timeout=300s
    
    # Check VM network interfaces
    oc get vmi high-perf-vm-lab -o jsonpath='{.status.interfaces}' | jq
    
    # Verify both networks are attached
    echo "VM Network Configuration:"
    oc get vmi high-perf-vm-lab -o jsonpath='{.spec.networks}' | jq
    
    # Check that VM has both pod network and user defined network
    oc describe vmi high-perf-vm-lab | grep -A 10 "Interfaces"
  6. Test the VM’s network performance:

    # Use the VMI network tester to validate connectivity
    python3 ~/low-latency-performance-workshop/scripts/module05-vmi-network-tester.py \
        --namespace default
    
    # Access the VM to verify network interfaces
    virtctl console high-perf-vm-lab
    
    # Inside the VM, check network interfaces
    ip addr show
    
    # You should see:
    # - eth0: Pod network interface (management) - 10.x.x.x
    # - eth1: User Defined Network (high-performance) - 192.168.100.10
    
    # Test connectivity on both interfaces
    ping -c 4 -I eth0 8.8.8.8  # Management network
    ping -c 4 -I eth1 192.168.100.1  # High-performance network
    
    # Check interface statistics
    ip -s link show eth0
    ip -s link show eth1

Lab Simulation Performance Expectations:

  • Pod Network (eth0): 2-5ms latency, 5-20 Gbps throughput

  • User Defined Network (eth1): 1-3ms latency, 10-30 Gbps throughput

  • Improvement: 30-50% better latency than pod network alone

While not as fast as SR-IOV (<1ms), this demonstrates: - Dual-interface VM architecture - Network separation (control vs data plane) - Performance optimization techniques - Production-ready patterns

This is perfect for learning and lab environments!

Configuring SR-IOV for Virtual Machines (Production)

When to Use This Section:

  • You have SR-IOV capable hardware (Intel X710, Mellanox ConnectX-5, etc.)

  • The SR-IOV Network Operator detected SR-IOV capable nodes

  • You need <1ms latency for production workloads

  • You’re deploying NFV or real-time applications

For Lab Environments: Use the User Defined Networks approach above instead.

Unlike pods, VMs require specific SR-IOV network configuration to attach Virtual Functions directly to the VM.

  1. Create an SR-IOV Network for VM use (production hardware required):

    cat << EOF | oc apply -f -
    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: vm-sriov-network
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: vm_sriov_net
      networkNamespace: default
      vlan: 100  # Optional: VLAN tagging
      capabilities: '{"ips": true, "mac": true}'
      # Important: This network will be used by VMs
      ipam: |
        {
          "type": "host-local",
          "subnet": "192.168.100.0/24",
          "rangeStart": "192.168.100.10",
          "rangeEnd": "192.168.100.100",
          "gateway": "192.168.100.1"
        }
    EOF

    This creates an SR-IOV network specifically for VM use. The resourceName must match the SR-IOV Network Node Policy configured in Module 2.

  2. Create a high-performance VM with SR-IOV networking:

    cat << EOF | oc apply -f -
    apiVersion: kubevirt.io/v1
    kind: VirtualMachine
    metadata:
      name: high-performance-vm-sriov
      namespace: default
    spec:
      running: true
      template:
        metadata:
          labels:
            kubevirt.io/vm: high-performance-vm-sriov
        spec:
          domain:
            cpu:
              cores: 4
              dedicatedCpuPlacement: true  # Pin CPUs for low latency
            memory:
              hugepages:
                pageSize: 2Mi  # Use HugePages
              guest: 4Gi
            devices:
              disks:
                - name: containerdisk
                  disk:
                    bus: virtio
                - name: cloudinitdisk
                  disk:
                    bus: virtio
              interfaces:
                # Primary interface: Pod network (for management)
                - name: default
                  masquerade: {}
                # Secondary interface: SR-IOV (for high-performance data plane)
                - name: sriov-net
                  sriov: {}
          networks:
            # Pod network for management traffic
            - name: default
              pod: {}
            # SR-IOV network for data plane traffic
            - name: sriov-net
              multus:
                networkName: vm-sriov-network
          volumes:
            - name: containerdisk
              containerDisk:
                image: quay.io/containerdisks/fedora:latest
            - name: cloudinitdisk
              cloudInitNoCloud:
                userData: |
                  #cloud-config
                  password: fedora
                  chpasswd: { expire: False }
    EOF

    VM SR-IOV Configuration Explained:

    • Two Network Interfaces:

      • default: Pod network for management (SSH, monitoring)

      • sriov-net: SR-IOV for high-performance data traffic

    • Why Two Interfaces?

      • Management traffic doesn’t need SR-IOV performance

      • Data plane traffic gets direct hardware access

      • Separates control and data planes

    • Performance Features:

      • dedicatedCpuPlacement: true - Pins CPUs to VM

      • hugepages - Reduces memory overhead

      • sriov: {} - Attaches SR-IOV VF directly to VM

  3. Verify the VM has SR-IOV networking:

    # Wait for VM to be running
    oc wait --for=condition=Ready vmi/high-performance-vm-sriov --timeout=300s
    
    # Check VM network interfaces
    oc get vmi high-performance-vm-sriov -o jsonpath='{.status.interfaces}' | jq
    
    # Verify SR-IOV VF is attached
    oc describe vmi high-performance-vm-sriov | grep -A 10 "Interfaces"
    
    # Check that VM has both pod network and SR-IOV
    echo "VM Network Configuration:"
    oc get vmi high-performance-vm-sriov -o jsonpath='{.spec.networks}' | jq

Testing VM SR-IOV Network Performance

Now let’s test the network performance of the VM with SR-IOV to see the improvement over pod networking.

  1. Access the VM and check network interfaces:

    # Access the VM console
    virtctl console high-performance-vm-sriov
    
    # Inside the VM, check network interfaces
    ip addr show
    
    # You should see:
    # - eth0: Pod network interface (management)
    # - eth1: SR-IOV interface (high-performance)
    
    # Check SR-IOV interface details
    ethtool -i eth1
    
    # Test network performance (requires iperf3 installed)
    # From another VM or pod, run iperf3 server
    # Then from this VM: iperf3 -c <server-ip> -i 1 -t 30
  2. Use the VMI network tester to validate SR-IOV VM connectivity:

    # Test networking to the SR-IOV-enabled VM
    python3 ~/low-latency-performance-workshop/scripts/module05-vmi-network-tester.py \
        --namespace default
    
    # This will test connectivity to VMs including SR-IOV-enabled ones
    # Expected results:
    # - Pod network interface: 2-5ms latency
    # - SR-IOV interface: <1ms latency (if tested directly)

SR-IOV Performance Expectations for VMs:

  • Pod Network (eth0): 2-5ms latency, 5-20 Gbps throughput

  • SR-IOV Network (eth1): <1ms latency, near line-rate throughput

The SR-IOV interface provides 5-10x better latency and 2-5x better throughput compared to pod networking for VMs.

Network Policy Latency Testing

Network policies can impact VM networking performance. Let’s test network policy enforcement latency using kube-burner’s network policy latency measurement.

  1. Create network policy latency test configuration adapted for SNO:

    cd ~/kube-burner-configs
    
    cat << EOF > network-policy-latency-config.yml
    global:
      measurements:
        - name: netpolLatency
    
    metricsEndpoints:
      - indexer:
          type: local
          metricsDirectory: collected-metrics-netpol
    
    jobs:
      # Job 1: Create pods and namespaces (reduced scale for SNO)
      - name: network-policy-setup
        jobType: create
        jobIterations: 3  # Reduced for SNO
        namespace: network-policy-perf
        namespacedIterations: true
        cleanup: false
        podWait: true
        waitWhenFinished: true
        verifyObjects: true
        errorOnVerify: false
        namespaceLabels:
          kube-burner.io/skip-networkpolicy-latency: "true"
        objects:
          - objectTemplate: network-test-pod.yml
            replicas: 2  # Reduced for SNO
            inputVars:
              containerImage: registry.redhat.io/ubi8/ubi-minimal:latest
    
      # Job 2: Apply network policies and test connectivity
      - name: network-policy-test
        jobType: create
        jobIterations: 3  # Reduced for SNO
        namespace: network-policy-perf
        namespacedIterations: false
        cleanup: false
        podWait: false
        waitWhenFinished: true
        verifyObjects: true
        errorOnVerify: false
        jobPause: 30s  # Reduced pause for faster testing
        objects:
          - objectTemplate: ingress-network-policy.yml
            replicas: 1  # Reduced for SNO
            inputVars:
              namespaces: 3  # Reduced for SNO
    EOF
  2. Create the network test pod template:

    cat << EOF > network-test-pod.yml
    apiVersion: v1
    kind: Pod
    metadata:
      name: network-test-pod-{{.Iteration}}-{{.Replica}}
      labels:
        app: network-test
        iteration: "{{.Iteration}}"
        replica: "{{.Replica}}"
    spec:
      # No nodeSelector for SNO - will schedule on the single node
      containers:
      - name: network-test-container
        image: {{.containerImage}}
        command: ["/bin/bash"]
        args: ["-c", "microdnf install -y httpd && echo 'Hello from pod {{.Iteration}}-{{.Replica}}' > /var/www/html/index.html && httpd -D FOREGROUND"]
        ports:
        - containerPort: 80
          protocol: TCP
        resources:
          requests:
            memory: "128Mi"  # Increased for httpd
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 10
          periodSeconds: 5
      restartPolicy: Never
    EOF
  3. Create the ingress network policy template:

    cat << EOF > ingress-network-policy.yml
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: ingress-policy-{{.Iteration}}-{{.Replica}}
    spec:
      podSelector:
        matchLabels:
          app: network-test
      policyTypes:
      - Ingress
      ingress:
      - from:
        - namespaceSelector:
            matchLabels:
              name: network-policy-perf-{{.Iteration}}
        - podSelector:
            matchLabels:
              app: network-test
        ports:
        - protocol: TCP
          port: 80  # Updated to match httpd default port
      # Allow egress for DNS resolution and package installation
      - from: []
        ports:
        - protocol: TCP
          port: 53
        - protocol: UDP
          port: 53
    EOF
  4. Run the network policy latency test:

      # Execute the network policy latency test adapted for SNO
    echo "Starting network policy latency test..."
    echo "   Test scale: 3 iterations × 2 replicas = 6 pods total"
    echo "   Environment: Single Node OpenShift (SNO)"
    echo ""
    
    kube-burner init -c network-policy-latency-config.yml --log-level=info
    
      # This test will:
      # 1. Create pods in multiple namespaces (reduced scale for SNO)
      # 2. Apply network policies with ingress rules
      # 3. Measure network policy enforcement latency
  5. Monitor network policy test progress:

      # Watch network policies being created (press Ctrl+C to exit)
    echo "Monitoring network policy test progress..."
    echo "   Use Ctrl+C to exit the watch command when test completes"
    echo ""
    
    watch -n 5 "echo '--- Network Policies ---' && oc get networkpolicy --all-namespaces | grep network-policy-perf && echo '' && echo '--- Test Pods ---' && oc get pods --all-namespaces | grep network-test"
  6. Check test results after completion:

      # Check final network policy status
    echo "📋 Final Network Policy Status:"
    oc get networkpolicy --all-namespaces | grep network-policy-perf
    
      # Check pod status
    echo ""
    echo "📋 Test Pod Status:"
    oc get pods --all-namespaces | grep network-test
    
      # Check if pods are ready and accessible
    echo ""
    echo "📊 Pod Readiness:"
    oc get pods --all-namespaces -o custom-columns=NAME:.metadata.name,READY:.status.containerStatuses[0].ready,STATUS:.status.phase | grep network-test

Educational Analysis Scripts for Virtualization

The workshop provides educational scripts to help you understand VM vs container trade-offs and test VM networking.

  1. VM vs Container Comparison - Educational comparison tool:

    # Compare VMs and containers comprehensively
    python3 ~/low-latency-performance-workshop/scripts/module05-vm-vs-container-comparison.py
    
    # Disable colored output for documentation
    python3 ~/low-latency-performance-workshop/scripts/module05-vm-vs-container-comparison.py --no-color

    This script provides:

    • Architecture and design differences explained

    • Startup time comparison (VMs: 60-90s vs Containers: 3-10s)

    • Resource usage and overhead analysis

    • Isolation and security characteristics

    • Networking performance comparison

    • Use case guidance for choosing VMs vs containers

  2. VMI Network Tester - Test networking against Virtual Machines:

    # Test networking against all VMIs in the cluster
    python3 ~/low-latency-performance-workshop/scripts/module05-vmi-network-tester.py
    
    # Test VMIs in specific namespace
    python3 ~/low-latency-performance-workshop/scripts/module05-vmi-network-tester.py \
        --namespace vmi-latency-test-0
    
    # Skip educational explanations
    python3 ~/low-latency-performance-workshop/scripts/module05-vmi-network-tester.py \
        --skip-explanation

    This script tests:

    • VMI connectivity and reachability

    • Network latency TO virtual machines (not pods!)

    • VMI IP assignment and configuration

    • Network policy impact on VM traffic

    • Creates test pods that ping VMIs to measure performance

The module05-vmi-network-tester.py script specifically tests networking against VMs (VMIs) rather than pods. This is important because:

  • VMs have different networking characteristics than containers

  • VMI networking goes through the virt-launcher pod

  • Network policies apply differently to VM traffic

  • SR-IOV can bypass the pod network entirely

This script helps you understand and validate VM networking performance.

Analyzing Network Policy Latency Results with Python

Use the educational Python scripts to analyze network policy enforcement latency and understand its impact on VM networking performance.

  1. Run the network policy performance analyzer:

    cd ~/low-latency-performance-workshop/scripts
    
      # Run the educational network policy latency analyzer
    echo "🔍 Analyzing Network Policy Performance Impact..."
    python3 ~/low-latency-performance-workshop/scripts/module05-network-policy-analyzer.py \
        --metrics-dir ~/kube-burner-configs \
        --analysis-type latency
    
      # The script provides:
      # 1. Educational analysis of policy enforcement overhead
      # 2. Color-coded performance assessment
      # 3. Performance vs security trade-off explanations
      # 4. Recommendations for policy optimization
  2. Generate comprehensive network policy performance insights:

    cd ~/low-latency-performance-workshop/scripts
    
      # Create detailed educational analysis with report generation
    echo "📊 Generating Comprehensive Network Policy Analysis..."
    python3 ~/low-latency-performance-workshop/scripts/module05-network-policy-analyzer.py \
        --metrics-dir ~/kube-burner-configs \
        --analysis-type comprehensive \
        --output-format educational
    
      # This educational analysis includes:
      # • Statistical analysis of policy enforcement latency
      # • Performance vs security trade-off explanations
      # • Best practices for low-latency network policies
      # • Detailed markdown report with optimization strategies
      # • Educational insights about CNI performance impact

Performance Optimization Best Practices

VM Configuration Best Practices

  1. CPU Optimization:

    • Use dedicatedCpuPlacement: true for guaranteed CPU access

    • Match VM vCPU count to NUMA topology

    • Use host-model CPU model for compatibility (or host-passthrough if supported)

    • Consider specific CPU models (e.g., Haswell-noTSX) for consistent behavior across environments

  2. Memory Optimization:

    • Configure HugePages for reduced TLB misses

    • Align memory allocation with NUMA topology

    • Disable memory overcommit for predictable performance

  3. Storage Optimization:

    • Use high-performance storage classes

    • Configure appropriate I/O schedulers

    • Consider local storage for ultra-low latency

  4. Network Optimization:

    • Use SR-IOV for direct hardware access

    • Configure multiple network interfaces for traffic separation

    • Optimize network policies for minimal overhead

Monitoring and Validation

  1. Key Metrics to Monitor:

    • VMI startup latency (target: < 90 seconds for SNO)

    • Network policy enforcement latency (target: < 10 seconds for SNO)

    • CPU utilization and isolation effectiveness

    • Memory allocation and HugePages usage

  2. Performance Validation Tools:

    • kube-burner for comprehensive latency testing

    • iperf3 for network throughput testing

    • stress-ng for CPU and memory stress testing

    • fio for storage performance testing

Module Summary

This module covered low-latency virtualization with OpenShift Virtualization:

  • Verified OpenShift Virtualization deployment from Module 2

  • Configured high-performance VMs with dedicated CPUs and HugePages

  • Measured VMI startup latency using kube-burner’s vmiLatency measurement

  • Tested network policy performance with netpolLatency measurement

  • Compared VM vs container performance to understand trade-offs

  • Implemented SR-IOV networking for ultra-low latency networking

Key Performance Insights

Metric Without Performance Profile With Performance Profile Improvement

Fedora VMI Startup (P99)

90-150 seconds

60-90 seconds

~30-40% faster

Network Policy Latency (P99)

10-20 seconds

5-10 seconds

~50% faster

VM vs Pod Startup

15-25x slower

10-15x slower

Reduced overhead

CPU Consistency

Variable performance

Predictable performance

Eliminated jitter

Memory Latency

Standard pages

HugePages optimization

Reduced TLB misses

Key Architectural Learning Points

VirtualMachine vs VirtualMachineInstance Usage Patterns:

Use Case Object Type Management Best For

Production Workloads

VirtualMachine

Full lifecycle management

Long-running VMs, interactive use

Performance Testing

VirtualMachineInstance

Direct creation, ephemeral

Automated testing, precise metrics

Development/Testing

VirtualMachine

Start/stop capability

Development environments

Latency Measurement

VirtualMachineInstance

No controller overhead

Pure hypervisor performance

What You Learned: * ✅ Architecture: VMs create and manage VMIs, but VMIs can exist independently * ✅ Performance Testing: Direct VMI creation eliminates management overhead * ✅ Measurement Precision: kube-burner measures pure hypervisor startup time * ✅ Real-world Usage: Production typically uses VMs for lifecycle management

Performance Profile Impact on VMs

The performance improvements from Module 4 are even more significant for VMs than containers because:

  • CPU Isolation: VMs benefit greatly from dedicated CPU cores without interference

  • HugePages: VM memory management sees substantial improvement with large pages

  • NUMA Alignment: VM memory and CPU locality reduces cross-NUMA penalties

  • Reduced Jitter: Consistent performance is critical for VM workloads

Consider completing Module 4 to see these benefits in action!

SNO Environment Considerations

Performance Characteristics: - Single Node: All workloads compete for the same resources - Control Plane Overhead: Master components consume CPU and memory - Storage Limitations: Single storage backend affects VM boot times - Network Simplicity: Reduced network complexity but shared bandwidth

Optimization Strategies: - Resource Allocation: Careful CPU and memory allocation for VMs - Test Scaling: Reduced test scale to prevent resource exhaustion - Performance Profiles: Even more important in resource-constrained environments - Monitoring: Close monitoring of resource utilization during tests

Troubleshooting Common Issues

PVC Binding Conflicts:

  # Check for PVC binding issues across all namespaces
oc get events --all-namespaces | grep -i "bound incorrectly"

  # Clean up orphaned PVCs if needed
oc get pvc --all-namespaces | grep -E "(Pending|Lost)"

VM Startup Issues:

  # Check VM status and events
oc describe vm <vm-name> -n <namespace>

  # Check DataVolume import progress
oc get dv -n <namespace> -w

  # Check CDI operator logs if DataVolume import fails
oc logs -n openshift-cnv deployment/cdi-deployment

  # Check virt-launcher pod logs for VM startup issues
oc logs -n <namespace> -l kubevirt.io/created-by=<vm-name>

CPU Model Compatibility Issues:

  # If you see "unsupported configuration: CPU mode 'host-passthrough'" error:

  # Check available CPU models
oc get nodes -o jsonpath='{.items[0].status.nodeInfo.machineID}'

  # The workshop uses 'host-model' for better compatibility
  # If issues persist, you can use a specific CPU model:
  # model: "Haswell-noTSX" or model: "Skylake-Client"

  # Check hypervisor capabilities
oc debug node/<node-name> -- chroot /host cat /proc/cpuinfo | head -20

Resource Constraints:

  # Monitor node resource usage during tests
oc adm top nodes

  # Check for resource pressure
oc describe node <node-name> | grep -A 10 "Conditions:"

Workshop Progress

  • Module 1: Low-latency fundamentals and concepts

  • Module 2: RHACM and GitOps deployment automation

  • Module 3: Baseline performance measurement and analysis

  • Module 4: Performance tuning with CPU isolation (optional but recommended)

  • Module 5: Low-latency virtualization with OpenShift Virtualization (current)

  • 🎯 Next: Module 6 - Monitoring, alerting, and continuous validation

Performance Comparison Opportunity

If you completed this module without performance profiles from Module 4: 1. Record your current VMI performance results from the Python analysis 2. Go back and complete Module 4 to configure performance profiles 3. Return and re-run the VMI tests to see the performance improvement 4. Compare the results to understand the impact of performance tuning on virtualization

This approach provides valuable insights into the performance benefits of proper cluster tuning for virtualized workloads.

Next Steps

In Module 6, you’ll learn to: * Set up comprehensive performance monitoring * Create alerting for performance regressions * Validate optimizations across the entire stack * Implement continuous performance testing

Knowledge Check

  1. What are the key differences between VM and container startup latency in terms of performance characteristics?

  2. How does SR-IOV improve network performance for VMs compared to traditional networking?

  3. What network policy latency thresholds are acceptable for production workloads in SNO environments?

  4. How do you configure a VM for maximum CPU performance using dedicated CPU placement?

  5. What are the trade-offs between VM isolation and performance overhead?