Module 7: GPU Workloads (Optional)

Optional Module - GPU Hardware Required

This module is optional and requires GPU-enabled nodes in your OpenShift cluster. If you don’t have GPU hardware available, you can skip this module and proceed to the workshop conclusion.

GPU workloads are increasingly important for: * Machine Learning and AI inference * High-performance computing (HPC) * Video processing and encoding * Scientific computing

This module demonstrates how to configure and optimize GPU workloads on OpenShift for low-latency performance.

Module Overview

This module covers deploying and optimizing GPU workloads on OpenShift using the NVIDIA GPU Operator. You’ll learn how to:

Install and configure the NVIDIA GPU Operator
Verify GPU discovery and availability
Deploy GPU-enabled workloads
Measure GPU performance
Configure GPU passthrough for virtual machines (optional)

Prerequisites

Before starting this module, ensure you have:

Completed Module 2 (Environment Setup) - verified cluster access
GPU-enabled nodes in your OpenShift cluster
Cluster-admin access to install operators
Basic understanding of GPU computing concepts

GPU Hardware Requirement

This module requires physical GPU hardware (NVIDIA GPUs) in your cluster nodes. Virtual or emulated GPUs will not work for the exercises in this module.

If you don’t have GPU hardware, you can: * Skip this module * Review the concepts and configuration examples * Use this as reference for future GPU deployments

Key Learning Objectives

Understand GPU Operator architecture and components
Install and configure the NVIDIA GPU Operator
Verify GPU discovery and node labeling
Deploy GPU-enabled workloads
Measure GPU performance and utilization
Configure GPU passthrough for VMs (optional)

Understanding GPU on OpenShift

GPU Operator Overview

The NVIDIA GPU Operator automates the management of all NVIDIA software components needed to provision GPUs in OpenShift, including:

NVIDIA Driver: Kernel modules and user-space libraries
Container Toolkit: Runtime for GPU containers
Device Plugin: Kubernetes device plugin for GPU discovery
DCGM Exporter: Metrics collection for GPU monitoring
GPU Feature Discovery: Automatic node labeling based on GPU capabilities

GPU Architecture on OpenShift

Component	Description
GPU Nodes	Worker nodes with physical NVIDIA GPUs installed
GPU Operator	Manages all NVIDIA software components automatically
Device Plugin	Makes GPUs available as schedulable resources in Kubernetes
Node Feature Discovery	Automatically labels nodes with GPU capabilities
DCGM Exporter	Provides GPU metrics to Prometheus for monitoring

Component

Description

GPU Nodes

Worker nodes with physical NVIDIA GPUs installed

GPU Operator

Manages all NVIDIA software components automatically

Device Plugin

Makes GPUs available as schedulable resources in Kubernetes

Node Feature Discovery

Automatically labels nodes with GPU capabilities

DCGM Exporter

Provides GPU metrics to Prometheus for monitoring

GPU Workload Types

Machine Learning Training: Training neural networks with frameworks like TensorFlow, PyTorch
Inference: Running trained models for predictions
HPC Workloads: High-performance computing applications
Video Processing: Encoding, transcoding, and processing video content
Scientific Computing: Computational simulations and data analysis

Hands-on Exercise: Installing the NVIDIA GPU Operator

Step 1: Verify GPU Hardware

Before installing the GPU Operator, verify that your nodes have GPUs available.

Check if nodes have GPUs:

# List all nodes
oc get nodes

# Check for GPU labels (if GPU Operator is already partially installed)
oc get nodes -l feature.node.kubernetes.io/pci-10de.present=true

# Check node hardware (requires SSH access to nodes)
oc debug node/<node-name> -- chroot /host nvidia-smi

GPU Detection

If nvidia-smi is not available, the GPU Operator will install the NVIDIA drivers. The operator handles driver installation automatically.

Step 2: Create Namespace for GPU Operator

Create the namespace for GPU Operator:

oc create namespace gpu-operator-resources

Step 3: Install GPU Operator via OperatorHub

Install the NVIDIA GPU Operator from OperatorHub:

# Create OperatorGroup
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: gpu-operator-group
  namespace: gpu-operator-resources
spec:
  targetNamespaces:
  - gpu-operator-resources
EOF

# Create Subscription
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator-certified
  namespace: gpu-operator-resources
spec:
  channel: "v1.11"
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
  installPlanApproval: Automatic
EOF

Wait for the operator to be installed:

# Watch the CSV until it shows "Succeeded"
watch oc get csv -n gpu-operator-resources

# Once installed, check operator pods
oc get pods -n gpu-operator-resources

Step 4: Create ClusterPolicy

The ClusterPolicy custom resource configures all GPU Operator components.

Create a ClusterPolicy:

cat <<EOF | oc apply -f -
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  driver:
    enabled: true
  toolkit:
    enabled: true
  devicePlugin:
    enabled: true
  dcgmExporter:
    enabled: true
  gfd:
    enabled: true
EOF

Monitor the ClusterPolicy status:

# Check ClusterPolicy status
oc get clusterpolicy -o yaml

# Watch pods being created
watch oc get pods -n gpu-operator-resources

# Check for driver daemonset
oc get daemonset -n gpu-operator-resources

Driver Installation Time

NVIDIA driver installation can take 5-10 minutes. The operator will: 1. Install kernel modules 2. Load drivers on GPU nodes 3. Start device plugin 4. Label nodes with GPU information

Verifying GPU Discovery

Step 1: Check Node Labels

GPU Feature Discovery automatically labels nodes with GPU information.

Check for GPU-related node labels:

# List all GPU-related labels
oc get nodes -o json | jq -r '.items[] | select(.metadata.labels | to_entries | map(.key) | any(startswith("nvidia.com"))) | {name: .metadata.name, labels: [.metadata.labels | to_entries[] | select(.key | startswith("nvidia.com"))]}'

# Check specific GPU labels
oc get nodes -l nvidia.com/gpu.present=true

# View all labels on a GPU node
oc get node <gpu-node-name> --show-labels | grep nvidia

Step 2: Verify GPU Resources

Check if GPUs are available as schedulable resources:

# Describe a GPU node to see allocatable resources
oc describe node <gpu-node-name> | grep -A 5 "Allocatable:"

# Should show: nvidia.com/gpu: <number>

Step 3: Test GPU Access

Create a test pod to verify GPU access:

cat <<EOF | oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "nvidia/cuda:12.0.0-base-ubuntu22.04"
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

# Wait for pod to start
oc wait --for=condition=Ready pod/gpu-test --timeout=60s

# Check pod logs
oc logs gpu-test

# Clean up
oc delete pod gpu-test

Expected Output

The nvidia-smi command should show: * GPU model and memory * Driver version * CUDA version * GPU utilization (if any)

If you see errors, check the GPU Operator logs and ClusterPolicy status.

Deploying GPU Workloads

Example: TensorFlow GPU Workload

Create a TensorFlow workload that uses GPU:

cat <<EOF | oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: tensorflow-gpu
spec:
  restartPolicy: Never
  containers:
  - name: tensorflow
    image: tensorflow/tensorflow:latest-gpu
    command: ["python", "-c", "import tensorflow as tf; print('GPUs:', tf.config.list_physical_devices('GPU')); print('TensorFlow version:', tf.__version__)"]
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
EOF

# Check logs
oc logs tensorflow-gpu

# Clean up
oc delete pod tensorflow-gpu

Example: PyTorch GPU Workload

Create a PyTorch workload:

cat <<EOF | oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: pytorch-gpu
spec:
  restartPolicy: Never
  containers:
  - name: pytorch
    image: pytorch/pytorch:latest
    command: ["python", "-c", "import torch; print('CUDA available:', torch.cuda.is_available()); print('GPU count:', torch.cuda.device_count() if torch.cuda.is_available() else 0)"]
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
EOF

# Check logs
oc logs pytorch-gpu

# Clean up
oc delete pod pytorch-gpu

GPU Performance Testing

Using DCGM for Metrics

The GPU Operator includes DCGM Exporter for GPU metrics.

Check DCGM Exporter:

# Check DCGM Exporter pods
oc get pods -n gpu-operator-resources -l app=nvidia-dcgm-exporter

# Check DCGM metrics endpoint
oc get svc -n gpu-operator-resources nvidia-dcgm-exporter

# Port-forward to access metrics
oc port-forward -n gpu-operator-resources svc/nvidia-dcgm-exporter 9400:9400

Access metrics (in another terminal):

# View metrics
curl http://localhost:9400/metrics | grep DCGM

GPU Workload Benchmarking

Create a GPU benchmark workload:

cat <<EOF | oc apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-benchmark
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: benchmark
        image: nvidia/cuda:12.0.0-base-ubuntu22.04
        command:
        - sh
        - -c
        - |
          nvidia-smi --query-gpu=name,memory.total,memory.used,utilization.gpu --format=csv
          nvidia-smi dmon -s u -c 10
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
EOF

# Watch job
oc get job gpu-benchmark -w

# Check logs
oc logs job/gpu-benchmark

# Clean up
oc delete job gpu-benchmark

GPU Passthrough for Virtual Machines (Optional)

Advanced Topic

GPU passthrough allows virtual machines to directly access physical GPUs, providing near-native performance for GPU workloads in VMs.

Prerequisites for GPU Passthrough

OpenShift Virtualization installed (from Module 2)
IOMMU enabled on nodes
GPU hardware that supports passthrough
Appropriate node configuration

Configuring GPU Passthrough

Create a HostDevice resource for GPU passthrough:

cat <<EOF | oc apply -f -
apiVersion: kubevirt.io/v1
kind: HostDevice
metadata:
  name: nvidia-gpu
spec:
  resourceName: nvidia.com/gpu
  selector:
    vendor: "10de"  # NVIDIA vendor ID
EOF

Create a VM with GPU passthrough:

cat <<EOF | oc apply -f -
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: gpu-vm
spec:
  running: true
  template:
    spec:
      domain:
        devices:
          hostDevices:
          - deviceName: nvidia-gpu
            name: gpu1
        resources:
          requests:
            memory: 4Gi
          limits:
            memory: 4Gi
            devices:
              nvidia.com/gpu: 1
      volumes:
      - name: containerdisk
        containerDisk:
          image: quay.io/kubevirt/fedora-cloud-container-disk-demo
EOF

GPU Passthrough Complexity

GPU passthrough requires: * Specific hardware support (IOMMU, VT-d/AMD-Vi) * Proper kernel parameters * Driver compatibility * May require node reboots

This is an advanced configuration and may not work on all hardware or cloud providers.

Module Summary

In this optional module, you have:

✅ Installed the NVIDIA GPU Operator
✅ Verified GPU discovery and node labeling
✅ Deployed GPU-enabled workloads
✅ Measured GPU performance using DCGM
✅ Explored GPU passthrough for VMs (optional)

Key Takeaways

The GPU Operator automates all NVIDIA software component management
GPUs are exposed as schedulable Kubernetes resources (nvidia.com/gpu)
GPU Feature Discovery automatically labels nodes with GPU capabilities
DCGM Exporter provides GPU metrics for monitoring
GPU passthrough enables direct GPU access in virtual machines

Next Steps

If you’ve completed all modules, you now have comprehensive knowledge of: * Baseline performance measurement * CPU isolation and HugePages configuration * Real-time kernel tuning * Low-latency virtualization * Performance monitoring and validation * GPU workload optimization (optional)

Troubleshooting

GPU Operator Not Installing

Check operator subscription:

oc get subscription -n gpu-operator-resources
oc describe subscription -n gpu-operator-resources

Check for install plan issues:

oc get installplan -n gpu-operator-resources
oc describe installplan -n gpu-operator-resources <installplan-name>

GPUs Not Discovered

Check ClusterPolicy status:

oc get clusterpolicy -o yaml
oc describe clusterpolicy

Check driver installation:

oc get daemonset -n gpu-operator-resources
oc logs -n gpu-operator-resources -l app=nvidia-operator-validator

GPU Pods Failing to Start

Check pod events:

oc describe pod <gpu-pod-name>
oc get events --field-selector involvedObject.name=<gpu-pod-name>

Verify GPU resources are available:

oc describe node <gpu-node> | grep -i gpu