Module 8: AI-Assisted Capacity Operations with OpenShift Lightspeed
Duration: 90 minutes
|
Optional Module
This module is optional. It requires OpenShift Lightspeed to be installed and configured on your cluster. If Lightspeed is not available in your environment, you can skip this module without affecting any other modules. |
Learning Objectives
By the end of this module, you will be able to:
-
Describe what OpenShift Lightspeed is, how it is configured, and where it lives in the OCP console
-
Query Lightspeed as a developer to debug resource sizing, QoS, and HPA problems
-
Query Lightspeed as an infrastructure engineer to reason about node density, etcd limits, and fleet architecture
-
Use Lightspeed as a forecasting co-pilot to write PromQL queries, estimate capacity runway, and translate metrics into executive language
-
Compare how IBM Granite and Qwen3 models respond differently to the same capacity planning question
-
Recognise the boundaries of what an AI assistant can and cannot do for capacity planning
What is OpenShift Lightspeed?
OpenShift Lightspeed is an AI chat assistant embedded directly into the OpenShift web console. It answers questions about OpenShift Container Platform — how things work, why something broke, and what commands to run — without leaving the browser.
Unlike a general-purpose chatbot, Lightspeed has access to a curated Retrieval-Augmented Generation (RAG) database of OCP documentation. This means it can give answers grounded in Red Hat product behaviour rather than generic Kubernetes theory.
What Lightspeed Can Help With
| Category | Examples |
|---|---|
Debugging |
"Why is my pod OOMKilled?" |
Configuration |
"How do I enable HPA on a custom metric?" |
PromQL authoring |
"Write a query for CPU request overcommit ratio per namespace" |
Forecasting assistance |
"I have 180 pods per node and grow at 12 pods/week. When do I hit maxPods?" |
Architecture decisions |
"When should I split one large cluster into multiple smaller ones?" |
What Lightspeed Cannot Do
|
Lightspeed is a conversational assistant, not an autonomous agent. It cannot:
|
Cluster Interaction (Technology Preview)
Lightspeed includes an optional Model Context Protocol (MCP) server that gives it read-only access to your cluster’s live Kubernetes state. When enabled, the Lightspeed Operator injects an openshift-mcp-server sidecar container into the lightspeed-app-server pod and grants it read-only in-cluster API access.
In this workshop, cluster interaction is enabled on your student cluster. The ocp4_workload_lightspeed role sets introspectionEnabled: true in OLSConfig, which causes the Operator to restart the app server with the MCP sidecar active.
With the MCP active, Lightspeed can answer questions about your actual cluster resources — pods, nodes, deployments, events, configmaps, and routes — instead of giving generic answers based only on its training data. Questions like "which of my namespaces is most over-requested?" or "are any nodes under memory pressure?" will be answered using live Kubernetes API data.
|
The bundled |
Lab 8E exercises this capability directly.
How Lightspeed is Provisioned
Your student cluster receives OpenShift Lightspeed automatically during provisioning. The platform team uses an AgnosticD v2 workload role called ocp4_workload_lightspeed to install and configure it. Understanding this provisioning model helps you manage Lightspeed environments as a platform engineer.
The AgnosticD Workload Pattern
AgnosticD v2 deploys workloads by listing role names in a workloads: array. For the Capacity Planning Workshop, student-compact-aws.yml includes:
workloads:
- agnosticd.core_workloads.ocp4_workload_cert_manager
- agnosticd.core_workloads.ocp4_workload_openshift_gitops
- ocp4_workload_capacity_planning_workshop
- ocp4_workload_lightspeed # <-- Module 8
Each role receives ACTION: provision and applies its changes idempotently to the cluster. Running agd provision -c openshift-workloads re-runs all workloads safely.
What ocp4_workload_lightspeed Does
The role performs these eight steps in order:
| Step | What It Does |
|---|---|
1 Namespace |
Creates |
2 OperatorGroup |
Creates an OperatorGroup in OwnNamespace mode. This is required — the Lightspeed Operator does not support AllNamespaces install mode. |
3 Subscription |
Subscribes to the |
4 Wait for operator |
Polls the |
5 LiteMaaS Secret |
Creates a Kubernetes Secret ( |
6 OLSConfig |
Applies the cluster-scoped |
7 MCP RBAC |
When |
8 Wait for service |
Polls the |
The LiteMaaS Token Flow
rhpds.litellm_virtual_keys → agnosticd_user_info.litemaas_api_key
↓
ocp4_workload_lightspeed_litemaas_api_token (in student-01-workloads.yml)
↓
Kubernetes Secret: litemaas-credentials (in openshift-lightspeed namespace)
↓
OLSConfig.spec.llm.providers[0].credentialsSecretRef
↓
lightspeed-app-server → LiteMaaS API → granite-3-2-8b-instruct
The rhpds.litellm_virtual_keys Ansible collection creates a per-GUID virtual key with the lab-prod package (Granite + Mistral, 90-day TTL). In production, this key flows automatically through agnosticd_user_info. For local development, pass it directly with -e ocp4_workload_lightspeed_litemaas_api_token=sk-….
Local Development
Platform engineers can re-install or update Lightspeed on any cluster without running a full agd cycle:
# From capacity-planning-lab-guide/ansible/
export KUBECONFIG=/path/to/kubeconfig
ansible-playbook setup-lightspeed.yml \
-e ocp4_workload_lightspeed_litemaas_api_token=<your-key> \
-e openshift_cluster_ingress_domain=apps.<guid>.<base_domain>
This dev/test wrapper calls ocp4_workload_lightspeed with ACTION: provision — the same role the RHDP platform runs in production.
Where to Find Lightspeed in the OCP Console
Lightspeed appears as a lightbulb icon in the top-right toolbar of the OpenShift web console, next to the question-mark help icon. Click it to open the chat panel.
|
Before starting the labs, open your student cluster console: Log in as If the lightbulb icon is not visible, the Lightspeed Operator may still be starting up. Wait 2–3 minutes and refresh the page. |
The Models in This Workshop
The Lightspeed instance on your cluster has been configured to use two models via the RHDP LiteMaaS AI service:
-
Primary (default):
qwen3-14b— Qwen 3 14B from Alibaba. Strong general reasoning and — critically — reliable MCP tool selection. This is the default because it correctly calls Kubernetes API tools by name when cluster interaction is enabled. -
Comparison (Lab 8D):
granite-3-2-8b-instruct— IBM Granite 3.2 8B, a Red Hat and IBM enterprise model trained on Red Hat documentation. Faster and optimised for OCP knowledge questions, but inconsistent with MCP tool calling in this Technology Preview release.
Both models are available without any local GPU. They are served by the RHDP Model-as-a-Service (LiteMaaS) platform over an OpenAI-compatible API.
|
Why Qwen 3 as the default? During validation of this workshop, Granite 3.2 8B consistently misnamed MCP tools (generating |
Lab 8A: Developer Queries — The "Smart Debugger"
20 minutes — Revisits Module 3 topics
In Module 3 you experienced OOMKilled pods, CPU throttling, QoS classes, and HPA configuration hands-on. In this lab you will ask Lightspeed about the exact apps you worked with — and compare its answers against what you observed directly.
|
All interactions in Labs 8A, 8B, and 8C happen inside the Lightspeed chat panel in the OCP web console. There are no terminal commands to run. Open the chat panel now at: The lightbulb icon is in the top-right toolbar. |
Query 1: OOMKilled diagnosis
Type the following prompt into the Lightspeed chat exactly as shown:
I have a pod called besteffort-app in the capacity-workshop namespace.
It has no memory requests or limits set. How do I check whether it has
been OOMKilled, and what Prometheus metric should I use to determine an
accurate memory request for it?
|
What to look for in the response
A good answer will mention:
Does Lightspeed’s answer match what you observed when you ran |
|
Using this prompt in your own environment
Replace |
Query 2: QoS classes in plain language
I have three apps in my capacity-workshop namespace:
- guaranteed-app: CPU and memory requests equal limits (200m CPU / 256Mi)
- burstable-app: has a 100m CPU request but no limit
- besteffort-app: no resource requests or limits set at all
Explain which QoS class each app is in, and what happens to each one
when the node runs low on memory. Which gets evicted first?
|
Reflection
Compare this response to what you saw in Module 3 when you deliberately induced memory pressure. Notice whether Lightspeed’s eviction ordering matches the behaviour you observed — |
|
Using this prompt in your own environment
Replace the three app names and resource values with your own deployments. Any namespace with a mix of Guaranteed, Burstable, and BestEffort pods works equally well. |
Query 3: HPA with a custom Prometheus metric
My load-generator app in the capacity-workshop namespace has CPU requests
of 100m and limits of 500m. I set up a basic CPU HPA on it in a previous
lab. Show me how to change it to scale based on a custom Prometheus metric
— for example, the number of HTTP requests per second — instead of CPU.
Show me the YAML.
|
What to look for
The answer should reference:
If the model gives a generic answer that ignores the existing CPU HPA, follow up with: "I already have a CPU HPA on load-generator. Show me the full updated YAML that replaces the CPU metric with |
|
Using this prompt in your own environment
Replace |
Query 4: CPU throttling PromQL
Write a PromQL query that shows the CPU throttling rate as a percentage
for every pod in my capacity-workshop namespace. My cpu-throttle-demo pod
has a 200m CPU limit but runs a burn loop that tries to use far more.
I want to be able to see that throttling clearly in the query output.
100 * sum by (pod, namespace) (
rate(container_cpu_cfs_throttled_seconds_total{namespace="capacity-workshop"}[5m])
) /
sum by (pod, namespace) (
rate(container_cpu_cfs_periods_total{namespace="capacity-workshop"}[5m])
)
|
Test every PromQL query Lightspeed writes before using it in a dashboard. Run it in the OCP console Observe → Metrics. Lightspeed can produce syntactically correct queries that don’t match your actual metric label names — verify that |
|
Using this prompt in your own environment
Replace |
Lab 8B: Infrastructure Engineer Queries — The "Fleet Advisor"
20 minutes — Revisits Modules 4 & 5
In Modules 4 and 5 you worked through node density mathematics, etcd constraints, and RHACM observability. Now you will ask Lightspeed the operational questions an infrastructure engineer faces — using your actual cluster topology as the context.
Query 1: Increasing maxPods safely
My OpenShift cluster has 3 nodes that are all running as combined
control-plane and worker nodes. I want to support more workloads in
the capacity-workshop namespace. How do I safely raise the maxPods
limit above the default 250 on OpenShift 4.21, and what are the specific
risks of doing this on nodes that run both control-plane and workload pods?
|
What to look for
A complete answer will cover:
Does Lightspeed flag that this triggers a node drain/reboot via MachineConfig? That is critical on a 3-node cluster — the cluster has no dedicated workers to absorb the load during a rolling restart. |
|
Using this prompt in your own environment
Replace the node count and role description with your own cluster topology.
The |
Query 2: etcd sizing and monitoring
Write a PromQL query to check the current etcd database size on my
OpenShift cluster and estimate how quickly it is growing. My cluster
has 3 control-plane nodes. I want to know when I should defrag or
plan for additional control plane capacity.
# Current etcd DB size in bytes
etcd_mvcc_db_total_size_in_bytes
# Growth rate over 7 days (bytes per day)
deriv(etcd_mvcc_db_total_size_in_bytes[7d]) * 86400
|
Using this prompt in your own environment
The PromQL queries are cluster-agnostic — they work on any OpenShift cluster regardless of size. Update the control-plane node count to match your own topology. |
Query 3: Cluster architecture trade-offs
I run a 3-node OpenShift training cluster where all nodes are combined
control-plane and workers, and I use a single capacity-workshop namespace
for all student workloads. As I scale to more students and cohorts, when
does it make more sense to keep growing this single cluster versus deploying
separate per-student clusters? What are the operational trade-offs?
|
Reflection
This question has no single right answer. Notice whether Lightspeed frames the decision around:
Compare this to the Module 4 discussion of when fleet federation makes sense. The 3-node combined topology is a real constraint that a generic "500-node cluster" scenario ignores. |
|
Using this prompt in your own environment
Replace the node count and use-case description with your own. The federation-vs-consolidation trade-off analysis is relevant at any scale — substitute your own workload pattern. |
Query 4: RHACM capacity dashboard PromQL
I connected my student cluster to an RHACM hub in Module 5 and I'm
building a Grafana capacity dashboard on the hub. Write three PromQL
queries for my capacity-workshop namespace:
1. CPU request overcommit ratio per namespace across all clusters
2. Memory utilisation versus allocated capacity per cluster
3. The number of pods per node to identify density hotspots
Use the cluster label so each cluster appears separately in the dashboard.
|
When Lightspeed writes multi-cluster queries for RHACM, it should reference the
If Lightspeed omits the |
|
Using this prompt in your own environment
Replace |
Lab 8C: Forecasting Assistant — The "Planning Copilot"
30 minutes — Revisits Modules 1, 2, & 7
This lab is the capstone of the workshop. You will use Lightspeed to operationalise the forecasting models from Modules 1 and 2, stress-test them against the Black Friday scenario from Module 6, and produce the executive language from Module 7 — all with AI assistance.
|
How this lab works The prompts below use your |
Query 1: Pod Velocity PromQL (Module 2 revisited)
Write the PromQL query that calculates Pod Velocity — the number of new
pod deployments per week — across all namespaces for the past 90 days.
This is the foundation of our capacity forecasting model from Module 2.
|
Expected output
Lightspeed should produce something close to:
The key concept from Module 2 is that Pod Velocity (new pods/time) is a better predictor of node demand than raw CPU trending for microservices architectures. Ask Lightspeed to explain why this metric matters for capacity forecasting to validate it understands the concept. |
Query 2: Runway calculation with the math shown
My capacity-workshop cluster has 3 nodes, each currently running around
65 pods. The Pod Velocity from my Module 2 Prometheus dashboard is
approximately 14 new pods per week. The default maxPods per node is 250.
How many weeks until I hit maxPods and need to add a worker node?
Show your calculation step by step, then write a PromQL expression that
computes this runway automatically from live cluster data.
Current available capacity:
3 nodes × 250 maxPods = 750 total pod slots
Currently used: 3 × 65 = 195 pods
Available: 750 - 195 = 555 pod slots
At 14 new pods/week:
Weeks until full: 555 ÷ 14 = 39.6 weeks (~40 weeks)
PromQL for live runway (in weeks):
(
sum(kube_node_status_allocatable{resource="pods"})
- sum(kube_pod_info{node!=""})
) / <pod_velocity_per_week>
|
Using this prompt in your own environment
Substitute your own node count, current pod count, and Pod Velocity. If your Module 2 dashboard showed a different velocity, use that number — the step-by-step maths and the PromQL pattern are the same regardless of scale. |
|
Capacity buffer rule of thumb Never plan to scale at 100% capacity. Ask Lightspeed: "What safety buffer should I apply to this runway estimate for a production platform?" A well-calibrated answer will recommend 20–30% buffer (scale at 70–80% capacity), accounting for burst traffic, deployment surges, and the time required to provision new nodes (especially in cloud environments where new node provisioning takes 5–15 minutes). |
Query 3: Grafana panel JSON for capacity countdown
Generate a Grafana panel JSON snippet that shows "days until maxPods is
reached" as a single-stat panel per node. This should read from Prometheus
and update automatically. Use a green/yellow/red threshold:
- green: > 60 days
- yellow: 30-60 days
- red: < 30 days
Assume Pod Velocity of 14 pods/week and maxPods of 250.
|
Lightspeed will produce a JSON panel definition. Before pasting it into your RHACM Grafana dashboard, check:
Lightspeed often produces valid JSON with correct structure but wrong data source names — a quick edit is usually all that’s needed. |
Query 4: Black Friday buffer planning (Module 6 revisited)
My production cluster has a CPU request overcommit ratio of 2.3× (meaning
applications have requested 2.3× the actual allocatable CPU). During Black
Friday, we expect a 10× traffic spike over baseline.
What are the risks of running at 2.3× overcommit during a 10× traffic event,
and how much additional node capacity should I plan to provision before the
event? Use the capacity planning framework from the answer.
|
Reflection
This is the Module 6 Black Friday Chaos Game scenario expressed as an advisory question rather than a live simulation. Lightspeed should discuss:
|
Query 5: Executive summary (Module 7 revisited)
Based on the following capacity data from our capacity-workshop OpenShift
cluster, write a one-paragraph executive summary suitable for a quarterly
budget request:
- Current cluster: 3 combined control-plane/worker nodes on AWS (m7a.2xlarge)
- Active workloads: capacity-workshop namespace with ~195 pods running
- Pod Velocity: 14 new pods/week (growing 8% month-over-month)
- Capacity runway: ~40 weeks at current growth before hitting maxPods
- Action required: provision additional worker nodes before week 32 (safety buffer)
- Estimated cost: $3,200/month per additional m7a.2xlarge node on AWS
Write this for an audience of finance and business leadership, not engineers.
Avoid technical jargon. Focus on cost, risk, and timeline.
|
Compare to your Module 7 pitch
In Module 7 you wrote a capacity pitch manually. Notice how Lightspeed’s version differs:
If you are not satisfied with the first output, try this follow-up: "Rewrite the summary to emphasise the risk of not provisioning the nodes. What business impact occurs at week 40 if we do nothing?" |
|
Using this prompt in your own environment
Replace the node type, pod count, velocity, and cost figure with your own cluster’s numbers. The executive summary pattern — current state, trend, runway, action, cost — is reusable for any capacity planning communication regardless of environment. |
Optional: Live cluster queries (cluster interaction required)
If your instructor has enabled cluster interaction on your Lightspeed instance, the following prompts will return answers based on your actual cluster state rather than hypothetical numbers.
Look at my current cluster state. Which three namespaces have the highest
ratio of CPU requests versus actual CPU usage over the past 24 hours?
Based on current pod counts across all nodes and the deployment velocity
you can observe in cluster events, estimate how many weeks of node capacity
remain before I need to add a worker node.
|
Cluster interaction is a Technology Preview feature as of OpenShift Lightspeed 1.0. Answers from live cluster queries are more specific but also more dependent on model quality. Your default model, |
Lab 8D: Model Comparison — Granite vs. Qwen3
20 minutes
|
Both models are pre-configured on your cluster. To switch: click the model name shown just above the chat input field and select from the dropdown. Your default is |
The comparison exercise
Pick one of the following prompts — ideally one that gave you a long or complex answer in Lab 8C. Run it against both models and compare side by side.
Suggested prompts for comparison:
-
The runway calculation from Lab 8C Query 2 (maths + PromQL)
-
The Black Friday buffer planning from Lab 8C Query 4
-
The executive summary from Lab 8C Query 5
For each response, evaluate against this rubric:
| Criterion | Granite 3.2 8B | Qwen3 14B |
|---|---|---|
Answers the question directly without unnecessary preamble |
☐ |
☐ |
Shows calculation steps (for maths questions) |
☐ |
☐ |
Uses Red Hat / OCP-specific terminology correctly |
☐ |
☐ |
Provides working PromQL (for metric queries) |
☐ |
☐ |
Explains reasoning, not just the answer |
☐ |
☐ |
Response length is appropriate (not too short, not padded) |
☐ |
☐ |
Executive language tone (for Query 5) |
☐ |
☐ |
Tick each box (☑) or leave it empty (☐) based on what you observe.
Discussion: Why does model choice matter?
|
Model selection is a capacity planning decision. Consider the trade-offs:
For a production Lightspeed deployment, the right choice depends on:
As the platform engineer responsible for this deployment, you should evaluate both models against your team’s actual query patterns — exactly as you did in this lab. |
Lab 8E: Live Cluster Queries via MCP — Asking About Your Actual Cluster
10 minutes — requires cluster interaction (pre-enabled on your student cluster)
In Labs 8A–8D, Lightspeed answered every question using its RAG knowledge base — authoritative, but hypothetical. Now that your cluster has the MCP server active, Lightspeed can call the Kubernetes API to read live resource state before forming its answer.
|
This lab requires the
You should see three lines: |
Query 1: What workloads are running in the capacity-workshop namespace?
Type the following into the Lightspeed chat panel:
What deployments and pods are running in the capacity-workshop namespace?
Are any of them in a non-Running state?
|
What to look for
With MCP enabled, Lightspeed should return the actual deployment names from your cluster rather than a generic answer. Compare this to the RAG-only responses in Labs 8A–8C — those returned general Kubernetes knowledge. This response should reference pods you actually deployed in earlier modules. If Lightspeed gives a generic answer about checking deployments, cluster interaction may still be initialising — wait 2 minutes and try again. |
Query 2: Summarise this cluster’s node capacity
How many nodes does this cluster have? What is the total allocatable CPU and
memory across all nodes? Are any nodes in a NotReady or degraded state?
|
What to look for
Lightspeed should call the Kubernetes API to list nodes and return real node names, actual CPU/memory values, and accurate Ready conditions. This is the baseline data you need before running any capacity projection — and getting it via chat is faster than running |
Query 3: Capacity headroom from live data
Based on the nodes in this cluster, estimate how much additional workload capacity
is available. What fraction of allocatable CPU and memory is currently requested?
|
Reflection
Compare this response to what Lightspeed gave you in Lab 8C Query 2 (the runway calculation). In Lab 8C you supplied hypothetical numbers. Here, Lightspeed reads the actual cluster state. Notice:
This is the core value proposition of MCP introspection: AI-assisted capacity operations against the real environment rather than a hypothetical scenario. |
Lab 8 Summary: AI-Assisted Capacity Operations
You have now completed all four labs. Here is what you accomplished:
Lab 8A — Developer Queries
✅ Diagnosed OOMKilled events using Lightspeed as a debugging assistant
✅ Generated QoS explanations and HPA YAML configurations
✅ Authored CPU throttling PromQL with AI assistance
Lab 8B — Infrastructure Engineer Queries
✅ Got step-by-step guidance on increasing maxPods safely
✅ Generated etcd monitoring queries and growth forecasts
✅ Explored cluster architecture trade-offs through conversational AI
Lab 8C — Forecasting Assistant
✅ Rebuilt the Module 2 Pod Velocity model with AI-assisted PromQL
✅ Calculated capacity runway with explicit maths shown by the model
✅ Generated a Grafana panel specification for capacity countdown
✅ Applied Black Friday buffer planning via conversational analysis
✅ Produced an executive capacity summary from raw data
Lab 8D — Model Comparison
✅ Compared Granite 3.2 8B and Qwen3 14B on the same capacity question
✅ Evaluated model choice as a capacity and cost trade-off
Lab 8E — Live PromQL via MCP
✅ Asked Lightspeed questions that returned real cluster data via the MCP metrics toolset
✅ Verified that MCP introspection replaces hypothetical scenarios with actual numbers
Prompt Engineering Tips for Capacity Planning
|
These patterns consistently improve Lightspeed responses on capacity topics:
|
Key Takeaways
-
Lightspeed is a force multiplier, not a replacement for understanding the capacity planning concepts from Modules 1–7 — the quality of your queries is directly proportional to the depth of your domain knowledge
-
Granite 3.2 8B is optimised for Red Hat product questions; Qwen3 14B is stronger for complex reasoning and tool-use
-
PromQL assistance is one of the highest-value Lightspeed use cases — it removes the barrier of memorising metric names and query syntax
-
Cluster interaction (MCP introspection) is enabled on your cluster — Lightspeed’s
openshift-mcp-serversidecar gives it read-only Kubernetes API access, enabling capacity questions answered against real cluster state rather than hypothetical data -
Model selection is a capacity planning decision: token cost, latency, and capability all factor in just like infrastructure sizing decisions
Workshop Complete
Congratulations — you have completed all eight modules of the Strategic Capacity Planning & Forecasting for OpenShift at Scale workshop.
You now have a complete toolkit for running data-driven capacity operations:
| Module | Capability Gained |
|---|---|
1 — Planning Horizon |
Established baselines using the three-horizon planning framework |
2 — Mathematics of Forecasting |
Pod Velocity model and predictive dashboard construction |
3 — Developer Track |
QoS classes, right-sizing with Prometheus P95, HPA configuration |
4 — Infrastructure Track |
Node density mathematics, etcd limits, kubeletConfig tuning |
5 — Fleet Observability |
RHACM multi-cluster Thanos dashboards and metric allowlists |
6 — Integration Challenge |
Real-time incident decision-making under Black Friday pressure |
7 — Strategic Roadmapping |
12-month capacity roadmap and executive communication |
8 — AI-Assisted Operations |
Lightspeed as a debugging co-pilot, PromQL assistant, forecasting advisor, and live-cluster MCP query engine |