As an IT professional deeply immersed in the worlds of artificial intelligence and cloud infrastructure, Dominic Jainy has a unique perspective on the forces shaping modern enterprise technology. We sat down with him to unpack the latest findings from the Cloud Native Computing Foundation, exploring the profound shift that has established Kubernetes not just as an orchestration tool, but as the de facto operating system for AI at scale. Our conversation delves into the drivers behind its widespread adoption, the persistent gap between infrastructure readiness and the pace of AI model deployment, and the surprising evolution of primary challenges from technical complexity to organizational dynamics.
Kubernetes is often called the “operating system” for AI. Beyond simple container orchestration, what specific capabilities make it the backbone for intelligent systems at scale? Please share a concrete, step-by-step example of how it manages a complex AI workload in a production environment.
That’s a fantastic observation, and the term “operating system” really captures the essence of its evolution. It’s not just about starting and stopping containers anymore. Kubernetes provides the fundamental primitives for reliability and scale that AI systems desperately need. Think about capabilities like automated scaling based on real-time demand, self-healing to replace failed model-serving instances without human intervention, and sophisticated networking that routes traffic efficiently. Imagine you have a new generative AI model for customer support. First, you package your model and its dependencies into a container image. Next, you define a Kubernetes Deployment to ensure a desired number of model replicas are always running. Then, you set up a Horizontal Pod Autoscaler to automatically increase the number of replicas when support ticket volume surges and scale them back down during quiet hours to save costs. Finally, you expose this entire scalable system through a single, stable network endpoint called a Service. This transforms a static AI model into a dynamic, resilient, and cost-effective production service, which is a world away from simple orchestration.
With production Kubernetes usage jumping to 82%, it’s clear the technology is now mainstream. What are the key business and technical drivers behind this rapid maturation? Can you detail the typical milestones for a team moving from experimental use to running business-critical applications on Kubernetes?
The jump to 82% is truly staggering and signals a massive shift in confidence. The primary business driver is the relentless need for agility and reliability. Enterprises realize they can’t compete if their deployment cycles are months long or if their systems can’t handle unexpected traffic. Kubernetes provides a standardized, battle-tested platform to solve that. Technically, its open-source nature and vendor-neutrality give companies the freedom to run anywhere without being locked in. The journey from experiment to standard practice is quite predictable. It usually starts with a small, innovative team running a non-critical application. The first milestone is successfully running that first workload in production. This builds immense trust. The next milestone is often the formation of a dedicated platform engineering team to build internal tooling and best practices. The final, most critical milestone is when the C-suite sees the platform not as a cost center but as a strategic enabler, and business-critical applications—the ones that directly generate revenue—are migrated over. That’s the point where it becomes foundational, not experimental.
Many organizations now use Kubernetes for AI inference, yet only 7% deploy models daily. What causes this disconnect between infrastructure readiness and deployment frequency? From your experience, what are the primary bottlenecks, and what specific metrics should a team track to close this gap?
This is the central paradox many AI teams are facing right now. The infrastructure is incredibly powerful and ready for rapid iteration, but the MLOps processes haven’t caught up. That 7% figure feels painfully accurate. The bottleneck is rarely the Kubernetes platform itself; it’s the human-centric and often manual processes around the model lifecycle. This includes things like complex model validation and testing, risk and compliance reviews that can take weeks, and a lack of automated GitOps workflows specifically for models. It’s a cultural and procedural debt. To close this gap, teams must start tracking metrics like “model deployment frequency”—just like software teams track software deployment frequency—and “change lead time,” which is the time from a model being retrained to it serving production traffic. Improving these numbers forces you to automate the validation, security scanning, and rollout processes, bridging that frustrating gap between what the platform can do and what the organization is currently doing.
The primary barriers to cloud-native adoption are shifting from technical complexity to organizational issues like team dynamics and internal alignment. What specific cultural or communication challenges do you see most often, and what practical steps can platform engineering leaders take to overcome these hurdles?
This is the most significant finding, in my opinion. For years, the conversation was about YAML complexity and service mesh configuration. Now, it’s about people. The most common challenge I see is the friction between a central platform team trying to create standards and application teams who feel their autonomy is being threatened. There’s a natural tension. Platform leaders must shift their mindset from being gatekeepers to being enablers. A practical first step is to stop building a platform for developers and start building it with them. This means embedding platform engineers with application teams to feel their pain points firsthand. Another crucial step is to define and communicate a clear value proposition. Don’t just hand them a tool; explain how this internal developer platform reduces their cognitive load, automates toil, and helps them ship features faster and more safely. It becomes a pull, not a push.
Given that 44% of organizations are still not running their AI/ML workloads on Kubernetes, what does this signal about the overall maturity of AI in production? What would be your recommended first three steps for these organizations to begin leveraging Kubernetes for their own models?
That 44% figure tells us that while the leaders are scaling with Kubernetes, a huge portion of the industry is still in the early, often artisanal, stages of AI production. Many are likely running models on single virtual machines or using managed, black-box AI services. It signals that operationalizing AI is still a frontier for many, and the maturity is concentrated at the top. For those ready to make the leap, my first recommendation is to start small. Don’t try to migrate your most complex model first. Pick a simple, low-risk inference task. Second, focus on containerization. Mastering the art of packaging your model, its dependencies, and a serving layer into a clean, reproducible Docker image is a non-negotiable prerequisite. Third, deploy it on a managed Kubernetes service from a cloud provider. This abstracts away the underlying infrastructural complexity and allows your team to focus on the Kubernetes concepts that matter for AI—deployments, services, and scaling—without needing to become deep cluster administrators overnight.
OpenTelemetry has become a dominant force for observability. How does adopting a vendor-neutral standard change an organization’s approach to monitoring dynamic workloads like AI? Could you share an anecdote where this level of visibility was critical in diagnosing a complex production issue that older tools might have missed?
Adopting a standard like OpenTelemetry fundamentally changes the game. It shifts observability from being a feature of a specific tool to being an intrinsic property of the software itself. Instead of being locked into a single vendor’s agent and dashboard, you instrument your code once and can send that rich telemetry—logs, metrics, and traces—to any tool you choose, now or in the future. This is liberating. I remember a situation where an AI inference service was experiencing intermittent, high-latency spikes. Traditional monitoring showed CPU and memory were fine. But because the service was instrumented with OpenTelemetry, we could trace a single slow request across multiple microservices. The trace revealed that the latency wasn’t in the model execution but in a downstream call to a feature store that was timing out under load. An older, siloed monitoring tool would have just shown the AI service was slow; OpenTelemetry showed us exactly why and where it was slow, which is a critical distinction when every millisecond counts.
What is your forecast for the convergence of AI and cloud-native infrastructure over the next five years?
Over the next five years, the line between AI platforms and cloud-native platforms will completely blur. We won’t talk about running AI on Kubernetes; we’ll talk about intelligent infrastructure where Kubernetes is the assumed foundation. I predict we’ll see the emergence of “AI-native” platforms that treat models, datasets, and feature stores as first-class citizens within Kubernetes, abstracting away much of the complexity teams face today. Concepts from the AI world, like experiment tracking and model versioning, will be deeply integrated into GitOps and CI/CD pipelines. Ultimately, Kubernetes will become the invisible, reliable engine powering a new generation of intelligent applications, much like Linux became the invisible engine of the web. The focus will shift entirely from managing YAML to managing the flow of intelligence through the system.
