With deep expertise in AI, machine learning, and blockchain, Dominic Jainy has dedicated his career to exploring how advanced technologies can transform industries. Today, he shares his insights on the evolving landscape of multi-cloud management, where the convergence of massive data processing and intelligent automation is revolutionizing how enterprises maintain reliability and control costs. Our conversation will touch upon the practicalities of achieving a unified view across disparate cloud environments, the mechanics of automated issue remediation, and what the future holds for truly autonomous IT operations.
Dynatrace now offers a unified view across AWS, Azure, and Google Cloud to manage operational complexity. Beyond a single dashboard, how does this help platform teams manage reliability and costs? Could you share a specific example of how this prevents a common multi-cloud issue?
It’s about moving beyond just seeing everything to truly understanding it. A single dashboard is table stakes, but what platform teams desperately need is context. The unified view provides a real-time dependency map that shows precisely how a service running on Azure might be impacting a user-facing application hosted on AWS. This visibility is critical for preventing the classic multi-cloud “blame game.” For instance, we often see situations where a retail application on Google Cloud suddenly slows down. The Google Cloud team sees no issues, but what they can’t see is that a critical inventory microservice they call, which is running in Azure Kubernetes Service, is experiencing resource contention. Our platform immediately visualizes that cross-cloud dependency, pinpointing the Azure service as the root cause. This turns a potential multi-day troubleshooting nightmare, involving two separate teams burning time and money, into a precise, actionable insight that can be resolved in minutes.
Your platform includes automated remediation to resolve issues as they occur. Can you walk us through how this function would identify a cross-cloud performance risk, and what specific steps it takes to resolve it without manual intervention?
Certainly. The process is a seamless flow from detection to resolution, driven by our AI engine. First, the system detects an anomaly using its built-in health indicators—not just a technical metric like high CPU, but a business-level issue like degrading transaction performance. Imagine a financial service application with its frontend on AWS and its AI-powered fraud detection model running on Azure AI Foundry. The platform might notice a slowdown in transaction approvals. Instead of just firing off a vague alert, it uses the Smartscape dependency graph to trace the issue across the cloud boundary directly to the Azure AI service. The AI engine then performs a root-cause analysis and determines the service is overloaded. At this point, the automation kicks in. It triggers a pre-configured remediation workflow—perhaps automatically scaling up the Azure resources or rerouting traffic to a healthy, replicated instance. This entire sequence happens in seconds, often before the operations team is even aware a problem was brewing, completely eliminating the manual effort of diagnosis and response.
The Grail data lakehouse and Smartscape dependency graph are central to your multi-cloud operations. How do these components work together to provide context for an alert? Please describe a scenario where this combination provides insights that a standard monitoring tool might miss.
Think of Grail as the massive, long-term memory and Smartscape as the real-time consciousness of the system. Grail is our data lakehouse, which ingests and indexes immense volumes of telemetry and metadata from every corner of the AWS, Azure, and Google Cloud environments. Smartscape then uses this rich data to build and continuously update a live, topological map of every single dependency. A standard monitoring tool might send an alert saying, “CPU on an AWS instance is at 95%.” That’s data, but it’s not insight. In that same scenario, Grail provides the historical performance data, while Smartscape shows that this specific instance supports a critical, revenue-generating service that is, in turn, dependent on a database in Google Cloud that just received a major schema update. The platform can then correlate the CPU spike with the database change, providing the “why.” This transforms a generic alert into a highly contextualized insight: “The recent database update in Google Cloud has caused an unexpected load on this dependent AWS service, putting customer transactions at risk.” That’s a level of contextual understanding a standard tool, looking at clouds in isolation, would completely miss.
A key goal for some customers, like SBS Software, is achieving “fully autonomous operations.” What does this look like in practice for a cloud operations team day-to-day? Could you outline the key capabilities or metrics a team would need to measure its progress toward this goal?
For a cloud operations team, “fully autonomous operations” means a fundamental shift from being reactive firefighters to proactive, strategic innovators. Day-to-day, it means their mornings aren’t spent triaging a flood of overnight alerts. Instead, the platform has already identified, diagnosed, and, in many cases, resolved potential issues automatically. The team’s focus shifts to higher-value work, like optimizing architecture for cost or developing new features. To measure progress, they would track metrics like Mean Time to Resolution (MTTR), which should trend toward zero for common issues. Another key metric is the percentage of incidents resolved without any human intervention. And perhaps most importantly, they would measure “innovation velocity”—how quickly they can deploy new code and services—because they are no longer bogged down by a constant stream of operational toil. It’s about innovating more with less, just as SBS Software described.
Given the intense focus on cloud spending, your automated optimization function continuously assesses resource usage. What specific, actionable recommendations does it provide, and how does it help teams balance performance demands with cost efficiency across different cloud providers?
With cloud spending under such intense scrutiny, teams are constantly caught between ensuring great performance and controlling a spiraling budget. Our automated optimization function acts as a continuous financial advisor for their cloud estate. It doesn’t just show you a bill; it provides specific, actionable recommendations. For example, it might identify a set of virtual machines in AWS that are consistently over-provisioned for their actual workload and recommend a smaller, cheaper instance type. Conversely, it might flag a service in Azure that is under-provisioned and at risk of performance degradation during peak hours, recommending a scale-up to protect the user experience. The key is that it continuously analyzes real-world usage data, allowing teams to make data-driven decisions. This helps them confidently downsize resources without impacting performance and justifies spending increases where they are truly needed, ensuring every dollar spent across their multi-cloud deployment is delivering maximum value.
What is your forecast for multi-cloud management and observability over the next three to five years?
Looking ahead, I believe the focus will shift dramatically from passive observation to proactive, AI-driven action. We’re moving beyond simply monitoring multi-cloud environments to truly managing and optimizing them autonomously. In the next three to five years, the expectation will be that observability platforms don’t just find problems; they predict and prevent them. The sheer scale and complexity of applications running across AWS, Azure, and Google Cloud will make manual intervention completely unfeasible. Therefore, platforms that can intelligently automate remediation, optimize resource consumption for both cost and carbon footprint, and secure the entire software delivery lifecycle will become the standard. The winning solutions will be those that can turn an ocean of data into automated decisions that directly improve business outcomes, making the concept of “fully autonomous operations” not just an aspirational goal, but a competitive necessity.
