Dynatrace Boosts Multi-Cloud Automation and Visibility

With deep expertise in AI, machine learning, and blockchain, Dominic Jainy has dedicated his career to exploring how advanced technologies can transform industries. Today, he shares his insights on the evolving landscape of multi-cloud management, where the convergence of massive data processing and intelligent automation is revolutionizing how enterprises maintain reliability and control costs. Our conversation will touch upon the practicalities of achieving a unified view across disparate cloud environments, the mechanics of automated issue remediation, and what the future holds for truly autonomous IT operations.

Dynatrace now offers a unified view across AWS, Azure, and Google Cloud to manage operational complexity. Beyond a single dashboard, how does this help platform teams manage reliability and costs? Could you share a specific example of how this prevents a common multi-cloud issue?

It’s about moving beyond just seeing everything to truly understanding it. A single dashboard is table stakes, but what platform teams desperately need is context. The unified view provides a real-time dependency map that shows precisely how a service running on Azure might be impacting a user-facing application hosted on AWS. This visibility is critical for preventing the classic multi-cloud “blame game.” For instance, we often see situations where a retail application on Google Cloud suddenly slows down. The Google Cloud team sees no issues, but what they can’t see is that a critical inventory microservice they call, which is running in Azure Kubernetes Service, is experiencing resource contention. Our platform immediately visualizes that cross-cloud dependency, pinpointing the Azure service as the root cause. This turns a potential multi-day troubleshooting nightmare, involving two separate teams burning time and money, into a precise, actionable insight that can be resolved in minutes.

Your platform includes automated remediation to resolve issues as they occur. Can you walk us through how this function would identify a cross-cloud performance risk, and what specific steps it takes to resolve it without manual intervention?

Certainly. The process is a seamless flow from detection to resolution, driven by our AI engine. First, the system detects an anomaly using its built-in health indicators—not just a technical metric like high CPU, but a business-level issue like degrading transaction performance. Imagine a financial service application with its frontend on AWS and its AI-powered fraud detection model running on Azure AI Foundry. The platform might notice a slowdown in transaction approvals. Instead of just firing off a vague alert, it uses the Smartscape dependency graph to trace the issue across the cloud boundary directly to the Azure AI service. The AI engine then performs a root-cause analysis and determines the service is overloaded. At this point, the automation kicks in. It triggers a pre-configured remediation workflow—perhaps automatically scaling up the Azure resources or rerouting traffic to a healthy, replicated instance. This entire sequence happens in seconds, often before the operations team is even aware a problem was brewing, completely eliminating the manual effort of diagnosis and response.

The Grail data lakehouse and Smartscape dependency graph are central to your multi-cloud operations. How do these components work together to provide context for an alert? Please describe a scenario where this combination provides insights that a standard monitoring tool might miss.

Think of Grail as the massive, long-term memory and Smartscape as the real-time consciousness of the system. Grail is our data lakehouse, which ingests and indexes immense volumes of telemetry and metadata from every corner of the AWS, Azure, and Google Cloud environments. Smartscape then uses this rich data to build and continuously update a live, topological map of every single dependency. A standard monitoring tool might send an alert saying, “CPU on an AWS instance is at 95%.” That’s data, but it’s not insight. In that same scenario, Grail provides the historical performance data, while Smartscape shows that this specific instance supports a critical, revenue-generating service that is, in turn, dependent on a database in Google Cloud that just received a major schema update. The platform can then correlate the CPU spike with the database change, providing the “why.” This transforms a generic alert into a highly contextualized insight: “The recent database update in Google Cloud has caused an unexpected load on this dependent AWS service, putting customer transactions at risk.” That’s a level of contextual understanding a standard tool, looking at clouds in isolation, would completely miss.

A key goal for some customers, like SBS Software, is achieving “fully autonomous operations.” What does this look like in practice for a cloud operations team day-to-day? Could you outline the key capabilities or metrics a team would need to measure its progress toward this goal?

For a cloud operations team, “fully autonomous operations” means a fundamental shift from being reactive firefighters to proactive, strategic innovators. Day-to-day, it means their mornings aren’t spent triaging a flood of overnight alerts. Instead, the platform has already identified, diagnosed, and, in many cases, resolved potential issues automatically. The team’s focus shifts to higher-value work, like optimizing architecture for cost or developing new features. To measure progress, they would track metrics like Mean Time to Resolution (MTTR), which should trend toward zero for common issues. Another key metric is the percentage of incidents resolved without any human intervention. And perhaps most importantly, they would measure “innovation velocity”—how quickly they can deploy new code and services—because they are no longer bogged down by a constant stream of operational toil. It’s about innovating more with less, just as SBS Software described.

Given the intense focus on cloud spending, your automated optimization function continuously assesses resource usage. What specific, actionable recommendations does it provide, and how does it help teams balance performance demands with cost efficiency across different cloud providers?

With cloud spending under such intense scrutiny, teams are constantly caught between ensuring great performance and controlling a spiraling budget. Our automated optimization function acts as a continuous financial advisor for their cloud estate. It doesn’t just show you a bill; it provides specific, actionable recommendations. For example, it might identify a set of virtual machines in AWS that are consistently over-provisioned for their actual workload and recommend a smaller, cheaper instance type. Conversely, it might flag a service in Azure that is under-provisioned and at risk of performance degradation during peak hours, recommending a scale-up to protect the user experience. The key is that it continuously analyzes real-world usage data, allowing teams to make data-driven decisions. This helps them confidently downsize resources without impacting performance and justifies spending increases where they are truly needed, ensuring every dollar spent across their multi-cloud deployment is delivering maximum value.

What is your forecast for multi-cloud management and observability over the next three to five years?

Looking ahead, I believe the focus will shift dramatically from passive observation to proactive, AI-driven action. We’re moving beyond simply monitoring multi-cloud environments to truly managing and optimizing them autonomously. In the next three to five years, the expectation will be that observability platforms don’t just find problems; they predict and prevent them. The sheer scale and complexity of applications running across AWS, Azure, and Google Cloud will make manual intervention completely unfeasible. Therefore, platforms that can intelligently automate remediation, optimize resource consumption for both cost and carbon footprint, and secure the entire software delivery lifecycle will become the standard. The winning solutions will be those that can turn an ocean of data into automated decisions that directly improve business outcomes, making the concept of “fully autonomous operations” not just an aspirational goal, but a competitive necessity.

Explore more

How Does CryptoBandits Steal Your Crypto via USB?

The seemingly innocuous act of inserting a flash drive into a workstation often serves as the silent catalyst for a devastating breach that can drain a digital wallet in seconds without triggering traditional antivirus alarms. This physical threat vector, utilized by the group known as CryptoBandits, exploits the inherent trust users place in hardware devices. While most cybersecurity discussions in

How Does the Klue Breach Expose Supply Chain Risks?

Introduction Modern digital ecosystems rely on a delicate web of trust that, when broken by a single compromised credential, can trigger a domino effect across the world’s most sophisticated cybersecurity firms. This reality became starkly evident when Klue, a prominent business intelligence provider, experienced a significant security failure within its integration architecture. The event serves as a masterclass in how

Trend Analysis: EDR Evasion in Ransomware

Digital adversaries have abandoned simple stealth in favor of an aggressive scorched-earth policy that systematically dismantles security defenses before a single byte of data is encrypted. This tactical evolution marks a significant departure from traditional malware behavior. As organizations deploy robust Endpoint Detection and Response (EDR) systems, operators have responded with security-killer frameworks operating within the system kernel. The significance

Is Traditional IAM Enough for the New Era of Agentic AI?

Dominic Jainy is a seasoned IT architect who has spent the better part of two decades navigating the complex intersection of artificial intelligence, machine learning, and blockchain technology. As organizations rush to integrate autonomous systems into their daily operations, Jainy has emerged as a vital voice in the conversation regarding how we secure these “digital employees.” His expertise is not

Data Centers Adopt New Strategies to Address Public Backlash

The unprecedented acceleration of global digital infrastructure has forced data center developers to confront a significant barrier of community opposition that technical expertise alone cannot overcome. For several decades, these facilities operated largely in the shadows, serving as the invisible architecture of the internet while hidden away in industrial parks or rural outskirts. However, the surge in generative artificial intelligence