Microsoft's Azure SRE Agent: Bringing Agentic AI to Site Reliability Engineering

The landscape of artificial intelligence tools has expanded dramatically over recent years. Initially, many of the most accessible and widely used AI applications focused on tasks like voice recognition, transcription, and simple text generation. While useful, these applications often represented just the tip of the iceberg in terms of AI's potential impact on complex operational environments.

A significant shift is occurring with the development of agentic AI. Unlike tools that merely generate content or perform single tasks, agentic AI systems are designed to understand context, reason, plan, and execute multi-step workflows to achieve a specific goal. This evolution moves beyond simple natural language interfaces for generation and towards using natural language to orchestrate complex operations and manage systems.

The integration of interfaces like OpenAPI allows these agents to translate user intent or system events into actual service calls and actions. Newer technologies, such as Microsoft's Model Context Protocol (MCP), further refine this by providing defined, structured interfaces for AI agents to interact with applications and services, ensuring more reliable and predictable outcomes.

Building Agents into Workflows: Beyond Simple Chatbots

One of the powerful aspects of agentic workflows is their ability to be triggered not just by human input, but also by system events. This enables completely automated operations that can surface information or require human intervention through familiar interfaces like command lines or collaboration tools such as Microsoft Teams. Microsoft's Adaptive Cards, originally conceived for embedding elements of 'microwork' within Teams conversations, are particularly well-suited for this. They can serve as triggers for steps within a longer workflow managed by an agent, effectively placing a human 'in the loop' to review reports, approve actions, or guide the next steps in a complex process.

Using agents to handle exceptions within established business processes represents a logical and impactful application of this technology. This approach can mitigate many of the challenges associated with traditional chatbots, which often struggle with open-ended conversations and unpredictable inputs. Agentic systems, when grounded in a known state and operating within a defined knowledge domain, can use AI tools to orchestrate responses and manage operations more reliably. They are designed to understand the current state of a system, compare it against a desired state or best practices, and take appropriate actions. Crucially, a well-designed system prompt ensures that unexpected events or situations outside the agent's defined capabilities are automatically escalated to a human operator, maintaining control and safety.

Adding Agents to Site Reliability Engineering (SRE)

A domain ripe for the application of agentic AI is Site Reliability Engineering (SRE). SRE teams are responsible for ensuring the reliability, availability, performance, and scalability of production systems. This often involves a significant amount of repetitive, lower-level tasks such as monitoring alerts, diagnosing common issues, restarting services, rotating certificates, and managing configurations. Automating some of these functions using AI agents can significantly reduce the operational burden on SREs, allowing them to focus on more strategic work, complex problem-solving, and system improvements. Microsoft refers to this application of AI in operations as agentic devops.

Modern cloud-native applications, which are increasingly common in today's infrastructure, are often defined and managed through code and configuration files. This includes infrastructure itself, using Infrastructure as Code (IaC) tools like Azure Resource Manager, Terraform, Bicep, or Pulumi. Application APIs can be described using specifications like TypeSpec, and container orchestration platforms like Kubernetes are configured via YAML files.

These configuration files, often stored securely in version control repositories like GitHub, provide a clear definition of the 'desired state' for the system. This desired state serves as a crucial baseline for an SRE agent. In environments using Windows Server, PowerShell's Desired State Configuration (DSC) definitions can similarly provide this foundational ground state.

Once a system is operational, a wealth of data is generated in the form of logs, metrics, and traces. Azure provides comprehensive monitoring tools to collect and collate this data. This information can be stored in a centralized repository like an Azure Fabric data lake, where powerful analytics tools using languages like Kusto Query Language (KQL) can query and analyze the data. This analysis generates reports and tables that provide deep insights into the performance and health of complex, distributed systems. These analytics form the backbone of SRE dashboards, enabling engineers to quickly identify issues and deploy fixes, often before users even notice a problem.

Azure Agent Tools for the Rest of Us: Introducing Azure SRE Agent

Given its extensive suite of existing DevOps and monitoring tools, Microsoft possesses many of the foundational components necessary to build sophisticated SRE agents. By combining existing agent capabilities, the expanding roster of MCP servers, and the rich data streams from monitoring and analytics services, Microsoft is well-positioned to provide automated alerts and basic remediation capabilities.

It was therefore not surprising when Microsoft announced at Build 2025 that it was already using a similar system internally and would soon be launching a public preview of an Azure SRE Agent. This follows a long-standing pattern for Microsoft, which frequently develops internal tools to manage its vast cloud infrastructure and then productizes them for its customers. Much of the Azure platform itself evolved from the systems Microsoft built to run its own global cloud applications.

Young Team of Specialists Working on Desktop Computers and Having a Conversation at a Workplace. Female and Male Software Developers Discussing a Solution for Their Artificial Intelligence Project — Credit: Gorodenkoff / Shutterstock

Announced as a key development at Build 2025, the Azure SRE Agent is specifically designed to assist in managing production services. It leverages reasoning large language models (LLMs) to analyze logs, metrics, and other system data to determine the root causes of issues and suggest potential fixes. The underlying approach combines elements of traditional machine learning, looking for anomalies and exceptions, with the reasoning capabilities of LLMs to compare the current state of a system against best practices and its defined desired state configuration. The goal is to identify and help resolve issues as quickly as possible, ideally before they impact end-users.

The primary aim of the Azure SRE Agent is to significantly reduce the workload on site reliability engineers, system administrators, and developers. By automating the detection, diagnosis, and initial steps of remediation for common problems, the agent allows these skilled professionals to remain focused on higher-value tasks without constant interruption from alerts and minor incidents. The agent is designed to operate continuously in the background, using data from normal system operations to refine its understanding and models, tailoring its insights to the specific applications and underlying infrastructure it monitors.

The context model built by the agent, reflecting the state and history of your Azure resources, can be queried at any time using natural language. This interaction model is similar to using the Azure MCP Server within tools like Visual Studio Code, providing a conversational interface to understand complex system states. Because the system is inherently grounded in your actual Azure resources and their associated data, the results and suggestions provided are based on real-time information and historical logs. This is analogous to using a specific retrieval-augmented generation (RAG) AI tool, but without the need for users to manage the complexity of building and maintaining a real-time vector index. Instead, services like Fabric's data agent provide APIs that handle the complexities of querying and retrieving relevant data for the SRE agent.

Furthermore, the integration with Azure Fabric and its data capabilities opens up possibilities for visualizing the data and insights generated by the agent. Using appropriate markup or integrated tools, the agent could potentially help generate graphs and charts to illustrate system performance, incident trends, or the impact of changes, making complex data more accessible and understandable for SRE teams.

Event-Driven Automation and Security Integration

Making the Azure SRE Agent event-driven is a critical design choice. This allows it to be directly tied to services like Azure Monitor and Azure's Security Graph. By integrating with Azure Monitor, the agent can automatically pull alert details as they occur, triggering an automated root-cause analysis process. This means that when an alert fires, the agent doesn't just notify an engineer; it immediately begins investigating the potential causes based on available logs and metrics.

Integration with Azure Security Graph allows the agent to use current Azure security policies and recommendations as a baseline for 'best practice'. It can compare the current state of a system's security configuration against this baseline, identify deviations or potential vulnerabilities, and inform users of issues. In some cases, it can even perform basic remediations automatically, in line with Azure's recommended security practices. For example, if a web server is found to be using an outdated version of TLS, the agent could potentially identify this and suggest or even initiate the update process (with approval), helping to ensure applications remain secure and compliant.

When an event triggers an analysis, the agent is able to leverage its access to known Azure data sources to detect exceptions and determine potential causes. It then reports its conclusions to the duty site reliability engineer. This provides the engineer with more than just a notification; it offers a starting point for investigation and potential remediation, significantly accelerating the incident response process.

While the agent can suggest fixes, the option exists to handle basic remediations directly, provided they are approved by a site reliability engineer. This human-in-the-loop model is crucial, especially in the early stages of adopting such powerful automation. The list of approved operations is intentionally kept small and safe, typically including actions like triggering scaling events, restarting services, or rolling back recent changes if they are identified as the cause of an issue. This cautious approach builds trust and ensures that critical systems are not inadvertently disrupted by automated actions.

Continuous Improvement and Reporting

A key aspect of effective SRE and DevOps is learning from incidents and continuously improving systems. The Azure SRE Agent facilitates this by automatically recording its findings and actions. The root-cause analysis performed by the agent, the problem discovery process, and details of any fixes (whether automated or human-approved) are written to the application's GitHub repository as an issue. This practice aligns perfectly with modern DevOps principles, bringing developers directly into the site reliability discussion. By documenting incidents in the development workflow, it ensures that everyone involved is informed, and development teams can incorporate lessons learned into future development cycles to prevent similar problems from recurring.

Beyond incident-specific reporting, the agent also produces regular daily reports. These reports provide a summary of incidents and their current status, an overview of the overall health of the monitored resource group, a list of possible actions that could be taken to improve performance and health proactively, and suggestions for potential proactive maintenance tasks. This comprehensive reporting helps teams stay informed about the state of their systems and prioritize work effectively.

Getting Started with Azure SRE Agent

The Azure SRE Agent is currently available as a gated public preview. Access requires signing up through a specific form. However, Microsoft has made much of the documentation publicly available, allowing interested users to understand how the agent works and its capabilities before gaining access.

Interacting with the SRE Agent currently primarily takes place within the Azure Portal. Getting started appears relatively straightforward: you create an agent instance from the portal, assign it to a specific Azure account and resource group, and select a region for the agent to run from. At the time of the announcement, the agent was limited to running in the Sweden Central region, although it possesses the capability to monitor resource groups located in any Azure region.

While the Azure Portal is a familiar environment for most SREs working with Azure, it may not be the ideal interface for conversational interactions or receiving real-time, actionable notifications. Site reliability engineers often have the portal open, but integrating the agent's interactions more closely with collaboration tools like Teams using Adaptive Cards, or delivering reports through business intelligence platforms like Power BI, could significantly enhance its usability and fit within typical service operations workflows. Such integrations would bring the agent's insights and actions directly into the tools teams use daily.

The Evolution of AI in Operations

Tools like the Azure SRE Agent represent an important step in the evolution of AI, particularly generative AI and agentic systems. They move beyond simple content generation or chatbot interactions. Instead, they are deeply grounded in real-time operational data and leverage sophisticated reasoning approaches to extract context, identify patterns, and understand the meaning behind system events. This context is then used to construct and execute workflows based on established best practices, desired system states, and current conditions.

Building the agent around a human-in-the-loop model, where human approval is required for significant actions, is a pragmatic approach. While it might introduce a slight delay compared to fully autonomous systems, it is essential for building trust in these new ways of working, especially when dealing with critical production environments. This model allows SRE teams to gain confidence in the agent's capabilities and outputs over time.

It will be fascinating to observe how the Azure SRE Agent evolves. As Microsoft continues to roll out deeper Azure service integrations through MCP servers and as languages like TypeSpec provide richer ways to add context to API descriptions, the agent's ability to understand and interact with the environment will only improve. Deeply grounded AI applications like this are poised to deliver on the promise of AI assistants and copilots, providing tools that genuinely make users' tasks easier, reduce cognitive load, and minimize interruptions. They also serve as valuable examples, demonstrating the types of practical, impactful AI applications that organizations should aim to build as the underlying AI platforms and operational practices continue to mature.

Subscribe to Our Tech & Career Digest