awsobservabilityPractitioner

Building an Autonomous SRE with AWS DevOps Agent

5 min read AWS DevOps BlogMay 8, 2026Reviewed for accuracy

Practitioner — Hands-on experience recommended

In today's fast-paced tech environment, downtime is not an option. The AWS DevOps Agent exists to tackle this challenge head-on. It acts as an autonomous, always-on frontier agent that investigates incidents as they happen, identifying root causes and suggesting mitigation plans without requiring constant human intervention. This means your team can focus on strategic initiatives while the agent handles operational issues in real-time.

When an incident occurs, it triggers a CloudWatch alarm, which invokes an EventBridge that calls a Lambda function. This function sends a payload to the DevOps Agent webhook, initiating an investigation. The agent uses its built-in troubleshooting capabilities to query Splunk logs, retrieve deployment history from GitHub, and correlate CloudWatch metrics with deployment events. This comprehensive analysis allows it to understand application topology, identify root causes, and generate detailed mitigation plans, complete with remediation steps and rollback procedures.

In production, you must ensure that you have a role called 'mcp_user' set up in Splunk, as this is necessary for token creation. Additionally, remember to copy the token upon creation; if you close the screen, you’ll need to generate a new one. The agent's configuration parameters include the Agent Space Name for identification and a specific Webhook Schema for sending messages. Pay attention to these details to avoid common pitfalls and ensure smooth operation.

Key takeaways

→Understand the incident investigation flow initiated by CloudWatch alarms and EventBridge.
→Configure the Agent Space with appropriate tools and permissions for effective operation.
→Set up the 'mcp_user' role in Splunk to facilitate token creation and access.

Why it matters

This approach reduces mean time to recovery (MTTR) significantly, allowing teams to respond to incidents faster and maintain higher service availability, which is critical in production environments.

Code examples

JavaScript

1import { createHmac } from "node:crypto"; 
2function sendEventToWebhook() { 
3   const payload = { 
4      eventType: "incident", 
5      ... // other event data 
6    }; 
7const timestamp = new Date().toISOString(); 
8hmac = createHmac("sha256", secret); 
9hmac.update(`${timestamp}:${JSON.stringify(payload)}`, "utf8"); 
10const signature = hmac.digest("base64"); 
11fetch(webhookUrl, { 
12   method: "POST", 
13   headers: { 
14      "Content-Type": "application/json", 
15      "x-amzn-event-timestamp": timestamp, 
16      "x-amzn-event-signature": signature, 
17    }, 
18    body: JSON.stringify(payload), 
19  }); 
20}