Amanda Souza

Amanda Souza

Senior SRE & Tech Lead

Writing about Azure, DevSecOps, SRE, and AI Reliability Engineering.
Creator of Terraform Provider for MinIO (13M+ users).

Latest Posts

An AI That Watches My Cloud Network So I Don't Have To

Someone asks “can you draw me our network topology?” and you open three consoles. AWS, Azure, GCP. You click through VPCs in one, VNets in another, VPC Networks in a third, and try to mentally stitch them together. By the time you’ve got half the picture, someone pings you on Slack and you lose it all.
Read Full Post

When Every Alert Is Critical, Nothing Is

You have 100 alerts. 80 of them are informational. 15 are warnings that nobody looks at. 4 are actual problems. And 1 is critical, buried in a Slack channel with 200 unread messages. The on-call person didn’t see it because every alert looks the same: a wall of orange text in #alerts that everyone muted weeks ago.
Read Full Post

Three Dashboards Is Not Observability

You have a logging dashboard. You have a metrics dashboard. You have a traces dashboard. When something breaks, you open all three, squint at timestamps trying to correlate them, and hope the clocks are synchronized. This is not observability. This is three dashboards and a prayer.
Read Full Post

The Four Numbers That Tell You Everything

Your service has 47 metrics. You have dashboards for CPU, memory, disk I/O, container restarts, pod count, HTTP status codes by path, database connection pool size, and that one custom metric someone added six months ago that nobody remembers the purpose of. When something breaks, you look at all of them and none of them tell you what’s actually wrong.
Read Full Post

The Alert Checklist Nobody Follows

Your alert is called NginxDown. It fires when the nginx pod restarts. The on-call person gets paged, opens Grafana, sees “NginxDown,” and thinks “OK, nginx is down.” Except nginx restarted because the node ran out of memory, which happened because the batch job that runs at midnight leaked memory, which happened because someone deployed a new version at 5 PM without load testing.
Read Full Post