Improving robustness requires increasing complexity. Let’s throw more complexity at it?
I’m using this enormously complex system, an LLM, to help me solve a problem that was created by software complexity in the first place.
Lorin Hochstein
This feels like using multiple agents as a sort of redundancy and cross-validation architecture to improve the reliability of agent output..
Alex Ewerlöf
This article explains why end-to-end testing breaks down in microservice-based systems, not due to poor tooling, but because of fundamental architectural and operational mismatches.
Alok Kumar — DZone
LaunchDarkly’s survey data show have some interesting things to say about the impact of AI.
[…] while build and deployment velocity have improved, production reliability has not.
LaunchDarkly
Fred Hebert surveyed how AI coding assistants vs. AI SRE tools are marketed and found a stark divide: coding assistants are framed as partners that augment engineers, while AI SREs are framed as replacements for low-value work. The implication is that the people building and buying these tools see incident response as grunt work to be automated away — and that says a lot about how decision-makers perceive the role.
Fred Hebert
I especially like the point that incidents are leadership moments — how you respond tells your team everything about the culture you’re building. This one is aimed at CTOs, but really it’s a great reminder for anyone in a leadership role during incidents.
Joe Mckevitt — Uptime Labs
There’s a really interesting bit in this one about libraries and layers of the system doing their own retries without your knowledge, magnifying retry volume.
David Iyanu Jonathan — DZone
I like the section on what AI should and shouldn’t do. It’s important to avoid automating away the process of learning from incidents.
incident.io
