Best practices for SRE, incident management, observability, and building reliable systems at scale. Written by practitioners who spent years on-call at scaled infrastructure teams, then built the platform they wished they had.
The Nova AI Ops blog covers the hard problems in modern SRE: reducing alert fatigue without missing real incidents, cutting MTTR from hours to minutes with AI-driven automation, migrating off legacy monitoring stacks without downtime, and building runbooks that AI agents can actually execute. Every article is practical, opinionated, and based on real incidents we or our customers have lived through.
Get engineering insights and product updates delivered to your inbox.