Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.


7 Tips On Building And Maintaining An SRE Team In Your Company

In today's "always on" world, Reliability is a primary business KPI. Plant the culture of Reliability by implementing these 7 simple tips to build a solid SRE team in your organization. Many of today’s hottest jobs didn’t exist at the turn of the millennium. Social media managers, data scientists, and growth hackers were never heard of before. Another relatively new job role in demand is that of a Site Reliability Engineer or SRE. The profession is quite new.


Why AlertOps is the best PagerDuty alternative

We will compare AlertOps to PagerDuty in 3 broad areas: On-call management Whether your on-call management needs are basic or complex, AlertOps has a solution for you. Creating on-call schedules is simple whether there one person on-call, two or more people on-call, or even multiple teams on-call. Escalations Automatic escalations based on your on-call schedules. Expand the possibilities with Workflows and Escalation Rule.


4 Essential Types of MSP Tools (in 2021)

Managed service providers (MSPs) need the right tools to get the job done quickly and securely. MSP tools dictate control over everything from virtual machine (VM) management and database administration to application and server monitoring. They can also help MSPs oversee IT infrastructure. MSP tools are valuable, but not all tools are created equal.


The Key Differences between SLI, SLO, and SLA in SRE

To incentivize reliability in your platform, there should be shared goals across your team to measure & quantify the capabilities of your product/service along with customer experience. Define the path of "Always-On" services by understanding few key SRE fundamentals and their implications - SLIs, SLOs & SLA. Framing SRE metrics for building or scaling a product is quite a daunting task.


2021 is the Year of Reliability

There’s no better time than now to dedicate effort to reliable software. If it wasn’t apparent before, this past year has made it more evident than ever: People expect their software tools to work every time, all the time. The shift in the way end-users think about software was as inevitable as our daily applications entered our lives, almost like water and electricity entered our homes.


The Secret of Communicating Incident Retrospectives

In the world of SRE, incidents are unplanned investments in reliability. Why? Because they are valuable opportunities to learn and grow. This perspective can be difficult to communicate to other stakeholders. Some may be upset about the cost incurred or the affected customers. Others might not understand why incidents happen in the first place. It is important to show how the lessons of an incident are relevant to each stakeholder role.


It's Time for Developer-Driven Reliability

Ten years on from, “software is eating the world,” it’s safe to say we live in a new digital age. Today’s businesses, from banks, to hospitals, to transportation and telecommunications, rely upon digital services to power the infrastructure behind everyday modern life. This new world has been made possible by rapid advances in how developers build and deliver software.


Top Reliability and Scaling Practices from Experts at Citrix, Greenlight Financial, and Incognia

Downtime costs more than dollars. It also costs customer happiness and trust. So how do teams maximize for reliability while scaling? Tooling, communication, observability, and more all play into a complete reliability strategy. In a recent industry leaders’ roundtable hosted by Blameless, top experts discussed best practices for responding to incidents, scaling for reliability, and how to engineer with the customer in mind.