An overview of major IT incidents and outages in 2021
At Blameless, we want to embrace all the benefits of the SRE best practices we preach. We’re proud to announce that we’ve started using a new system of feature flagging with canaried and iterative rollouts. This is a system where new releases are broken down and flagged based on the features each part of the release implements. Then, an increasing subset of users are given access to an increasing number of features.
The term Site Reliability Engineer (SRE) first appeared in Google in the early 2000s. In Google’s 2016 SRE Book, Benjamin Treynor Sloss wrote that, generally speaking, “an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).” This means that the SRE teams at Google decide how a system should run in production as well as how to make it run that way.
Wondering about Cloud-Native Development? We explain what cloud-native development is and how it can help build fast and reliable applications.
A summary of the Log4j vulnerability, and key takeaways for SREs.
Part of the trepidation of being on-call is encountering unfamiliar emergency scenarios where we are surprised by suddenly not knowing how to do our jobs. We feel lost and alone, complicated by the world around us, powerless to resolve or even mitigate the problem. On-call need not be a solo affair full of fear and anxiety. There are ways we can employ practice and open collaboration outside of incidents to prepare us better.
We talk about reliability a lot from the context of software engineering. We ask questions about service availability, or how important it is for specific users. But when organizations face outages, it becomes immediately obvious that the reliability of an online service or application is something that impacts the entire business with significant costs. A mindset of putting reliability first is a business imperative that all teams should share.
As organizations progress in their reliability journey, they may build a dedicated team of site reliability engineers. This team can be structured in two major ways: a distributed model, where SREs are embedded in each project team, providing guidance and support for that team; and a centralized model, where one team provides infrastructure and processes for the entire organization.
As the new year approaches, we at Blameless like to ponder the future of Reliability Engineering. For 2021, we predicted that the practice of site reliability engineering (SRE) would continue to grow in terms of adoption, we would see adoption increase faster among smaller organizations, and SRE practices would get more attention to drive adoption compared to hiring. We’re sure you’ll agree that these trends have indeed strengthened in the last year.