Incident management is the process of identifying and resolving problems that occur in IT services. Incident Management is also used as a metric to measure the health of the IT Service Desk. Let’s discuss what incident management is, why it matters to your business, and how you can apply it to your organization.
Another day, another drama! This one, though, is very much of my own making. I have been wanting to try my hand at a bit of chaos engineering for some time now but C&Js just hasn’t been ready. Sarah’s been up for it too, though, at Animapanions. And now that our CIO, Charlie has seen MTTR drop across every single technology team, thanks to the rollout of Moogsoft and the new incident management system (kudos to James), it’s pilot day.
Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.
Cloud Operations Sandbox serves as a simulation tool for budding SREs to learn the best practices from Google and apply them to real cloud services. In this blog, we have compiled a list of FAQs surrounding the use of Google's Cloud Operations Sandbox. The Google SRE sandbox provides an easy way to get started with the core skills you need to become a SRE.
SRE’s Golden Signals are four key metrics used to monitor the health of your service and underlying systems. We will explain what they are, and how they can help you improve service performance.
2020 heralded a year of increased complexity and customer demands, which isn’t going away. In this new normal, organizations will still be tasked with keeping up this break-neck pace. So, what did digital operations look like in 2020 compared to 2019?