San Mateo, CA, USA
Oct 28, 2020   |  By Blameless Community
Blameless recently had the pleasure of interviewing Yury Niño Roa, Site Reliability Engineer, Solutions Architect and Chaos Engineering Advocate at ADL Digital Labs. She’s worked in roles ranging from solutions architect, to software engineering professor, to DevOps engineer, to SRE. Additionally, Yury is an avid blogger and conference speaker who regularly presents at events such as Chaos Conf, DevOpsDays Bogotá, and more.
Oct 27, 2020   |  By Emily Arnott
Onboarding is an essential yet challenging part of the hiring process. As your organization matures, more of its processes become unique. This makes it harder for new employees to get up to speed. Investing in custom processes and tooling to achieve your specific goals is a valuable practice. But, you must balance this with an investment in onboarding.
Oct 26, 2020   |  By Hannah Culver
Atlassian JIRA, one of the most popular ticketing systems, allows teams to catalogue incidents, follow-up actions, bugs, stories, and more. As a common tool in any DevOps/SRE operation’s toolchain, JIRA is a key integration at Blameless. Blameless’ integration with JIRA allows teams to automatically generate a ticket within both Blameless and JIRA. This integration also allows teams to track follow-up actions via Blameless’ postmortem tool.
Oct 19, 2020   |  By Emily Arnott
Adopting SRE principles into your organization can be a big undertaking. You’ll need to develop new practices and procedures to minimize the costs of incident coordination. You’ll need to create a retrospective process that encourages continuous learning. You’ll need to shift culture to begin appreciating failure as an opportunity to grow. Your transition to the world of SRE will also require buy-in from all levels of your organization.
Oct 16, 2020   |  By Blameless Community
BOO! Did we scare you? We couldn’t help it, we’re just so happy it’s spooky season. Here’s the October issue of SREview! This monthly zine features epic Tweets, content, and events happening in the SRE and resilience engineering community.
Oct 13, 2020   |  By Emily Arnott
When we talk about the reliability of services, SRE encourages us to take a holistic view. Unreliability in service delivery can be due to anything, from hardware malfunctions to errors in code. One source of unreliability that is often overlooked is security. A security breach can damage customer trust far beyond the impact of the breach itself. Even smaller infractions, like failing a service audit, can make users wary.
Oct 8, 2020   |  By Emily Arnott
As you adopt SRE practices, you’ll find that there are optimization opportunities across every part of your development and operations cycle. SRE breaks down silos and helps learning flow through every stage of the software lifecycle. This forms connections between different teams and roles. Understanding all the new connections formed by SRE practices can be daunting. Building a model of SRE specific to your organization is a good way to keep a clear picture in your head.
Oct 1, 2020   |  By Emily Arnott
Network Operation Centers, or NOCs, serve as hubs for monitoring and incident response. A NOC is usually a physical location in an organization. NOC operators sit at a central desk with screens showing current service data. But, the functionality of a NOC can be distributed. Some organizations build virtual NOCs. These can be staffed fully remotely. This allows for distributed teams and follow-the-sun rotations. NOC as a service is another structure gaining in popularity.
Sep 30, 2020   |  By Hannah Culver
Incident retrospectives (or postmortems, post-incident reports, RCAs, etc.) are the most important part of an incident. This is where you take the gift of that experience and turn it into knowledge. This knowledge then feeds back into the product, improving reliability and ensuring that no incident is a wasted learning opportunity. Every incident is an unplanned investment and teams should strive to make the most of it.
Sep 24, 2020   |  By Emily Arnott
We live in the era of software convenience, where we take for granted that hundreds of services are always at our fingertips. These applications become part of our daily routines because they are so reliable. However, this consistency makes reliability work invisible to the end user. It can be difficult to appreciate the effort behind maintaining a high availability service. Because of that, people may misunderstand exactly what makes a service reliable.

Blameless offers the only complete reliability engineering platform that brings together AI-driven incident resolution, blameless postmortems, SLOs/Error Budgets, and reliability insights reports and dashboards, enabling businesses to optimize reliability and innovation.

Enabling modern software businesses to adopt SRE best practices:

  • Incident Resolution: Use AI to engage the right people and teams in the right way to stop problems fast, ensure customer satisfaction and prevent incidents from happening again.
  • Blameless Postmortems: Learn without pointing fingers, ensuring continuous improvements. We automatically bring relevant information, proper context and industry best practices to your postmortem process.
  • SLOs/Error Budgets: Create SLOs and see your remaining error budgets with the SLO dashboard. Teams gain insight into what parts of the business are consuming the error budget, allowing them to make informed decisions between releasing new features and reliability.
  • Reliability Insights: Blameless will allow your business to consume event data across your entire DevOps stack, query the data, and create custom dashboards, meaning teams can quickly find signals amongst their DevOps data noise.

The Complete Site Reliability Engineering (SRE) Platform.