Chaos Engineering

Incident Ready: How to Chaos Engineer Your Incident Response Process | FireHydrant

We’re pretty sure using a real incident to test a new response process is not the best idea. So, how do you test your process ahead of time? In this video, FireHydrant CEO, Robert Ross, will share how FireHydrant customers leverage best practices to break, mitigate, resolve, and fireproof incident processes. We’ll show you how to use chaos engineering philosophies to stress test 3 critical parts of a great process.

Breaking Serverless Things on Purpose: Chaos Engineering in Stateless Environments - Emrah Samdan

Serverless enabled us to build highly distributed applications that led to more granular functions and ultimate scalability. However, it also brought the risk of failure from a single microservice to many serverless functions and resources. You might be able to predict and design for certain troublesome issues but there are many, many more that you probably will not be able to easily plan for. How do you build a resilient system under these highly distributed circumstances? The answer is Chaos Engineering: Breaking things on purpose just to experience how the whole system will react.

Chaos Engineering: The Path to Reliability - Kolton Andrus

We’re all here for the same purpose: to ensure the systems we build operate reliably. This is a difficult task, one that must balance people, process and technology during difficult conditions. We operate with incomplete information, assessing risks and dealing with emerging issues. We’ve found Chaos Engineering to be a valuable tool in addressing these concerns. Learn from real world examples what works, what doesn’t, and what the future holds.

Identifying Hidden Dependencies - Liz Fong Jones

You don't need to write automation or deploy on Kubernetes to gain benefits from resilience engineering! Learn how Honeycomb improved the reliability of our Zookeeper, Kafka, and stateful storage systems through terminating nodes on purpose. We'll discuss the initial manual experiments we ran, the bugs in our automatic replacement tools we uncovered, and what steps we needed to progress towards continuously running the experiments. Today, no node at Honeycomb lives longer than 12 months, and we automatically recycle nodes every week.
gremlin

Looking back on Chaos Conf 2020

It’s already been a week since we closed our third annual Chaos Conf! While we were forced to take the conference online, this meant that more of you could join us. Over 3,500 people signed up to help make this the world’s largest Chaos Engineering conference. That’s 5x more than 2019, and nearly 10x more than 2018! This is a testament to the growth of Chaos Engineering as a practice across many different industries and around the world.

gremlin

Is your microservice a distributed monolith?

Your team has decided to migrate your monolithic application to a microservices architecture. You’ve modularized your business logic, containerized your codebase, allowed your developers to do polyglot programming, replaced function calls with API calls, built a Kubernetes environment, and fine-tuned your deployment strategy. But soon after hitting deploy, you start noticing problems.

gremlin

Technology Business Management and Chaos Engineering

Get started with Gremlin’s Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. Technology Business Management (TBM) is a decision-making tool that helps organizations maximize the business value of information technology (IT) spending by adjusting management practices. With TBM, IT is transformed to run like a business instead of merely a cost center.

gremlin

Understanding your application's critical path

Don’t wait for an incident to focus on reliability. Learn concrete steps for preventing incidents in the first place in our two-part series, Planning and Architecting for Reliability. It’s 3 a.m. You’re lying comfortably in bed when suddenly your phone starts screeching. It’s an automated high-severity alert telling you that your company’s web application is down. Exhausted, you open the website on your phone and do some basic tests.