Datadog on Gamedays

Datadog on Gamedays

Sep 13, 2021

As engineers, as we scale our applications and infrastructure, we accept that failure can and will happen. But, how can we get ahead of those potential failures? Gamedays are events which aim to test the resilience of a system when facing abnormal and turbulent situations, checking whether our expectations on how it will fail (or not) are correct.

In this session Ara Pulido, Technical Evangelist, chatted with Mike Petruzelli, reliability engineer on the Core Resilience team, and Elijah Andrews, software engineer on the Traffic team. We discussed and showed examples on how gamedays are organized at Datadog, particularly how the reliability engineers partner with teams across the organization to run larger events focused on general system failures impacting a big part of the system.

After watching this session you will have a better understanding of what gamedays are and how you can start organizing them at your company.

00:00 - Introduction

03:19 - What are Gamedays?

15:16 - Why do we organize Gamedays?

19:57 - What do we aim for when organizing Gamedays?

22:15 - Gamedays at Datadog

25:20 - Small Scale Gamedays

34:17 - Large Scale Gamedays

43:58 - Q&A