Google Operations

Nov 22, 2021   |  By Kenny Kon
Editor’s note: Today we hear from Kenny Kon, an SRE Director at Sabre. Kenny shares about how they have been able to successfully adopt Google’s SRE framework by leveraging their partnership with Google Cloud. As a leader in the travel industry, Sabre Corporation is driving innovation in the global travel industry and developing solutions that help airlines, hotels, and travel agencies transform the traveler experience and satisfy the ever-evolving needs of its customers.
Nov 15, 2021   |  By Lee Yanco
Prometheus, the de facto standard for Kubernetes monitoring, works well for many basic deployments, but managing Prometheus infrastructure can become challenging at scale. As Kubernetes deployments continue to play a bigger role in enterprise IT, scaling Prometheus for a large number of metrics across a global footprint has become a pressing need for many organizations.
Nov 10, 2021   |  By Eyamba Ita
The need for relevant and contextual telemetry data to support online services has grown in the last decade as businesses undergo digital transformation. These data are typically the difference between proactively remediating application performance issues or costly service downtime. Distributed tracing is a key capability for improving application performance and reliability, as noted in SRE best practices.
Oct 18, 2021   |  By Rakesh Dhoopar
Whether you are moving your applications to the cloud or modernizing them using Kubernetes, observing cloud-based workloads is more challenging than observing traditional deployments. When monitoring on-prem monoliths, operations teams had full visibility over the entire stack and full control over how/what telemetry data is collected (from infrastructure to platform to application data).
Oct 5, 2021   |  By Nathan Beach
The newly released 2021 Accelerate State of DevOps Report found that teams who excel at modern operational practices are 1.4 times more likely to report greater software delivery and operational performance and 1.8 times more likely to report better business outcomes. A foundational element of modern operational practices is having monitoring tooling in place to track, analyze, and alert on important metrics.
Sep 7, 2021   |  By Shyam Palani
The stakes of managing Lowes.com have never been higher, and that means spotting, troubleshooting and recovering from incidents as quickly as possible, so that customers can continue to do business on our site. To do that, it’s crucial to have solid incident engineering practices in place. Resolving an incident means mitigating the impact and/or restoring the service to its previous condition.
Aug 20, 2021   |  By Eyamba Ita
Inevitably, in the lifetime of a service or application, developers, DevOps, and SREs will need to investigate the cause of latency. Usually you will start by determining whether it is the application or the underlying infrastructure causing the latency. You have to look for signals that indicate the performance of those resources when the issue occured.
Aug 18, 2021   |  By Rahul Harpalani
When you are experiencing an issue with your application or service, having deep visibility into both the infrastructure and the software powering your apps and services is critical. Most monitoring services provide insights at the Virtual Machine (VM) level, but few go further. To get a full picture of the state of your application or service, you need to know what processes are running on your infrastructure.
Aug 13, 2021   |  By Roy Nuriel
Keeping the experience of your end user in mind is important when developing applications. Observability tools help your team measure important performance indicators that are important to your users, like uptime. It’s generally a good practice to measure your service internally via metrics and logs which can give you indications of uptime, but an external signal is very useful as well, wherever feasible.
Aug 12, 2021   |  By Haskell Garon
Troubleshooting production issues with virtual machines (VMs) can be complex and often requires correlating multiple data points and signals across infrastructure and application metrics, as well as raw logs. When your end users are experiencing latency, downtime, or errors, switching between different tools and UIs to perform a root cause analysis can slow your developers down.
Dec 1, 2021   |  By Google Operations
In our last episode, we covered how to best deploy and use Cloud Monitoring. This week, we answer the most important questions about Cloud Logging - what’s the best way to ingest logs? And how do you centralize logs and manage access? Watch this episode of Engineering for Reliability to learn some best practices for using Cloud Logging. Watch to learn how to keep your services reliable and your users happy.
Nov 17, 2021   |  By Google Operations
In our last episode, we covered best practices for deploying and using Cloud Operations in an enterprise environment. But we still left some questions unanswered. How should you monitor your services? How should you deal with alerts? And what about managing cost? In this episode of Engineering for Reliability, Yuri discusses best practices for setting up and using Cloud Monitoring and optimizing monitoring costs.
Nov 5, 2021   |  By Google Operations
Learn about innovations in cloud network security over a global network. This includes Google Cloud innovations released this year from DDoS and Web Application Firewall (WAF), Google Cloud Armor, Google Cloud firewalls, and Google Cloud IDS - the newest network based intrusion detection solution.
Nov 3, 2021   |  By Google Operations
How can you get the most value out of Cloud Operations, especially as your Cloud footprint grows? In this episode of Engineering for Reliability, we look at the enterprise best practices for setting up and using Cloud Operations. Watch to learn how to improve the security of your services, better manage capacity, and keep your users happy!
Nov 2, 2021   |  By Google Operations
Prometheus is an open-source monitoring system which helps you collect, store, query, and get alerts on metrics that are important to your applications and infrastructure. In this video, we introduce Google Cloud Managed Service for Prometheus which is designed to help you scale your monitoring. Watch to learn how you can configure and manage Prometheus to keep up with the metrics from all of your successful services!
Oct 28, 2021   |  By Google Operations
Monitoring CPU load and memory usage is common practice, but with serverless no action is required. In this video, we quickly explain that if your Cloud Run instances start hitting high CPU load, Google Cloud will automatically spin up new instances for you, and vice versa!
Oct 22, 2021   |  By Google Operations
Welcome back to What’s New in Networking where we keep you up-to-date on Google Cloud networking. In this episode, David Tu gives you the latest updates for the Network Intelligence Center, Connectivity Tests, and the Performance Dashboard.
Oct 21, 2021   |  By Google Operations
If you’ve ever been surprised by the invoice from your cloud provider, then this video is for you! In this episode of Data Science & Analytics Patterns, we talk about managing your cloud costs by exporting billing data into your data warehouse and analyzing it with Looker.
Oct 20, 2021   |  By Google Operations
You've got Cloud Monitoring all set up in your project - but what do you do if you need to manage multiple projects and unify monitoring across them? In this episode of Engineering for Reliability, we look at Cloud Monitoring metrics scopes and show you how to use them to monitor multiple Cloud projects. Watch to learn how to use the Cloud Console to manage Metrics Scopes, view metrics from resources in multiple projects, and automate configurations using the API!
Oct 8, 2021   |  By Google Operations
In this video, Google Cloud Developer Advocate, Stephanie Wong, speaks with Google Fellow, Eric Brewer, about his experience building infrastructure, including Kubernetes, over the last decade at Google. You’ll get a window into what it was like to help propel Kubernetes into one of the largest open source projects today.

Monitoring and management for services, containers, applications, and infrastructure.

Operations aggregates metrics, logs, and events from infrastructure, giving developers and operators a rich set of observable signals that speed root-cause analysis and reduce mean time to resolution (MTTR). Operations doesn’t require extensive integration or multiple “panes of glass,” and it won’t lock developers into using a particular cloud provider.

Operations is built from the ground up for cloud-powered applications. Whether you’re running on Google Cloud Platform, Amazon Web Services, on-premises infrastructure, or with hybrid clouds, Operations combines metrics, logs, and metadata from all of your cloud accounts and projects into a single comprehensive view of your environment, so you can quickly understand service behavior and take action.