Google Operations

May 7, 2021   |  By Adrian Hilton
A big part of ensuring the availability of your applications is establishing and monitoring service-level metrics—something that our Site Reliability Engineering (SRE) team does every day here at Google Cloud. The end goal of our SRE principles is to improve services and in turn the user experience. The concept of SRE starts with the idea that metrics should be closely tied to business objectives. In addition to business-level SLAs, we also use SLOs and SLIs in SRE planning and practice.
May 3, 2021   |  By John Brier
For decades, application development and operations teams have struggled with the best way to generate, collect, and analyze telemetry data from systems and apps. In 2010, we discussed our approach to telemetry and tracing in the Dapper papers, which eventually spawned the open-source OpenCensus project, which merged with OpenTracing to become OpenTelemetry.
Apr 28, 2021   |  By Rahul Harpalani
Site Reliability Engineering (SRE) and Operations teams responsible for operating virtual machines (VMs) are always looking for ways to provide a more stable, more scalable environment for their development partners. Part of providing that stable experience is having telemetry data (metrics, logs and traces) from systems and applications so you can monitor and troubleshoot effectively.
Apr 27, 2021   |  By Rakesh Dhoopar
As applications move from monolithic architectures to microservices-based architectures, DevOps and Site Reliability Engineering (SRE) teams face new operational challenges. Microservices are updated constantly with new features and resource managers/schedulers (like Kubernetes and GKE) can add/remove containers in response to changing workloads. The old way of creating alerts based on learned behaviors of your monolithic applications will not work with microservices applications.
Apr 22, 2021   |  By John Day
Elite software development teams automate and integrate monitoring observability tools more frequently than lower performing teams, per the Accelerate: State of DevOps report. Organizations that need the highest levels of reliability, security, and scalability for their applications choose Google Kubernetes Engine (GKE). Recently we introduced GKE Autopilot to further simplify Kubernetes operations by automating the management of the cluster infrastructure, control plane, and nodes.
Apr 1, 2021   |  By Charles Baer
System and application logs provide crucial data for operators and developers to troubleshoot and keep applications healthy. Google Cloud automatically captures log data for its services and makes it available in Cloud Logging and Cloud Monitoring. As you add more services to your fleet, tasks such as determining a budget for storing logs data and performing granular cross-project analysis can become challenging.
Mar 23, 2021   |  By Rahul Harpalani
Running and troubleshooting production services requires deep visibility into your applications and infrastructure. Virtual machines running on Google Compute Engine (GCE) provide some system logs and metrics without any configuration required, but capturing application and advanced system data has required the installation of both a metrics agent and a logging agent.
Mar 2, 2021   |  By Charles Baer
Troubleshooting an application running on Google Kubernetes Engine (GKE) often means poking around various tools to find the key bit of information in your logs that leads to the root cause. With Cloud Operations, our integrated management suite, we’re working hard to provide the information that you need right where and when you need it. Today, we’re bringing GKE logs closer to where you are—in the Cloud Console—with a new logs tab in your GKE resource details pages.
Feb 26, 2021   |  By Rory Petty
Cloud Monitoring is one of the easiest ways you can gain visibility into the performance, availability, and health of your applications and infrastructure. Today, we’re excited to announce the lifting of three limits within Cloud Monitoring. First, the maximum number of projects that you can view together is now 375 (up from 100). Customers with 375 or fewer projects can view all their metrics at once, by putting all their projects within a single workspace.
Feb 26, 2021   |  By Yuri Grinshteyn
Applications fail. Containers crash. It’s a fact of life that SRE and DevOps teams know all too well. To help navigate life’s hiccups, we’ve previously shared how to debug applications running on Google Kubernetes Engine (GKE). We’ve also updated the GKE dashboard with new easier-to-use troubleshooting flows. Today, we go one step further and show you how you can use these flows to quickly find and resolve issues in your applications and infrastructure.
May 14, 2021   |  By Google Operations
APIs are packages of data and functionality that contain business-critical information. However - as API programs scale - it becomes impossible to individually manage each API. In this video, we demo how Apigee helps simplify API operations and allows you to deliver seamless and connected experiences for your customers.
May 3, 2021   |  By Google Operations
Cloud Logging is a real-time log management tool that allows you to securely store, search, analyze, and alert on all of your log data and events. In this video, we show you what Cloud Logging is and how you can use it to convert logs to log-based metrics for monitoring, alerting, analyzing and visualizing for your applications infrastructure.
Apr 10, 2021   |  By Google Operations
Almost every app and digital interaction today depends on APIs, so it’s important to be able to find and fix issues fast. Apigee’s API monitoring can alert you to live issues, give you in-depth details for every problem, and recommend a course of action. Take a look at this API monitoring demo from the Apigee team to keep your APIs running smoothly!
Mar 15, 2021   |  By Google Operations
APIs are great tools since they provide developers a simplified way to consume data and functionality that resides in backend systems. However, they are targets for malicious attacks because they contain business-critical information. In this video, we demo how Google Cloud can help you better secure your APIs with Apigee and Cloud Armor. Watch to learn how these tools offer security at multiple levels for your APIs!
Feb 6, 2021   |  By Google Operations
Want to visualize your monitoring data like never before? In this episode of Stack Doctor, we show you how to use the new Dashboard Editor to easily visualize your Cloud Monitoring data. Specifically, we’ll show you how to create a dashboard using gauges, scorecards, and text widgets and how you can utilize the new layouts and chart configuration modes to closely monitor the health of your services!
Feb 1, 2021   |  By Google Operations
Welcome to the Google Cloud Video Learning Series, where we show you how to use Google Cloud services. In this episode, we’ll show you how to export logs from Google Cloud Logging to BigQuery. Customers often export logs to BigQuery to run analytics against the metrics extracted from the logs. BigQuery can help identify unauthorized changes in configuration and inappropriate access to data, thus meeting your organization’s security and analytics requirements.
Jan 28, 2021   |  By Google Operations
Cloud SQL Insights helps you detect, diagnose, and prevent query performance problems for Cloud SQL databases. With Insights, you can monitor performance at an application level and trace the source of a problematic query across the application stack by model, view, controller, route, user, and host. In this video, we introduce you to Cloud SQL Insights and demo how you can use it for self-service, intuitive monitoring and troubleshooting.
Oct 5, 2020   |  By Google Operations
Tune in every week for a new episode and let us know what you think of the latest announcements in the comments below! Product: Apigee, Google Cloud VMs; fullname: Stephanie Wong;
Oct 2, 2020   |  By Google Operations
Formerly known as Stackdriver, Google Cloud Operations Suite is a platform where you can monitor, troubleshoot, and improve application performance on your Google Cloud environment. In this episode of Google Cloud Drawing Board, we show you what Google Cloud Operations Suite is and how you can use it to gain greater observability over your applications.
Sep 5, 2020   |  By Google Operations
We try to automate as much as possible in our environments, but we often treat monitoring as an afterthought. In this episode of Stack Doctor, we show you how to automate your monitoring configurations via Terraform. Watch to learn how you can automate the creation of common resources - such as uptime checks, alerting policies, and dashboards - with Terraform!

Monitoring and management for services, containers, applications, and infrastructure.

Operations aggregates metrics, logs, and events from infrastructure, giving developers and operators a rich set of observable signals that speed root-cause analysis and reduce mean time to resolution (MTTR). Operations doesn’t require extensive integration or multiple “panes of glass,” and it won’t lock developers into using a particular cloud provider.

Operations is built from the ground up for cloud-powered applications. Whether you’re running on Google Cloud Platform, Amazon Web Services, on-premises infrastructure, or with hybrid clouds, Operations combines metrics, logs, and metadata from all of your cloud accounts and projects into a single comprehensive view of your environment, so you can quickly understand service behavior and take action.