Observability¶

Building and deploying applications is only half the battle. The other half is to be able to observe what's going on in your application. This is where observability comes in.

What is observability?¶

Observability is a term used to describe the ability to understand the state of a system by looking at the logs, metrics and traces it produces. This is in contrast to the traditional approach of debugging a system by looking at the code.

We often use the analogy of a car to explain the difference between the two approaches. If you have a problem with your car, you can either look at the code (the engine) or you can look at the car itself (the dashboard). The dashboard gives you a lot of information about the state of the car, and you can use this information to understand what is going on.

The tree pillars of observability are:

Logs - Logs are a record of what has happened in your application. They are useful for debugging, but due to their unstructured format they generally do not scale very well.
Metrics - Metrics are a numerical measurement of something in your application. They are useful for understanding the performance of your application and is generally more scalable than logs both in terms of storage and querying since they are structured data.
Traces - Traces are a record of the path a request takes through your application. They are useful for understanding how a request is processed in your application.

graph
  A[Application] --> B(Logs)
  A --> C(Metrics)
  A --> D(Traces)

  click B "#logs"
  click C "#metrics"
  click D "#traces"

Automatic observability¶

Nais provides a new way to get started with observability. By enabling auto-instrumentation, you can get started with observability without having to write any code. This is the easiest way to get started with observability, as it requires little to no effort on the part of the team developing the application.

Get started with auto-instrumentation

Metrics¶

Metrics are a way to measure the state of your application. Metrics are usually numerical values that can be aggregated and visualized. Metrics are often used to create alerts and dashboards.

We use the OpenMetrics format for metrics. This is a text-based format that is easy to parse and understand. It is also the format used by Prometheus, which is the most popular metrics system.

Learn more about metrics

Prometheus¶

Prometheus is a time-series database that is used to store metrics. It is a very powerful tool that can be used to create alerts and dashboards. Prometheus is used by many open source projects and is the de facto standard for metrics in the cloud native world.

Prometheus is a pull-based system. This means that Prometheus will scrape (pull) metrics from your application. This is in contrast to a push-based system, where your application would push metrics to a central system.

graph LR
  Grafana --> Prometheus
  Prometheus --GET /metrics--> Application

Access Prometheus here

Grafana¶

Grafana is a tool for visualizing metrics. It is used to create dashboards that can be used to monitor your application. Grafana is used by many open source projects and is the de facto standard for metrics in the cloud native world.

Access Grafana here

Logs¶

Logs are a way to understand what is happening in your application. They are usually text-based and are often used for debugging. Since the format of logs is usually not standardized, it can be difficult to query and aggregate logs and thus we recommend using metrics for dashboards and alerting.

Logs that are sent to console (stdout) are collected automatically and can be configured for persistent storage and querying in several ways.

graph LR
  Application --stdout/stderr--> Router
  Router --> A[Grafana Loki]
  Router --> B[Team Logs]

Learn more about logs

Traces¶

With tracing, we can get application performance monitoring (APM). Tracing gives deep insight into the execution of your application. For instance, you can use tracing to see if parallel function are actually run in parallel, or what amount of time your application spends in a given function.

Traces from Nais applications can be collected using the OpenTelemetry standard. Performance metrics are stored and queried from the Tempo component.

Visualization of traces can be done in Grafana, using the *-tempo data sources (one for each environment).

graph LR
  Application --gRPC--> Tempo
  Tempo --> Grafana

Learn more about tracing

Alerts¶

Alerts are a way to notify you when something is wrong with your application, and are usually triggered when a metric or log entry matches a certain condition.

Alerts in Nais are based on application metrics and use Prometheus Alertmanager to send notifications to Slack.

graph LR
  alerts.yaml --> Prometheus
  Prometheus --> Alertmanager
  Alertmanager --> Slack

Learn more about alerts

Learning more¶

Observability is a very broad topic and there is a lot more to learn. Here are some resources that you can use to learn more about observability:

Monitoring, the Prometheus Way

SRE Book - Monitoring distributed systems

SRE Workbook - Monitoring

SRE Workbook - Alerting