Skip to content

Observability

Building and deploying applications is only half the battle. The other half is to be able to observe what's going on in your application. This is where observability comes in.

What is observability?

Observability is a term used to describe the ability to understand the state of a system by looking at the logs, metrics and traces it produces. This is in contrast to the traditional approach of debugging a system by looking at the code.

We often use the analogy of a car to explain the difference between the two approaches. If you have a problem with your car, you can either look at the code (the engine) or you can look at the car itself (the dashboard). The dashboard gives you a lot of information about the state of the car, and you can use this information to understand what is going on.

The tree pillars of observability are:

  1. Logs - Logs are a record of what has happened in your application. They are useful for debugging, but due to their unstructured format they generally do not scale very well.
  2. Metrics - Metrics are a numerical measurement of something in your application. They are useful for understanding the performance of your application and is generally more scalable than logs both in terms of storage and querying since they are structured data.
  3. Traces - Traces are a record of the path a request takes through your application. They are useful for understanding how a request is processed in your application.
graph
A[Application] --> B((Logs))
A --> C((Metrics))
A --> D((Traces))

Metrics

Metrics are a way to measure the state of your application. Metrics are usually numerical values that can be aggregated and visualized. Metrics are often used to create alerts and dashboards.

We use the OpenMetrics format for metrics. This is a text-based format that is easy to parse and understand. It is also the format used by Prometheus, which is the most popular metrics system.

Get started with metrics

Prometheus

Prometheus is a time-series database that is used to store metrics. It is a very powerful tool that can be used to create alerts and dashboards. Prometheus is used by many open source projects and is the de facto standard for metrics in the cloud native world.

Prometheus is a pull-based system. This means that Prometheus will scrape (pull) metrics from your application. This is in contrast to a push-based system, where your application would push metrics to a central system.

graph LR
  Grafana --> Prometheus
  Prometheus --GET /metrics--> Application

Grafana

Grafana is a tool for visualizing metrics. It is used to create dashboards that can be used to monitor your application. Grafana is used by many open source projects and is the de facto standard for metrics in the cloud native world.

Access Grafana here

Logs

Logs are a way to understand what is happening in your application. They are usually text-based and are often used for debugging. Since the format of logs is usually not standardized, it can be difficult to query and aggregate logs and thus we recommend using metrics for dashboards and alerting.

Logs are collected automatically by fluentd, stored in Elasticsearch and made accessible via Kibana.

graph LR
  Application --stdout/stderr--> Fluentbit
  Fluentbit --> Elasticsearch
  Elasticsearch --> Kibana

Configure your logs

Alerts

Alerts are a way to notify you when something is wrong with your application, and are usually triggered when a metric or log entry matches a certain condition.

Alerts in NAIS are based on application metrics and use Prometheus Alertmanager to send notifications to Slack.

The alert resource can be used to configure alerts for your applications.

graph LR
  alerts.yaml --> Prometheus
  Prometheus --> Alertmanager
  Alertmanager --> Slack

Configure your alerts

Learning more

Observability is a very broad topic and there is a lot more to learn. Here are some resources that you can use to learn more about observability:


Last update: 2022-11-08
Created: 2022-10-20