Prometheus Alerting Rule Reference¶
Prometheus alerts are defined in a PrometheusRule
resource. This resource is part of the Prometheus Operator and is used to define alerts that should be sent to the Alertmanager.
Alertmanager is a component of the Prometheus project that handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct Slack channel.
Prometheus alerts are defined using the PromQL query language. The query language is used to specify when an alert should fire, and the PrometheusRule
resource is used to specify the alert and its properties.
Prometheus alerts are sent to the team's Slack channel configured in Console when the alert fires.
graph LR
alerts.yaml --> Prometheus
Prometheus --> Alertmanager
Alertmanager --> Slack
PrometheusRule
¶
.nais/alert.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
team: my-team
name: my-alerts
namespace: my-team
spec:
groups:
- name: my-app-client-errors
rules:
- alert: HttpClientErrorRateHigh
expr: |
1 - (
sum(
rate(
http_client_request_duration_seconds_count{app="my-app", http_response_status_code="200"}[5m]
)
) by (server_address)
/
sum(
rate(
http_client_request_duration_seconds_count{app="my-app"}[5m]
)
) by (server_address)
) * 100 < 95
for: 10m
annotations:
summary: "High error rate for outbound http requests"
consequence: "Users are experiencing errors when using the application."
action: "Check the logs using `kubectl logs` for errors and check upstream services."
message: "Requests to `{{ $labels.server_address }}` are failing at {{ $value }}% over the last 5 minutes."
runbook_url: "https://github.com/navikt/my-app-runbook/blob/main/HttpClientErrorRateHigh.md"
dashboard_url: "https://grafana.nav.cloud.nais.io/d/000000000/my-app"
labels:
severity: warning
namespace: my-team
groups[]
¶
A PrometheusRule
can contain multiple groups of rules. Each group can contain multiple alert rules.
groups[].name
¶
The name of the group. This is used to group alerts in the Alertmanager.
groups[].rules[]
¶
groups[].rules[].alert
¶
The name of the alert. This is used to identify the alert in the Alertmanager. Typically this is a short, descriptive name on the form CamelCase
.
groups[].rules[].expr
¶
The expression that defines when the alert should fire. This is a PromQL expression that should evaluate to true
when the alert should fire.
We suggest using the Explore page in Grafana to build and test your PromQL expressions before creating a PrometheusRule
.
groups[].rules[].for
¶
For how long time the expr
must evaluate to true
before firing the alert. This is used to prevent flapping alerts and alerting on temporary spikes in metrics.
When the expr
first evaluates to true
the alert will be in pending
state for the duration specified.
Example values: 30s
, 5m
, 1h
.
groups[].rules[].labels
¶
Labels to attach to the alert. These are used to group and filter alerts in the Alertmanager.
groups[].rules[].labels.severity
(required)¶
This will affect what color the notification gets. Possible values are critical
(🔴), warning
(🟡) and info
(🟢).
groups[].rules[].labels.namespace
(required)¶
The team that is responsible for the alert. This is used to route the alert to the correct Slack channel.
groups[].rules[].labels.send_resolved
(optional)¶
If set to false
, no resolved message will be sent when the alert is resolved. This is useful for alerts that are not actionable or where the resolved message is not needed.
groups[].rules[].annotations
¶
groups[].rules[].annotations.summary
(optional)¶
The summary annotation is used to give a short description of the alert. This is useful for the one receiving the alert to understand what the alert is about and is the first line in the alert message in Slack.
groups[].rules[].annotations.consequence
(optional)¶
The consequence annotation is used to describe what happens in the world when this alert fires. This is useful for the one receiving the alert to understand the impact of the alert.
groups[].rules[].annotations.action
(optional)¶
The action annotation is used to describe what the best course of action is to resolve the issue. Good alerts should have a clear action that can be taken to resolve the issue.
groups[].rules[].annotations.message
(optional)¶
The message annotation is used to give a more detailed description of the alert. This is useful for the one receiving the alert to understand the alert in more detail and will printed for each result in the alert expression.
groups[].rules[].annotations.runbook_url
(optional)¶
The runbook URL annotation is used to link to a runbook that describes how to resolve the issue. This is useful for the one receiving the alert to quickly find the information needed to resolve the issue and is added as a link in the alert message in Slack.
groups[].rules[].annotations.dashboard_url
(optional)¶
The dashboard URL annotation is used to link to a dashboard that can help diagnose the issue. This is useful for the one receiving the alert to quickly find the information needed to diagnose the issue and is added as a link in the alert message in Slack.