Alert Manager

Alert Manager provides automatic alerting based on metrics stored in Prometheus server.

Install Alert Manager

To install Alert Manager run 04_alertmanager.sh as root user on the monitoring server.

#!/bin/bash

# all packages are installed as root
if [[ $EUID -ne 0 ]]; then
   echo "This script must be run as root" 1>&2
   exit 1
fi

# install lts version of alert manager
VERSION="0.27.0"

# download
wget https://github.com/prometheus/alertmanager/releases/download/v${VERSION}/alertmanager-${VERSION}.linux-amd64.tar.gz

# extract contents and remove original archive
tar xvfz alertmanager-${VERSION}.linux-amd64.tar.gz && rm alertmanager-${VERSION}.linux-amd64.tar.gz

# create folders to install
mkdir /var/lib/alertmanager

# change into extracted folder
pushd alertmanager-${VERSION}.linux-amd64

# move to bin and etc
mv alertmanager amtool /usr/local/bin/
mv alertmanager.yml /etc/prometheus/alertmanager.yml
#mv consoles/ console_libraries/ /etc/prometheus/

# return to parent folder
popd

# and check prometheus is installed
alertmanager --version

# create systemctl service file
cat >/etc/systemd/system/alertmanager.service << EOL
[Unit]
Description=Alert Manager
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/local/bin/alertmanager \
    --config.file /etc/prometheus/alertmanager.yml \
    --storage.path /var/lib/alertmanager/ \
    --log.level=info

[Install]
WantedBy=multi-user.target
EOL

# reload the systemd, enable the service and check its status
systemctl daemon-reload
systemctl enable alertmanager
systemctl restart alertmanager

# good then
systemctl status alertmanager

After this script finishes successfully, systemctl status alertmanager command should show running Alert Manager server and Alert Manager Admin UI should also be visible at the IP address of the monitoring server, port 9093.

* alertmanager.service - Alert Manager
       Loaded: loaded (/etc/systemd/system/alertmanager.service; enabled; vendor preset: enabled)
       Active: active (running) since Thu 2024-05-16 12:23:37 UTC; 17h ago
     Main PID: 828 (alertmanager)
        Tasks: 8 (limit: 4558)
       Memory: 36.5M
          CPU: 35.167s
       CGroup: /system.slice/alertmanager.service
               └─828 /usr/local/bin/alertmanager --config.file /etc/prometheus/alertmanager.yml

Alertmanager Up

We would now need to configure the receiver of the alerts. This is done in /etc/prometheus/alertmanager.yml file.

Alert Manager supports a lot of receivers, but as we use MS Teams internally, sample integration is done using Incoming WebHooks API. The URL here is just a sample, but your actual URL should look similar. Note the lines are wrapped in this example, but actual lines must not be.

route:
  receiver: 'teams'

receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'

  - name: 'teams'
    msteams_configs:    
      - webhook_url: 'https://blabla.webhook.office.com/webhookb2/
           bafdbe36-728a-476b-9c91-9755a2102aa3@057485d6-aa30-424a-aa6
           6-b668b2cb137b/IncomingWebhook/94e7c1fce1054e038fc55bb
           61e16bba5/42f6afa4-41ac-4672-99f6-93a4099d46bd'

Integrate Alert Manager and Prometheus

To integrate Alert Manager with Prometheus we need to add the following entries to /etc/prometheus/prometheus.yml. The Prometheus server will then look for Alert Manager on localhost, port 9090 and load alerting rules from our file /etc/prometheus/websafety_alerts.

#
# alertmanager configuration
#
alerting:
  alertmanagers:
    - static_configs:
      - targets:
        - localhost:9093

#
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
#
rule_files:
  - "websafety_alerts.yml"

Configure Alert Rules

The websafety_alerts.yml file should be copied from our repository to /etc/prometheus/websafety_alerts.yml and should contain the following alerts.

groups:
  - name: websafety
    rules:

      #
      # websafety appliance is down
      #
      - alert: InstanceDown
        expr: up{job="websafety_node", app="websafety"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Web Safety Appliance {{ $labels.instance }} is down for more than 5 minutes"

      #
      # websafety appliance has low disk space
      #
      - alert: InstanceLowDiskSpace
        expr: ((node_filesystem_avail_bytes{app='websafety', mountpoint='/'}) / (node_filesystem_size_bytes{app='websafety', mountpoint='/'}) * 100) > 90
        labels:
          severity: critical
        annotations:
          summary: "Web Safety Appliance {{ $labels.instance }} runs low on the HDD space available"

      #
      # websafety appliance has low RAM
      #
      - alert: InstanceLowMemory
        expr: ((node_memory_MemAvailable_bytes{app='websafety'}) / (node_memory_MemTotal_bytes{app='websafety'}) * 100) < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Web Safety Appliance {{ $labels.instance }} runs low on the RAM for more than 5 minutes"

      #
      # websafety appliance high CPU
      #
      - alert: InstanceHighCPUUsage
        expr: 100 * (1 - avg by(instance)(irate(node_cpu_seconds_total{mode='idle'}[1m]))) >= 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Web Safety Appliance {{ $labels.instance }} has high CPU usage for more than 5 minutes"

Verify Alerts

To verify alerts, we will now shutdown the 192.168.4.14 virtual appliance. Grafana will readily mark the appliance as Down as shown on the following screenshot.

Node Down

Alert will be visible and firing in Prometheus.

Firing

Admin UI of Alert Manager will also show the alert as firing.

Alert

Finally our Prometheus channel at MS Teams will show an incoming alert.

Teams