Alertmanager 配置終極指南:從“邪道”到正規(guī)軍
Prometheus Operator 的 AlertmanagerConfig 死活不生效時(shí),我選擇了對(duì)加密配置下手...
問題背景
部署 Prometheus Operator 后,精心配置的 AlertmanagerConfig 資源死活不生效。在無數(shù)次調(diào)試無果后,我決定繞過 Operator,直接對(duì)加密的默認(rèn)配置動(dòng)手——這是一條邪修之路,但效果立竿見影!
邪道方案:直搗黃龍
1.獲取加密配置
kubectl get secret alertmanager-rancher-monitoring-alertmanager-generated \
  -n cattle-monitoring-system -o yaml > secret.yaml2.解密核心配置
# 安裝 yq 工具
wget https://github.com/mikefarah/yq/releases/download/v4.25.1/yq_linux_amd64 -O /usr/local/bin/yq
chmod +x /usr/local/bin/yq
# 解密 alertmanager 配置
echo "$(yq eval '.data."alertmanager.yaml.gz"' secret.yaml)" | base64 -d | gzip -d > alertmanager.yaml
# 解密模板文件
echo "$(yq eval '.data."rancher_defaults.tmpl"' secret.yaml)" | base64 -d > rancher_defaults.tmpl3.魔改配置(QQ郵箱示例)
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: 'xxxx@qq.com'
  smtp_auth_username: 'xxxx@qq.com'
  smtp_auth_password: 'xxxxxxx'
  smtp_require_tls: false
route:
  receiver: "k8s-alarm"
  group_by: [alertname]
  routes:
  - receiver: "null"
    matchers:
    - alertname = "Watchdog"
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
receivers:
- name: "k8s-alarm"
  email_configs:
  - to: 'test@gmail.cn'
    send_resolved: true
- name: "null"
templates:
- /etc/alertmanager/config/*.tmpl4.重新加密并部署
# 壓縮配置
gzip -c alertmanager.yaml > alertmanager.yaml.gz
# Base64 編碼
ALERTMANAGER_CONFIG=$(base64 -w0 alertmanager.yaml.gz)
TEMPLATE_CONFIG=$(base64 -w0 rancher_defaults.tmpl)
# 生成新 Secret
yq eval ".data.\"alertmanager.yaml.gz\" = \"$ALERTMANAGER_CONFIG\" | 
         .data.\"rancher_defaults.tmpl\" = \"$TEMPLATE_CONFIG\"" secret.yaml > updated-secret.yaml
# 修改 Secret 名稱
sed -i 's/name: alertmanager-.*/name: alertmanager-main/' updated-secret.yaml
# 應(yīng)用配置
kubectl apply -f updated-secret.yaml -n cattle-monitoring-system5.修改Alertmanager工作負(fù)載
# 修改 volumes 配置
volumes:
- name: config-volume
  secret:
    secretName: alertmanager-main  # 替換默認(rèn)值效果驗(yàn)證
圖片
圖片
圖片
警告:此方案雖快但險(xiǎn),Operator 升級(jí)可能導(dǎo)致配置被覆蓋!
正規(guī)軍方案:優(yōu)雅之道
1.配置告警接收器和路由
# k8s-alarm.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: k8s-alarm
  namespace: test
spec:
  receivers:
    - name: tialert
      webhookConfigs:
        - url: https://your-webhook-url
          sendResolved: true
  route:
    groupBy: [alertname]
    groupInterval: 5m
    groupWait: 30s
    matchers:
      - name: severity
        value: "warning|critical"
        regex: true
    receiver: tialert
    repeatInterval: 4h2.配置靜默路由
# null.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: silence-watchdog
  namespace: cattle-monitoring-system
spec:
  receivers:
    - name: null-receiver
  route:
    matchers:
      - name: alertname
        value: "Watchdog"
    receiver: null-receiver3.自定義告警規(guī)則
# app-alert.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-backend-alerts
  namespace: test
  labels:
    prometheus: rancher-monitoring
    role: alert-rules
spec:
  groups:
  - name: app-backend
    rules:
    - alert: HighRequestRate
      expr: |
        sum(rate(http_requests_total{job="app-backend"}[5m])) by (service) > 100
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "High request rate on {{ $labels.service }}"
        description: "Request rate is {{ $value }} per second"總結(jié)對(duì)比
圖片
選擇建議:調(diào)試階段可用“邪道”快速驗(yàn)證,生產(chǎn)環(huán)境務(wù)必使用正規(guī)方案!
無論是“邪道”還是“正規(guī)軍”,最終目的都是讓告警系統(tǒng)穩(wěn)定、可靠、可控。調(diào)試階段,適當(dāng)“走捷徑”可以快速驗(yàn)證思路,但千萬別讓臨時(shí)方案變成長(zhǎng)期債務(wù)。真正的運(yùn)維高手,不是不走捷徑,而是知道什么時(shí)候該回頭,把“邪修”的經(jīng)驗(yàn),沉淀為“正道”的規(guī)范。















 
 
 











 
 
 
 