Kubernetes,  Operation

Prometheus + Grafana + node_exporter + Alertmanager 快速主机部署(生产建议使用k8s)

Prometheus + Grafana + node_exporter + Alertmanager 快速主机部署(生产建议使用k8s)

  • Architectures
IP Servers Description
10.111.178.62 Prometheus / Grafana / Alertmanager 为了简单均采用单点 docker bridge port-forward 模式
10.111.0.111 node_exporter 为了简单采用 systemd 管理(生产采用 k8s DaemonSet 方式)
10.111.0.112 node_exporter 为了简单采用 systemd 管理(生产采用 k8s DaemonSet 方式)

1. 部署 Prometheus

1.1 [展开] Installation & configure prometheus

# @see: https://hub.docker.com/r/prom/prometheus/tags
docker pull docker.io/prom/prometheus:v2.30.3

sudo mkdir -p /etc/prometheus

# ----- prometheus.yml -----
sudo cat <<-'EOF'>/etc/prometheus/prometheus.yml
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 10.111.178.63:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "emqx-alert-rules.yml"
  - "haproxy-alert-rules.yml"
  - "jvm-alert-rules.yml"
  - "kafka-alert-rules.yml"
  - "node-alert-rules.yml"
  - "prometheus-alert-rules.yml"
  - "redis-alert-rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label job=<job_name> to any timeseries scraped from this config.
  - job_name: "prometheus"
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: node
    static_configs:
    - targets: ['10.111.0.111:9100']
      labels:
        instance: collect-node1
    - targets: ['10.111.0.112:9100']
      labels:
        instance: collect-node2
    - targets: ['10.111.178.79:9100']
      labels:
        instance: web-node2
    - targets: ['10.111.178.78:9100']
      labels:
        instance: web-node3
    - targets: ['10.111.178.60:9100']
      labels:
        instance: compute-node1
    - targets: ['10.111.0.11:9100']
      labels:
        instance: kafka-follower1
    - targets: ['10.111.0.12:9100']
      labels:
        instance: kafka-follower2
    - targets: ['10.29.69.150:9100']
      labels:
        instance: kafka-leader1
    - targets: ['10.111.178.58:9100']
      labels:
        instance: sink-node1
    - targets: ['10.111.178.57:9100']
      labels:
        instance: sink-node2
    - targets: ['10.111.178.70:9100']
      labels:
        instance: repo-node1
    - targets: ['10.111.178.63:9100']
      labels:
        instance: devops-node1
    - targets: ['10.111.178.72:9100']
      labels:
        instance: emr-header-1.cluster-125585
    - targets: ['10.111.178.73:9100']
      labels:
        instance: emr-header-2.cluster-125585
    - targets: ['10.111.178.71:9100']
      labels:
        instance: emr-worker-1.cluster-125585
    - targets: ['10.111.178.74:9100']
      labels:
        instance: emr-worker-2.cluster-125585
    - targets: ['10.111.178.75:9100']
      labels:
        instance: emr-worker-3.cluster-125585
    - targets: ['10.111.178.76:9100']
      labels:
        instance: emr-worker-4.cluster-125585
  - job_name: haproxy
    static_configs:
    - targets: ['10.111.0.111:9101']
      labels:
        instance: collect-node1:9101
    - targets: ['10.111.0.112:9101']
      labels:
        instance: collect-node2:9101
  - job_name: iam-gateway
    metrics_path: /actuator/prometheus
    static_configs:
    - targets: ['10.111.178.79:18086']
      labels:
        instance: web-node2:18086
    - targets: ['10.111.178.78:18086']
      labels:
        instance: web-node3:18086
  - job_name: kafka
    static_configs:
    - targets: ['10.111.0.11:10105']
      labels:
        instance: kafka-follower1:10105
    - targets: ['10.111.0.12:10105']
      labels:
        instance: kafka-follower2:10105
    - targets: ['10.29.69.150:10105']
      labels:
        instance: kafka-leader1:10105
  - job_name: emq
    static_configs:
    - targets: ['10.111.0.111:9091']
      labels:
        instance: collect-node1:9091
    - targets: ['10.111.0.112:9091']
      labels:
        instance: collect-node2:9091
EOF


sudo mkdir -p /mnt/disk1/prometheus
sudo chmod -R 777 /mnt/disk1/prometheus
docker run -d --name=prometheus1 -p 9090:9090 \
-v /etc/prometheus/:/etc/prometheus/ \
-v /mnt/disk1/prometheus/:/prometheus/ \
--network host \
--restart=always prom/prometheus:v2.30.3 \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.console.libraries=/usr/share/prometheus/console_libraries \
--web.console.templates=/usr/share/prometheus/consoles \
--storage.tsdb.retention=180d \
--web.listen-address="0.0.0.0:9090" \
--web.enable-admin-api

# 【可选】临时测试启动,停止自动销毁(可观察创建过程日志 stdout 输出)
#docker run --rm --name=prometheus1 -p 9090:9090 \
-v /etc/prometheus/:/etc/prometheus/ \
-v /mnt/disk1/prometheus/:/prometheus/ \
prom/prometheus:v2.30.3

以上除了 --storage.tsdb.retention=180d --web.enable-admin-api 其他均为 prometheus 官方容器默认启动参数,由于 docker run ... 时指定的 COMMAND 和 ARGS 会覆盖默认启动命令,因此必须全部指定。参见官方文档: docker-run#cmd-default-command-or-options

2. 部署 Grafana

[展开] Installation grafana

# see: https://hub.docker.com/r/grafana/grafana/tags
docker pull docker.io/grafana/grafana:8.2.2

sudo mkdir -p /mnt/disk1/grafana
sudo chmod -R 777 /mnt/disk1/grafana
docker run -d --name=grafana1 -p 3000:3000 --network host --restart=always -v /mnt/disk1/grafana:/var/lib/grafana docker.io/grafana/grafana:8.2.2


  • 控制台界面: http://10.111.178.62:3000/ admin/初始密码为任意非空字符串,建议首次登录进去后立即修改密码
  • 添加 Prometheus 数据源

3. 部署 node_exporter

3.1 安装及配置

  • 下载安装
[展开] Installation node_exporter

# ------ Downloading installation -------
sudo cd /tmp
sudo curl -OL https://github.91chifun.workers.dev/https://github.com//prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz
sudo tar -xf node_exporter*
sudo cp node_exporter*/node_exporter /usr/sbin/

# ------ Make configuration -------
sudo cat <<-'EOF' >/etc/sysconfig/node_exporter
OPTIONS="--collector.systemd \
--collector.mdadm \
--collector.tcpstat \
--collector.processes \
--web.listen-address=:9100 \
--web.telemetry-path=/metrics"
EOF


  • 配置服务脚本(推荐,适用于如 CentOS 7.x)
[展开] Configure node_exporter(systemd)

sudo cat <<-'EOF' >/etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter

[Service]
User=root
Group=root
EnvironmentFile=-/etc/sysconfig/node_exporter
ExecStart=/usr/sbin/node_exporter $OPTIONS
ExecReload=/bin/kill -s HUP
StandardOutput=journal
StandardError=journal
LimitNOFILE=64
LimitNPROC=64
LimitCORE=infinity
TimeoutStartSec=10
TimeoutSec=300
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl restart node_exporter
sudo systemctl status node_exporter
sudo journalctl -u node_exporter -f

# 测试访问
curl http://localhost:9100/metrics


  • 配置服务脚本(可选,适用于如 CentOS 6.x)
[展开] Configure node_exporter(init.d)

sudo cat <<-'EOF' >/etc/init.d/node_exporter.sh
#!/bin/bash
#/*
# * Copyright 2017 ~ 2025 the original author or authors. 
# *
# * Licensed under the Apache License, Version 2.0 (the "License");
# * you may not use this file except in compliance with the License.
# * You may obtain a copy of the License at
# *
# *      http://www.apache.org/licenses/LICENSE-2.0
# *
# * Unless required by applicable law or agreed to in writing, software
# * distributed under the License is distributed on an "AS IS" BASIS,
# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# * See the License for the specific language governing permissions and
# * limitations under the License.
# */

[ -f /etc/sysconfig/network ] && . /etc/sysconfig/network
[ "$NETWORKING" = "no" ] && exit 0

# Environment definition.
binFile="$(command -v node_exporter)"
if [ -f /etc/sysconfig/node_exporter ]; then
  . /etc/sysconfig/node_exporter
  export NODE_EXPORTER_OPTIONS=$OPTIONS
fi

function start() {
  local pids=$(getPids)
  if [ -z "$pids" ]; then
    nohup $binFile $NODE_EXPORTER_OPTIONS > "/var/log/node_exporter.stdout" 2>&1 &
    echo -n "Starting node exporter ..."
    while true
    do
      pids=$(getPids)
      if [ "$pids" == "" ]; then
        echo -n ".";
        sleep 0.8;
      else
        echo $pids >"/run/node_exporter.pid"
        break
      fi
    done
    echo -e "\nStarted node_exporter on "$pids
  else
    echo "Node exporter process is running "$pids
  fi
}

function stop() {
  local pids=$(getPids)
  if [ -z "$pids" ]; then
    echo "Node exporter not running!"
  else
    echo -n "Stopping node_exporter for $pids ..."
    kill -s TERM $pids
    while true
    do
      pids=$(getPids)
      if [ "$pids" == "" ]; then
        \rm -f /run/node_exporter.pid
        break
      else
        echo -n ".";
        sleep 0.8;
      fi
    done
    echo -e "\nStopped node_exporter !"
  fi
}

function status() {
  ps -ef | grep -v grep | grep $binFile
}

function getPids() {
  local pids=$(ps ax | grep -i "$binFile" | grep -v grep | awk '{print $1}')
  echo $pids # Output execution result value.
  return 0 # Return the execution result code.
}

# --- Main call. ---
CMD=$1
case $CMD in
  status)
    status
    ;;
  start)
    start
    ;;
  stop)
    stop
    ;;
  restart)
    stop
    start
    ;;
    *)
  echo $"Usage: {start|stop|restart|status}"
  exit 2
esac
EOF

chmod +x /etc/init.d/node_exporter.sh
/etc/init.d/node_exporter.sh restart
/etc/init.d/node_exporter.sh status

# 测试访问
curl http://localhost:9100/metrics


3.2 添加仪表盘 (node_exporter)

  • 推荐简单 Grafana ID: 13978
  • 推荐热门 Grafana ID: 8919   -   1 Node Exporter for Prometheus Dashboard CN
  • 推荐热门 Grafana ID: 11074   -   1 Node Exporter for Prometheus Dashboard EN
  • 推荐热门 Grafana ID: 15172   -   1 Node Enhanced version based on 11074:Exporter for Prometheus Dashboard EN, or download1: node_exporter-15172-rev6.json, download2: node_exporter-15172-rev6.json
  • 推荐热门 Grafana ID: 1860   -   1 Node Exporter Full
  • 推荐热门 Grafana ID: 315   -   Kubernetes cluster monitoring (via Prometheus)
  • 更多优秀 Dashboard: https://grafana.com/grafana/dashboards

3.3 效果图

3.4 常见重要生产环境指标

指标名 启用参数 默认启用 参考意义
node_memory_MemFree_bytes, node_memory_Buffers_bytes, node_memory_Cached_bytes, node_memory_SwapCached_bytes, node_memory_VmallocUsed_bytes --collector.meminfo 当前内存使用情况, kernel 的内存申请分配机制复杂
node_netstat_Tcp_CurrEstab --collector.netstat 高并发时响应延迟或拒绝连接连接数排查
node_tcp_connection_states --collector.tcpstat 高并发时响应延迟或拒绝连接连接数排查
node_processes_threads --collector.processes 【重要】通常都关心 OOM 等 mem问题,较少关心系统的 Thread 情况,亲经历 netty 服务器高并时耗尽 pid_max(如 CentOS 7.9 默认 32768) ,导致 ssh 无法登陆及 http 服务无法响应 fork: Cannot allocate memory

3.5 指标类型及应用场景:

  • Counter:收集事件次数等单调递增的数据
  • Gauge:收集当前的状态,比如数据库连接数
  • Histogram:收集随机正态分布数据,比如响应延迟
  • Summary:收集随机正态分布数据,和 Histogram 是类似的

4. 部署 alertmanager

[展开] Configure /etc/alertmanager/alertmanager.yml

sudo mkdir -p /etc/alertmanager

sudo cat <<-'EOF'>/etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: 'you_email_user@163.com'
  smtp_auth_username: 'you_email_user@163.com'
  smtp_auth_password: 'you_email_password'
  smtp_require_tls: false # see: smtp 454 Command not permitted when TLS active
  smtp_hello: '163.com'
  #wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'

templates:
  - 'template/*.tmpl'

route:
  group_by: ['alertname'] # 报警分组依据
  group_wait: 10s # 最初即第一次等待多久时间发送一组警报的通知
  group_interval: 10s # 在发送新警报前的等待时间
  repeat_interval: 1m # 发送重复警报的周期, 对于email此项不可以设置过低, 以防smtp服务器拒绝
  receiver: 'email' # 对应以下receivers

receivers:
  - name: 'email'
    email_configs:
      - to: '983708408@qq.com'
        html: '{{ template "email.tmpl" . }}'
        headers: { Subject: "[WARN] 报警邮件"}
  #- name: 'webhook'
  #  webhook_configs:
  #    - url: 'http://127.0.0.1:5001'
  #      send_resolved: true
  #- name: 'wechat'
  #  wechat_configs:
  #    - send_resolved: true
  #      to_party: '1' # 接收组ID
  #      agent_id: '1000002' # 企业微信->自定应用->AgentId
  #      corp_id: '******' # 企业信息(我的企业->CorpId[在底部])
  #      api_secret: '******' # 企业微信(企业微信->自定应用->Secret)
  #      message: '{{ template "test_wechat.html" . }}'

# 抑制器配置
inhibit_rules:
  - source_match: # 源标签警报触发时抑制含有目标标签的警报,在当前警报匹配 status: 'High'
      severity: 'critical'
      status: 'High'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
EOF


  • Configure email.tmpl
sudo cat <<-'EOF'>/etc/alertmanager/template/email.tmpl
{{ define "email.html" }}
<table border="1">
        <tr>
                <td>报警项</td>
                <td>实例</td>
                <td>报警阀值</td>
                <td>开始时间</td>
        </tr>
        {{ range $i, $alert := .Alerts }}
                <tr>
                        <td>{{ index $alert.Labels "alertname" }}</td>
                        <td>{{ index $alert.Labels "instance" }}</td>
                        <td>{{ index $alert.Annotations "value" }}</td>
                        <td>{{ $alert.StartsAt }}</td>
                </tr>
        {{ end }}
</table>
{{ end }}
EOF
  • Deployment alertmanager server
# @see: https://hub.docker.com/r/prom/alertmanager/tags
docker pull docker.io/prom/alertmanager:v0.23.0

docker run -d --name=alertmanager1 -p 9093:9093 -v /etc/alertmanager/:/etc/alertmanager/ -v /mnt/disk1/alertmanager/:/alertmanager/ --network host --restart=always prom/alertmanager:v0.23.0

5. FAQ

5.1 启动 prometheus 容器报错?

  • ERROR: level=error ts=2021-10-24T13:08:47.353Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied" panic: Unable to create mmap-ed active query log

  • 原因: 由于 prometheus 官方镜像启动是用的 nobody 用户, 而在宿主机上启动容器是root, 因此无权限,简单解决方法是设置挂载的宿主上的目录(/mnt/disk1/prometheus)为 777.

  • 参考: Prometheus 监控 k8s 集群使用邮箱和微信告警

5.2 如何清理prometheus 时序数据?

./prometheus --storage.tsdb.retention=180d --web.enable-admin-api
  • 删除 mysql 实例: rds-node1:3066 的指标: mysql_global_status_threads_running所有数据
curl -X POST -g 'http://10.111.178.62:9090/api/v1/admin/tsdb/delete_series?match[]=up&match[]=mysql_global_status_threads_running{instance="rds-node1:3066",job="mysql"}'
  • 删除 mysql 实例: rds-node1:3066 的指标: mysql_global_status_threads_running 在时间区间 1557903714155790395 的数据
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?start=1557903714&end=1557903954&match[]=up&match[]=mysql_global_status_threads_running{instance="rds-node1:3066",job="mysql"}'
  • 删除 mysql 实例: rds-node1:3066所有指标在时间区间 1557903714155790395 的数据
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?start=1557903714&end=1557903954&match[]=up&match[]={instance="rds-node1:3066",job="mysql"}'

5.3 如何扩展自定义 Grafana 仪表盘?

-> 左侧主菜单(Dashboard)
-> Manage
-> 选择现有仪表板(进入)
-> Dashboard settings(一般导入的开源模版默认是只读的) 
-> Make Editable
-> 左上角返回 (此时显示的整个仪表盘是可编辑状态) 
-> Add panel (右上角)
-> Add an empty panel
-> 在 Metrics browser 输入要展示的指标expr (如: node_memory_MemFree_bytes{job="node",instance="$instance"})
-> Legend (设置显示别名)
-> 右上角 search by 选择图表类型,常用 Graph (old)
-> Apply (预览)

5.4 分享自定义的表盘到 Grafana 官方,上传时报错?

5.5 grafana 管理员账号密码忘记如何重置?

  • 方式1:
docker exec -it grafana1 /bin/sh
grafana-cli admin reset-admin-password 123456
  • 方式2:
# 找到 sqllite db 文件
find / -name grafana.db
sqlite3 /var/lib/grafana/grafana.db
# 查看user表
select * from user;
# 更改密码为 admin
update user set password = '59acf18b94d7eb0694c61e60ce44c110c7a683ac6a8f09580d626f90f4a242000746579358d77dd9e570e83fa24faa88a8a6', salt = 'F3FAxVm33R' where login = 'admin';

6. 参考

一条评论

留言

您的电子邮箱地址不会被公开。