Prometheus + Grafana + node_exporter + Alertmanager 快速主机部署(生产建议使用k8s)
Prometheus + Grafana + node_exporter + Alertmanager 快速主机部署(生产建议使用k8s)
- Architectures
IP | Servers | Description |
---|---|---|
10.111.178.62 | Prometheus / Grafana / Alertmanager | 为了简单均采用单点 docker bridge port-forward 模式 |
10.111.0.111 | node_exporter | 为了简单采用 systemd 管理(生产采用 k8s DaemonSet 方式) |
10.111.0.112 | node_exporter | 为了简单采用 systemd 管理(生产采用 k8s DaemonSet 方式) |
1. 部署 Prometheus
1.1 [展开] Installation & configure prometheus
# @see: https://hub.docker.com/r/prom/prometheus/tags
docker pull docker.io/prom/prometheus:v2.30.3
sudo mkdir -p /etc/prometheus
# ----- prometheus.yml -----
sudo cat <<-'EOF'>/etc/prometheus/prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 10.111.178.63:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "emqx-alert-rules.yml"
- "haproxy-alert-rules.yml"
- "jvm-alert-rules.yml"
- "kafka-alert-rules.yml"
- "node-alert-rules.yml"
- "prometheus-alert-rules.yml"
- "redis-alert-rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label job=<job_name>
to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: node
static_configs:
- targets: ['10.111.0.111:9100']
labels:
instance: collect-node1
- targets: ['10.111.0.112:9100']
labels:
instance: collect-node2
- targets: ['10.111.178.79:9100']
labels:
instance: web-node2
- targets: ['10.111.178.78:9100']
labels:
instance: web-node3
- targets: ['10.111.178.60:9100']
labels:
instance: compute-node1
- targets: ['10.111.0.11:9100']
labels:
instance: kafka-follower1
- targets: ['10.111.0.12:9100']
labels:
instance: kafka-follower2
- targets: ['10.29.69.150:9100']
labels:
instance: kafka-leader1
- targets: ['10.111.178.58:9100']
labels:
instance: sink-node1
- targets: ['10.111.178.57:9100']
labels:
instance: sink-node2
- targets: ['10.111.178.70:9100']
labels:
instance: repo-node1
- targets: ['10.111.178.63:9100']
labels:
instance: devops-node1
- targets: ['10.111.178.72:9100']
labels:
instance: emr-header-1.cluster-125585
- targets: ['10.111.178.73:9100']
labels:
instance: emr-header-2.cluster-125585
- targets: ['10.111.178.71:9100']
labels:
instance: emr-worker-1.cluster-125585
- targets: ['10.111.178.74:9100']
labels:
instance: emr-worker-2.cluster-125585
- targets: ['10.111.178.75:9100']
labels:
instance: emr-worker-3.cluster-125585
- targets: ['10.111.178.76:9100']
labels:
instance: emr-worker-4.cluster-125585
- job_name: haproxy
static_configs:
- targets: ['10.111.0.111:9101']
labels:
instance: collect-node1:9101
- targets: ['10.111.0.112:9101']
labels:
instance: collect-node2:9101
- job_name: iam-gateway
metrics_path: /actuator/prometheus
static_configs:
- targets: ['10.111.178.79:18086']
labels:
instance: web-node2:18086
- targets: ['10.111.178.78:18086']
labels:
instance: web-node3:18086
- job_name: kafka
static_configs:
- targets: ['10.111.0.11:10105']
labels:
instance: kafka-follower1:10105
- targets: ['10.111.0.12:10105']
labels:
instance: kafka-follower2:10105
- targets: ['10.29.69.150:10105']
labels:
instance: kafka-leader1:10105
- job_name: emq
static_configs:
- targets: ['10.111.0.111:9091']
labels:
instance: collect-node1:9091
- targets: ['10.111.0.112:9091']
labels:
instance: collect-node2:9091
EOF
-
更多告警规则参见: github.com/wl4g/prometheus-integration/tree/master/prometheus
-
1.2 启动 prometheus server
sudo mkdir -p /mnt/disk1/prometheus
sudo chmod -R 777 /mnt/disk1/prometheus
docker run -d --name=prometheus1 -p 9090:9090 \
-v /etc/prometheus/:/etc/prometheus/ \
-v /mnt/disk1/prometheus/:/prometheus/ \
--network host \
--restart=always prom/prometheus:v2.30.3 \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.console.libraries=/usr/share/prometheus/console_libraries \
--web.console.templates=/usr/share/prometheus/consoles \
--storage.tsdb.retention=180d \
--web.listen-address="0.0.0.0:9090" \
--web.enable-admin-api
# 【可选】临时测试启动,停止自动销毁(可观察创建过程日志 stdout 输出)
#docker run --rm --name=prometheus1 -p 9090:9090 \
-v /etc/prometheus/:/etc/prometheus/ \
-v /mnt/disk1/prometheus/:/prometheus/ \
prom/prometheus:v2.30.3
以上除了 --storage.tsdb.retention=180d --web.enable-admin-api
其他均为 prometheus 官方容器默认启动参数,由于 docker run ...
时指定的 COMMAND 和 ARGS 会覆盖默认启动命令,因此必须全部指定。参见官方文档: docker-run#cmd-default-command-or-options
- 控制台界面: http://10.111.178.62:9090/
2. 部署 Grafana
[展开] Installation grafana
# see: https://hub.docker.com/r/grafana/grafana/tags docker pull docker.io/grafana/grafana:8.2.2 sudo mkdir -p /mnt/disk1/grafana sudo chmod -R 777 /mnt/disk1/grafana docker run -d --name=grafana1 -p 3000:3000 --network host --restart=always -v /mnt/disk1/grafana:/var/lib/grafana docker.io/grafana/grafana:8.2.2
- 控制台界面: http://10.111.178.62:3000/
admin/初始密码为任意非空字符串
,建议首次登录进去后立即修改密码 - 添加 Prometheus 数据源
3. 部署 node_exporter
- 官方推荐更多 exporter: https://prometheus.io/docs/instrumenting/exporters/
3.1 安装及配置
- 下载安装
[展开] Installation node_exporter
# ------ Downloading installation ------- sudo cd /tmp sudo curl -OL https://github.91chifun.workers.dev/https://github.com//prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz sudo tar -xf node_exporter* sudo cp node_exporter*/node_exporter /usr/sbin/ # ------ Make configuration ------- sudo cat <<-'EOF' >/etc/sysconfig/node_exporter OPTIONS="--collector.systemd \ --collector.mdadm \ --collector.tcpstat \ --collector.processes \ --web.listen-address=:9100 \ --web.telemetry-path=/metrics" EOF
- 配置服务脚本(推荐,适用于如 CentOS 7.x)
[展开] Configure node_exporter(systemd)
sudo cat <<-'EOF' >/etc/systemd/system/node_exporter.service [Unit] Description=Node Exporter [Service] User=root Group=root EnvironmentFile=-/etc/sysconfig/node_exporter ExecStart=/usr/sbin/node_exporter $OPTIONS ExecReload=/bin/kill -s HUP StandardOutput=journal StandardError=journal LimitNOFILE=64 LimitNPROC=64 LimitCORE=infinity TimeoutStartSec=10 TimeoutSec=300 Restart=always [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl restart node_exporter sudo systemctl status node_exporter sudo journalctl -u node_exporter -f # 测试访问 curl http://localhost:9100/metrics
- 配置服务脚本(可选,适用于如 CentOS 6.x)
[展开] Configure node_exporter(init.d)
sudo cat <<-'EOF' >/etc/init.d/node_exporter.sh #!/bin/bash #/* # * Copyright 2017 ~ 2025 the original author or authors.# * # * Licensed under the Apache License, Version 2.0 (the "License"); # * you may not use this file except in compliance with the License. # * You may obtain a copy of the License at # * # * http://www.apache.org/licenses/LICENSE-2.0 # * # * Unless required by applicable law or agreed to in writing, software # * distributed under the License is distributed on an "AS IS" BASIS, # * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # * See the License for the specific language governing permissions and # * limitations under the License. # */ [ -f /etc/sysconfig/network ] && . /etc/sysconfig/network [ "$NETWORKING" = "no" ] && exit 0 # Environment definition. binFile="$(command -v node_exporter)" if [ -f /etc/sysconfig/node_exporter ]; then . /etc/sysconfig/node_exporter export NODE_EXPORTER_OPTIONS=$OPTIONS fi function start() { local pids=$(getPids) if [ -z "$pids" ]; then nohup $binFile $NODE_EXPORTER_OPTIONS > "/var/log/node_exporter.stdout" 2>&1 & echo -n "Starting node exporter ..." while true do pids=$(getPids) if [ "$pids" == "" ]; then echo -n "."; sleep 0.8; else echo $pids >"/run/node_exporter.pid" break fi done echo -e "\nStarted node_exporter on "$pids else echo "Node exporter process is running "$pids fi } function stop() { local pids=$(getPids) if [ -z "$pids" ]; then echo "Node exporter not running!" else echo -n "Stopping node_exporter for $pids ..." kill -s TERM $pids while true do pids=$(getPids) if [ "$pids" == "" ]; then \rm -f /run/node_exporter.pid break else echo -n "."; sleep 0.8; fi done echo -e "\nStopped node_exporter !" fi } function status() { ps -ef | grep -v grep | grep $binFile } function getPids() { local pids=$(ps ax | grep -i "$binFile" | grep -v grep | awk '{print $1}') echo $pids # Output execution result value. return 0 # Return the execution result code. } # --- Main call. --- CMD=$1 case $CMD in status) status ;; start) start ;; stop) stop ;; restart) stop start ;; *) echo $"Usage: {start|stop|restart|status}" exit 2 esac EOF chmod +x /etc/init.d/node_exporter.sh /etc/init.d/node_exporter.sh restart /etc/init.d/node_exporter.sh status # 测试访问 curl http://localhost:9100/metrics
3.2 添加仪表盘 (node_exporter)
- 推荐简单 Grafana ID: 13978
- 推荐热门 Grafana ID: 8919 - 1 Node Exporter for Prometheus Dashboard CN
- 推荐热门 Grafana ID: 11074 - 1 Node Exporter for Prometheus Dashboard EN
- 推荐热门 Grafana ID: 15172 - 1 Node Enhanced version based on 11074:Exporter for Prometheus Dashboard EN, or download1: node_exporter-15172-rev6.json, download2: node_exporter-15172-rev6.json
- 推荐热门 Grafana ID: 1860 - 1 Node Exporter Full
- 推荐热门 Grafana ID: 315 - Kubernetes cluster monitoring (via Prometheus)
- 更多优秀 Dashboard: https://grafana.com/grafana/dashboards
3.3 效果图
3.4 常见重要生产环境指标
指标名 | 启用参数 | 默认启用 | 参考意义 |
---|---|---|---|
node_memory_MemFree_bytes, node_memory_Buffers_bytes, node_memory_Cached_bytes, node_memory_SwapCached_bytes, node_memory_VmallocUsed_bytes | --collector.meminfo | 是 | 当前内存使用情况, kernel 的内存申请分配机制复杂 |
node_netstat_Tcp_CurrEstab | --collector.netstat | 是 | 高并发时响应延迟或拒绝连接连接数排查 |
node_tcp_connection_states | --collector.tcpstat | 否 | 高并发时响应延迟或拒绝连接连接数排查 |
node_processes_threads | --collector.processes | 否 | 【重要】通常都关心 OOM 等 mem问题,较少关心系统的 Thread 情况,亲经历 netty 服务器高并时耗尽 pid_max(如 CentOS 7.9 默认 32768) ,导致 ssh 无法登陆及 http 服务无法响应 fork: Cannot allocate memory |
3.5 指标类型及应用场景:
- Counter:收集事件次数等单调递增的数据
- Gauge:收集当前的状态,比如数据库连接数
- Histogram:收集随机正态分布数据,比如响应延迟
- Summary:收集随机正态分布数据,和 Histogram 是类似的
4. 部署 alertmanager
[展开] Configure /etc/alertmanager/alertmanager.yml
sudo mkdir -p /etc/alertmanager sudo cat <<-'EOF'>/etc/alertmanager/alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: 'smtp.163.com:465' smtp_from: 'you_email_user@163.com' smtp_auth_username: 'you_email_user@163.com' smtp_auth_password: 'you_email_password' smtp_require_tls: false # see: smtp 454 Command not permitted when TLS active smtp_hello: '163.com' #wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' templates: - 'template/*.tmpl' route: group_by: ['alertname'] # 报警分组依据 group_wait: 10s # 最初即第一次等待多久时间发送一组警报的通知 group_interval: 10s # 在发送新警报前的等待时间 repeat_interval: 1m # 发送重复警报的周期, 对于email此项不可以设置过低, 以防smtp服务器拒绝 receiver: 'email' # 对应以下receivers receivers: - name: 'email' email_configs: - to: '983708408@qq.com' html: '{{ template "email.tmpl" . }}' headers: { Subject: "[WARN] 报警邮件"} #- name: 'webhook' # webhook_configs: # - url: 'http://127.0.0.1:5001' # send_resolved: true #- name: 'wechat' # wechat_configs: # - send_resolved: true # to_party: '1' # 接收组ID # agent_id: '1000002' # 企业微信->自定应用->AgentId # corp_id: '******' # 企业信息(我的企业->CorpId[在底部]) # api_secret: '******' # 企业微信(企业微信->自定应用->Secret) # message: '{{ template "test_wechat.html" . }}' # 抑制器配置 inhibit_rules: - source_match: # 源标签警报触发时抑制含有目标标签的警报,在当前警报匹配 status: 'High' severity: 'critical' status: 'High' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] EOF
- Configure email.tmpl
sudo cat <<-'EOF'>/etc/alertmanager/template/email.tmpl
{{ define "email.html" }}
<table border="1">
<tr>
<td>报警项</td>
<td>实例</td>
<td>报警阀值</td>
<td>开始时间</td>
</tr>
{{ range $i, $alert := .Alerts }}
<tr>
<td>{{ index $alert.Labels "alertname" }}</td>
<td>{{ index $alert.Labels "instance" }}</td>
<td>{{ index $alert.Annotations "value" }}</td>
<td>{{ $alert.StartsAt }}</td>
</tr>
{{ end }}
</table>
{{ end }}
EOF
- Deployment alertmanager server
# @see: https://hub.docker.com/r/prom/alertmanager/tags
docker pull docker.io/prom/alertmanager:v0.23.0
docker run -d --name=alertmanager1 -p 9093:9093 -v /etc/alertmanager/:/etc/alertmanager/ -v /mnt/disk1/alertmanager/:/alertmanager/ --network host --restart=always prom/alertmanager:v0.23.0
-
控制台界面: http://10.111.178.62:9093/
-
AlertManager 效果图
5. FAQ
5.1 启动 prometheus 容器报错?
-
ERROR:
level=error ts=2021-10-24T13:08:47.353Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied" panic: Unable to create mmap-ed active query log
-
原因: 由于 prometheus 官方镜像启动是用的 nobody 用户, 而在宿主机上启动容器是root, 因此无权限,简单解决方法是设置挂载的宿主上的目录(/mnt/disk1/prometheus)为 777.
5.2 如何清理prometheus 时序数据?
-
官方文档: https://prometheus.io/docs/prometheus/latest/querying/api/#delete-series
-
首先确保 prometheus 启动参数指定了
--web.enable-admin-api
,例如:
./prometheus --storage.tsdb.retention=180d --web.enable-admin-api
- 删除 mysql 实例:
rds-node1:3066
的指标:mysql_global_status_threads_running
的所有数据
curl -X POST -g 'http://10.111.178.62:9090/api/v1/admin/tsdb/delete_series?match[]=up&match[]=mysql_global_status_threads_running{instance="rds-node1:3066",job="mysql"}'
- 删除 mysql 实例:
rds-node1:3066
的指标:mysql_global_status_threads_running
在时间区间 1557903714 到 155790395 的数据
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?start=1557903714&end=1557903954&match[]=up&match[]=mysql_global_status_threads_running{instance="rds-node1:3066",job="mysql"}'
- 删除 mysql 实例:
rds-node1:3066
的所有指标在时间区间 1557903714 到 155790395 的数据
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?start=1557903714&end=1557903954&match[]=up&match[]={instance="rds-node1:3066",job="mysql"}'
5.3 如何扩展自定义 Grafana 仪表盘?
-> 左侧主菜单(Dashboard)
-> Manage
-> 选择现有仪表板(进入)
-> Dashboard settings(一般导入的开源模版默认是只读的)
-> Make Editable
-> 左上角返回 (此时显示的整个仪表盘是可编辑状态)
-> Add panel (右上角)
-> Add an empty panel
-> 在 Metrics browser 输入要展示的指标expr (如: node_memory_MemFree_bytes{job="node",instance="$instance"})
-> Legend (设置显示别名)
-> 右上角 search by 选择图表类型,常用 Graph (old)
-> Apply (预览)
5.4 分享自定义的表盘到 Grafana 官方,上传时报错?
-
ERROR;
Dashboards from Grafana v3.0 or earlier cannot be uploaded.
-
Grafana 3.1 前后导出的仪表盘 json 格式存在差异,大于 3.1 增加了
__inputs
前缀,导出时必须勾选Export for sharing externally
,当然导入新版仪表盘到本地低版本是有兼容的,见官方说明: https://grafana.com/docs/grafana/v7.5/dashboards/export-import/#note -
分享上传步骤:
- a. 首先进入当前账号主页: https://grafana.com/auth/sign-in?plcmt=top-nav&cta=myaccount
- b. 依次进入菜单: ORG SETTINGS -> My Dashboards -> Upload Dashboards
5.5 grafana 管理员账号密码忘记如何重置?
- 方式1:
docker exec -it grafana1 /bin/sh
grafana-cli admin reset-admin-password 123456
- 方式2:
# 找到 sqllite db 文件
find / -name grafana.db
sqlite3 /var/lib/grafana/grafana.db
# 查看user表
select * from user;
# 更改密码为 admin
update user set password = '59acf18b94d7eb0694c61e60ce44c110c7a683ac6a8f09580d626f90f4a242000746579358d77dd9e570e83fa24faa88a8a6', salt = 'F3FAxVm33R' where login = 'admin';
6. 参考
- 参考 1: node_exporter 默认启用的指标 (https://github.com/prometheus/node_exporter#enabled-by-default)
- 参考 2: node_exporter 指标扩展分析
- 参考 3: Prometheus client_golang 库源码分析
- 参考 4: 一键备份导出生产环境所有 Grafana dashboard
- 参考 5: Grafana dashboard 分享嵌入到外部系统的方式
- 参考 6: Prometheus + node_exporter + Grafana 实现主机监控
- 参考 7: Prometheus (rule)+ alertmanager 监控告警整合
- 参考 8: Kubernetes 集群 Prometheus + Grafana 监控方案
一条评论
wordpress
up