文章目录
- Docker Compose 部署 Prometheus
- 1. 环境准备
- 2. 配置文件准备
- 3. 编写 Docker Compose 文件
- 4. 启动服务
- 5. 验证部署
- 6. 常用操作
- 7. 生产环境增强建议
- 8. 扩展监控对象
Docker Compose 部署 Prometheus
1. 环境准备
- 安装 Docker(版本 ≥ 20.10)和 Docker Compose(版本 ≥ 1.29)
- 创建项目目录:
mkdir prometheus && cd prometheus
2. 配置文件准备
-
创建 Prometheus 配置文件
prometheus.yml(基础配置):global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # 监控自身 # 示例:添加 Node Exporter(需额外部署) # - job_name: "node" # static_configs: # - targets: ["node-exporter:9100"] -
创建告警规则文件(可选)
alerts.yml:groups: - name: example rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} down"linux_rules.yml:groups: - name: linux-system-rules rules: # CPU 相关规则 - alert: HighCpuLoad expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 10m labels: severity: warning annotations: summary: "High CPU load on {{ $labels.instance }}" description: "CPU usage is {{ $value }}% for last 10 minutes" # 内存相关规则 - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 > 5 # 修改测试触发告警 for: 10m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}% for last 10 minutes" # 交换分区规则 - alert: HighSwapUsage expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50 for: 15m labels: severity: warning annotations: summary: "High swap usage on {{ $labels.instance }}" description: "Swap usage is {{ $value }}% for last 15 minutes" # 磁盘空间规则 - alert: LowDiskSpace expr: (node_filesystem_avail_bytes{mountpoint!~"^(/run|/var/lib/docker).*",fstype!="tmpfs"} / node_filesystem_size_bytes * 100) < 15 for: 10m labels: severity: warning annotations: summary: "Low disk space on {{ $labels.instance }} ({{ $labels.mountpoint }})" description: "Only {{ $value }}% free space left on {{ $labels.mountpoint }}" # 磁盘 I/O 规则 - alert: HighDiskIoLoad expr: rate(node_disk_io_time_seconds_total[1m]) * 100 > 80 for: 10m labels: severity: warning annotations: summary: "High disk I/O load on {{ $labels.instance }} ({{ $labels.device }})" description: "Disk I/O load is {{ $value }}% for last 10 minutes" # 网络相关规则 - alert: HighNetworkErrors expr: increase(node_network_receive_errs_total[5m]) > 10 or increase(node_network_transmit_errs_total[5m]) > 10 for: 5m labels: severity: warning annotations: summary: "High network errors on {{ $labels.instance }} ({{ $labels.device }})" description: "Network errors detected on interface {{ $labels.device }}" # 系统负载规则 - alert: HighSystemLoad expr: node_load5 / count by(instance)(node_cpu_seconds_total{mode="system"}) > 1.5 for: 15m labels: severity: warning annotations: summary: "High system load on {{ $labels.instance }}" description: "5-minute load average is {{ $value }} (relative to CPU count)" # 节点宕机规则 - alert: InstanceDown expr: up{job="node"} == 0 for: 5m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} has been down for more than 5 minutes" # 文件描述符规则 - alert: HighFileDescriptorUsage expr: node_filefd_allocated / node_filefd_maximum * 100 > 80 for: 10m labels: severity: warning annotations: summary: "High file descriptor usage on {{ $labels.instance }}" description: "File descriptor usage is {{ $value }}% of maximum"windows_rules.yml:groups: - name: windows-system-rules rules: # CPU 相关规则 - alert: HighCpuUsageWindows expr: 100 - (avg by(instance) (rate(windows_cpu_time_total{mode="idle"}[5m])) * 100) > 85 for: 10m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}% for last 10 minutes" # 内存相关规则 - alert: HighMemoryUsageWindows expr: (windows_os_physical_memory_total_bytes - windows_os_physical_memory_free_bytes) / windows_os_physical_memory_total_bytes * 100 > 90 for: 10m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}% for last 10 minutes" # 磁盘空间规则 - alert: LowDiskSpaceWindows expr: (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes * 100) < 95 # 修改测试触发告警 for: 10m labels: severity: warning annotations: summary: "Low disk space on {{ $labels.instance }} ({{ $labels.volume }})" description: "Only {{ $value }}% free space left on {{ $labels.volume }}" # 磁盘 I/O 规则 - alert: HighDiskIoWindows expr: rate(windows_logical_disk_read_seconds_total[5m]) * 100 > 80 or rate(windows_logical_disk_write_seconds_total[5m]) * 100 > 80 for: 10m labels: severity: warning annotations: summary: "High disk I/O on {{ $labels.instance }} ({{ $labels.volume }})" description: "Disk I/O utilization is {{ $value }}% for last 10 minutes" # 服务状态规则 - alert: CriticalServiceDown expr: windows_service_status{status!="running"} == 1 for: 2m labels: severity: critical annotations: summary: "Critical service down on {{ $labels.instance }}" description: "Service {{ $labels.service }} is not running" # 系统启动时间规则 - alert: SystemRebooted expr: time() - windows_system_system_up_time > 300 for: 0m labels: severity: info annotations: summary: "System rebooted on {{ $labels.instance }}" description: "System was rebooted, uptime is {{ $value }} seconds" # 网络连接规则 - alert: HighNetworkUtilizationWindows expr: rate(windows_net_bytes_total[5m]) / windows_net_speed_bits * 8 * 100 > 80 for: 10m labels: severity: warning annotations: summary: "High network utilization on {{ $labels.instance }} ({{ $labels.interface }})" description: "Network utilization is {{ $value }}% for last 10 minutes" # 进程内存泄漏检测 - alert: ProcessMemoryLeakWindows expr: predict_linear(windows_process_private_bytes[1h], 3600) / 1024 / 1024 / 1024 > 2 for: 30m labels: severity: warning annotations: summary: "Possible memory leak in {{ $labels.process }} on {{ $labels.instance }}" description: "Process {{ $labels.process }} is predicted to exceed 2GB memory in 1 hour" # 系统日志错误规则 - alert: SystemLogErrorsWindows expr: rate(windows_event_log_errors_total[5m]) > 5 for: 5m labels: severity: warning annotations: summary: "High system log errors on {{ $labels.instance }}" description: "{{ $value }} errors per second in system logs"linux_recording_rules.yml:groups: - name: linux-recording-rules interval: 1m rules: # CPU Usage (兼容多版本Node Exporter) - record: instance:node_cpu_usage:rate5m expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle",job=~".*"}[5m])) * 100) # Memory Usage (排除缓存/缓冲区) - record: instance:node_memory_usage:ratio expr: > (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 # Disk Space Usage (过滤无效挂载点) - record: instance:node_filesystem_usage:ratio expr: > (node_filesystem_size_bytes{fstype!~"tmpfs|squashfs",mountpoint!~"/run|/snap"} - node_filesystem_avail_bytes{fstype!~"tmpfs|squashfs",mountpoint!~"/run|/snap"}) / node_filesystem_size_bytes{fstype!~"tmpfs|squashfs",mountpoint!~"/run|/snap"} * 100 # Network Traffic (过滤虚拟接口) - record: instance:node_network_receive_mbps:rate5m expr: sum by(instance)(rate(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m])) * 8 / 1048576 # System Load (标准化) - record: instance:node_load_ratio:rate5m expr: node_load5 / count by(instance)(node_cpu_seconds_total{mode="system"})
3. 编写 Docker Compose 文件
docker-compose.yml:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts.yml:/etc/prometheus/alerts.yml # 挂载告警规则
- prometheus-data:/prometheus # 数据持久化
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle' # 允许热重载配置
ports:
- "9090:9090"
restart: unless-stopped
networks:
- monitor-net
# 可选:添加 Grafana 可视化
grafana:
image: grafana/grafana:latest
container_name: grafana
volumes:
- grafana-data:/var/lib/grafana
ports:
- "3000:3000"
restart: unless-stopped
networks:
- monitor-net
# 可选:添加 Node Exporter 监控主机
# node-exporter:
# image: prom/node-exporter:latest
# container_name: node-exporter
# restart: unless-stopped
# network_mode: host # 需主机模式
# pid: host
# volumes:
# - /:/host:ro,rslave
# command:
# - '--path.rootfs=/host'
volumes:
prometheus-data:
grafana-data:
networks:
monitor-net:
driver: bridge
4. 启动服务
docker-compose up -d # 后台启动
5. 验证部署
- Prometheus UI:访问
http://<服务器IP>:9090- 检查 Targets:Status → Targets
- 查询指标:Graph → 输入
up查看状态
- Grafana UI(如部署):
http://<服务器IP>:3000(默认账号 admin/admin)- 添加 Prometheus 数据源:
http://prometheus:9090
- 添加 Prometheus 数据源:
6. 常用操作
- 重载配置(不重启):
curl -X POST http://localhost:9090/-/reload - 查看日志:
docker-compose logs -f prometheus - 停止服务:
docker-compose down - 备份数据:备份
prometheus-data卷(默认位置:/var/lib/docker/volumes/...)
7. 生产环境增强建议
- 安全加固:
- 设置 Prometheus
--web.config.file启用基础认证 - 限制 Grafana 登录策略
- 设置 Prometheus
- 持久化优化:
volumes: prometheus-data: driver_opts: type: nfs o: addr=<nfs_server>,rw device: ":/path/to/nfs" - 资源限制:
prometheus: deploy: resources: limits: cpus: '2' memory: 4G - 高可用方案:
- 部署多个 Prometheus 实例 + Thanos
- 使用 Alertmanager 集群
8. 扩展监控对象
修改 prometheus.yml 添加:
# 监控 Docker 容器
- job_name: "docker"
static_configs:
- targets: ["docker-host:9323"] # 需配置 docker daemon 暴露 metrics
# 监控 MySQL
- job_name: "mysql"
static_configs:
- targets: ["mysql-exporter:9104"] # 需部署 mysqld-exporter
注:完整配置参考 Prometheus 官方文档



















