Prometheus 报警规则参考-笔记 – 21运维
通知: .-...

Prometheus 报警规则参考-笔记

prometheus 21运维 203浏览

prometheus的报警规则案例参考[编辑器原因导致判断符号显示不正确]。
规则是通过文件方式来定义的,这些规则加载目录可以通过prometheus.yaml 配置文件进行定义,比如:

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
    - "rule/*.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

这里我们是放置到了/usr/local/prometheus/server/rule/目录下,可以自定义多个报警规则文件,prometheus会通过global设置的参数进行周期扫描加载,不用重启prometheus守护进程。

这里我们记录下常规的rule规则。
prometheus报警规则(类似zabbix的triggr触发器)说明,参考如下rule规则进行说明:

groups:
  - name: tcp port check
    rules:
      - alert: tcp_port_check failed
        for: 5s
        expr: probe_success{job="tcp_port_check"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.app }} tcp检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的应用 {{ $labels.app }} 端口检测不通"

(1)报警规则名称name(这个可以通过promethues status-rule可以看到报警规则名称。
(2)rule具体规则:alert,报警title。插件报警会将这个进行title展示,什么业务一目了然。当然,这些报警规则下的所有alert都可以通过prometheus的alerts可以看到。
(3)rule具体规则:触发周期。多久进行一次数据拉取和规则匹配。
(4)rule具体规则:触发器表达式,报警的核心。 这里要设置针对哪个target的什么指标 设置什么值 进行触发。
(5)rule具体规则:报警级别。 类似zabbix的报警级别。
(6)注释,便于查看和后期维护。

1,linux基本资源指标监控,比如cpu、内存、网卡、磁盘等,也可以通过promsql 自己设定其他规则

[[email protected] rule]# cat  linux_rule.yml 
groups: 
  - name: linux_alert
    rules: 
      - alert: "linux load5 over 5"
        for: 5s
        expr: node_load5 > 5
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }}  over 5,当前值:{{ $value }}"
          summary: "linux load5  over 5"

      - alert: "node explorter have down"
        for: 5s
        expr: up==0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "node explorter value equle 0"

      - alert: "cpu used percent over 80% per 1 min"
        for: 5s
        expr: 100 * (1 - avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])))  * on(instance) group_left(hostname) node_uname_info > 80
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "cpu used percent over 80% per 1 min"

      - alert: "memory used percent over 85%"
        for: 5m
        expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes{instance!~"172..*"})) * 100 > 85
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "memory used percent over 85%"

      - alert: "eth0 input traffic network over 10M"
        for: 3m
        expr: sum by(instance) (irate(node_network_receive_bytes_total{device="eth0",instance!~"172.1.*|172..*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "eth0 input traffic network over 10M"

      - alert: "eth0 output traffic network over 10M"
        for: 3m
        expr: sum by(instance) (irate(node_network_transmit_bytes_total{device="eth0",instance!~"172.1.*|175.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "eth0 output traffic network over 10M"

      - alert: "disk usage over 80%"
        for: 10m
        expr: (node_filesystem_size_bytes{device=~"/dev/.+"} - node_filesystem_free_bytes{device=~"/dev/.+"} )/ node_filesystem_size_bytes{device=~"/dev/.+"} * 100 > 80
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.mountpoint }} 分区 over 80%,当前值:{{ $value }}"
          summary: "disk usage over 80%"

2,icmp 监控 (主要用来判断target是否在线或者是有网络抖动)

[[email protected] rule]# cat check_icmp_rule.yml 
groups:
  - name: icmp check
    rules:
      - alert: icmp_check failed
        for: 5s
        expr: probe_success{job="icmp_check"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.hostname }} icmp检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的服务器 {{ $labels.hostname }} 服务器检测不通"

3,端口监控(判断某个端口socket是否通信 ,一般是对应某个业务守护进程,比如go进程,mysql,mongo,redis等等)

[[email protected] rule]# cat  check_port_rule.yml 
groups:
  - name: tcp port check
    rules:
      - alert: tcp_port_check failed
        for: 5s
        expr: probe_success{job="tcp_port_check"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.app }} tcp检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的应用 {{ $labels.app }} 端口检测不通"

4,url 监控
这个一般是判断某一个url是否可以访问,直接请求是否返回200或者301,302状态码来判断业务是否正常。

[[email protected] rule]# cat  check_url_rule.yml 
groups:
  - name: httpd url check
    rules:
      - alert: http_url_check failed
        for: 5s
        expr: probe_success{job="http_url_check"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.app }} url检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的应用 {{ $labels.app }} url接口检测不通"

转载请注明:21运维 » Prometheus 报警规则参考-笔记