1、准备环境

监控系统下载运行prometheus和alertmanager,被监控的ES集群下载运行node_exporter和elasticsearch_exporter

2、修改配置

1、prometheus配置:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - s4:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
   - "alerts/*.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'S5'
    static_configs:
    - targets: ['10.3.3.5:9100']

  - job_name: 'S6_mariadb'
    static_configs:
    - targets: ['10.3.3.6:9104']

  - job_name: 'S5_elasticsearch'
    scrape_interval: 60s
    scrape_timeout:  30s
    metrics_path: "/metrics"
    static_configs:
    - targets:
      - '10.3.3.5:9108'
      labels:
        service: elasticsearch
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*)\:9109'
        target_label:  'instance'
        replacement:   '$1'
      - source_labels: [__address__]
        regex:         '.*\.(.*)\.lan.*'
        target_label:  'environment'
        replacement:   '$1'

2、alertmanager配置

alertmanager.yml

global:
  # 当Alertmanager持续多长时间未接收到告警后标记告警状态为resolved(已解决),默认为5m
  resolve_timeout: 5m
  # 全局的SMTP服务器配置
  smtp_smarthost: 'smtp.126.com:25'
  smtp_from: 'maple34@126.com'
  smtp_auth_username: 'maple34@126.com'
  smtp_auth_password: '{{password}}'


route:
  group_by: ['alertname']
  # 有的时候为了能够一次性收集和发送更多的相关信息时,可以通过group_wait参数设置等待时间,
  # 如果在等待时间内当前group接收到了新的告警,这些告警将会合并为一个通知向receiver发送,默认为30s
  group_wait: 10s
  # 定义相同的Group之间发送告警通知的时间间隔,默认为5m
  group_interval: 20s
  # 如果已经成功发送了一个警告,在再次发送通知之前需要等待多长时间,默认为4h
  repeat_interval: 1m
  receiver: 'mail-error'
  routes:
  - match:
      level: '严重'
    receiver: 'mail-error'
  - match:
      level: '紧急'
    receiver: 'mail-warning'
    repeat_interval: 10m


receivers:
- name: 'mail-error'
  email_configs:
  - to: 345999369@qq.com

- name: 'mail-warning'
  email_configs:
  - to: maple34@126.com

3、alertmanager告警规则配置

在prometheus监控主机的prometheus根目录下,建立rules文件夹,放置告警规则配置文件。

[admin@s4 prometheus-2.5.0]$ ls rules/
es_alert.yml os_alert.yml

OS存活状态和内存使用率:

OS宕机或系统内存使用率大于70%触发告警

os_alert.yml

groups:
- name: OS_alert
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

  - alert: NodeMemoryUsage
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 70
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "{{$labels.instance}}: High Memory usage detected"
      description: "{{$labels.instance}}: Memory usage is above 70% (current value is:{{ $value }})"

ES集群多个指标不正常触发告警:

es_alert.yml

groups:
- name: ES_Alert
  rules:
  ##########  集群健康状态:红色  ###############
  - alert: Elastic_Cluster_Health_RED
    expr: elasticsearch_cluster_health_status{color="red"}==1 
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {{ $labels.cluster }}"
      description: "Instance {{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {{ $labels.cluster }}."
 
 ##########  集群健康状态:黄色  ###############
  - alert: Elastic_Cluster_Health_Yellow 
    expr: elasticsearch_cluster_health_status{color="yellow"}==1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: " Instance {{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {{ $labels.cluster }}" 
      description: "Instance {{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {{ $labels.cluster }}."
 
   ##########  ES JVM堆内存使用率超过百分之80  ###############
  - alert: Elasticsearch_JVM_Heap_Too_High
    expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.8
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "ElasticSearch node {{ $labels.instance }} heap usage is high "
      description: "The heap in {{ $labels.instance }} is over 80% for 15m."
          
  - alert: Elasticsearch_health_up
    expr: elasticsearch_cluster_health_up !=1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: " ElasticSearch node: {{ $labels.instance }} last scrape of the ElasticSearch cluster health failed"                               
      description: "ElasticSearch node: {{ $labels.instance }} last scrape of the ElasticSearch cluster health failed"
          
          
  - alert: Elasticsearch_Count_of_JVM_GC_Runs
    expr: rate(elasticsearch_jvm_gc_collection_seconds_count{}[5m])>5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "ElasticSearch node {{ $labels.instance }}: Count of JVM GC runs > 5 per sec and has a value of {{ $value }} "
      description: "ElasticSearch node {{ $labels.instance }}: Count of JVM GC runs > 5 per sec and has a value of {{ $value }}"
          
  - alert: Elasticsearch_GC_Run_Time
    expr: rate(elasticsearch_jvm_gc_collection_seconds_sum[5m])>0.3
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: " ElasticSearch node {{ $labels.instance }}: GC run time in seconds > 0.3 sec and has a value of {{ $value }}"
      description: "ElasticSearch node {{ $labels.instance }}: GC run time in seconds > 0.3 sec and has a value of {{ $value }}"
    
  - alert: Elasticsearch_health_timed_out
    expr: elasticsearch_cluster_health_timed_out>0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: " ElasticSearch node {{ $labels.instance }}: Number of cluster health checks timed out > 0 and has a value of {{ $value }}"
      description: "ElasticSearch node {{ $labels.instance }}: Number of cluster health checks timed out > 0 and has a value of {{ $value }}"