Elasticsearch 입문 30편 — Monitoring (Cluster health·node stats·slow logs·Stack Monitoring)

2026-05-19•Elasticsearch 입문에서 운영까지

Elasticsearch 입문 30편 Monitoring. _cluster·_nodes/stats·_cat·Slow Log·Stack Monitoring·Prometheus·Grafana.

이 글은 Elasticsearch 입문에서 운영까지 시리즈 38편 중 30편이에요. 26~29편이 클러스터·샤드·스냅샷·보안 같은 운영 골격 을 잡았다면, 30편 Monitoring 은 그 골격이 지금 멀쩡한지 매 순간 들여다보는 눈 자리예요. 운영자가 새벽 3시에 PagerDuty 알람으로 깨는 자리도, 진짜 사고를 미리 잡아내는 자리도 다 여기.

📚 학습 노트

이 글은 Elasticsearch 8.x 공식 docs (Monitoring·_cat APIs·Slow Log) 와 Stack Monitoring·Metricbeat·Prometheus Exporter 운영 가이드를 학습 노트로 풀어쓴 자료예요.

로컬에 Kibana Stack Monitoring 한 번만 켜 봐도 본문이 머리에 훨씬 잘 박혀요.

운영 ES 의 눈 — 무엇을 봐야 하나

ES 운영에서 가장 흔한 조용한 죽음 패턴은 비슷해요. 어제까지 멀쩡 → 오늘 새벽 0.1% 색인 지연 → 오전 9시 검색 응답 3초 → 오전 10시 색인 거부 → 오전 11시 OOM. 중간 3~4시간 동안 경고 신호 가 분명히 떴는데 아무도 보지 않은 자리.

Monitoring 의 목적은 그 신호를 사람보다 빨리 보는 것 이에요. 봐야 할 신호는 크게 다섯 갈래.

클러스터 레벨 — green/yellow/red 색깔, unassigned shards 수, pending tasks 큐. 지금 클러스터가 살아 있는지 0순위 신호.

노드 레벨 — JVM heap usage, GC 빈도·시간, thread pool 큐, disk usage. 어느 노드가 곧 죽을지 알리는 신호.

쿼리 레벨 — slow log, 검색 응답 시간 p99, 검색 거부 수. 사용자가 느끼는 품질 신호.

색인 레벨 — bulk indexing rate, indexing 거부 수, segment merge 시간. 데이터가 제대로 들어오는지 신호.

캐시·메모리 레벨 — fielddata cache, query cache hit rate, circuit breaker trip 수. 메모리가 위험한지 신호.

이 다섯을 어떻게 잡느냐가 30편 본문 전체예요.

_cluster/health — 0순위 신호등

가장 먼저, 가장 자주 두드리는 API. 5초 안에 클러스터 상태가 보여요.

curl -X GET "localhost:9200/_cluster/health?pretty"

응답에서 봐야 할 필드.

{
  "cluster_name": "production-es",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 6,
  "number_of_data_nodes": 4,
  "active_primary_shards": 120,
  "active_shards": 230,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 10,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 95.83
}

status 가 가장 중요해요. 세 색깔.

green — 모든 Primary Shard 와 Replica Shard 가 살아 있음. 정상.
yellow — 모든 Primary 는 살아 있지만 Replica 일부 가 미할당. 데이터 손실 없음, 다만 고가용성 떨어진 상태. 즉시 위험 X.
red — Primary Shard 일부 미할당. 일부 인덱스가 읽기·쓰기 불가. 즉시 대응 필요.

unassigned_shards 가 0 이 아니면 원인 진단. _cluster/allocation/explain 으로 왜 미할당인지 직접 물어봐요.

curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
  "index": "orders-2026-05",
  "shard": 2,
  "primary": true
}'

응답에 디스크 watermark 초과·노드 부족·allocation filter 충돌 같은 진짜 이유가 텍스트로 떠요. 26편(Cluster Operations) 에서 깊이 다뤘으니 같이 보면 좋아요.

pending_tasks 가 0 이 아니라면 마스터 노드가 처리 못 한 작업 큐 가 쌓이는 중. 보통 마스터 노드가 과부하 거나 대량 mapping 변경 진행 중 신호. 100 을 넘으면 알람.

index 레벨 로도 health 를 볼 수 있어요.

curl -X GET "localhost:9200/_cluster/health/orders-2026-05?level=shards&pretty"

level=shards 옵션으로 샤드별 상태 까지 한 번에 봐요.

_cluster/stats — 클러스터 전체 통계

_cluster/health 가 지금 살아 있는가 라면, _cluster/stats 는 얼마나 크고 어디가 무거운가 를 보는 자리.

curl -X GET "localhost:9200/_cluster/stats?pretty"

응답에서 자주 보는 필드.

indices.count · indices.shards.total — 인덱스·샤드 총 갯수. 샤드 폭증 (1만 개 넘으면 마스터 부하) 신호.
indices.docs.count · indices.docs.deleted — 문서 총 수와 삭제 마킹된 문서 수. deleted 가 count 의 30% 가 넘으면 force merge 가 밀린 상태.
indices.store.size_in_bytes — 실데이터 크기.
indices.fielddata.memory_size_in_bytes — fielddata 캐시가 점유한 메모리. text 필드 정렬·집계 가 많으면 폭증.
indices.segments.count — 총 segment 수. 한 샤드당 수십~수백 이 정상, 수천 이면 merge 미설정.
nodes.jvm.mem.heap_used_in_bytes vs heap_max_in_bytes — 힙 사용률.
nodes.os.mem.total_in_bytes — 클러스터 전체 OS 메모리.

운영 대시보드에서 클러스터 한 페이지 요약 패널을 만들 때 가장 자주 쓰는 API 예요.

_nodes/stats — 노드 한 대 한 대 깊이

_cluster/stats 가 전체 합계 라면, _nodes/stats 는 노드 한 대 한 대 의 모든 디테일이 나와요. ES 운영에서 가장 많이 보는 API.

curl -X GET "localhost:9200/_nodes/stats?pretty"

응답이 어마어마하게 커서 보통 특정 그룹만 필터링해서 봐요.

curl -X GET "localhost:9200/_nodes/stats/jvm,thread_pool,indices?pretty"

핵심 그룹 다섯.

(1) JVM — 힙·GC

"jvm": {
  "mem": {
    "heap_used_percent": 72,
    "heap_used_in_bytes": 23192678656,
    "heap_max_in_bytes": 32212254720
  },
  "gc": {
    "collectors": {
      "young": {
        "collection_count": 12345,
        "collection_time_in_millis": 234567
      },
      "old": {
        "collection_count": 23,
        "collection_time_in_millis": 45678
      }
    }
  }
}

봐야 할 신호.

heap_used_percent — 75% 가 임계. 지속적으로 75% 위에 있으면 circuit breaker 가 곧 발동.
gc.old.collection_count — Old GC 가 분당 1회 이상이면 위험. ES 는 G1GC 기본인데도 Old GC 가 잦으면 힙 부족·fielddata 폭증 의심.
gc.old.collection_time_in_millis — Old GC 시간이 수초 단위로 늘어나면 stop-the-world 가 길어져 검색 응답 폭증.

(2) Thread Pool — 큐·거부

ES 는 작업 종류별로 thread pool 이 따로 있어요. 가장 자주 보는 셋.

search — 검색 쿼리 thread pool.
write (8.x, 이전엔 bulk) — bulk 색인 thread pool.
management — 클러스터 관리 작업 thread pool.

"thread_pool": {
  "search": {
    "threads": 13,
    "queue": 0,
    "active": 2,
    "rejected": 0,
    "largest": 13,
    "completed": 123456789
  },
  "write": {
    "threads": 8,
    "queue": 5,
    "active": 6,
    "rejected": 12,
    "completed": 9876543
  }
}

봐야 할 신호.

queue — 큐에 작업이 쌓이는 중. 0 이 정상, 지속적으로 양수 면 thread pool 부족.
rejected — 큐도 넘쳐서 거부된 작업 수. 0 이 정상, 1 이라도 발생 하면 클라이언트가 bulk 색인 실패 또는 검색 timeout 을 받는 상태. 알람 1순위.

(3) Indices.indexing — 색인 속도

"indices": {
  "indexing": {
    "index_total": 12345678,
    "index_time_in_millis": 234567,
    "index_current": 5,
    "index_failed": 0
  }
}

index_total 의 증가 속도 = 초당 색인 문서 수. bulk indexing rate 그래프의 원천이에요.

index_failed 가 늘어나면 mapping conflict·circuit breaker·디스크 부족 의심.

(4) Indices.search — 검색 속도

"search": {
  "query_total": 9876543,
  "query_time_in_millis": 1234567,
  "query_current": 2,
  "fetch_total": 9876543,
  "fetch_time_in_millis": 234567
}

query_total / query_time_in_millis = 평균 쿼리 시간. 이 평균이 p99 까지 가면 slow log 와 묶어 분석.

(5) Indices.fielddata · query_cache — 캐시

"fielddata": {
  "memory_size_in_bytes": 12345678,
  "evictions": 0
},
"query_cache": {
  "memory_size_in_bytes": 23456789,
  "total_count": 12345,
  "hit_count": 11000,
  "miss_count": 1345,
  "evictions": 0
}

봐야 할 신호.

fielddata.evictions — 0 이 정상. 늘어나면 fielddata circuit breaker 임박.
query_cache.hit_count / total_count — query cache hit rate. 70% 이상이면 좋음. 30% 이하면 집계 쿼리 패턴 재점검.

_cat APIs — 쉘 친화 빠른 점검

JSON 으로 다 받기엔 무거울 때, 터미널 한 줄로 표 형태 결과를 받는 가벼운 API 묶음이 _cat.

자주 쓰는 셋.

_cat/indices

curl -X GET "localhost:9200/_cat/indices?v&s=store.size:desc"

health status index            uuid  pri rep docs.count docs.deleted store.size pri.store.size
green  open   orders-2026-05   abc.. 5   1   12345678   234567       45gb       22.5gb
green  open   orders-2026-04   xyz.. 5   1   11234567   123456       42gb       21gb
yellow open   logs-2026-05-19  qrs.. 3   1   9876543    0            8gb        8gb

health·샤드 수·문서 수·디스크 크기 가 한눈에. s=store.size:desc 옵션으로 큰 인덱스부터 정렬.

_cat/shards

curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,docs,store,node"

index           shard prirep state    docs   store  node
orders-2026-05  0     p      STARTED  2.4M   4.5gb  es-data-01
orders-2026-05  0     r      STARTED  2.4M   4.5gb  es-data-02
orders-2026-05  1     p      UNASSIGNED

샤드 한 개 한 개의 위치·상태 가 다 보여요. UNASSIGNED 가 있으면 즉시 allocation/explain.

_cat/nodes

curl -X GET "localhost:9200/_cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,role,master"

name        heap.percent ram.percent cpu load_1m role  master
es-master-01 35          78          12  1.2     mr    *
es-master-02 32          76          10  1.1     mr    -
es-data-01   72          88          45  3.4     dh    -
es-data-02   68          85          42  3.2     dh    -
es-data-03   75          90          50  4.1     dh    -

노드별 힙·CPU·역할·현재 마스터 가 한눈에. heap.percent 가 75% 가까운 노드를 빠르게 찾는 용도.

_cat/recovery

curl -X GET "localhost:9200/_cat/recovery?v&active_only=true"

스냅샷 복원·노드 추가 직후 지금 어떤 샤드가 어디로 옮겨지는 중인지 보는 자리. 운영 중 샤드 이동 중 인 클러스터에서 진행률 확인.

이 네 가지가 터미널 ES 운영자가 매일 두드리는 명령이에요.

Slow Logs — 검색·색인 느린 쿼리 추적

평균은 빠른데 일부 쿼리가 폭망 하면 평균 그래프로는 안 보여요. ES 가 자체적으로 느린 쿼리 를 로그 파일에 기록하는 기능이 Slow Log.

두 종류.

Index slowlog — 느린 색인 작업 기록. bulk 한 묶음이 얼마나 걸렸는지.
Search slowlog — 느린 검색 작업 기록. 쿼리 한 건이 얼마나 걸렸는지.

설정은 인덱스 단위 로 박아요.

curl -X PUT "localhost:9200/orders-2026-05/_settings" -H 'Content-Type: application/json' -d'
{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.slowlog.threshold.query.info": "2s",
  "index.search.slowlog.threshold.query.debug": "500ms",
  "index.search.slowlog.threshold.fetch.warn": "1s",
  "index.search.slowlog.threshold.fetch.info": "500ms",
  "index.indexing.slowlog.threshold.index.warn": "10s",
  "index.indexing.slowlog.threshold.index.info": "5s"
}'

해석.

query.warn 5s — 쿼리 단계 5초 넘으면 WARN 레벨로 로그.
query.info 2s — 2초 넘으면 INFO 레벨로 로그.
fetch.warn 1s — 검색 결과 가져오는 단계 1초 넘으면 WARN.
indexing.warn 10s — bulk 색인 10초 넘으면 WARN.

로그 파일 위치는 보통 /var/log/elasticsearch/<cluster_name>_index_search_slowlog.json 같은 곳. 8.x 부터 JSON 형식 이 기본이라 Filebeat → Kibana 로 그대로 흘려 분석.

Slow log 한 줄 예시.

{
  "@timestamp": "2026-05-19T03:24:15.234Z",
  "level": "WARN",
  "component": "i.s.s.query",
  "cluster.name": "production-es",
  "node.name": "es-data-03",
  "message": "took[5.2s], took_millis[5234], total_hits[12345 hits], stat[], search_type[QUERY_THEN_FETCH], total_shards[5], source[...]"
}

source 필드에 실제 쿼리 본문 까지 들어가서 어떤 쿼리가 느렸는지 그대로 잡아내요.

Stack Monitoring — Kibana 통합 가시성

ES 가 자체적으로 제공하는 클러스터 통합 대시보드 기능이 Stack Monitoring (구 X-Pack Monitoring). Kibana 의 Stack Monitoring 메뉴에서 클러스터·노드·인덱스·로그 를 한 화면에서 봐요.

두 가지 수집 모드.

(1) Self-monitoring (legacy)

ES 가 스스로 자기 메트릭을 수집해서 .monitoring-es- 인덱스에 저장*. 가장 단순한 모드.

# elasticsearch.yml
xpack.monitoring.collection.enabled: true
xpack.monitoring.elasticsearch.collection.enabled: true

문제는 모니터링 대상 클러스터가 죽으면 모니터링도 같이 죽음. 운영 권장 X.

(2) Metricbeat agent (권장)

별도 Metricbeat 에이전트가 클러스터 외부에서 ES API 를 두드려 메트릭을 수집. 모니터링 대상 클러스터와 완전 분리 가능.

# metricbeat.yml
metricbeat.modules:
- module: elasticsearch
  xpack.enabled: true
  period: 10s
  hosts: ["http://es-data-01:9200"]
  username: "remote_monitoring_user"
  password: "***"
output.elasticsearch:
  hosts: ["http://es-monitoring-01:9200"]

이렇게 수집된 메트릭을 별도 monitoring 클러스터 의 .monitoring-es-* 에 저장하면, 운영 클러스터가 죽어도 monitoring 은 살아 있어서 사고 직후 분석이 가능해요.

Dedicated Monitoring Cluster

운영 권장은 모니터링 전용 클러스터 분리. 보통 소형 3노드 정도 로 띄우고 여러 운영 클러스터의 메트릭 을 모아 봐요.

[운영 클러스터 A: 6 nodes] ──┐
                              ├─ Metricbeat ─→ [모니터링 클러스터: 3 nodes] ─→ Kibana
[운영 클러스터 B: 4 nodes] ──┘

이 패턴이 8.x 운영의 사실상 표준 구성이에요.

외부 도구 — Prometheus·Grafana·ElastAlert

ES 의 공식 Stack Monitoring 외에 외부 표준 도구 와 묶는 패턴도 흔해요. 시리즈 6(데이터 인프라) 와 시리즈 7(Grafana) 에서 다뤘던 Prometheus + Grafana 조합이 ES 에도 그대로 적용.

(1) Elasticsearch Exporter

ES 의 _cluster/health · _nodes/stats · _cat/* API 를 Prometheus 포맷으로 노출해 주는 별도 에이전트. 대표 구현 두 개.

prometheus-community/elasticsearch_exporter — Go 로 짠 가장 널리 쓰이는 exporter.
elastic 공식 Prometheus integration — Metricbeat 이 직접 Prometheus 포맷으로 export.

설치는 보통 Docker 한 줄.

docker run -d -p 9114:9114 \
  -e ES_URI=http://es-data-01:9200 \
  quay.io/prometheuscommunity/elasticsearch-exporter:latest

/metrics 에서 Prometheus 가 긁어가요.

(2) Grafana 대시보드

Grafana 공식 Dashboard library 에 Elasticsearch 검색하면 수백 개 의 미리 만들어진 대시보드가 있어요. 가장 많이 쓰이는 셋.

Dashboard 14191 — Elasticsearch Exporter Quickstart and Dashboard — 가장 표준.
Dashboard 2322 — Elasticsearch (prometheus-community) — 노드별 깊이.
Dashboard 13083 — Elasticsearch Cluster Health — 클러스터 한 페이지 요약.

JSON import 한 번이면 5분 안에 대시보드 완성.

(3) ElastAlert · Alertmanager

알람은 두 갈래.

Kibana 자체 Alerting — Stack Monitoring 에서 룰 기반 알람. PagerDuty·Slack·Email 연동.
Prometheus Alertmanager — Prometheus + Exporter 조합이면 표준 Alertmanager 로 알람.

운영에서 가장 자주 거는 알람 룰 다섯.

# prometheus-alerts.yml
- alert: ElasticsearchClusterRed
  expr: elasticsearch_cluster_health_status{color="red"} == 1
  for: 1m
  annotations:
    summary: "Elasticsearch cluster is RED"

- alert: ElasticsearchHeapTooHigh
  expr: elasticsearch_jvm_memory_used_bytes / elasticsearch_jvm_memory_max_bytes > 0.85
  for: 5m
  annotations:
    summary: "ES heap usage above 85% for 5 min"

- alert: ElasticsearchDiskTooHigh
  expr: elasticsearch_filesystem_data_used_percent > 85
  for: 5m

- alert: ElasticsearchUnassignedShards
  expr: elasticsearch_cluster_health_unassigned_shards > 0
  for: 10m

- alert: ElasticsearchThreadPoolRejected
  expr: rate(elasticsearch_thread_pool_rejected_count[5m]) > 0
  for: 2m

이 다섯 룰이 ES 사고 90% 를 잡아 줍니다.

자주 만나는 사고

사고 1 — Slow Log 미활성

원인 — 운영 시작 시 slow log threshold 를 설정 안 해서, 느린 쿼리가 발생해도 로그에 안 남음. 사고 후 어떤 쿼리가 느렸는지 추적 불가.

해결 — 모든 운영 인덱스에 index template 으로 slow log 기본값을 박아요. query.warn 5s · query.info 2s · indexing.warn 10s 가 표준.

사고 2 — Monitoring Cluster 미분리

원인 — 모니터링을 self-monitoring 모드 로만 두면, 운영 클러스터가 죽으면 모니터링도 같이 죽음. 사고 직후 왜 죽었는지 데이터가 없어요.

해결 — Metricbeat + 별도 monitoring 클러스터 로 분리. 3노드 소형이면 충분.

사고 3 — Heap 75%+ 무시

원인 — heap_used_percent 가 75% 를 지속적으로 넘는데 알람이 없거나, 알람이 와도 습관적 무시. 결국 Old GC 폭증 → stop-the-world → 검색 timeout → circuit breaker → OOM.

해결 — 85% 5분 지속이면 PagerDuty 1순위. 근본 해결은 힙 증설 또는 fielddata 줄이기·shard 줄이기.

사고 4 — Thread Pool Queue 폭주

원인 — bulk 색인 클라이언트가 retry 없이 무한 재시도 를 박아서 write thread pool 큐가 폭증. 결국 rejected 가 발생하고 클라이언트가 데이터 손실.

해결 — exponential backoff 로 클라이언트 재시도 설정. rejected > 0 알람 즉시 대응.

사고 5 — Circuit Breaker Triggered 무시

원인 — fielddata circuit breaker 가 발동했는데 알람만 끄고 본질을 해결 안 함. aggregation·sort 가 text 필드 위에서 돌면 fielddata 가 폭증, breaker 발동.

해결 — text 필드 정렬·집계 금지 가 원칙. 필요하면 keyword 멀티필드 추가. 8편(Mapping) 에서 패턴.

사고 6 — Pending Tasks 폭증

원인 — 대량 mapping 변경·인덱스 생성 을 마스터 노드 한 대 가 다 처리하다 큐가 폭증. 클러스터 상태 변경 이 지연.

해결 — 대규모 작업은 야간 배치 로. pending_tasks > 100 알람. 마스터 노드는 data 와 분리 가 표준.

사고 7 — Disk Watermark 초과

원인 — 디스크 사용률이 low watermark 85% 를 넘으면 새 샤드 할당 차단, high watermark 90% 를 넘으면 기존 샤드를 다른 노드로 강제 이동, flood_stage 95% 를 넘으면 모든 인덱스가 read-only 잠금.

해결 — 80% 알람으로 미리 대응. ILM 으로 old 인덱스 자동 삭제 도 같이.

운영 권장 패턴

(1) Three-Tier 모니터링 구조

수집 (Metricbeat) → 저장 (모니터링 전용 ES 클러스터) → 시각화·알람 (Kibana + Alertmanager). 운영 클러스터와 완전 분리 가 1번 규칙.

(2) 알람은 다섯 룰만

Cluster Red · Heap 85% · Disk 85% · Unassigned Shards · Thread Pool Rejected. 이 다섯이 ES 사고 90% 를 잡아요. 더 많이 걸면 알람 피로 로 무시되니까 시작은 다섯만.

(3) Slow Log 기본 템플릿

모든 운영 인덱스에 index template 으로 slow log 기본값 박아요. 나중에 박으려고 미루면 진짜 사고 직후에 데이터가 없음.

(4) Dashboard 분리

클러스터 한 페이지 요약 (관리자용) · 노드 깊이 (운영자용) · 쿼리·색인 깊이 (개발자용) — 세 종류의 대시보드를 대상별로 분리. 한 대시보드에 다 넣으면 아무도 안 봄.

시험 직전 한 번 더 — 압축 노트

_cluster/health — green/yellow/red. unassigned_shards·pending_tasks 0순위.
_cluster/stats — 클러스터 전체 합계. 샤드·문서·디스크·힙 한눈.
_nodes/stats — 노드 한 대 한 대 깊이. JVM·thread pool·indexing·search·cache 다섯 그룹.
_cat APIs — 터미널 친화 표 형식. _cat/indices·shards·nodes·recovery 매일 두드리는 셋.
Slow Log — 인덱스 단위 query·fetch·indexing threshold 설정. JSON 로그 → Filebeat → Kibana.
Stack Monitoring — Kibana 통합 대시보드. Metricbeat + 별도 monitoring 클러스터 가 표준.
Prometheus Exporter — elasticsearch_exporter (Go) 가 가장 표준. /metrics 노출.
Grafana Dashboard 14191 — Elasticsearch 표준 대시보드.
사고 5순위 알람 — Cluster Red · Heap 85% · Disk 85% · Unassigned Shards · Thread Pool Rejected.
Disk Watermark — low 85% (할당 차단) · high 90% (강제 이동) · flood 95% (read-only).
Heap 75% 지속 = Old GC 임박, 85% = PagerDuty.
Three-tier — 수집(Metricbeat) → 저장(모니터링 클러스터) → 시각화(Kibana + Alertmanager).

시리즈 다른 편

이전 글 = 29편 Security — TLS·역할·API key·감사 로그
다음 글 = 31편 Performance Tuning — 검색·색인 최적화
26편 = Cluster Operations — _cluster/allocation/explain·split-brain·reroute
27편 = Shard Allocation — Primary·Replica·rebalance
28편 = Snapshot — repository·restore·SLM
33편 = Kibana·ELK — Logstash·Beats·Discover
34편 = Observability — APM·logs·metrics 통합
38편 = 시리즈 마무리 — 결정 트리·체크리스트·자격증

한 줄 정리 — Monitoring = _cluster/health (0순위) · _nodes/stats (깊이) · _cat (쉘) · Slow Log (느린 쿼리) 네 가지 ES 자체 도구에, Stack Monitoring (Kibana) + Prometheus Exporter (Grafana) + Alertmanager (알람) 외부 도구를 묶어 3-tier 분리 운영하는 가시성 표준.

※ 이 포스팅은 쿠팡 파트너스 활동의 일환으로, 이에 따른 일정액의 수수료를 제공받습니다.

운영 ES 의 눈 — 무엇을 봐야 하나

_cluster/health — 0순위 신호등

_cluster/stats — 클러스터 전체 통계

_nodes/stats — 노드 한 대 한 대 깊이

(1) JVM — 힙·GC

(2) Thread Pool — 큐·거부

(3) Indices.indexing — 색인 속도

(4) Indices.search — 검색 속도

(5) Indices.fielddata · query_cache — 캐시

_cat APIs — 쉘 친화 빠른 점검

_cat/indices

_cat/shards

_cat/nodes

_cat/recovery

Slow Logs — 검색·색인 느린 쿼리 추적

Stack Monitoring — Kibana 통합 가시성

(1) Self-monitoring (legacy)

(2) Metricbeat agent (권장)

Dedicated Monitoring Cluster

외부 도구 — Prometheus·Grafana·ElastAlert

(1) Elasticsearch Exporter

(2) Grafana 대시보드

(3) ElastAlert · Alertmanager

자주 만나는 사고

사고 1 — Slow Log 미활성

사고 2 — Monitoring Cluster 미분리

사고 3 — Heap 75%+ 무시

사고 4 — Thread Pool Queue 폭주

사고 5 — Circuit Breaker Triggered 무시

사고 6 — Pending Tasks 폭증

사고 7 — Disk Watermark 초과

운영 권장 패턴

(1) Three-Tier 모니터링 구조

(2) 알람은 다섯 룰만

(3) Slow Log 기본 템플릿

(4) Dashboard 분리

시험 직전 한 번 더 — 압축 노트

시리즈 다른 편

답글 남기기 응답 취소