OpsFire: My AWS Elasticsearch Cluster is YELLOW (Again)

Amazon Logo

It's that time once again - time for the AWS ES cluster to go yellow. As you may recall, this has happened before so I figured, hey, I got this, let's sort out some shards!

$  curl -s -XGET 'my-cluster.es.amazonaws.com/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep -i unassign
logstash-2017.09.27 4 r UNASSIGNED INDEX_CREATED
logstash-2017.09.27 3 r UNASSIGNED INDEX_CREATED
logstash-2017.09.27 2 r UNASSIGNED INDEX_CREATED
logstash-2017.09.27 1 r UNASSIGNED INDEX_CREATED
logstash-2017.09.27 0 r UNASSIGNED INDEX_CREATED

$ curl -s -XPUT 'my-cluster.es.amazonaws.com/logstash-2017.09.27/_settings' -d '{"number_of_replicas": 0}'
{"acknowledged":true}

$  curl -s -XPUT 'my-cluster.es.amazonaws.com/logstash-2017.09.27/_settings' -d '{"number_of_replicas": 1}'
{"acknowledged":true}

$  curl -s -XGET 'my-cluster.es.amazonaws.com/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep -i unassign
logstash-2017.09.27 1 r UNASSIGNED REPLICA_ADDED
logstash-2017.09.27 4 r UNASSIGNED REPLICA_ADDED
logstash-2017.09.27 3 r UNASSIGNED REPLICA_ADDED
logstash-2017.09.27 2 r UNASSIGNED REPLICA_ADDED
logstash-2017.09.27 0 r UNASSIGNED REPLICA_ADDED

Go to back to sleep, wake up next morning, and what status is my cluster?

YELLOW.

$  curl -s -XGET 'my-cluster.es.amazonaws.com/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep -i unassign
logstash-2017.09.27 4 r UNASSIGNED REPLICA_ADDED
logstash-2017.09.27 3 r UNASSIGNED REPLICA_ADDED
logstash-2017.09.27 2 r UNASSIGNED REPLICA_ADDED
logstash-2017.09.27 1 r UNASSIGNED REPLICA_ADDED
logstash-2017.09.27 0 r UNASSIGNED REPLICA_ADDED

Rookie mistake. Probably worth actually looking into the why this happened, eh? Assumptions and what not.

→  curl -s -XGET 'my-cluster.es.amazonaws.com/_cluster/allocation/explain' -d '{"index":"logstash-2017.09.27","shard":1,"primary":false}' | jq

{
  "shard": {
    "index": "logstash-2017.09.27",
    "index_uuid": "████████████████████",
    "id": 1,
    "primary": false
  },
  "assigned": false,
  "shard_state_fetch_pending": false,
  "unassigned_info": {
    "reason": "REPLICA_ADDED",
    "at": "2017-09-27T15:29:41.772Z",
    "delayed": false,
    "allocation_status": "no_attempt"
  },
  "allocation_delay_in_millis": 60000,
  "remaining_delay_in_millis": 0,
  "nodes": {
    "████████████████████": {
      "node_name": "████████",
      "store": {
        "shard_copy": "NONE"
      },
      "final_decision": "NO",
      "final_explanation": "the shard cannot be assigned because allocation deciders return a NO decision",
      "weight": -0.725,
      "decisions": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark and has more than allowed [85.0%] used disk, free: [12.461602721437474%]"
        }
      ]
    },
...

You'll note that I piped the output through jq. I highly recommend you do the same. Unless you like big blobs of JSON. If so, I mean, I guess that's your personal preference. But if you're working with Elasticsearch you'll be seeing a lot of it, so you know. Just uh, think about that.

Moving on.

Reading the output, it looks like my ES cluster has run out of space. Say what? Didn't I configure CloudWatch alarms to warn me about this?

Well, I did, but it turns out that I didn't configure them *quite* the way I needed to. You see, there are two different metrics for free space: free storage space and minimum free storage space. Turns out that even though the cluster had enough free space, it didn't have enough minimum free space (the free space on the smallest node).

Upside down smiley

Moving forward: added a CloudWatch alert to look at the minimum free space and increased the storage on my ES cluster.

OpsFire Badge

Documented on my frequently used assets page.

Menu