It's that time once again - time for the AWS ES cluster to go yellow. As you may recall, this has happened before so I figured, hey, I got this, let's sort out some shards!

$curl -s -XGET 'my-cluster.es.amazonaws.com/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep -i unassign logstash-2017.09.27 4 r UNASSIGNED INDEX_CREATED logstash-2017.09.27 3 r UNASSIGNED INDEX_CREATED logstash-2017.09.27 2 r UNASSIGNED INDEX_CREATED logstash-2017.09.27 1 r UNASSIGNED INDEX_CREATED logstash-2017.09.27 0 r UNASSIGNED INDEX_CREATED$ curl -s -XPUT 'my-cluster.es.amazonaws.com/logstash-2017.09.27/_settings' -d '{"number_of_replicas": 0}'
{"acknowledged":true}

$curl -s -XPUT 'my-cluster.es.amazonaws.com/logstash-2017.09.27/_settings' -d '{"number_of_replicas": 1}' {"acknowledged":true}$  curl -s -XGET 'my-cluster.es.amazonaws.com/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep -i unassign


Go to back to sleep, wake up next morning, and what status is my cluster?

YELLOW.

\$  curl -s -XGET 'my-cluster.es.amazonaws.com/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep -i unassign


Rookie mistake. Probably worth actually looking into the why this happened, eh? Assumptions and what not.

→  curl -s -XGET 'my-cluster.es.amazonaws.com/_cluster/allocation/explain' -d '{"index":"logstash-2017.09.27","shard":1,"primary":false}' | jq

{
"shard": {
"index": "logstash-2017.09.27",
"index_uuid": "████████████████████",
"id": 1,
"primary": false
},
"assigned": false,
"shard_state_fetch_pending": false,
"unassigned_info": {
"at": "2017-09-27T15:29:41.772Z",
"delayed": false,
"allocation_status": "no_attempt"
},
"allocation_delay_in_millis": 60000,
"remaining_delay_in_millis": 0,
"nodes": {
"████████████████████": {
"node_name": "████████",
"store": {
"shard_copy": "NONE"
},
"final_decision": "NO",
"final_explanation": "the shard cannot be assigned because allocation deciders return a NO decision",
"weight": -0.725,
"decisions": [
{
"decider": "disk_threshold",
"decision": "NO",
"explanation": "the node is above the low watermark and has more than allowed [85.0%] used disk, free: [12.461602721437474%]"
}
]
},
...


You'll note that I piped the output through jq. I highly recommend you do the same. Unless you like big blobs of JSON. If so, I mean, I guess that's your personal preference. But if you're working with Elasticsearch you'll be seeing a lot of it, so you know. Just uh, think about that.

Moving on.

Reading the output, it looks like my ES cluster has run out of space. Say what? Didn't I configure CloudWatch alarms to warn me about this?

Well, I did, but it turns out that I didn't configure them *quite* the way I needed to. You see, there are two different metrics for free space: free storage space and minimum free storage space. Turns out that even though the cluster had enough free space, it didn't have enough minimum free space (the free space on the smallest node).

Moving forward: added a CloudWatch alert to look at the minimum free space and increased the storage on my ES cluster.

Documented on my frequently used assets page.