OpsFire: Always Check Your Assumptions

Ran into an interesting situation today - received the seemingly benign request to terminate an EC2 instance that was no longer needed. Instance is being monitored by New Relic and maintained by Puppet, so:

$  sudo /etc/init.d/puppet stop
$  sudo yum remove newrelic-infra -y

Then I jumped into the AWS Console and stopped the instance. Easy peasy, right?

Things started to go wrong when an alert from New Relic appeared in our monitoring Slack channel that the instance was not reporting. I log into the New Relic Infrastructure web UI and, sure enough, the instance is still listed there in host list.

It did seem odd that New Relic threw an error after the host was stopped but not when I removed the package. Reviewing the documentation I found this:

Delete a host from the index (v2) | New Relic Documentation

Ok, sure. I wasn't 100% sure what the server ID was for New Relic, being relatively new to this product, so at first I assumed it was the hostname - protip: it isn't. I actually needed to issue another API call to get the server ID as it isn't in the web UI either:

$  curl -X GET 'https://api.newrelic.com/v2/servers.json' -H "X-Api-Key:${APIKEY}"

There was quite a bit of JSON output, so I decided to employ jq to give me what I needed to ID the server:

$  curl -X GET 'https://api.newrelic.com/v2/servers.json' -H "X-Api-Key:${APIKEY}" | jq '.servers[] | "\(.id) -- \(.name) -- \(.host)"'

And this is where I started scratching my head. Again. The instance was not in the list of servers. I wondered if we had just enough instances to have paged results, but it turns out we do not. So it wasn't hiding elsewhere.

At this point, I look in the New Relic web UI again. The host is gone! It's a miracle! Perhaps there's been some lag? Who knows!

So I go back into the AWS Console ...
... I stop the instance ...
... and I get a monitoring alert from New Relic that the host ins't reporting.

@#&$^

Ok. Deep breaths. There's probably some sort of independent health check going on. Which means I'll need to monitor the chatter on the instance. To get started I do some digging and find that New Relic helpfully documents what IP ranges they use:

Networks | New Relic Documentation

Now I can check the inbound/outbound traffic for addresses in these ranges. This means we're using our old frenemy, tcpdump. We'll filter on both CIDR ranges and ports:

$  sudo tcpdump -nn -p -e "(net 50.31.164.0/24 or net 162.247.240.0/22) and (port 80 or port 443)"
-bash: tcpdump: command not found

We'll sigh, install tcpdump on an instance we just want to delete, and try again:

$  sudo yum install tcpdump && sudo tcpdump -i eth0 -nn -p -e "(net 50.31.164.0/24 or net 162.247.240.0/22) and (port 80 or port 443)"

No traffic. Checking the ports:

$  sudo tcpdump -i eth0 -nn -p -e "(port 80 or port 443)"

Lots of chatter. Good, I guess? Dropping the ports, checking only the CIDR ranges:

$  sudo tcpdump -i eth0 -nn -p -e "(net 50.31.164.0/24 or net 162.247.240.0/22)"

No traffic. Ok.

Reinstalled the New Relic agent, noticed that perhaps I wasn't waiting long enough for New Relic to alarm after removing the agent - apparently New Relic does in fact alarm (eventually?) when the agent is removed.

And it is right now that I actually look at how the alert itself was configured and realized I could have just saved myself a lot of work. The alert was "Host not reporting" and: it checks everything that is tagged as Class:prod. I changed the Class to test: no alarm. Removed the New Relic agent: no alarm. Stopped the instance: no alarm.

It's always embarrassing when the answer is easy, isn't it?

OpsFire Badge

Documented on my frequently used assets page.