Adding EC2 Instance Recovery Alarms with CloudFormation

Automatically Push the “Recover” Button

Instance Recovery is a little-advertised, little-used feature of EC2. It doesn’t take long to set up and promises to recover your instance on the rare occasion that the underlying hardware fails. Recovery resumes the instance on new hardware, retaining its instance ID, private IP addresses, Elastic IP addresses, and all instance metadata.

I’ve deployed it on “snowflake” instances that don’t have the luxury of using Auto Scaling Groups. This gives me a little extra uptime assurance. I don’t think I’ve actually ever seen any EC2 instance get auto-recovered though.

Maybe I’m cargo-culting it, but it’s not much work to set up, so it feels like an easy (potential) win.

Update (2019-09-13): I asked on reddit for examples and Redditron-2000-4 replied. They have 1800 EC2 instances and see 1-3 automatic recoveries a month. This is a failover rate of 0.05%-0.15%. Small but significant if you’re looking for even 99.9% uptime!

You can click to set up a recover alarm for an instance on the console as per the documentation.

I like automating with CloudFormation though. I converted the resulting manually-created alarm into a template snippet some time ago, and copy paste it between projects. Let’s take a look at it.

If you have an EC2 instance in your CloudFormation template’s Resources like so:

Ec2Instance:
  Type: AWS::EC2::Instance
  Properties:
    LaunchTemplate:
      LaunchTemplateId: !Ref LaunchTemplateId
      Version: !Ref LaunchTemplateVersion
    Tags:
    - Key: Name
      Value: Snowflake-Production

…then a basic recovery alarm for it would look like this:

Ec2InstanceAutorecoverAlarm:
  # Recover the instance if its EC2 status checks fail, as per:
  # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html
  # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html#AddingRecoverActions
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmActions:
    - !Sub arn:aws:automate:${AWS::Region}:ec2:recover
    AlarmDescription: Recover instance if its status checks fail.
    Namespace: AWS/EC2
    MetricName: StatusCheckFailed_System
    Dimensions:
      - Name: InstanceId
        Value: !Ref Ec2Instance
    EvaluationPeriods: 2
    Period: 60
    Statistic: Minimum
    ComparisonOperator: GreaterThanThreshold
    Threshold: 0

The snippet EvaluationPeriods set to gives you 2 minutes of a failing instance before it’s recovered. This is as recommended in the manual setup documentation.

That documentation page also shows how to create alarms to stop, terminate, or reboot instances. I’ve not needed to do any of those, but if you do, you should be able to adapt this template snippet to match.

Fin

Hope this helps you recover,

—Adam


Newly updated: my book Boost Your Django DX now covers Django 5.0 and Python 3.12.


Subscribe via RSS, Twitter, Mastodon, or email:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: ,