Adding EC2 Instance Recovery Alarms with CloudFormation
Instance Recovery is a little-advertised, little-used feature of EC2. It doesn’t take long to set up and promises to recover your instance on the rare occasion that the underlying hardware fails. Recovery resumes the instance on new hardware, retaining its instance ID, private IP addresses, Elastic IP addresses, and all instance metadata.
I’ve deployed it on “snowflake” instances that don’t have the luxury of using Auto Scaling Groups. This gives me a little extra uptime assurance. I don’t think I’ve actually ever seen any EC2 instance get auto-recovered though.
Maybe I’m cargo-culting it, but it’s not much work to set up, so it feels like an easy (potential) win.
Update (2019-09-13): I asked on reddit for examples and Redditron-2000-4 replied. They have 1800 EC2 instances and see 1-3 automatic recoveries a month. This is a failover rate of 0.05%-0.15%. Small but significant if you’re looking for even 99.9% uptime!
You can click to set up a recover alarm for an instance on the console as per the documentation.
I like automating with CloudFormation though. I converted the resulting manually-created alarm into a template snippet some time ago, and copy paste it between projects. Let’s take a look at it.
If you have an EC2 instance in your CloudFormation template’s Resources
like so:
Ec2Instance:
Type: AWS::EC2::Instance
Properties:
LaunchTemplate:
LaunchTemplateId: !Ref LaunchTemplateId
Version: !Ref LaunchTemplateVersion
Tags:
- Key: Name
Value: Snowflake-Production
…then a basic recovery alarm for it would look like this:
Ec2InstanceAutorecoverAlarm:
# Recover the instance if its EC2 status checks fail, as per:
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html#AddingRecoverActions
Type: AWS::CloudWatch::Alarm
Properties:
AlarmActions:
- !Sub arn:aws:automate:${AWS::Region}:ec2:recover
AlarmDescription: Recover instance if its status checks fail.
Namespace: AWS/EC2
MetricName: StatusCheckFailed_System
Dimensions:
- Name: InstanceId
Value: !Ref Ec2Instance
EvaluationPeriods: 2
Period: 60
Statistic: Minimum
ComparisonOperator: GreaterThanThreshold
Threshold: 0
The snippet EvaluationPeriods
set to gives you 2 minutes of a failing instance before it’s recovered. This is as recommended in the manual setup documentation.
That documentation page also shows how to create alarms to stop, terminate, or reboot instances. I’ve not needed to do any of those, but if you do, you should be able to adapt this template snippet to match.
Newly updated: my book Boost Your Django DX now covers Django 5.0 and Python 3.12.
One summary email a week, no spam, I pinky promise.
Related posts:
- A Minimum Viable CloudFormation Template
- Running CloudFormation Drift Detection on All Your Stacks
- Testing Boto3 with pytest Fixtures
Tags: aws, cloudformation