I’ve been using AWS for a few years now, and it has been rock solid. Last Sunday one of my sites became unreachable, when I got home a couple of hours later, I was able to ssh into the instance and everything seemed to be working perfectly. I checked utmp logs and the instance was rebooted. A while later I got this email from Amazon:
From: Amazon EC2 Notification Subject: Notice: Degraded Amazon EC2 Instance Hello, We have noticed that one or more of your instances are running on a host degraded due to hardware failure. i-xxxxxx The host needs to undergo maintenance and will be taken down at 12:00 GMT on 2010-06-23. Your instances will be terminated at this point. The risk of your instances failing is increased at this point. We cannot determine the health of any applications running on the instances. We recommend that you launch replacement instances and start migrating to them. Feel free to terminate the instances with the ec2-terminate-instance API when you are done with them. Sincerely, The Amazon EC2 Team Sounded like they would terminate the instance because of hardware failure, and that would be very bad – this is a high volume eCommerce site. I looked around to see what was the best way to “clone” the instance and relaunch it, and it turned out to be really simple. When I setup EC2 stuff I always use an EBS volume for the important data like the /home, the MySQL storage, most of the configurations in /etc like Apache vhost configs. I also use an Elastic IP address so I can switch it to another instance easily, and it won’t require modifying DNS records at all. So all I had to do was:
- get all your AWS access keys, certs, and user id, onto the instance
- create a folder for the AMI bundling work
- bundle the root volume on the dying instance
$ sudo mkdir /mnt/ami && sudo ec2-bundle-vol -d /mnt/ami -k pk-CKXXXXXXXXXXXX.pem -u 12345678 -c cert-CKXXXXXXXXXXXXXXX.pem
- upload the bundle to S3 and register the AMI
$ ec2-upload-bundle -b somesite-post-degraded -m /mnt/ami/image.manifest.xml -a XXXXXXXXXX -s XXXXXXXXXXXXX/00XX
$ ec2-register somesite-post-degraded/image.manifest.xml
- launch a new instance with the AMI
- unattach the EBS volume from the old instance
- attach the EBS volume to new instance
- re-assign elastic IP to new instance You can do a lot of these tasks from the
AWS Management Console. All of that took about 2 hours, most of the time was spent waiting for the AMI to bundle and upload as it was pretty large. Everything worked perfectly after the migration, when I set up the EC2 infrastructure I had planned for things like these and in theory migration should go without any glitch, but I never actually had a need to migrate an instance. It’s good to know that everything actually worked as designed.