Monday, June 6, 2011

OpenStack Nova: basic disaster recovery

Today, I want to take a look at some possible issues that may be encountered while using OpenStack. The purpose of this topic is to share our experience dealing with the hardware or software failures which definitely would be faced by anyone who attempts to run OpenStack in production.

Software issue


Let's look at the simplest, but possibly the most frequent issue. For example, if we need to upgrade the kernel or software that will require a host reboot on one of the compute nodes, the best decision in this case is to migrate all virtual machines running on this server to other compute nodes. Unfortunately, sometimes it may be impossible due to several reasons, such as lack of shared storage to perform migration or cpu/memory resources to allocate all VMs. The only option is to shut down virtual machines for the maintenance period. But how should they be started correctly after being rebooted? Of course, you may set the special flag in nova.conf and instances will start automatically on the host system boot:


However, you may want to disable it (in fact, setting this flag is a bad idea if you use nova-volume service).

There are many ways to start virtual machines. Probably the simplest one is to run:


It will recreate and start the libvirt domain using instance XML. This method works good if you don't have remote attached volume; otherwise, nova boot will fail with an error. In this case, you'll need to start the domain manually using the virsh tool, connect the iscsi device, create an XML file and attach it to the instance, which is a nightmare if you have lots of instances with volumes.

Hardware issue


Imagine another situation. Assume our server with a compute node experiences a hardware issue that we can't eliminate in a short time. The bad thing is that it often happens unpredictably, without the ability to transfer virtual machines to a safe place. Yet, if you have shared storage, you won't lose instances data; however, the way to recover may be pretty vague. Going into technical details, the procedure can be described by following steps:
  • update host information in DB for recovered instance

  • spawn instance on compute node

  • search for any attached volumes in database

  • look for volume device path, connect to it by iscsi or some other driver if necessary

  • attach it to the guest system

Solution


For this and previous situations we developed python script that would run a virtual machine on the host where this script is executed. You can find it on our git repository: openstack-utils. All you need is to copy the script on the compute node where you want to recover the virtual machine and execute:


You can look for instance_id using the nova list command. The only limitation is that the virtual machine should be available on the host system.


Of course, in everyday OpenStack usage, you will be faced with lots of troubles that couldn't be solved by this script. For example, you may have storage configuration that provides the mirroring of data between two compute nodes and you need to recover the virtual machine on the third node that doesn't contain it on local hard drives. The more complex issues require more sophisticated solutions and we are working to cover most of them.