Just realised I haven't written anything here for a while, which is remiss of me. I seem to have talked quite a bit previously about the virtualisation setup I'm using at work, so I figure I'd better do a quick update. Right, first thing: it works! I have proof! Our main file server's power supply went "Fzzzt!" and made a big green flash and (I'm assured) a nasty smell on Thursday, stopping our server working. It also managed to fuse the plug connected to the UPS. So, that's our main file server out of action in a way that implies it'll need to be taken apart and have melted bits and bobs removed, ordered and replaced over the course of a week or so. Not good. Not to worry, though! All our servers are running as virtual machines, and all those virtual machines are being mirrored between separate bits of physical hardware, i.e. we have two machines with a copy of each virtual machine on. So, all we had to do to get the file server back online was to start the virtual machine up on the physical server that was still running. Well, actually, we had to sort that fused plug and UPS out first, but the whole thing still took under half an hour.
This quick turn-around was enabled by DRBD, a Linux kernel module that allows you to mirror a block device over an ethernet connection to another machine. This is the nice thing about Linux, the way its various sub-system tend to work as "stacks" of protocols. There's the network stack, obviously, that's the one that everyone's used to thinking of. But there's the "disk protocol stack" too that allows you to swap bits in and out as you like to accomplish what you need. We now have a disk stack with physical disks at the bottom, then software RAID (mdadm) on top of that to provide some resilience to harddrive failure and some performance gains, then Linux Logical Volume Management to allow for easy partitioning and resizing of disk space and, importantly, snap-shot backups of volumes, then DRBD on top of that to mirror volumes between physical machines to provide more resilience to hardware failure. We might slot in some encryption there, too, over the summer break to keep our important data safe, probably using dm-crypt.
The above setup means that your machines need to be able to communicate with each for DRBD to be able to do its thing, which proved to be much simpler if each server had a fixed IP address, so the previously discussed setup where each physical machine simply appeared on the network and reported to a central server hasn't quite come about. No worries, I can now get a server up and ready to run virtual machines in a couple of hours flat. That's the other thing - we've now moved to CentOS 5.1, as Xen on Ubuntu 8.04 simply kept on crashing.
But wait - aren't you meant to use a fancy (and costly) SAN setup for ensuring virtual machine up-time? Well, maybe, but I don't think think so in our situation, and I think our situation holds for many others, too. We're a school, not a business as such, and we don't have stacks of money to spend on IT equipment - we spend money on people, teachers, not equipment. A single SAN device would allow us fail-over if a processing machine went down, i.e. if one of your processing servers conks out you simply switch its virtual machines to another one and carry right on. However, this doesn't cover you for failure of the SAN server itself - a power supply blow-out, for instance. So you wind up wanting to have two SAN devices, mirroring between them, and you want those in separate physical locations in case of fire/flood/etc, and all this necessitates a fancy (and definitely costly) networking setup between said separated physical locations, probably involving fibre channel connections. And all, when you think about it, for the privilege of being able to have a loose coupling between your processor(s) and harddrives. That's all a SAN actually provides, generally at the expense of some performance over what you could get with a processor local to a decent RAID 1 controller and a stack of harddrives. And look at today's processor and memory prices and a typical machine's actual processor usage. In our school, most of our servers' jobs are basically file serving, there's hardly any actual processing to be done. Shifting files around and squirting them down network connections hardly takes up a processor's time much these days, especially one of your fancy quad-core jobbies. So, basically, a decent quad-core processor (or two) and, say, 8GB of RAM isn't going to set you back more than around £500 these days, and would easily handle the processing requirements of half a dozen virtual file servers. No need to bother spending stacks of money de-coupling the processor and harddrives, just spend the money on a really decent RAID controller.