Server Failure Lesson #7: Be Redundant, Be Redundant
In my mind, redundancy is one of those “Well, duh!” ideas that you just do. How great is it to have two of something when one breaks?
But, the problem with redundancy is that it can be very expensive, especially when you are talking about hardware. Redundant drives, redundant power supplies, redundant network connections…it all adds up.
I have a very interesting conversation with our company’s CEO immediately after we lost our systems. He asked me how much a new server costs. I know most of you right now are rolling your eyes because this can be the equivalent of asking how long is a piece of string. But, to avoid the obvious “it all depends” conversation I said, “About $16,000″. He took one look at me and said, “We’ve lost more than that in productivity today alone! We need to look at having a redundant system in place.”
So, this is what I came up with.
Note: This is a preliminary design that is still in the works. I would love to hear some feedback on this and get some improvements/suggestions on how to make it better. (You know, that whole collaborative Web 2.0 things!)
There are two things that I want to accomplish with this solution:
- Keep services available in the event of a software failure
- Keep services available in the event of a hardware failure
What I have come up with is two servers that work together to provide a pool of resources to the system. This means that the servers are no longer working alone. Rather, they are working as a group or team. This is not exactly server clustering but it does provide some of the same benefits. This system will work something like this:

Each server will be configured to work with VMWare ESX server. If you are not familiar with ESX server, it is essentially a very small layer that gives the system the ability to run multiple operating systems on the hardware at the same time. In my mind, this was crucial for VMWare to do because the one big issue that I had with VMWare was the large overhead that was required by the hosting operating system. It seemed too inefficient to me.
With ESX server, this issue disappears. While it is true that there is still a bit of overhead required to run the ESX services, it is significantly less than what required for an entire OS.
I can then install two virtual systems on one physical server. These operating systems will be built into a cluster configuration for even further redundancy.
Now, let’s assume that we have a service that crashed on Virtual Server 1. The clustered configuration would automatically move the service over to Virtual Server 2, removing any downtime that the end user would have experienced.
So, let’s assume that we have a virtual server completely crash. In this instance, Virtual Server 1 would automatically take over all of the services that we one Virtual Server 2. We could restart Virtual Server 2 either on the original hardware or on a second physical server.

We could also take snapshots of each server at a specific moment in time. This way, if we had a failure in a server because of a driver update or other software installation, we could instantly roll the server back and be back up and running in seconds.
So, let’s assume the worst has happened and we lose a complete server. VMWare has a product called VMWare High Availability which will automatically move the systems away from the dead server and host it on the system that is still running. Users may see a bit of system slowdown but they will not experience any actual downtime.

As I stated earlier, this is a very preliminary concept and I still have a lot of reading and research to do on this but I think that I have the start of a system that will help increase uptime and keep users working.
If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?
5 Responses to “Server Failure Lesson #7: Be Redundant, Be Redundant”
-
Salva Says:
July 4th, 2007 at 9:46 amHello.
The only thing I miss in your early configuration to be a little bit more complete is a SAN, probably the most expensive part of the configuration. As far as I know ESX with High Availability only works “on the fly” if the server images are stored on a storage area network properly configured to be accessible from all the servers at any time. Evidently this SAN has to be also redundant (2 power sources, 2 controllers, as many disks as you want and 2 data paths for each server you want to connect).
When you consider the possibilities I recomend you to have a look in the CXFS product from SGI here:
http://www.sgi.com/products/storage/tech/file_systems.htmlAnd of course feel free to e-mail me if you want any further info.
-
Brent Says:
July 4th, 2007 at 11:15 amWe are using the latest version of ESX on one of my projects. I believe the High availability requires a SAN.
We have been very happy with VMWare as a whole. We have 6 servers running on 3 platforms where the VMWare servers are working as a cluster.
One very nice feature is that we can pull the VMWare images into our lab environment and test changes without affecting production.
-
Josh Says:
July 4th, 2007 at 2:55 pmDefinitely some positive steps.
It’s always going to be a challenge to know when you’ve gone from situation A) having your machines running with enough extra capacity that if you had to you could limp along with one machine down for a few days to situation B) all servers are operating near capacity and trying to consolidate will slow things down so it’s almost as bad as not having any services. So I highly recommend having a spare ESX server running (actually running, not just in a closet somewhere) to take over if necessary. You’ll set it up N+1, so it’s more cost effective if you have more ESX servers running than just two. I.E. have 5 mid-range servers, one as a spare, instead of 3 high-end servers with one as a spare.
-
Tim Fehlman Says:
July 4th, 2007 at 8:11 pmI am planning on implementing a SAN in my new design (as you will see in a future post).
Tim
-
jb Says:
September 19th, 2007 at 10:52 amgo get openfiler, set up a light weight iSCSI san and start testing.

