Server Failure Lesson #1: Poop Happens
It appears that Murphy and his laws were in full effect this week at work. First, I get sick with a nasty case of the flu. Everything that I ate was either return to sender or express exit. As I’m settling in to a day of self pity and TV reruns, the phone start to ring with the news of my second problem.
Apparently our domain controller for our main office crashed and the IT team could not get it to come back up and stay up. So, by 10 o’ clock, I was dragging my flu ridden butt out to the office. I ended up working until 2:AM the next day.
The third problem occurred at 11:AM on day 2. I got a phone call from the tech guy out at our field office telling me that their server did not come back up when they rebooted it. So they now found themselves in the exact same position as we did at the main office.
Things are starting to get sorted out now. We have new servers running in both locations and we are getting everything to start pointing away from the old servers and point to the new servers. We are still getting the occasional person telling us about something that is not working and we are dealing with these as they come up.
One thing that I like to do in situations like this is try and get something positive out of the situation. And there are definitely some good things that are coming out of this whole turn of events. One of those positives is the fact that I have learned a lot about recovering your environment and getting it running in short order.
Since I have gained about five years worth of experience in the past three days, I’m going to be sharing a number of these lessons with you over the next week or so. I hope that you can learn this stuff from me and not the hard way like I did.
So, the first lesson is Poop Happens! We did everything right and by the book. We did proper backups. We plan for disasters to occur. We were prepared to act in the case of a server lose. And yet, we did not count on me being sick. We were not prepared to lose two servers in such a short period of time. There were a lot of details that we just could not foresee or if we did think of them in advance, we figured that the odds of them happening we so small, we did not worry about our actions in the event that they did occur.
What got us by were two key things: experience and flexibility. All of the combined experience that the team had allowed us to come up with solutions to our problems. The fact that one member in the team had tried a solution in a similar situation in the past helped to guide us to success.
Because the team was also flexible, able to think on their feet and come up with sometimes really unique solutions on the fly, was also significant to our success. Not only did the team think outside the box, they threw the box away! We did things that I though we would never do.
A big thanks goes out to Kent, Jeff, Mark, John, and Mamood for all of the help and effort that you put in over the past few days. You guys rock!
If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?
5 Responses to “Server Failure Lesson #1: Poop Happens”
-
The Sys Admin Says:
November 30th, 1999 at 12:00 ampodcast/videocast with many training videos and screencasts for people working in IT or studying for their MCSE exams. They also touch on pc, mac, ipod and zune topics. The Daily Cup of Tech has a great series on server failures that they labeled the Server Failure Lesson
-
EntreGeeks Says:
November 30th, 1999 at 12:00 amMicrosoft to simplify downgrades from Vista to XPCHKDSK sucks!Developers looking for some middleclass criminalsEl top ten de los virus más nocivos para la PCNmap a fondo: Escaneo de redes y hostsWindows Live TVServer Failure Lession #1: Poop Happens Google Desktop, ahora para GNU/LinuxFOSS-ed for Windows[GPLv3] Links collectiongoogle desktop para linuxAdministrador de tareas en Linux al pulsar Control + Alt + SuprimirGoogle Desktop su Linux, prove su Ubuntu
-
Matt D Says:
June 28th, 2007 at 1:32 pmsounds like an eventful set of days. Im very interested in recovery from crashes like this. I work for an IT company doing SMB support and we are the primary administrators on 20+ servers around the area. Our SMB team has never really had to deal with something this intense. Lately we’ve discussed possible ways for data recovery in an event where systems went down that needed to be brought back up ASAP to try to avoid the chaos of trying to learn it all in the moment. I look forward to these series of posts!
-Matt
ps: your USB Drive series is my favorite and my drive has become so handy for me and my co-workers. It’s quite the swiss army knife of the tech world.
-
will Says:
June 28th, 2007 at 2:54 pmSh!t Happens,
DooDoo Occurs -
Richard in Kunming Says:
June 28th, 2007 at 11:36 pmWow, that’s quite the forced march.. I thought I had a heck of a day last week - got hit by a minivan while riding my bicycle to language class - but it pales compared to your events (seriously)! I think you summed it up best with your first sentence….. Murphy’s Law. One of the most powerful forces in existence in my opinion.
Thanks for sharing about it, your experience is valuable,
-
Greg D. Says:
June 29th, 2007 at 3:43 pmIt is things like this that have made me a huge fan of “Bare Metal Restore” products, disk images as well as mirrored disks. Time invested in areas like these pays off big when disaster strikes.
I love the war stories. Give us all the gory details. They can save us all a great deal of time and pain.

