Server Failure Lesson #9: Manage the Insanity
When you have a significant system failure like we did, things have a tendency to get a bit crazy rather quickly. There are several reasons for this, some of which include:
- You are under a lot of pressure to get things done as quickly as possible
- People outside of the IT department may have little or nothing to do
- Many tasks have only one person in the company with the skills to perform the tasks at hand
- Everyone’s individual need, in their opinion, is the most important task that needs to be completed first
Because this is such a stressful time, it is important to keep a clear head and manage the situation as quickly as possible. I have put together a few key action items that you can do when this happens to you so that you can keep things on track.
Appoint a User Representative
Invite a user who has little or nothing to do because of the system downtime to act as a liaison between the user population and the IT team. All users will go through this representative with any issues and problems that they are having and the liaison will compile and communicate user needs with the IT team.
There are several advantages to doing this:
- This frees up technical personnel to do their jobs
- The users know exactly who to go to with questions and concerns
- The users feel like they are more involved with the resolution process
- The users feel like they are being heard
- It gives one user something to do instead of being bored
- Prevents IT staff from getting sidetracked by users with less important tasks
Make a Plan of Attack
It is far too easy to just run about like a chicken with its head cut off doing whatever comes about. Often, all you end up doing is working on the symptoms instead of solving the root of the problem.
Take some time at the very beginning of the event and assess what really needs to be done. Create a plan and then put that plan into action.
This can be one of the most difficult things to do because your first reaction is to hit the ground running and do something. But, without knowing what that something is, you are just wasting cycles.
Prepare Contingency Plans
No matter how good your plan may be, there are going to be other things that go wrong. Try to anticipate what some of these other things may be and prepare for them.
This can be incredibly difficult to do, especially since there are literally thousands of other things that can go wrong in any given situation. But, if you understand your computing environment and your systems’ “personalities”, you will have a pretty good feel for what else might be creeping up on you in the disaster department.
Get Your Resources Together
Make sure that you have all of your required resources available to you when you start to get things together. Some of these resources include:
- consultants and their contact information
- important websites
- support information for hardware and software
- IT staff schedules and
- company IT policy manuals and documentation
- software and software keys/serial numbers
- hardware inventory levels
- vendor contacts
- system recovery tools
Constantly Communicate
Everyone needs to know what is going on. Management needs to know how this will effect their business. Users need to know when they will be back up and running. Your IT staff needs to know where you are in your recovery plan. It is up to you to make sure that everyone knows what they need to know.
Here are a few ideas that you can implement to help you communicate:
- Set up a wiki that everyone can access so they know what the plan is and where you are in the plan
- Put up a whiteboard where people can add their issues and where they can be checked off
- Plan a five minute “what’s new” meeting every four hours with your team
- Deliver a written update to management every four hours
- Give everyone two way radios
Be Careful with ETAs
As a rule, I try to not provide time estimates for completion of any tasks in an emergency or disaster situation. There are two good reasons for this:
- I do not have the time or the luxury to perform an appropriate analysis to give a true and accurate time estimate
- People take time estimates as time deadlines. Regardless of what you say, they will hear what they want
Set Up a War Room
This does not have to be an actual room but at least a central place where everyone knows they can get answers and information.
Prioritize and Triage
It is important to set priorities and determine which tasks need to be performed in what order. The hard part of this is that you need to tell people that their issues are going to have to wait until later. They are not going to like this but it is critical to get things up and running as soon as possible.
Keep IT Staff Busy
Your IT team wants to be a part of the solution. If you want to frustrate them, have them sit there bored while you run around like a maniac.
The best way to keep your staff busy is to know what their technical strengths are and let them use these skills.
If you do not need all of your staff for a short period of time, send them out for a coffee break. If you do not need them for a longer period of time, send them home for some sleep and then have them return when you need them. This way, people will be rested and able to operate at peek performance.
If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?
3 Responses to “Server Failure Lesson #9: Manage the Insanity”
-
RevFry Says:
July 6th, 2007 at 12:17 pmI think by now we all realize you had some problems with your server. Can we move on to other topics now?
Rev
-
Nate Says:
July 6th, 2007 at 1:44 pmTim,
I think this is one of the best articles of this series. I’m really grateful that you’ve taken time out of your schedule to provide such valuable information freely. I am confident in your ability to decide when to move to other topics, as opposed to someone who has no idea what you are about to post next.
Thanks.
-
Greg Says:
July 9th, 2007 at 1:27 pmTwo way radios may not always be great. Sometimes they can get very annoying, but it is good for keeping everyone in the know. The staff just need to remember that they need to keep on their tasks and listen selectively.
