Server Failure Lesson #5: Server Tendrils Go Deep
When you have a central compute go down, this can be a huge problem because the effects are so far reaching. Our server was not only a domain controller, it was also file (both user and workgroup data), print, DNS, DHCP, RADIUS, along with a few other more esoteric functions.
Each one of these functions caused us a different problem that made the end user’s life pretty miserable for a few days because practically nothing was where they were used to finding it and what they were able to get going was horrifically slow.
There were a lot of obvious problems such as files could not be found on the mapped network drive and Internet access was lost because DNS was unavailable while some users experienced complete network loss because they could not get an IP from DHCP.
But there were a lot of unexpected problems that came up as well simply because the server was so deeply embedded into each workstation’s operating system. These are some of the big “who would have thought” moments that we experienced.
Secondary DNS Slows Down Network
Most of us put a secondary DNS server into our TCP/IP setting when we are configuring DHCP or static IP addresses. The purpose of this secondary DNS setting (as I understand it) is to provide name resolution in the event that the primary DNS fails. So I was quite surprised when people started telling me that the Internet was slow even though they were not at a location where the server had failed.
As soon as I reconfigured DHCP to point to a different secondary DNS server that was still up and running, all of the systems immediately sped up (once they renewed their IP address). I’m not exactly sure why this works this way (unless I understand the purpose of the secondary DNS server wrong) but it sure made my life easier.
Folder Redirection Bites Me In The Butt…
One of the things that I was quite proud of when I built the network initially was the fact that I automatically redirected the user’s My Documents to their share on the network. This way, they could use their My Documents as they normally do but all I would have to do is back up the server since none of their documents would be stored locally (We have a policy that prohibits the users from saving files elsewhere on their systems.).
Unfortunately, when the server disappeared, so did all of the users documents. We were able to quickly give them a shortcut on their desktop to their new documents but because their original My Documents pointing to the now defunct server was on their Start menu, it was excruciatingly slow.
Plus, because we used a group policy to apply the folder redirection, we were not able to change the location of their My Documents until we could change the policy and force the policy down on the machine (More about this in a future lesson).
…With The Help of Office
The other thing about folder redirection is that this network location now becomes the default location for a number of different applications. I found references to the old server in Word, Excel, PowerPoint, and Publisher. There are probably others as well.
Because of how these applications are written, they look to see if these network locations are available prior to giving the user access to the applications. Thus, each time a user opened up a Word or Excel file, they had to wait for the timeout period before they could use the program.
This timeout wait period was also seen when trying to open or save files from the File menu because it was waiting for either the network location to become available (which was never going to happen) or for the timeout to expire. Very frustrating!
Network Printers
Just as the folder redirection caused problems with opening and saving files, network print queues that were stored on the dead server caused similar problems for printing. Whenever an application needed to print, the program would have to wait for the network printers to become available or for the timeout to be reached. Since the printers were coming online anytime in the near future, waiting for the timeout each time they printed became extremely painful.
Most Recently Used Lists
Most recently used lists are handy when you want to open up a file that you just worked on. They are standard both within the operating system and within applications. Unfortunately, when these MRU’s (as they are sometimes called) point to a file or location that does not exist, it also sometimes needs to wait for the system to time out before giving you access to it.
References in the Registry
I did a search of the registry for the name of the now dead server on one workstation. I found literally thousands of references to the system. There were so many that it would literally be impossible to check each and every one to determine what it does. I am sure that there have to be some of these entries that are causing system slowdown and system errors.
As a side note, I am working on a test system and doing a complete search for the old server name in the registry and replacing it with the new server name just to see if this will resolve some of the problems. I’ll keep you posted.
References in Files
Just as there are locations in the registry that refer to the old server, it is very easy for there to be referenced in files to the old server. A perfect example is Excel, where you can place data in a workbook that is directly sourced from another data file source. None of these file will work properly anymore because these files do not exist in their original location.
Just the Tip of the Iceberg
These are just the things that we have caught over the past couple of days. I am sure that, as time progresses, we will continue to find more references to the old server on workstations.
If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?
3 Responses to “Server Failure Lesson #5: Server Tendrils Go Deep”
-
Aaron Says:
July 2nd, 2007 at 9:49 amAs I was reading this, I thought to myself, “he sure has a bunch of critical services running on one box.” Having been in education for most of my professional IT career, I have had the good fortune of being able to spread crucial services around to several different servers, reducing my single points of failure. Reading this brought back vivid memories of single server shops I’ve been at where, if that server got borked, the business processes would literally shut down. How do you deal with or explain to management what the problem is (concentrating too many services in one server) and get it corrected before a problem like this. I’m sure that now you would have no problem showing the management at your location that you need some cash for a few extra servers, now that they have experienced this magnitude of problem.
How did you/do you think you will deal with this?
-
Ingo Says:
July 3rd, 2007 at 3:23 pmWhen the primary dns server is not responding it try`s 3 times to reach it till it takes the second one.
This could cause the delay you mentioned. -
Bryan Sullo Says:
November 5th, 2007 at 4:12 pmHow did you recover from the folder redirection issue?
I am in a similar situation.
I have users who have modified their offline My Documents and I am now wary to change the redirection to a new server.
What will happen?
Will the offline files from the PCs be written to the new location?
Will the contents of the new location take precidence and kill all the offline changes?
Will there be some sort of conflict that has to be resolved on each PC?
