OK. As you may have noticed, I have not really been on the ball when it comes to posting lately. I have resorted to lamely posting links to “cool” websites which I guess puts out a post but it does not really give you anything that you couldn’t get by searching through Google. Once again, I apologize for my unannounced absence.
It All Began So Innocently…
Here is what transpired to tear me away from my beloved blog. I was supposed to go to a satellite office on Tuesday to just make sure everything was running fine over there while their tech was on training. It was going to be a quick “Hey, how ya doin’? Everything runnin’ fine?” trip. I was planning for a long day (the office was three hours each way) but I expected that I would be home by 8 o’clock that night.
I was about five minuted from the satellite office when I got the phone call. A primary application was down. No big deal. There were any number of reasons that a user could not access this application. As I pulled into the office, I could not have know that this was the beginning of four very long and stressful days.
I went and checked the server that housed the application. The application was on its own server so there were no other systems that were effected. I tried to log into the application with no luck. So, when it doubt, reboot! Still, no luck. I checked the service and it was refusing to start up. This is where things began to get weird.
I looked at the D: drive that housed the data. I attempted to safe a text file to the D: drive but I got write errors. It started to look like I was in for a bit of a challenge. Time to check out the server itself.
We Might Have a Problem…
The first thing that caught my eye was the fact that one of the drives (Drive 0) was showing an amber status light instead of the normal green. This usually indicates that the drive had failed. No worries because I had the two drives in a mirrored configuration.
For those of you who are not familiar with drive mirroring, this is how it works: You have two physical hard drives that work in unison. They appear like a single unit from the operating system’s perspective. (See the diagram below.) This means that when I write a file to the D: drive, for example, it will actually write it to both Drive 0 and Drive 1. There is a layer in between which acts as a “translator” between the hardware and the operating system. This layer creates containers that the operating system sees as separate partitions or drives.
If a single drive fails, the system can continue to run because the data will then be used only from the one drive. The failed drive can then be replaced, sometimes without even turning off the computer if the drives are hot swappable, and the system will automatically rebuild the drive and carry on its merry way.
I wanted to make sure that I wasn’t about to commit a major faux pas so I contacted the manufacturer’s tech support. They agreed that it looked like a drive failure was the issue. It would take up to ten working days to get a new drive out to me but it should continue to run just fine with only one drive. (Just hope that the other drive does not fail during those two weeks!) Pop out the old drive and everything should be fine.
Worst Case Scenario Realized…
How wrong can one person be? The second that I pulled out the drive, all of the data on D: disappeared! I would have died had I not remembered that we daily backed up the data on the database! Which got me thinking. So I went to the backup server and promptly lost it when I noticed that the backup had been backing up the same data for several months because the database export function had locked up and it didn’t report any errors!
So, at that moment I realized the following:
- I just watched all my data disappear into the ether
- The last backup of the MIA data is several months old
- I am completely ignorant when it comes to the database and how it saves the information on the server
- Anywhere but here is sounding really nice about now
I thought that maybe if I put the hard drive back into the system, the data would come back. Alas, this was not the case! I decided to call the hardware manufacturer’s tech support again to see if they had any more bright ideas (I was obviously desperate at this point.) While I was waiting on hold for tech support, the data suddenly reappeared! It was a miracle and a full 45 days before Christmas!
Everyone Remain Calm…
I immediately attempted to copy everything from that drive to another system. I got about 1% done and the data disappeared again! I was not impressed. But, after a few minutes of giving the server the evil eye, the data reappeared. Only to disappear again a few minutes later! How was I going to get this data off the system? Suddenly, the bulb went on in my head!
I wrote a quick batch file that recursively called robocopy. I then let it run overnight in hopes that when I returned the next morning, I would have all of the data.
Luckily, this worked out just fine. I’m not sure how many times there were errors and failures with the copy but in the end I was able to get everything of the drive! I could hear the bullet whiz by as I dodged it!
Now, what do I do with this data? I know absolutely nothing about the database. I reluctantly contacted the manufacturer of the application that crashed. They had not been very helpful in the past because they only make the application, and they do not support the database that it runs on. But, to my surprise, when I contacted then, I discovered that they had been purchased by another company which not only supports this database but it also hosts the application for companies who do not have the in house technical skills to properly manage it.
It took an entire day but by time I got back home that night (I decided to make the three hour trip back so that I could spend a couple of hours with my family), we had a plan in place!
Here’s the Plan…
At 4:00 AM Thursday morning, I headed back to the satellite office. I packed up the entire server and put it in the back seat of one of the 4×4 trucks that the company owns. The roads were kind of icy and everyone wanted me back in one piece (or at least the server). I then traveled another three hours to a different city where the company I hired had their data hub. I met the recovery team that I hired. Miracle of miracles, we were able to get all of the data off the server and onto their ASP hosting. By this time, it was getting late so I slept in the city for the night.
Bright and early the next morning (Actually, it wasn’t really all that bright. I was up at least three hours before the sun was!), I was back on the road to the satellite office. I needed to wait for the recovery team to properly set up the data. Once they confirmed that everything was ready to go, I needed to modify all of the computers on the network in order for them to point to the new server instead of the old (and dead) server.
I simply modified a version of my Modify Every Computer on the Network script to change all of the computers. It was a simple matter of altering a couple of text files which was a relatively easy change.
There were a number of remote systems that were not connected to the network. When the tech for the satellite office returned from training, he had a quick installer that I created waiting for him. This installer, also creating using AutoIt, allowed him to simply execute the installer and then move on to the next system.
I Love It When A Plan Comes Together…
I have to admit, it was a pretty rough week. Here’s how things summed up, by the numbers, once all was said and done:
- 4 - number of days of shear panic
- 12 - number of hours of sleep I got over the four days
- 18 - number of hours I spent driving
- 3 - number of meals I actually ate (all at Subway)
- 3 - number of different cities I slept in over the four days
- 2000 - approximate number of kilometers I drove
With that being said, it ended up being a really rewarding experience. My boss and the president of the company have each recognized my efforts and that goes a long way.
Special shouts out to Trent for organizing the team that helped recover the data, Dave for managing the data recovery, Keith for thinking outside the box when it came to getting information from an extremely flaky system, and Naeem for the actual data recovery and transfer to a new system. You guys are the best!If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?
|Trackback link - http://www.dailycupoftech.com/2006/11/13/long-time-nopost/trackback/|