Long Time, No…Post?
OK. As you may have noticed, I have not really been on the ball when it comes to posting lately. I have resorted to lamely posting links to “cool” websites which I guess puts out a post but it does not really give you anything that you couldn’t get by searching through Google. Once again, I apologize for my unannounced absence.
It All Began So Innocently…
Here is what transpired to tear me away from my beloved blog. I was supposed to go to a satellite office on Tuesday to just make sure everything was running fine over there while their tech was on training. It was going to be a quick “Hey, how ya doin’? Everything runnin’ fine?” trip. I was planning for a long day (the office was three hours each way) but I expected that I would be home by 8 o’clock that night.
I was about five minuted from the satellite office when I got the phone call. A primary application was down. No big deal. There were any number of reasons that a user could not access this application. As I pulled into the office, I could not have know that this was the beginning of four very long and stressful days.
I went and checked the server that housed the application. The application was on its own server so there were no other systems that were effected. I tried to log into the application with no luck. So, when it doubt, reboot! Still, no luck. I checked the service and it was refusing to start up. This is where things began to get weird.
I looked at the D: drive that housed the data. I attempted to safe a text file to the D: drive but I got write errors. It started to look like I was in for a bit of a challenge. Time to check out the server itself.
We Might Have a Problem…
The first thing that caught my eye was the fact that one of the drives (Drive 0) was showing an amber status light instead of the normal green. This usually indicates that the drive had failed. No worries because I had the two drives in a mirrored configuration.
For those of you who are not familiar with drive mirroring, this is how it works: You have two physical hard drives that work in unison. They appear like a single unit from the operating system’s perspective. (See the diagram below.) This means that when I write a file to the D: drive, for example, it will actually write it to both Drive 0 and Drive 1. There is a layer in between which acts as a “translator” between the hardware and the operating system. This layer creates containers that the operating system sees as separate partitions or drives.
If a single drive fails, the system can continue to run because the data will then be used only from the one drive. The failed drive can then be replaced, sometimes without even turning off the computer if the drives are hot swappable, and the system will automatically rebuild the drive and carry on its merry way.

I wanted to make sure that I wasn’t about to commit a major faux pas so I contacted the manufacturer’s tech support. They agreed that it looked like a drive failure was the issue. It would take up to ten working days to get a new drive out to me but it should continue to run just fine with only one drive. (Just hope that the other drive does not fail during those two weeks!) Pop out the old drive and everything should be fine.
Worst Case Scenario Realized…
How wrong can one person be? The second that I pulled out the drive, all of the data on D: disappeared! I would have died had I not remembered that we daily backed up the data on the database! Which got me thinking. So I went to the backup server and promptly lost it when I noticed that the backup had been backing up the same data for several months because the database export function had locked up and it didn’t report any errors!
So, at that moment I realized the following:
- I just watched all my data disappear into the ether
- The last backup of the MIA data is several months old
- I am completely ignorant when it comes to the database and how it saves the information on the server
- Anywhere but here is sounding really nice about now
I thought that maybe if I put the hard drive back into the system, the data would come back. Alas, this was not the case! I decided to call the hardware manufacturer’s tech support again to see if they had any more bright ideas (I was obviously desperate at this point.) While I was waiting on hold for tech support, the data suddenly reappeared! It was a miracle and a full 45 days before Christmas!
Everyone Remain Calm…
I immediately attempted to copy everything from that drive to another system. I got about 1% done and the data disappeared again! I was not impressed. But, after a few minutes of giving the server the evil eye, the data reappeared. Only to disappear again a few minutes later! How was I going to get this data off the system? Suddenly, the bulb went on in my head!

I wrote a quick batch file that recursively called robocopy. I then let it run overnight in hopes that when I returned the next morning, I would have all of the data.
Luckily, this worked out just fine. I’m not sure how many times there were errors and failures with the copy but in the end I was able to get everything of the drive! I could hear the bullet whiz by as I dodged it!
Now, what do I do with this data? I know absolutely nothing about the database. I reluctantly contacted the manufacturer of the application that crashed. They had not been very helpful in the past because they only make the application, and they do not support the database that it runs on. But, to my surprise, when I contacted then, I discovered that they had been purchased by another company which not only supports this database but it also hosts the application for companies who do not have the in house technical skills to properly manage it.
It took an entire day but by time I got back home that night (I decided to make the three hour trip back so that I could spend a couple of hours with my family), we had a plan in place!
Here’s the Plan…
At 4:00 AM Thursday morning, I headed back to the satellite office. I packed up the entire server and put it in the back seat of one of the 4×4 trucks that the company owns. The roads were kind of icy and everyone wanted me back in one piece (or at least the server). I then traveled another three hours to a different city where the company I hired had their data hub. I met the recovery team that I hired. Miracle of miracles, we were able to get all of the data off the server and onto their ASP hosting. By this time, it was getting late so I slept in the city for the night.
Bright and early the next morning (Actually, it wasn’t really all that bright. I was up at least three hours before the sun was!), I was back on the road to the satellite office. I needed to wait for the recovery team to properly set up the data. Once they confirmed that everything was ready to go, I needed to modify all of the computers on the network in order for them to point to the new server instead of the old (and dead) server.
I simply modified a version of my Modify Every Computer on the Network script to change all of the computers. It was a simple matter of altering a couple of text files which was a relatively easy change.
There were a number of remote systems that were not connected to the network. When the tech for the satellite office returned from training, he had a quick installer that I created waiting for him. This installer, also creating using AutoIt, allowed him to simply execute the installer and then move on to the next system.
I Love It When A Plan Comes Together…
I have to admit, it was a pretty rough week. Here’s how things summed up, by the numbers, once all was said and done:
- 4 - number of days of shear panic
- 12 - number of hours of sleep I got over the four days
- 18 - number of hours I spent driving
- 3 - number of meals I actually ate (all at Subway)
- 3 - number of different cities I slept in over the four days
- 2000 - approximate number of kilometers I drove
With that being said, it ended up being a really rewarding experience. My boss and the president of the company have each recognized my efforts and that goes a long way.
Special shouts out to Trent for organizing the team that helped recover the data, Dave for managing the data recovery, Keith for thinking outside the box when it came to getting information from an extremely flaky system, and Naeem for the actual data recovery and transfer to a new system. You guys are the best!
If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?
| Trackback link - http://www.dailycupoftech.com/2006/11/13/long-time-nopost/trackback/ |
|
6 Responses to “Long Time, No…Post?”
-
Matthew Says:
November 14th, 2006 at 4:49 amCongrats on solving the issue! I know IT-systems always surprise the hell out of you on the most inconvenient moment. Had to deal with some weird disk behaviour myself recently.
Just wondering: if the app/db is mission critical why is your corp.management NOT hesitant about hosting the app at an external site? Would the funds needed not be better spend giving you additional training?Just some food for thought…. : A business owner (a client of mine) demanded access to their admin-account on the file-server. Not to use it a user, but just as assurance. I turns out he used the account to change some access-rights to folder that where so hush-hush that he did not trust anyone with access to the files. Not even me as the admin (ofcourse I could care less about the contents of the file)! He never told me there are special folders with very limited access. In daily practise this means backups of those folders did not work either, but will not show up as errors. He had deleted everyone from the acl except himself (even the system account).
Main problem arose when we had a major disk failure in our raid1 setup and decided to get two identical bigger disks as replacements. When moving data I always do a screen dump of the dir /s with amounts of folders/files and sizes just before and after a move, just for a quick visual check. It turns out that not being able to access the files also leaves those files out of the files and folders and their respective sizes… i never knew the files were there, and I never knew the files did not copy, i never knew… Old defect drive is destroyed, old working drive was wiped entirely just three weeks before the owner started looking for his files.. (that are not to be found
)
The hard way: never give out the admin password to a person, not even the owner. I write all account info on a paper, put it in a sealed envelope and have that stored in a safe in the clients building.
My visual check was not foolproof. Any thoughts?
Should I do a ‘take-ownership’ on all files everytime we do a datamigration? Any ideas?
Sorry for the long reply… maybe should start my own blog..haha.thanks you for your horror story, I am not the only one having nightly adventures behind a server screen i guess…

-
Stefan Says:
November 14th, 2006 at 7:35 amTim,
good work, these kind of stories make DCOT so much fun to read. I actually got to know your blog through your USB drive articles, which are just great.
One thing came to my attention when I saw this post in my feed reader (Google Reader, btw): It seems that you are gradually increasing the amount of advertising on your blog - this time you even included an image ad in your feed.
More ads, less posts - I think your fans and readers would like you to reverse this trend
But nonetheless, good work, and - get some rest
Stefan.
-
Tim Fehlman Says:
November 14th, 2006 at 10:31 amStefan,
Thanks for the positive words and the feedback. Yes, the number of ads and where they are showing up is changing. My long term goal is to be able to blog full time. This means that I need to figure out how to either have someone pay me to write my blog or have the blog generate income. Since no one is paying me to blog, I need to get the blog to generate revenue. Hence the ads.
But, there needs to be a balance. I know that it really annoys me when I need to spend a couple of minutes navigating around a website’s ads just to find a postage stamp size article. I hope that I am not doing that.
As for the fewer posts, that is going to be rectified very shortly. I am hoping to start sending our two or three posts a day, usually with interesting tech related news and stories. And, a couple of times a week with some unique content that you can only find at DCoT.
I am playing with the advertising right now so you will continue to see the ads increasing and decreasing over the next few weeks/months.
-
Tim Fehlman Says:
November 14th, 2006 at 2:10 pmMatthew,
Check out HowTo Cover Your Butt When Migrating Data for a rather long winded answer to your question.Tim
-
matthew Says:
November 14th, 2006 at 5:44 pmThanks Tim, have read your reply. Nice ideas!
Doing a full image from the disks in original state is an idea, provided you can get the raid controllers working that way. (via floppy-boot, BARTPE, KNOPPIX or other solutions)
Keeping the info on stock is logical, did that for almost three weeks. Should have been much longer, i agree.
To be honest, I am afraid that the copy-to-other-disk (your option2) will break most file-rights… correct? In most situations that is not really acceptable.BTW, I keep running into more and more servers without floppies in them. USB-floppy drives are cheap, bring one!
Thanks for your stories, you gained a reader for life. I hope you find the best ads/content trade-off so that you can do the writing stuff full-time.
greetz,
Matthew
Amsterdam, NL -
Stormy Says:
January 10th, 2007 at 1:12 amYup, I know the story well. There’s some kind of evil spirit that floats around the businesses of the world and every once in a while it decides to visit you….
