I have been putting some thought around creating a distributed file archive system with redundancy lately and I think that I have come up with a viable proof of concept. The entire process is manual at this point but with a bit of work, I think that I could automate it and make it usable.

What Is It?

The whole idea came to me from a comment left on a tumblog post. Essentially, JD asked about whether or not someone could point him in the right direction for something like this. I gave it some thought and I think I have a viable model.

Essentially, the question was asked whether or not we could use all of the unused storage on all of the workstations and laptops in a small enterprise environment as a backup or archive solution. To me, this seemed like a logical use of resources, especially for a small IT shop where the budgets are small or for a home with a now common one computer per person setup.

On the surface, this seemed like a wonderful idea but there were issues.

No Redundancy

The biggest issue that I saw with a solution that uses this concept is the hard drive. Workstations are typically single drive systems. There is rarely any redundancy in place for these drives. If that drive fails, your data is gone.

Now, if this is a simple backup solution, this may be less of an issue because, since the data is a copy to begin with, you already have a copy of the data. Things get a bit more risky when we are talking about an archive system.

The purpose of an archive is to move data to a storage location for later access. By definition, you do not have a copy back where the original was located. Now what should we do?

Parchive to the Rescue

The answer to this problem is to use parchive files for redundancy. What are parchive files? Here is what the Parchive Project says about parchive files:

The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet. We accomplished that goal. Our new goal with version 2.0 of the specification is to improve. It extends the idea of version 1.0 and takes the recovery process beyond the file-level barrier. This allows for more effective protection with less recovery data, and removes some previous limitations on the number of recoverable parts.

How The System Would Work

Let’s use a common scenario to examine how to use parchive files to create a redundant archive storage grid.

Let’s say, for example, that you have six computers on your network, your computer and five others. Your connections to these computers would look something like this:

Network

Let’s also assume that you have write access to a share on each of these computers.

Now, you want to archive your data by distributing it on each of the systems. For our example, we are going to assume that you have a 697 MB file called ubuntu.iso that you want to archive and each system has 150 MB of free disk space.

You compress the file to save disk space. You now have a file ubuntu.zip that is 681 MB in size.

You now split the Data.zip file into five equally sized files. You are now left with the following files:

  • ubuntu.zip.001
  • ubuntu.zip.002
  • ubuntu.zip.003
  • ubuntu.zip.004
  • ubuntu.zip.005

Each file is 136 MB in size.

You place one file on each computer. So:

  • ubuntu.zip.001 on Computer 1
  • ubuntu.zip.002 on Computer 2
  • ubuntu.zip.003 on Computer 3
  • ubuntu.zip.004 on Computer 4
  • ubuntu.zip.005 on Computer 5

This creates a total of 681 MB of used storage.

Accounting for Hard Drive Failure

This scenario works well as long as nothing goes wrong! But, if you were to lose the hard drive on just one of the workstations, all of the data in ubuntu.iso is gone!

One option would be to put duplicate files on each system. So, you could do the following:

  • ubuntu.zip.001 and ubuntu.zip.002 on Computer 1
  • ubuntu.zip.002 and ubuntu.zip.003 on Computer 2
  • ubuntu.zip.003 and ubuntu.zip.004 on Computer 3
  • ubuntu.zip.004 and ubuntu.zip.005 on Computer 4
  • ubuntu.zip.005 and ubuntu.zip.001 on Computer 5

This would require 1,362 MB of storage to ensure that if one of the systems crashed, you would be able to recover all of your data.

But, if we were to create parchive files, the amount of data that we would have to store would become significantly less. In our example, we would need to create five parchive files with a redundancy of 25%. One parchive volume file and the main par file would accompany each file. The file distribution would look like this:

  • ubuntu.zip.001, ubuntu.zip.vol000+94.PAR2, and ubuntu.zip.par2 on Computer 1
  • ubuntu.zip.002, ubuntu.zip.vol094+94.PAR2, and ubuntu.zip.par2 on Computer 2
  • ubuntu.zip.003, ubuntu.zip.vol188+93.PAR2, and ubuntu.zip.par2 on Computer 3
  • ubuntu.zip.004, ubuntu.zip.vol281+93.PAR2, and ubuntu.zip.par2 on Computer 4
  • ubuntu.zip.005, ubuntu.zip.vol374+93.PAR2, and ubuntu.zip.par2 on Computer 5

The total required amount of disk space would be approximately 854 MB! This is 508 MB less disk storage than the previous solution, a savings of 37.3%!

The More, The Merrier

The nice thing about this solution is that the more workstations that you have, the less redundant overhead that you require. See the table below:

Workstation Count Redundancy Overhead
2 100.00%
3 50.00%
4 33.33%
5 25.00%
10 11.11%
25 5.26%
50 2.04%
100 1.01%

The Math

There are a lot of calculation that are being made for these configurations. All of these configurations are based on the number of archive locations. For these calculations, let’s assume that the number of archive locations is represented by a and the compressed file size in bytes is represented by z.

The number of files (f) equals the number of archive locations (a). This should be used for both splitting the compressed file and determining the number of parchive files to create.

We also need to plan how redundant we want our system to be. So, the number of locations that can be dead is represented by d. Please note that is it is very important that d < a (i.e. the number of archive locations must be greater than the number of dead locations).

Redundancy

The percentage of redundancy (r%) required can be calculated as follows:

r% = d / (a - d) * 100

Total Storage Required

The total storage (s) required for an individual file:

s = z * r% + z

Split File Size

Size of each file in bytes (b) when the compressed file is split:

b = z / f

Using the Calculations in QuickPar

I use QuickPar to create the parchive files. Here is a screenshot to show you where these calculations come in place in the QuickPar application.

Par Calculations

Perform Your Own Manual Proof of Concept

Here is how you can do your own proof of concept for this type of a system:

Archiving

  1. Download the software that you will work with. I use QuickPar to create parchive files and 7-Zip for file compression and splitting. I use these because they are freely available on the Internet.
  2. Create archive locations. Since this is a proof of concept, they can be locations on remote systems or different folders on your own computer.
  3. Compress your file using 7-Zip.
  4. Determine the size of each file for splitting (b) and spit the file using 7-Zip.
  5. Create the parchive files using QuickPar and the calculations provided above.
  6. Move the files to your archive locations as indicated in the example above.
  7. Move (or if your are interested in living dangerously, delete) the original file and the compressed file.

Recovery

  1. Copy all but one location’s worth of the split and parchive files back into the original location. This will simulate the failure of one system.
  2. Open up the main parchive file (this is the smallest file) in QuickPar.
  3. Rebuild the lost/damaged files in QuickPar.
  4. Recombine the files in 7-Zip.

Conclusion

Once again, this is a proof of concept just to show how a system like this would work. My next step would be to get AutoIt fired up and use command line versions of 7-Zip and QuickPar to automate the entire process.

So, what do you think of this idea? How could you use it in your environment? Let me know in the comments.

If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?

BuildersI am not the type of person who lets different ways of doing things get in my way. I know that I can put hot water into a cup and then add a tea bag. I could also put the tea bag into the cup and then add hot water. In the end, I still get tea.

But, there are some people, especially when it comes to technology, who need to have things done just so. These are the people who, when asked about a specific web page, will tell you, “I’m not sure what the page is called but you can get to it by going to Google, searching for ‘life hack’, click on the first website that comes up, then click on ‘ALL’ near the top of the page. Then, look for their search box and type in ‘Tephlon’. Click on the first result that you find and then look for the word ‘Tephlon’ in that page. It will be a link so click on it. Then click on the coffee mug. It will take you to a page. On that page it will say ‘web’ and then have a website address next to it. Click on that link and you are there! Pretty simple, hey?”

OK. Maybe that is a bit of an extreme example but it gets my point across. These are the users who know exactly what they do on a computer and that’s it. They do not know how to do anything else and that is the way they like it!

So, if you ever need them to install a piece of software from an zip archive, you’re toast because they are used to those nice little installers. If your instructions are more complicated that “Click Next. Next. Next. Next. Finish”, they glaze over and you are stuck doing the installation for them.

Luckily, there are a whole pile of free installers available on the Internet that are just waiting for you to download and build. Here are my favorites:

Read the rest of the story…

If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?

SlurprAccording to WiGLE, the Wireless Geographic Logging Engine, over 40% of the wireless networks in their database do not have WEP encryption. This means that there is a very large supply of “free” Internet bandwidth floating about among the radio waves.

Mark Hoekstra of Geek Technique thought that it was a real shame all that free bandwidth was going to waste so he built the little jewel on the right and called it Slurpr. Essentially, Slurpr looks for open (and potentially not so open) wireless networks and connects to them. It then takes all of those relatively small (up to 54 MBPS) Internet connections and then aggregates them together into one massive Internet link!

Not, don’t get me wrong. I’m not necessarily advocating that you actually do this (but, man would it ever be cool to try). The website is more than aware of the potential legal implications. But, they are planning to make it available in the near future.

If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?

Source CodeA great big “Thank You!” to everyone who donated to the release of the source code for DCoT Menu. Because of your generosity, DCoT Menu is now available to everyone as open source code. Feel free to download the code and start making your own derivative branches of this application.

Here are the generous people who we all have to thank for this:

So, here is what you have all been waiting for, the DCoT Menu source code: Read the rest of the story…

If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?

Reaching OUtI have been getting a number of requests for access to the source code for some of the applications that I have made available as freeware on Daily Cup of Tech. Apparently, there are a number of people who would like to learn how I do some of the coding and also follow my examples.

I have been really at odds about what to do regarding this. One of the main goals of Daily Cup of Tech is to help others learn about technology and how to do things for themselves. And learning by example is an excellent way to do this.

On the other hand, I put a lot of time, thought, and effort into these applications. I do not think it is unreasonable for me to expect a little something in return. After all, I am doing this stuff as a hobby. Any time that I spend working on the blog and on applications for the blog comes right off my family time.

So, what is a blogger supposed to do? I want to help others but I don’t want to screw myself over in the meantime.

Donation Source

What I came up with is a new concept called donation source. Essentially, I will still release all of my DCoT apps initially as freeware with the source code being closed. But, you will be able to release the code by donating to the project. Most projects donation levels are set to $100 so it is not a ridiculous amount of money required to make the code accessible and I get a bit of monetary encouragement to push me to the next project.

First Projects

And, to add to this announcement, I am officially making DCoT Menu and External IP the first two projects eligible for code release. So, this is your opportunity to make this source code available to everyone.

Go forth and Release The Code!

If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?

« Previous PageNext Page »