Distributed File Archive Proof of Concept
I have been putting some thought around creating a distributed file archive system with redundancy lately and I think that I have come up with a viable proof of concept. The entire process is manual at this point but with a bit of work, I think that I could automate it and make it usable.
What Is It?
The whole idea came to me from a comment left on a tumblog post. Essentially, JD asked about whether or not someone could point him in the right direction for something like this. I gave it some thought and I think I have a viable model.
Essentially, the question was asked whether or not we could use all of the unused storage on all of the workstations and laptops in a small enterprise environment as a backup or archive solution. To me, this seemed like a logical use of resources, especially for a small IT shop where the budgets are small or for a home with a now common one computer per person setup.
On the surface, this seemed like a wonderful idea but there were issues.
No Redundancy
The biggest issue that I saw with a solution that uses this concept is the hard drive. Workstations are typically single drive systems. There is rarely any redundancy in place for these drives. If that drive fails, your data is gone.
Now, if this is a simple backup solution, this may be less of an issue because, since the data is a copy to begin with, you already have a copy of the data. Things get a bit more risky when we are talking about an archive system.
The purpose of an archive is to move data to a storage location for later access. By definition, you do not have a copy back where the original was located. Now what should we do?
Parchive to the Rescue
The answer to this problem is to use parchive files for redundancy. What are parchive files? Here is what the Parchive Project says about parchive files:
The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet. We accomplished that goal. Our new goal with version 2.0 of the specification is to improve. It extends the idea of version 1.0 and takes the recovery process beyond the file-level barrier. This allows for more effective protection with less recovery data, and removes some previous limitations on the number of recoverable parts.
How The System Would Work
Let’s use a common scenario to examine how to use parchive files to create a redundant archive storage grid.
Let’s say, for example, that you have six computers on your network, your computer and five others. Your connections to these computers would look something like this:

Let’s also assume that you have write access to a share on each of these computers.
Now, you want to archive your data by distributing it on each of the systems. For our example, we are going to assume that you have a 697 MB file called ubuntu.iso that you want to archive and each system has 150 MB of free disk space.
You compress the file to save disk space. You now have a file ubuntu.zip that is 681 MB in size.
You now split the Data.zip file into five equally sized files. You are now left with the following files:
- ubuntu.zip.001
- ubuntu.zip.002
- ubuntu.zip.003
- ubuntu.zip.004
- ubuntu.zip.005
Each file is 136 MB in size.
You place one file on each computer. So:
- ubuntu.zip.001 on Computer 1
- ubuntu.zip.002 on Computer 2
- ubuntu.zip.003 on Computer 3
- ubuntu.zip.004 on Computer 4
- ubuntu.zip.005 on Computer 5
This creates a total of 681 MB of used storage.
Accounting for Hard Drive Failure
This scenario works well as long as nothing goes wrong! But, if you were to lose the hard drive on just one of the workstations, all of the data in ubuntu.iso is gone!
One option would be to put duplicate files on each system. So, you could do the following:
- ubuntu.zip.001 and ubuntu.zip.002 on Computer 1
- ubuntu.zip.002 and ubuntu.zip.003 on Computer 2
- ubuntu.zip.003 and ubuntu.zip.004 on Computer 3
- ubuntu.zip.004 and ubuntu.zip.005 on Computer 4
- ubuntu.zip.005 and ubuntu.zip.001 on Computer 5
This would require 1,362 MB of storage to ensure that if one of the systems crashed, you would be able to recover all of your data.
But, if we were to create parchive files, the amount of data that we would have to store would become significantly less. In our example, we would need to create five parchive files with a redundancy of 25%. One parchive volume file and the main par file would accompany each file. The file distribution would look like this:
- ubuntu.zip.001, ubuntu.zip.vol000+94.PAR2, and ubuntu.zip.par2 on Computer 1
- ubuntu.zip.002, ubuntu.zip.vol094+94.PAR2, and ubuntu.zip.par2 on Computer 2
- ubuntu.zip.003, ubuntu.zip.vol188+93.PAR2, and ubuntu.zip.par2 on Computer 3
- ubuntu.zip.004, ubuntu.zip.vol281+93.PAR2, and ubuntu.zip.par2 on Computer 4
- ubuntu.zip.005, ubuntu.zip.vol374+93.PAR2, and ubuntu.zip.par2 on Computer 5
The total required amount of disk space would be approximately 854 MB! This is 508 MB less disk storage than the previous solution, a savings of 37.3%!
The More, The Merrier
The nice thing about this solution is that the more workstations that you have, the less redundant overhead that you require. See the table below:
| Workstation Count | Redundancy Overhead |
| 2 | 100.00% |
| 3 | 50.00% |
| 4 | 33.33% |
| 5 | 25.00% |
| 10 | 11.11% |
| 25 | 5.26% |
| 50 | 2.04% |
| 100 | 1.01% |
The Math
There are a lot of calculation that are being made for these configurations. All of these configurations are based on the number of archive locations. For these calculations, let’s assume that the number of archive locations is represented by a and the compressed file size in bytes is represented by z.
The number of files (f) equals the number of archive locations (a). This should be used for both splitting the compressed file and determining the number of parchive files to create.
We also need to plan how redundant we want our system to be. So, the number of locations that can be dead is represented by d. Please note that is it is very important that d < a (i.e. the number of archive locations must be greater than the number of dead locations).
Redundancy
The percentage of redundancy (r%) required can be calculated as follows:
r% = d / (a - d) * 100
Total Storage Required
The total storage (s) required for an individual file:
s = z * r% + z
Split File Size
Size of each file in bytes (b) when the compressed file is split:
b = z / f
Using the Calculations in QuickPar
I use QuickPar to create the parchive files. Here is a screenshot to show you where these calculations come in place in the QuickPar application.

Perform Your Own Manual Proof of Concept
Here is how you can do your own proof of concept for this type of a system:
Archiving
- Download the software that you will work with. I use QuickPar to create parchive files and 7-Zip for file compression and splitting. I use these because they are freely available on the Internet.
- Create archive locations. Since this is a proof of concept, they can be locations on remote systems or different folders on your own computer.
- Compress your file using 7-Zip.
- Determine the size of each file for splitting (b) and spit the file using 7-Zip.
- Create the parchive files using QuickPar and the calculations provided above.
- Move the files to your archive locations as indicated in the example above.
- Move (or if your are interested in living dangerously, delete) the original file and the compressed file.
Recovery
- Copy all but one location’s worth of the split and parchive files back into the original location. This will simulate the failure of one system.
- Open up the main parchive file (this is the smallest file) in QuickPar.
- Rebuild the lost/damaged files in QuickPar.
- Recombine the files in 7-Zip.
Conclusion
Once again, this is a proof of concept just to show how a system like this would work. My next step would be to get AutoIt fired up and use command line versions of 7-Zip and QuickPar to automate the entire process.
So, what do you think of this idea? How could you use it in your environment? Let me know in the comments.
If you found this post useful, why don't you buy me a cup of coffee to show your gratitude?
27 Responses to “Distributed File Archive Proof of Concept”
-
Andrew Says:
August 16th, 2007 at 11:23 amHow would you handle different computers having different size HDs? Also, you’d probably need a client/server architecture to handle the “who’s the boss of the data” problem. I could see this really taking off as a web based service. All client’s report back to and get instruction from a main web server that tracks what is being archived or backed up on the system and would allow a user to interact with their data from anywhere in the world.
- Andrew
-
Tim Fehlman Says:
August 16th, 2007 at 11:55 am@Andrew
You could get the remote systems to report their disk space to you and then you could set a limit on how much free disk space they would need available to be used as a storage location.
As for a client/server architecture, this is how I originally conceived of the idea. But, I think it could probably be set up either client/server or stand alone, depending on whether or not you wanted to keep data only on your local LAN or distribute it to anonymous users over the Internet.
As for a web service, I think this could be a very cool project to run open source and give each user a client and have a central web server that they could register with and use.
Definitely something to think about!
Tim
-
Andrew Says:
August 16th, 2007 at 12:11 pm@Tim
Yeah, I guess if you handle each file individually you could just fill a smaller drive to a certain point, then flag it as full and all future files would be split over the remaining computers. So your efficiency would go down when the smaller drive filled up, but oh well.
You could also have the clients periodically checking the amount of drive space left on their host computer, if the space got low, the client could signal to the server and the server could remove some of the parts that are being stored on that computer and re-distribute the load to the other computers on the network.
But yeah, I think you’d really need a server computer running some sort of database. Then the question pops up….what happens if the server goes down ;-). I guess you could have some process by which you could install the server software on a replacement box and have it re-build the index.
- Andrew
-
Jim Says:
August 16th, 2007 at 1:13 pmBasically, a very good idea. What is your thought in managing the archives? Would you envision a list of archived data maintained somewhere?
I would really like to see some kind of automated process where files not accessed for say >30 days could be automatically archived, perhaps even in the fashion you suggest. But, a zero length file is still left in place and when a user access that zero length file the archive retreval process begins and restores the file by gathering all the pieces.
-
University Update - Open Source - Distributed File Archive Proof of Concept Says:
August 16th, 2007 at 1:22 pm[…] Contact the Webmaster Link to Article open source Distributed File Archive Proof of Concept » Posted at Daily Cup of Tech […]
-
Tim Fehlman Says:
August 16th, 2007 at 1:33 pm@Jim
Archive management would depend on the structure. For example, it could be something as simple as a text file that replaces the archived file. This file could tell the program where all of the pieces for the file are located and use that information to check on the availability of files and recovery of files.
I had also thought that an archive scanner which works much like a traditional backup system would also be beneficial. A scheduled archive run would search for and archive all files based on certain criteria (e.g. size, age, type, etc.).
Tim
-
Jim Says:
August 16th, 2007 at 1:42 pmTim,
Replacing the original file with a text one which contains the list of pieces wouldn’t necessarily work because the representitive file would be different than the original (exe, com, pdf, etc.). Not really a good idea. If somehow a snipit of the original file could contain unarchive program location and file list, then when the file was accessed, unknown to the user, the restore/unarchive would execute while holding up the users file access until the file is completely restored.
-Jim
-
Tim Fehlman Says:
August 16th, 2007 at 1:47 pm@Jim
I was not actually thinking of saving the file with the exact same file name. Rather I was thinking of creating a custom extension.
For example, if you were to archive a file called data.pdf, the file that replaced it would be something like data.pdf.dfa.
The autorecover concept is a cool one. I’m not certain how it could be implemented.
Tim
-
Jim Says:
August 16th, 2007 at 1:49 pmTim,
That sounds like it could work.
-Jim
-
Jim Says:
August 16th, 2007 at 1:51 pmTim,
Oh, I don’t know how the autorecover would work either.
-Jim
-
Andrew Says:
August 16th, 2007 at 2:10 pmBut, if 2 users shut off their computers, the whole system goes down! Doesn’t that put a bit of a damper on things?
- Andrew
-
Tim Fehlman Says:
August 16th, 2007 at 2:16 pm@Andrew
This is only in the specific scenario. If you were to expand this to a larger enterprise of say 100 computers, then you could get the same redundancy and require 25 computers to shut down before you were in trouble.
Or, by changing the level of redundancy in the parchive file, you could configure any number of systems to be required, down to as little as one or two machines.
Now, the closer you get to th two machine limit, the larger your parchive files get. Just like everything, there is a tradeoff.
Tim
-
Andrew Says:
August 16th, 2007 at 2:28 pm@Tim
Maybe I don’t fully understand the capabilities of the Parchive app. I understand that if you have 100 computer, you PAR your 1 large file into 100 small ones and distribute them. Are you saying that you can design those PAR files so that they can be missing X number of parts and still be able to re-construct the data? I assumed that it functioned like RAID where you can have any number of files, but only be missing 1 of those files to re-construct the data.
- Andrew
-
Tim Fehlman Says:
August 16th, 2007 at 2:33 pm@Andrew,
Yes, I can create a predetermined level of redundancy into the system. So, I could create an “array” of 100 computers with a 25% redundancy built into the parchive files that would allow for me to lose approximately 25 workstations before I ran into trouble.
Tim
-
Andrew Says:
August 16th, 2007 at 2:43 pm@Tim
Oh wow, that makes all the difference!
So you could have a specified drive or folder that is watched by the client. If a new file shows up it is PARed and the pieces are sent off to the other currently available computers. The client could then tell the server that it’s still waiting for computer x, y and z and the server could send a notification when those system come online and report to the server.
- Andrew
-
Tim Fehlman Says:
August 16th, 2007 at 2:51 pm@Andrew
Yep! There’s nothing stopping us from creating something like that!
Tim
-
Tony Says:
August 16th, 2007 at 2:54 pmWhat about allowing for multiple failures? Add some encryption and you might be able to use something like this to back up an entire server!
-
Tim Fehlman Says:
August 16th, 2007 at 3:05 pm@Tony
Multiple failures can easily be overcome (see the comments that I mentioned to Andrew). As for encryption, I’m not certain it would be required since nobody actually has all of the information required to access your data.
But, for the sake of argument, this would not necessarily be too difficult because you could password encrypt the compressed file prior to splitting it and if you wanted encryption over the wire, you could easily do SSH tunneling, SSL, or even SCP for data transmission.
Tim
-
Han Says:
August 17th, 2007 at 2:14 amTim:
I have one question. As you said, you have 5 par block file and 1 main par2 file that is 6 files and you put those files in the following way:
ubuntu.zip.001, ubuntu.zip.vol000+94.PAR2, and ubuntu.zip.par2 on Computer 1
ubuntu.zip.002, ubuntu.zip.vol094+94.PAR2, and ubuntu.zip.par2 on Computer 2
ubuntu.zip.003, ubuntu.zip.vol188+93.PAR2, and ubuntu.zip.par2 on Computer 3
ubuntu.zip.004, ubuntu.zip.vol281+93.PAR2, and ubuntu.zip.par2 on Computer 4
ubuntu.zip.005, ubuntu.zip.vol374+93.PAR2, and ubuntu.zip.par2 on Computer 5However if computer 5 is off line, then ubuntu.zip.vol374+93.PAR2 file is in-accessible, thus we are missing one par2 block file, so the repair is impossible. So how could I recover ubuntu.zip.005?
Thanks.
-
Tim Fehlman Says:
August 17th, 2007 at 7:03 am@Han
There is enough information in the parity files and the other four files that you can rebuild ubuntu.zip.005 from the other files. It does not matter which parity file or which split data file is missing. The parity files can rebuild it from scratch.
The key is having enough redundancy in place so that you do not need that one split file and that one parchive file. That is where the formulas come in to play.
Tim
-
Andrew Says:
August 17th, 2007 at 8:13 amIf and when a computer fails, would we need to re-combine ALL the files that involved that computer in order to re-create the data and re-span it across the available computer (or re-populate a replacement system)? Is there a semi easy way to ONLY generate the missing file and corresponding PAR files? I’m sure there has to be, but I didn’t see any way (in the QuickPar interface) to have the app recover the PAR files that were missing.
- Andrew
-
Andrew Says:
August 17th, 2007 at 10:32 amI’ve just answered my own question. YES you can easily re-create PAR files for replaced systems without re-combining all the files and re-PARing everything.
In this example I’ll take a system that has 10 storage locations (for easy math) and 2 allowed failures (25% parity). If 2 locations are down, you would need to:
1.) Copy all the source files and PAR files from the other locations.
2.) Rebuild the 2 missing files.
3.) because you’re missing 2 of 10 PAR files in a system that has 25% parity, you would use (.25 / 10 * 2). This is (parity / locations * number of location to repopulate). This gives us 5%. So if we use QuickPar to create 2 recovery files at 5% parity, we have our 2 replacement PAR files to be distributed with the newly restored source files.I hope this all make sense
- Andrew
-
Tim Fehlman Says:
August 17th, 2007 at 10:36 am@Andrew
Good work! You have discovered exactly what I have discovered. Pretty cool, hey?
Tim
-
JD Says:
August 17th, 2007 at 11:07 amSweet! When I first asked the question I didn’t even think about using PAR files even though I have used them for years. So now I (we) just have to come up with a front end program that handles all the network shares, distribution and recovery.
Lag time would probably be a bit of an issue as well. In an office with a typical 100 Mbps network, grabbing 20 100 Meg files from 20 different computers would not be speedy. But for backup and archive (something my company NEVER has enough space for) it would be great.
Glad my suggestion provided the impetus for such an interesting thought experiment. If anyone commercializes this I hope they give me a free copy.
-
Tim Fehlman Says:
August 17th, 2007 at 11:16 am@JD
The funny thing is that as soon as you posted the comment, the first thing that popped into my head was “PAR files!” but I almost put it right into the comment.
But I knew that this could be a whole lot more so I wanted to do some research before I brought this idea out. In fact, I wanted to wait even longer until I had a bit of a program in place that everyone could play with but I was so excited by the idea that I just needed to get it out.
Plus, there have been some really cool ideas that I would not have come up with on my own that were presented in the comments.
I may start working on a rudimentary AutoIt script/program that we can all work on to see if we can come up with a viable tool.
Tim
-
Bhasker Says:
August 17th, 2007 at 12:55 pmIs this in any way the same as Distributed Internet Backup (http://web.archive.org/web/20060424061705/http://www.csua.berkeley.edu/~emin/source_code/dibs/) which I always wanted to become more popular and log-lived than it appears to have been.
-
Adrian Says:
August 21st, 2007 at 2:03 amThere are other existing systems out there that do a similar job already. See cleversafe for example: http://www.cleversafe.org/
