Which Compression Format to Use for Archiving
As my digital documents grows and expands I want to ensure that my documents and data will be available to me far into the future.
One of the challenges of managing all my data is organizing and backing up the many many files I have.
I want to save almost all of my files and documents, but A lot of them I suspect I will rarely access in the future. Or maybe I will never access them. In order to reduce the overall number of files I have to deal with I want to combine and compress these files into an individual archive. Still, I want to ensure that if I ever want to revisit these files they would be safe and accessible even if that is years in the future.
After some brainstorming I have arrived with a set of criteria that I believe will help ensure my data is safe while using compression.
- The compression tool must be opensource.
- The compression format must be open.
- The tool must be popular enough to be supported by the community.
- Ideally there would be multiple implementations.
- The format must be resilient to data loss.
You can find a more complete list of compression formats on Wikipedia.
A note about tar; it is not a compression tool itself. Rather it collects many
files into a single individual file that is then compressed. We often see
.tgz, which are many files and directories in a single tar file
then compressed with gzip. tar would need to be used with compression tools that
themselves do not collect many files into an archive. Such as bzip2 or xz,
which only compresses a single file.
The tool must be open source
Formats that are locked down, hidden away, and selfishly hoarded die and are forgotten, their data lost to time and unpopular tools. How many dissertations have been lost to the WordPerfect format.
To ensure that the tool I uses is going to be around in several years it must be open source. It must freely available so that any interested party can build on it and extend it’s life if the original developer chooses not to.
This rules out rar, which has a closed source compressor. It does have an source available decompresser, but it is hindered by a do not reverse engineer clause. This makes it wholly unacceptable for my needs.
Along the same lines as the source being available is the need for the format, the algorithm behind the compression, to be open and accessible. Given the unlikely destruction of all implementations there should be sufficient information available to write a compressor and decompresser from scratch.
The tool should be popular
It’s not enough for the tool to be open source. There are plenty of open source tools that have faded in use, that lack updates, and are no longer being maintained.
While open source makes it significantly easier to have and use the tool it would require time and attention to revive a long dormant compression tool on my own.
Choosing to use a tool that is popular, with a strong community, ensures that not only will it be around in a few years, but that it will be thriving. With code updates for new platforms, features, and bug fixes.
A popular tool helps with longevity.
All of zip, 7zip, xz, bzip2 fulfill this requirement.
More than just popularity, the availability of multiple implementations greatly ensures future availability. It means that more than one group of people have decided that this format is worth their time and investment.
Multiple implementations means that if any one group and their code stops development, dies, goes out of business, or otherwise disappears from the earth then there will be other versions of the tool available.
Most of the formats I’m looking at fulfill this requirement, however zip is the big winner here. There are many many tools that support zip, including WinZip, Microsoft, and Apple. It’s necessary to have a community supported format, but a huge bonus that business are also supporting the zip format.
Resiliency and durability
The last criteria is the most important; the format has to be resilient. It has to expect that damage will happen, and have a strategy for dealing with that damage. Or at least work around the damage.
Of the reviewed format only zip does not use Solid Compress as a strategy for combining multiple formats into a single archive. A Solid Archive is one where many files are concatenated into a single block that is then compressed. This single block is unwise as bit rot damage to the file could prevent any data from being used during decompression.
7zip, xz, bzip2, tar all use Solid to combine files.
Bonus: Parity for safety
For additional resilience the par2 parity tool should be used. This tool creates a set of parity files that contain data that can be used to recover parts of a file should it be damaged from bit rot.
When creating the parity files the amount of resilience can be specified. The default as recommended when using par2 for usenet is 10% 1, but any percentage amount can be specified. That percentage is the max that can be damaged, or missing, from the original archive file in order to recover 100% of the damage.
For unimportant files no par2 need be created. For important documents par2 should be created with the resilience percentage that make sense for how important the document is.
Parity files should be used in conjunction with Zip to compress files and parity files into a single convenient file.
For now I expect to zip up groups of files that I know I want to archive, but do not expect to access with frequency. For those that are important I will create par2 parity files for them, and then zip them up again.
I think this is the best option for convenience of archive and long term resilience and durability.