emergency file server cleanup
Antonio Diaz Diaz
antonio at gnu.org
Mon Sep 29 19:20:19 UTC 2014
Hello Alexandre,
Alexandre Oliva wrote:
>> Why? Lzip can compress more than xz with a bit of tuning via --options.
>
> Maybe it can, but when I compared the sizes of the files to decide which
> one to keep, .xz files were consistently (if slightly) smaller than .lz
> ones.
I guess you mainly mean tarballs, because lzip compresses patches and
xdeltas better than xz, sometimes even when passing the --extreme option
to xz. (Updating lzip to 1.16 gives even better results):
98923 2014-09-29 03:49 linux-libre-3.17-rc7-gnu.xdelta.lz (1.16)
99065 2014-09-29 03:49 linux-libre-3.17-rc7-gnu.xdelta.lz
99096 2014-09-29 03:49 linux-libre-3.17-rc7-gnu.xdelta.xz
7268517 2014-09-29 05:19 patch-3.16-gnu-3.17-rc7-gnu.lz (1.16)
7284746 2014-09-29 05:19 patch-3.16-gnu-3.17-rc7-gnu.lz
7272508 2014-09-29 05:19 patch-3.16-gnu-3.17-rc7-gnu.xz (-9e)
7344044 2014-09-29 05:19 patch-3.16-gnu-3.17-rc7-gnu.xz (-9)
81530 2014-09-29 05:40 patch-3.17-rc6-gnu-3.17-rc7-gnu.lz (1.16)
81638 2014-09-29 05:40 patch-3.17-rc6-gnu-3.17-rc7-gnu.lz
81724 2014-09-29 05:40 patch-3.17-rc6-gnu-3.17-rc7-gnu.xz (-9e)
82104 2014-09-29 05:40 patch-3.17-rc6-gnu-3.17-rc7-gnu.xz (-9)
> Maybe I'm not using the best options to compress tarballs, vcdiffs and
> xdeltas with lzip. Suggestions are certainly welcome.
Vcdiff is already a compressed format. I guess the best option is not to
compress it again and just distribute one plain .vcdiff file per
release. You save about 66% in size and the (re)compressing time.
About tarballs, when LZMA-utils was renamed to XZ-utils its developers
changed the name of the "algoritnm" to LZMA2 and at the same time
increased the dictionary size of option -9 from 32 MiB to 64 MiB,
misleading users into thinking that the increase in compression ratio
was because of the new "algorithm". (BTW, LZMA2 is not an algorithm, but
a container format).
As you can see near the end of the lzip benchmark[1], passing to lzip
the arguments equivalent to those of "xz -9" (or to xz the arguments
equivalent to those of "lzip -9"), will usually make lzip compress more
than xz. But I do not recommend you to do it because using plain "-9" on
both compressors, lzip usually compresses large files about as much as
xz, but using half the RAM and requiring half the RAM to decompress.
In the case of small files the difference of memory required to
decompress is even larger. The massif tool of valgrind finds that lzip
uses 443,384 bytes to decompress 'patch-3.17-rc6-gnu-3.17-rc7-gnu',
while xz uses 67,154,552 bytes.
In the lzip benchmark you can also see that each and every one of the 43
xz tarballs being distributed in ftp.gnu.org were better compressed by lzip.
[1] http://www.nongnu.org/lzip/lzip_benchmark.txt
>> Lzip was designed for long-term archiving, having a
>> tool to recover corrupt files.
>
> I very much doubt it could recover corrupt files to the point that the
> original signature would match, because that would require a lot of
> redundancy to be added, which is the opposite of what a compressor is
> supposed to do. And if the original signature doesn't match, I wouldn't
> trust the result, especially given that we have alternate paths to
> obtain the tarballs.
Lziprecover is so awesome that people can't believe it. :-) Most think
it is just like bzip2recover.
Lziprecover can repair perfectly most files with a single-byte error on
them, without the need of any extra redundance at all. The repaired file
will be identical bit for bit to the original.
Just get one linux-libre tarball and modify the value of a byte (near
the beginning for a quick test). For example, I modified the byte at
offset 1000 in 'linux-libre-3.12.5-gnu.tar.lz' and lziprecover repaired
it in 12 seconds:
e871ba7561ed4833e9349f40d2975f53 linux-libre-3.12.5-gnu.tar.lz
e871ba7561ed4833e9349f40d2975f53 linux-libre-3.12.5-gnu1k.tar_fixed.lz
One byte may seem small, but most file corruptions not produced by I/O
errors just affect one byte, or even one bit, of the file. Also, unlike
magnetic media, where errors usually affect a whole sector, solid-state
devices tend to produce single byte errors, making of lzip the perfect
format for data stored on such devices.
Even if the repair capability of lziprecover is not needed for
linux-libre files it may save the irreplaceable data of many users,
which they would lose if they use bzip2 or xz.
As the author of GNU ddrescue I know about the tragedy of losing data
and how to increase the probability of recovering it. If I have spent 6
years developing a whole family of tools around a compression format you
can be sure that it is the best for users. If it weren't, I would just
have continued developing my projects and using the best format for my
tarballs. Data compression should not be seen as a popularity contest,
but as a service to humankind.
Be the change you wish to see in the world. Drop xz tarballs altogether. ;-)
Best regards,
Antonio.
More information about the linux-libre
mailing list