Decompression Problem Broken Compressed Data

141 views

Skip to first unread message

Analisa Wack

unread,

Jul 25, 2024, 6:39:09 AM7/25/24

to asglocatkber

I'm using GZipStream to compress / decompress data. I chose this over DeflateStream since the documentation states that GZipStream also adds a CRC to detect corrupt data, which is another feature I wanted. My "positive" unit tests are working well in that I can compress some data, save the compressed byte array and then successfully decompress it again. The .NET GZipStream compress and decompress problem post helped me realize that I needed to close the GZipStream before accessing the compressed or decompressed data.

decompression problem broken compressed data

DOWNLOAD > https://geags.com/2zNYYY

Next, I continued to write a "negative" unit test to be sure corrupt data could be detected. I had previously used the example for the GZipStream class from MSDN to compress a file, open the compressed file with a text editor, change a byte to corrupt it (as if opening it with a text editor wasn't bad enough!), save it and then decompress it to be sure that I got an InvalidDataException as expected.

When I wrote the unit test, I picked an arbitrary byte to corrupt (e.g., compressedDataBytes[50] = 0x99) and got an InvalidDataException. So far so good. I was curious, so I chose another byte, but to my surprise I did not get an exception. This may be okay (e.g., I coincidentally hit an unused byte in a block of data), so long as the data could still be recovered successfully. However, I didn't get the correct data back either!

To be sure "it wasn't me", I took the cleaned up code from the bottom of .NET GZipStream compress and decompress problem and modified it to sequentially corrupt each byte of the compressed data until it failed to decompress properly. Here's the changes (note that I'm using the Visual Studio 2010 test framework):

So, this means there were actually 7 cases where corrupting the data made no difference (the string was successfully recovered), but corrupting byte 11 neither threw an exception, nor recovered the data.

It should be vanishingly rare to not detect an error with a corruption anywhere else in the stream. Most of the time the decompressor will detect an error in the format of the compressed data, never even getting to the point of checking the crc. If it does get to the point of checking a crc, that check should fail very nearly all the time with a corrupted input stream. ("Nearly all the time" means a probability of about 1 - 2^-32.)

I just tried it (in C with zlib) using your sample string, which produces an 84-byte gzip stream. Incrementing each of the 84 bytes leaving the remainder the same, as you did, resulted in: two incorrect header checks, one invalid compression method, seven successes, one invalid block type, four invalid distances set, seven invalid code lengths set, four missing end-of-block, 11 invalid bit length repeat, three invalid bit length repeat, two invalid bit length repeat, two unexpected end of stream, 36 incorrect data check (that's the actual CRC error), and four incorrect length check (another check in the gzip format for the correct uncompressed data length). In no cases was a corrupted compressed stream not detected.

After downloading a genome from NCBI as a .gz file, when I go to see the header of the file in Galaxy it shows nothing and gives me the error "Problem decompressing gzipped data". Am I missing a simple unzip step in Galaxy?

Use FTP to upload to Galaxy: -upload/. Compressed .gz data should load without problems from this source, however, you could always upload the uncompressed version. But this should not really be necessary if the data is complete/intact.

I also noticed that you had some SRA formatted data loaded. When using the tool Get Data: EBI SRA, choose the link(s) under the table header "FASTQ files (Galaxy)" to import the data instead of the SRA formatted data under the header "NCBI SRA file (Galaxy)": -data/

This is not an "issue" what I'm posting here, instead, it's a solution. A patch for the Linux kernel that might solve the SQUASHFS problem for some devices [don't expect this to be a magic solution if your hardware is actually faulty].

I'm a developer, I started my journey by developing for AOSP when I was younger (like when I was 15 or 16). I may not be the smartest or the wiser, neither the more expert. However, I do have a strong willpower. My device, the EA6350v3, works fine. It's a nice device, some issues with the VLAN and the switch (which I've fixed myself) and with the fact that ath10k is far from perfect. Still, a nice experience.

It worked flawlessly and amazingly running Linux 4.14 and Linux 4.19, but that quickly changed when Linux 5.4 came to master. Since then, when using the USB port and the WiFi intensively at the same time (say, using it as a NAS), and particularly with Samba since the binary weigths like 30MB, the device will spam the log like this:

Spamming the whole kernel log is not the actual issue: the process needing the resources from the ROM will eventually crash, and it can be init, since it may need a shared library, some code swapped-out of memory, from the ROM, since init must live in the flashable SQUASHFS partition; or to run some busybox tool, and if init crashes, the kernel will immediately panic.

So, the problem is not phyisical. The problem only happens when using Linux 5.4, it's a software problem, it's a issue related to how the CPU, the RAM, the DMA and the respective drivers [ath10k, usbcore, ubi/ubiblock/ubifs, spi-qup, mtd/mtdblock] interact, but I can't find out what's wrong.

Investigating about UBI, I found that UBI is resistant to errors, it can detect and correct many problems with actual bad flash chips [not the issue here], and indeed UBI detects a failure and tries again as the log clearly states, and most of the times it succeed. This means that programs or drivers that talk directly to the UBI layer succeed, like the UBIFS for example, and the ubiupdatevol and many more. SQUASHFS should do the same.

This means: SQUASHFS needs to be fixed, since UBI corrects the error, and it's transparent to the upper layer, why isn't SQUASHFS satisfied? Because when UBI fixes the error, the data is correct in the memory, it should be transparent to SQUASHFS, what's the deal?

The Linux kernel will not allow a corrupted page in the memory. It will fight back with any driver attempting to corrupt the kernel's memory [corrupting the memory on purpose it's another story], including the SPI and the MTD and NAND drivers. UBI itself taints it's buffer, tries to read, and perform a checksum over the buffer. This means that the data, after UBI retries, is indeed present in memory, and it's correct, but SQUASHFS is unable to detect it as such. Instead, SQUASHFS will enter in a fail loop, even when the data is technically correct

This is because somehow the page cache (a section of memory used by Linux to optimize disk usage) keep the "corrupted" pages in memory. Almost all file systems in the kernel know how to deal with that situation, like F2FS in a bad flash or EXT4 with a faulty data cable, but SQUASHFS doesn't. From this point and onwars, "corrupted" in quotes means that it looks corrupted to SQUASHFS but no corruption is present as it's guaranteed by UBI, it's synonymous when I write "poison".

This is the first step to fix SQUASHFS: invalidate the pages and fragments that are "known to be corrupted", making them never stay in the "front" cache. The squashfs_cache_get function in the cache.c file contains the "front" cache verification logic of SQUASHFS, it contains the call that, eventually, will read the data into memory and decompress it either to a buffer or directly to the page cache [not discussed here].

If you see the logs closely, SQUASHFS only attempts to decompress once. This is because the data goes along to SQUASHFS's "front" cache even if it does not work. This means that upon the decompression failure, the cache will remain poissoned until Linux needs to free some RAM and the page it's lucky enough to get evicted, and only then it's when Linux will attempt to read again.

Invalidating the "front" cache entries, by changing their block property to the SQUASHFS_INVALID_BLK will force SQUASHFS to try to read and decompress the data again, otherwise, the page marked as faulty will stay in RAM for an indefinite amount of time. However, this does not resolve any poisioning in the "back" cache, and if it's poisoned, it will fail over and onver again.

This alone may fix some problems, but the problem with doing only this is that the "corrupted" data will be used by all waiting processes at least once. It will be attempted again only when another process waits for the same data, but after all the original waiters have already used the corrupted data, or received the SIGBUS killing signal from the kernel, or as the page is not held corrupted in the page cahce, it will enter in contention. A better solution is try to avoid poisoning the "front" cache in the first place.

SQUASHFS manages it's "front" cache (the data decompressed and actually used) in the function described above. The front cache can be invalidated and it will force SQUASHFS to try to read the backing device and decompress again. Here it's the second cache: the "back" cache which is the cache that belongs to the raw SQUASHFS image, the compressed data stored in the disk. SQUASHFS uses the ll_rw_block function (not anymore in the mainstream kernel) to read pages from the backing device. Linux optimizes these calls by putting the data, again, into it's page cache.

So, the very first time, the data will be actually read from disk, but the second and subsequent calls, even when it's requested to do so, will not; and the "corrupted" compressed data will remain in memory indefinitely. Two things need to be done: implement the retry logic for the squashfs_read_data function, so it retries without poisoning the "front" cache and evict the data from RAM, so the function ll_rw_block may attempt to re-fetch the data from disk.

The first one is trivial: change the name of the original function to __squashfs_read_data, make it static inline, wrap it in a fake squashfs_read_data with the same signature that will loop n times when the returning code of the real function is error. This allows retrying without poisoning the "front" cache.