I've been investigating this SD-Card problem for a while and I think
it's time to share my observations.
As we know, 'apt-get update' is very good in triggering the problem.
I've managed to reduce it into a small test program
which almost always causes the card to fail. The program is attached
(mmap_stress.c).
Current theory is following:
Typically we fail on writing a page with error -EILSEQ (CRC error).
Now the code that calculates the CRC is in (this is from .37+ kernel):
drivers/mmc/host/mmc_spi.c:mmc_spi_writeblock():
...
if (host->mmc->use_spi_crc)
scratch->crc_val = cpu_to_be16(
crc_itu_t(0, t->tx_buf, t->len));
if (host->dma_dev)
dma_sync_single_for_device(host->dma_dev,
host->data_dma, sizeof(*scratch),
DMA_BIDIRECTIONAL);
status = spi_sync_locked(spi, &host->m);
We can see that CRC is calculated over the TX buf *before* the
message is sent to bus. Now we know that doing SPI transfer is going
to sleep -> which causes context switch.
Then apt-get (or mmap_stress) is scheduled to the processor and it
does some change to the mmapped file and presumably to the same page
which currently being transferred to the card. This causes card to
fail to verify the CRC -> bang.
Now this behavior is normal for mmapped files. It can be changed and
only way of making sure that the disk has the latest stuff is to
call msync() or similar.
We can prevent CRCs from SD card transfers by passing
'mmc_core.use_spi_crc=0' to the kernel.
I've been running some testing with disabled data CRCs and so far
results are good -> no errors whatsoever. I still continue testing but
I suggest others to try the option also.
Regards,
MW
2011/1/13 Mika Westerberg <mika.we...@iki.fi>:
> Hello,
>
> I've been investigating this SD-Card problem for a while and I think
> it's time to share my observations.
>
> As we know, 'apt-get update' is very good in triggering the problem.
> I've managed to reduce it into a small test program
> which almost always causes the card to fail. The program is attached
> (mmap_stress.c).
thank you, I'll try out!
Can be also ext2 fs not adapt on sim1? I mean, should I try with btrfs
or ext4 (without jurnal) ?
Those following are my last error, I had it with SD card (2GB, not
sdhc) that Sergio send me...
I saw better result when I move from SDHC to SD
I had only this error and it went out only after an apt-get update
with sources.list very populated...
Fetched 14.4MB in 4min16s (56.1kB/s)
mmcblk0: error -84 transferring data, sector 3210145, nr 48, card
status 0x0
end_request: I/O error, dev mmcblk0, sector 3210145
end_request: I/O error, dev mmcblk0, sector 3210153
end_request: I/O error, dev mmcblk0, sector 3210161
end_request: I/O error, dev mmcblk0, sector 3210169
end_request: I/O error, dev mmcblk0, sector 3210177
end_request: I/O error, dev mmcblk0, sector 3210185
mmcblk0: error -84 transferring data, sector 267, nr 2, card status
0x0
end_request: I/O error, dev mmcblk0, sector 267
Buffer I/O error on device mmcblk0p1, logical block 8
lost page write due to I/O error on mmcblk0p1
------------[ cut here ]------------
WARNING: at fs/buffer.c:1152 mark_buffer_dirty+0x34/0xec()
[<c002c118>] (unwind_backtrace+0x0/0xec) from [<c003e14c>]
(warn_slowpath_common+0x44/0x5c)
[<c003e14c>] (warn_slowpath_common+0x44/0x5c) from [<c003e180>]
(warn_slowpath_null+0x1c/0x24)
[<c003e180>] (warn_slowpath_null+0x1c/0x24) from [<c00be074>]
(mark_buffer_dirty+0x34/0xec)
[<c00be074>] (mark_buffer_dirty+0x34/0xec) from [<c00fa9a4>]
(ext2_new_blocks+0x3e8/0x53c)
[<c00fa9a4>] (ext2_new_blocks+0x3e8/0x53c) from [<c00fe5a0>]
(ext2_get_block+0x3b4/0x78c)
[<c00fe5a0>] (ext2_get_block+0x3b4/0x78c) from [<c00bf284>]
(block_prepare_write+0x1c4/0x4c8)
[<c00bf284>] (block_prepare_write+0x1c4/0x4c8) from [<c00bf71c>]
(block_write_begin+0x48/0x78)
[<c00bf71c>] (block_write_begin+0x48/0x78) from [<c00fd75c>]
(ext2_write_begin+0x3c/0x64)
------------[ cut here ]------------
WARNING: at fs/buffer.c:1152 mark_buffer_dirty+0x34/0xec()
[<c00fd75c>] (ext2_write_begin+0x3c/0x64) from [<c006ae28>]
(generic_file_buffered_write+0xd8/0x240)
[<c002c118>] (unwind_backtrace+0x0/0xec) from [<c003e14c>]
(warn_slowpath_common+0x44/0x5c)
[<c006ae28>] (generic_file_buffered_write+0xd8/0x240) from
[<c006cf2c>] (__generic_file_aio_write+0x460/0x4a4)
[<c003e14c>] (warn_slowpath_common+0x44/0x5c) from [<c003e180>]
(warn_slowpath_null+0x1c/0x24)
[<c006cf2c>] (__generic_file_aio_write+0x460/0x4a4) from [<c006cfdc>]
(generic_file_aio_write+0x6c/0xd4)
[<c003e180>] (warn_slowpath_null+0x1c/0x24) from [<c00be074>]
(mark_buffer_dirty+0x34/0xec)
[<c006cfdc>] (generic_file_aio_write+0x6c/0xd4) from [<c0099798>]
(do_sync_write+0xa8/0xf4)
[<c00be074>] (mark_buffer_dirty+0x34/0xec) from [<c00fa9a4>]
(ext2_new_blocks+0x3e8/0x53c)
[<c0099798>] (do_sync_write+0xa8/0xf4) from [<c009a194>]
(vfs_write+0xb0/0x13c)
[<c009a194>] (vfs_write+0xb0/0x13c) from [<c009a2d4>]
(sys_write+0x40/0x6c)
[<c00fa9a4>] (ext2_new_blocks+0x3e8/0x53c) from [<c00fe5a0>]
(ext2_get_block+0x3b4/0x78c)
[<c009a2d4>] (sys_write+0x40/0x6c) from [<c0026dc0>]
(ret_fast_syscall+0x0/0x2c)
[<c00fe5a0>] (ext2_get_block+0x3b4/0x78c) from [<c00bf284>]
(block_prepare_write+0x1c4/0x4c8)
---[ end trace 2567634ba15c3eaa ]---
[<c00bf284>] (block_prepare_write+0x1c4/0x4c8) from [<c00bf71c>]
(block_write_begin+0x48/0x78)
[<c00bf71c>] (block_write_begin+0x48/0x78) from [<c00fd75c>]
(ext2_write_begin+0x3c/0x64)
[<c00fd75c>] (ext2_write_begin+0x3c/0x64) from [<c006ae28>]
(generic_file_buffered_write+0xd8/0x240)
[<c006ae28>] (generic_file_buffered_write+0xd8/0x240) from
[<c006cf2c>] (__generic_file_aio_write+0x460/0x4a4)
[<c006cf2c>] (__generic_file_aio_write+0x460/0x4a4) from [<c006cfdc>]
(generic_file_aio_write+0x6c/0xd4)
[<c006cfdc>] (generic_file_aio_write+0x6c/0xd4) from [<c0099798>]
(do_sync_write+0xa8/0xf4)
[<c0099798>] (do_sync_write+0xa8/0xf4) from [<c009a194>]
(vfs_write+0xb0/0x13c)
[<c009a194>] (vfs_write+0xb0/0x13c) from [<c009a2d4>]
(sys_write+0x40/0x6c)
[<c009a2d4>] (sys_write+0x40/0x6c) from [<c0026dc0>]
(ret_fast_syscall+0x0/0x2c)
---[ end trace 2567634ba15c3eab ]---
EXT2-fs (mmcblk0p1): error: ext2_fsync: detected IO error when writing
metadata buffers
Reading package lists... Error!
W: GPG error: http://packages.enlightenment.org squeeze Release: The
following signatures couldn't be verified because the public key is
no2
W: GPG error: http://www.debian-multimedia.org squeeze Release: The
following signatures couldn't be verified because the public key is
not7
E: Problem syncing the file - sync (5 Input/output error)
E: The package lists or status file could not be parsed or opened.
Ok, I've tested this on all of my SD-Cards and both 'apt-get update'
and 'mmap_stress'
work if we disable the data CRCs. I tested with ext3 and ext4 filesystems.
One can disable the data CRCs with following on the u-boot prompt:
setenv bootargs ${bootargs} mmc_core.use_spi_crc=0
(and optional saveenv)
Above works if you have compiled in MMC support. Otherwise you'll do
it with modprobe like
modprobe mmc_core use_spi_crc=0
(or something similar, I didn't try this)
It would be great if someone tries the options and reports back
whether it helps or not :)
Best regards,
MW
2011/1/17 Mika Westerberg <mika.we...@iki.fi>:
> Ok, I've tested this on all of my SD-Cards and both 'apt-get update'
> and 'mmap_stress'
> work if we disable the data CRCs. I tested with ext3 and ext4 filesystems.
>
> One can disable the data CRCs with following on the u-boot prompt:
>
> setenv bootargs ${bootargs} mmc_core.use_spi_crc=0
> (and optional saveenv)
Thank You! That solve the problem! :D
I've added on wiki at
http://code.google.com/p/sim1/wiki/BootLoader
Disabling CRC is a king of workaround for a problem still to fix or
was just uboot miss config?
I tested on btrfs and apt-get update, never went so far... :)
federico
It is certainly "a workaround" but probably only thing we can do to
prevent failures
with this kind of programs.
Since mmcqd (the kernel thread which handles request queue for the SD-card) can
be scheduled out while it is waiting for the SPI layer to complete a
transfer, it is certainly
possible that some program modifies it's own data (which is currently
transmitted to the
SD-card).
We have some alternatives, though:
1. Fix the programs (yeah, right ;)
2. Make a copy of the TX data and pass that to the SPI layer. This
ensures that no one is
able to modify it while the data is transmitted. This could have
negative effect on performance.
(possibly some more)
In case CRC check is needed, you can use a filesystem which supports checksums.
Btrfs seems to have support for this.
> I tested on btrfs and apt-get update, never went so far... :)
Great :) Thanks for testing.
Wow! Does that also make the memory-to-memory DMA interface work for
RAM-to-RAM block transfers?
Certainly your work is interesting and I'd be happy to try it when it
"seems to work".
Cheers
M
Cheers
M
Hi
To check for how likely undetected data errors are I've run stress
tests writing large amounts of random data to cards, then reading it
back in and checking that it is what it should be.
With 640MB on SDHC and plain SD cards and 128MB on old MMC, there
are NO data errors. 100% OK.
Cheers
M
Cool, that's impressive :)
And if there is still need to make sure that no corruptions occur,
people can use some filesystem which makes CRCs of the data
and metadata.
Thanks,
MW