Doug Anderson has uploaded a new change for review.
Change subject: WIP: Recover from bit errors that show up across sleep/wake
......................................................................
WIP: Recover from bit errors that show up across sleep/wake
This is WIP code. Specific issues:
* Has a bunch of debugging printouts.
* Has a purposeful corruption of memory to test.
* Detection of when to run needs to be finished.
* Code needs to be cleaned.
* Optimization / testing.
* ...and a bunch more.
Uploading for people that are interested. At this point the
feedback I'm most interested in is where to put the "bitfix"
code and what to name it. ...but can wait and move it once
things are more final, too.
The best description of what's going on is in 'bitfix-snow.c'.
BUG=chrome-os-partner:15655
TEST=powerd_suspend and look at log. See:
[ 29.953878] s3c_pm_check: Restore CRC error at 71734000 (4523f21e vs ffffffff)
[ 29.953878] s3c_pm_check: trying again
[ 29.953878] bitfix_recover_chunk: 71734000
[ 29.953878] s3c_pm_check: better now
Change-Id: I9c7ac2f85b8d9398c93f486f9401daee1526f571
Signed-off-by: Doug Anderson <diand...@chromium.org>
---
M arch/arm/mach-exynos/Kconfig
M arch/arm/mach-exynos/Makefile
A arch/arm/mach-exynos/bitfix-snow.c
A arch/arm/mach-exynos/include/mach/bitfix-snow.h
M arch/arm/mach-exynos/mach-exynos5-dt.c
M arch/arm/plat-samsung/pm-check.c
M chromeos/config/armel/chromeos-exynos5.flavour.config
7 files changed, 700 insertions(+), 79 deletions(-)
Discussion subject changed to "samsung: snow: bitfix: Recover from bit errors across sleep/... [chromiumos/third_party/kernel : chromeos-3.4]" by Doug Anderson (Code Review)
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 8:
I think we're ready for review now. Running more testing tonight.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 8: I would prefer that you didn't submit this
(1 inline comment)
....................................................
File arch/arm/mach-exynos/include/mach/bitfix-snow.h
Line 29: Actually, probably need no-ops of these if no config.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 10:
Fix for style nits from checkpatch plus now handles when CONFIG_SNOW_BITFIX isn't defined (whoops!).
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 10:
Unfortunately I've only had time to give a quick look and won't be able to look at this in more detail until after lunch (my time, not yours, at least).
High level looks good, it might be worth it to change pm_check back to having CHUNK_SIZE as a #define shared with ( taken from?) bitfix.c. Having it be runtime configurable was mainly done to aid testing I was doing earlier and I don't think we have any real benefit from it. Simplifies the code somewhat in terms of calls between modules.
Also I'm mostly done testing this but https://gerrit.chromium.org/gerrit/#/c/36944/ might be worth running. It 's a wrapper around powerd_suspend that allocated all free memory and fills it with known patterns as a stopgap against bugs in the kernel code. A good way to sanity check that we're doing the right things (and fixing properly).
Olof Johansson has posted comments on this change.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 10: Looks good to me, but someone else must approve
(1 inline comment)
I think this is looking reasonable. I have not peeled the onion all the way back and checked all calculations, so giving +1 until either myself or someone else does it.
I included an include file nit below, and I think some of the code could be tightened up a bit by choosing shorter variable names in some locations (to avoid line wraps, etc). But overall it seems reasonable.
....................................................
File arch/arm/mach-exynos/include/mach/bitfix-snow.h
Line 17: #define _MACH_EXYNOS_BITFIX_SNOW_H
This should just go as a local header in arch/arm/mach-exynos/ and be included as "header.h". It's only needed in include/mach if it's used outside of the mach directory (there's been lots of cleanups and removal of these kind of headers in upstream recently).
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 10: (2 inline comments)
I'll do a pass of cleanup to try to simplify variable names. Then I'll search around for someone to give an in-depth review.
....................................................
File arch/arm/mach-exynos/bitfix-snow.c
Line 427: * Clear each page in prep for xor to recover. Keep a copy for
Actually, should probably recover to compare_chunk and then copy. That should make us more robust to bitflips in kernel globals.
Said another way: if there's a bitflip in kernel globals it may be in something unimportant and that's why we're still running code. We should still recover the problem but we shouldn't zero the memory while we're doing recovery.
....................................................
File arch/arm/mach-exynos/include/mach/bitfix-snow.h
Line 17: #define _MACH_EXYNOS_BITFIX_SNOW_H
Yeah, I had it local until I needed it included by <arch/arm/plat-samsung/pm-check.c>
I tried to keep the changes to pm-check minimal but needed to hook in somewhere. I could try to architect some sort of generic hook there but it seemed overkill?
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 10: (1 inline comment)
....................................................
File arch/arm/mach-exynos/bitfix-snow.c
Line 427: * Clear each page in prep for xor to recover. Keep a copy for
Done
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 14:
Comparing to patch set #10:
* Fixed my own suggestion of always recovering to a single chunk.
* Shortened a few names. I can shorten more if people insist but I think the longer names help me. I did eliminate some wrapping.
* Added the ability to force bitfix on using a config option.
* Updated description a little.
I published my draft changes since I did the fixups and renames in two passes (in case that helps anyone comparing from patch set #10). ...then I fixed a few bugs.
Re-running tests with this new CL but expect them to continue passing. Will set verified in a few hours.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 14: (2 inline comments)
uh, comment review :) (Code looks tidy, but I didn't spend the time needed to be a solid reviewer.)
....................................................
File arch/arm/mach-exynos/bitfix-snow.c
Line 75: * We'll then rely on an out-of-band singnal (CRC) to identify bad chunks (8K)
typo: signal
Line 77: * will recompute it. Since all bad chunks are in a single corrutpion unit
typo: corruption
Olof Johansson has posted comments on this change.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 10: (1 inline comment)
....................................................
File arch/arm/mach-exynos/include/mach/bitfix-snow.h
Line 17: #define _MACH_EXYNOS_BITFIX_SNOW_H
Ok, let's deal with it when it breaks then instead.
Olof Johansson has posted comments on this change.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 14: I would prefer that you didn't submit this
(9 inline comments)
I have reviewed a good portion of the logic now, and it looks correct to me so far. But I have a handful of comments on general readability below that I think would be useful to change in case others need to get into this in the future.
Also, please hold off uploading a new revision with these things changed if others are still reviewing, since it is annoying to have code change underneath of your review. :)
With the below fixed, I think I am ready to +2. But I still want at least one more person to take a close look.
....................................................
File arch/arm/mach-exynos/bitfix-snow.c
Line 376: * @skip_fn: This will be called chunk at a time. If a chunk was never
The name skip_fn makes me think it's a function to call when a chunk is skipped, not to check if it should be skipped. I think a better name would help readability.
Line 381: static void _bitfix_recover_chunk(phys_addr_t dest_chunk,
This is no longer dest_chunk, and threw me in a loop because of it. Well, it's the eventual destination, but not the target of the recovery in this function.
Line 384: const u32 dest_corruption_unit = bitfix_get_corruption_unit(dest_chunk);
Using something like "cu" to refer to a corruption unit though the file might make sections like this a bit easier to follow, variable names get pretty long
Line 400: * need to process it now for recovery.
Not following the language here -- "was skipped"? You mean that skip_fn() will return true on the xor chunk and we still want to process it, right?
Line 406: for (pgnum = 0; pgnum < (CHUNK_SIZE / PAGE_SIZE); pgnum++) {
If you += PAGE_SIZE here and compare accordingly, you could get rid of a lot of * PAGE_SIZE below. Otherwise you can use PAGES_PER_CHUNK Here. But I prefer the former.
Line 446: if (recover_chunk == NULL)
you can do:
if (WARN_ON_ONCE(recover_chunk == NULL))
return;
here, if you wawnt.
Line 510: recover_chunk = kmalloc(CHUNK_SIZE, GFP_KERNEL);
Is it worth it to always allocate this instead of keeping it allocated? The chance of getting an order-2 allocation fulfilled will decrease with uptime of the system. If not, also consider GFP_ATOMIC.
Line 513: * Zero out all of the xor superchunk.
/* can be a single line comment like this */
Line 615: #ifdef CONFIG_SNOW_BITFIX_FORCE
This still makes it hard to enable this for testing purposes, since it would require a rebuild.
I think we'd be better off always reserving the memory (since it's only .1% of RAM), and have a module_parameter() to force it, since that could also be changed at runtime through /sys. That would mean we could force it on for non-affected hardware to check for regressions in the lab as part of regular test suites.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 14: (24 inline comments)
lots of nits but no problems (take some of my comments w/ a grain of salt given I read the bitfix code before the pm code--which would've clarified things)
I find all the units/chunks/superchunks/whatever confusing but not a big deal.
....................................................
File arch/arm/mach-exynos/bitfix-snow.c
Line 47: * - The 0x00000003 bit has to do with byte. Each byte in a 32-bit word goes
"has to do with byte" <what?>; also 3 is 2 bits :)
Line 68: * (unless we get a double-corruption which should be extremely rate).
I would s/unit/set/ (or maybe bucket) but not important.
Line 75: * We'll then rely on an out-of-band singnal (CRC) to identify bad chunks (8K)
where does this "out-of-band signal" come from?
Line 77: * will recompute it. Since all bad chunks are in a single corrutpion unit
"recompute it", recompute <what>? the memory contents?
Line 232: * @size: Size of region 1.
in bytes
Line 234: * @whatsz: Size of region 2.
in bytes
Line 244: if (what >= (ptr + size))
I assume there's no need to worry about wrap-around
Line 287: * elements.
I prefer an assert to a comment
Line 299: * Run xor on the given page; storing in the given corruption unit.
storing <what>?
Line 301: * We'll xor the given page into the equivalnet page in the given corruption
equivalent
Line 325: * Print out comparisons of which bits were fixed in a page.
s/comparisons of //
Line 347: * - Only pages that have been processed can be recovered.
"processed"? you mean only a page that has an xor recorded in the xor block? (or whatever you call it)
Line 349: * should indicate that it was skipped.
got confused by this comment being separated from _bitfix_recover_chunk where skip_fn is used
Line 355: * page instead of on a chunk-by-chunk.
s/a // or add " basis"
Line 375: * This is the start of the bad chunk.
should you assert the address is the _start_ of a chunk?
Line 422: * We can only recovery a chunk whose pages were processed originally with
s/recovery/recover/
Line 446: if (recover_chunk == NULL)
why does this happen? do you want a pr_err or similar?
Line 453: * We recover to recover_chunk instead of directly to memory and then
Maybe: "We recover to the recover chunk and then copy instead of directly to the destination. This allows us to ignore bit errors in the memory used for the recover chunk."
Line 456: * weren't super serious enough to prevent bitfix from running. In
s/super//
Line 463: * Do comparisons to characterize the corruption.
how expensive is this? do you want to make it optional? (sounds like no since you say there is negligible impact on resume time)
Line 513: * Zero out all of the xor superchunk.
s/out all of//
....................................................
File arch/arm/plat-samsung/pm-check.c
Line 372: bitfix_recover_chunk(addr, s3c_pm_should_skip);
thought bitfix knew when it could recover a chunk; why do you need to re-crc the data? paranoia?
....................................................
Commit Message
Line 67: [ 0.000000] bitfix_reserve: Reserved 00200000 bytes
assume you tested the new bios and see 0 impact
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 14: (36 inline comments)
....................................................
File arch/arm/mach-exynos/bitfix-snow.c
Line 47: * - The 0x00000003 bit has to do with byte. Each byte in a 32-bit word goes
Done
Line 68: * (unless we get a double-corruption which should be extremely rate).
Not going to change terminology now...
From Bryan: s/rate/rare
Done
Line 75: * We'll then rely on an out-of-band singnal (CRC) to identify bad chunks (8K)
Done
Line 75: * We'll then rely on an out-of-band singnal (CRC) to identify bad chunks (8K)
pm-check
Line 77: * will recompute it. Since all bad chunks are in a single corrutpion unit
Done
Line 77: * will recompute it. Since all bad chunks are in a single corrutpion unit
Done
Line 196: #define SUPERCHUNK_SIZE (1 << CORRUPTION_UNIT_OFFSET)
From Bryan: should be based on SUPERCHUNK_BITS for clarity.
Done.
Line 232: * @size: Size of region 1.
Done
Line 234: * @whatsz: Size of region 2.
Done
Line 244: if (what >= (ptr + size))
No, there shouldn't be. We are comparing if two chunks of memory overlap and those chunks must fit fully within phys_addr_t.
Line 287: * elements.
I'll put a BUG_ON() for the count but not for the dest/src. In fact, I'll remove the comment for dest/src. The fact that they are "u32 *" means that they must be aligned. Validating at that level doesn't make sense to me.
Line 299: * Run xor on the given page; storing in the given corruption unit.
Hopefully I've clarified this now.
Line 301: * We'll xor the given page into the equivalnet page in the given corruption
Done
Line 325: * Print out comparisons of which bits were fixed in a page.
Done
Line 347: * - Only pages that have been processed can be recovered.
Done
Line 349: * should indicate that it was skipped.
Done
Line 355: * page instead of on a chunk-by-chunk.
Done
Line 375: * This is the start of the bad chunk.
Could, but this an internal function called in one place.
Line 376: * @skip_fn: This will be called chunk at a time. If a chunk was never
Done
Line 381: static void _bitfix_recover_chunk(phys_addr_t dest_chunk,
Done
Line 384: const u32 dest_corruption_unit = bitfix_get_corruption_unit(dest_chunk);
Happy with that. Done.
Line 400: * need to process it now for recovery.
Done
Line 406: for (pgnum = 0; pgnum < (CHUNK_SIZE / PAGE_SIZE); pgnum++) {
Done
Line 422: * We can only recovery a chunk whose pages were processed originally with
Done
Line 446: if (recover_chunk == NULL)
Done
Line 446: if (recover_chunk == NULL)
Code is gone. Memory is allocated in globals now.
Line 453: * We recover to recover_chunk instead of directly to memory and then
Reworded. See what you think.
Line 456: * weren't super serious enough to prevent bitfix from running. In
Reworded whole section.
Line 463: * Do comparisons to characterize the corruption.
We don't get bit errors often (3.5% of the time?). When we do we don't seem to get more than 100 in an extreme case. The extra speed impact here doesn't seem huge compared to processing all 2 gigs of memory.
Line 510: recover_chunk = kmalloc(CHUNK_SIZE, GFP_KERNEL);
Sure. It does make things simpler, and it's only 8K. The nice thing is that it can now go in the special suspend_volatile section.
Line 513: * Zero out all of the xor superchunk.
Done
Line 513: * Zero out all of the xor superchunk.
Done
Line 615: #ifdef CONFIG_SNOW_BITFIX_FORCE
Done
....................................................
File arch/arm/plat-samsung/pm-check.c
Line 372: bitfix_recover_chunk(addr, s3c_pm_should_skip);
Not completely paranoia:
* In the case of the failure of 2 chips at the same time bitfix will actually do the wrong thing. ...but it has no way of knowing--it just xors stuff together with no real checksum.
* There is a possibility that something more drastic could go wrong also. We can get bit flips in the code that runs bitfix or in other places. I'd rather be sure that we really fixed things.
....................................................
Commit Message
Line 67: [ 0.000000] bitfix_reserve: Reserved 00200000 bytes
With Olof's feedback we're actually going to allocate memory always.
...but yes, new bios shouldn't be getting automatically enabled.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 15:
Re-running with the latest set of patches, but early testing shows that we're still good.
Incorporated everyone's feedback that I've received and I don't _think_ there's anyone currently reviewing.
Will set "Verified" in a few hours if no problems found.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 15: Looks good to me, but someone else must approve
There is a lot of good work here, Doug.
I do not see any problems, but it is a bit much for me to +2 in one day.
Olof Johansson has posted comments on this change.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 15: Looks good to me, approved
LGTM, but please wait for Sam to give a review since he had comments before.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 15: Looks good to me, approved
(1 inline comment)
just a nit you can leave for later; no need for round-trip if you decide to change the comment
....................................................
File arch/arm/mach-exynos/bitfix-snow.c
Line 467: * We recover to recover_chunk and then copy instead of recoverying
recovering
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 15: (1 inline comment)
....................................................
File arch/arm/mach-exynos/bitfix-snow.c
Line 467: * We recover to recover_chunk and then copy instead of recoverying
Done
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 17:
Sorry to all for the churn. Needed to rebase since we decided to put _suspend_volatile in "bss" and it made sense to rename the tag then.
Also:
* Fixed Sam's typo.
* Use generic addr_overlap instead of copying in_range function.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 17: (1 inline comment)
....................................................
File arch/arm/mach-exynos/bitfix-snow.c
Line 18: #include <linux/addr_overlap.h>
I do not see this include file.
Change subject: samsung: snow: bitfix: Recover from bit errors across sleep/wake
......................................................................
Patch Set 17: (1 inline comment)
....................................................
File arch/arm/mach-exynos/bitfix-snow.c
Line 18: #include <linux/addr_overlap.h>
Sorry--this CL the last in a big chain. addr_overlap is the first.