Memtest Determine Which Stick Is Bad

0 views
Skip to first unread message

Shawana Kallhoff

unread,
Aug 5, 2024, 1:34:17 AM8/5/24
to camppesttradquad
Oneor many of the memory cards out of total 4GB (4x1GB) are failing. The following is memtest progress screenshot. It is always pointing to memory at 19xx MB and 50xx MB. Error bits are always 04000000. How can I determine which of those 4 cards are failing? The strange thing is that if I test them individually they are not failing. I've change Motherboard.

But, just because one failed a test, don't assume that the other doesn't fail (you could have two failing memory modules) -- where you've detected a failure with two memory modules, test each of those two separately afterwards.


Important note: With features like memory interleaving, and poor memory module socket numbering schemes by some motherboard vendors, it can be difficult to know which module is represented by a given address.


I know it's an old post but maybe someone will have the same problem. It could be that the problem appears when the mainboard is entering in dual channel configuration, so the problem could be from the processor, especially some of the the pins from underneath it could be bended.


You didn't specify the laptop or RAM model, but I assume your laptop only has 2 DIMM slots. It's possible both of the new DIMMs are bad or incompatible (e.g. the laptop may not support high-density modules), or the laptop does not support 16 GB of RAM (in which case, it probably only supports 8 GB or 4 GB). Try installing one of the new sticks and one of the old sticks, and run the test again. If it works fine, swap out the new stick with the other new stick.


Please be aware that not all errors reported by MemTest86 are due to bad memory. The test implicitly tests the CPU, L1 and L2 caches as well as the motherboard. It is impossible for the test to determine what causes the failure to occur. However, most failures will be due to a problem with memory module. When it is not, the only option is to replace parts until the failure is corrected.


Sometimes memory errors show up due to component incompatibility. A memory module may work fine in one system and not in another. This is not uncommon and is a source of confusion. In these situations the components are not necessarily bad but have marginal conditions that when combined with other components will cause errors.


Often the memory works in a different system or the vendor insists that it is good. In these cases the memory is not necessarily bad but is not able to operate reliably at full speed. Sometimes more conservative memory timings on the motherboard will correct these errors. In other cases the only option is to replace the memory with better quality, higher speed memory. Don't buy cheap memory and expect it to work reliably. On occasion "block move" test errors will occur even with name brand memory and a quality motherboard. These errors are legitimate and should be corrected.


All valid memory errors should be corrected. It is possible that a particular error will never show up in normal operation. However, operating with marginal memory is risky and can result in data loss and even disk corruption. Even if there is no overt indication of problems you cannot assume that your system is unaffected. Sometimes intermittent errors can cause problems that do not show up for a long time. You can be sure that Murphy will get you if you know about a memory error and ignore it.


We are often asked about the reliability of errors reported by MemTest86. In the vast majority of cases errors reported by the test are valid. There are some systems that cause MemTest86 to be confused about the size of memory and it will try to test non-existent memory. This will cause a large number of consecutive addresses to be reported as bad and generally there will be many bits in error. If you have a relatively small number of failing addresses and only one or two bits in error you can be certain that the errors are valid. Also intermittent errors are without exception valid. Frequently memory vendors question if MemTest86 supports their particular memory type or a chipset. MemTest86 is designed to work with all memory types and all chipsets.


The Hammer Test is designed to detect RAM modules that are susceptible to disturbance errors caused by charge leakage. This phenomenon is characterized in the research paper Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors by Yoongu Kim et al. According to the research, a significant number of RAM modules manufactured 2010 or newer are affected by this defect. In simple terms, susceptible RAM modules can be subjected to disturbance errors when repeatedly accessing addresses in the same memory bank but different rows in a short period of time. Errors occur when the repeated access causes charge loss in a memory cell, before the cell contents can be refreshed at the next DRAM refresh interval.


Starting from MemTest86 v6.2, the user may see a warning indicating that the RAM may be vulnerable to high frequency row hammer bit flips. This warning appears when errors are detected during the first pass (maximum hammer rate) but no errors are detected during the second pass (lower hammer rate). See MemTest86 Test Algorithms for a description of the two passes that are performed during the Hammer Test (Test 13). When performing the second pass, address pairs are hammered only at the rate deemed as the maximum allowable by memory vendors (200K accesses per 64ms). Once this rate is exceeded, the integrity of memory contents may no longer be guaranteed. If errors are detected in both passes, errors are reported as normal.


The errors detected during Test 13, albeit exposed only in extreme memory access cases, are most certainly real errors. During typical home PC usage (eg. web browsing, word processing, etc.), it is less likely that the memory usage pattern will fall into the extreme case that make it vulnerable to disturbance errors. It may be of greater concern if you were running highly sensitive equipment such as medical equipment, aircraft control systems, or bank database servers. It is impossible to predict with any accuracy if these errors will occur in real life applications. One would need to do a major scientific study of 1000 of computers and their usage patterns, then do a forensic analysis of each application to study how it makes use of the RAM while it executes. To date, we have only seen 1-bit errors as a result of running the Hammer Test.


Depending on your willingness to live with the possibility of these errors manifesting itself as real problems, you may choose to do nothing and accept the risk. For home use you may be willing to live with the errors. In our experience, we have several machines that have been stable for home/office use despite experiencing errors in the Hammer Test.


You may also choose to replace the RAM with modules that have been known to pass the Hammer Test. Choose RAM modules of different brand/model as it is likely that the RAM modules with the same model would still fail the Hammer test.


For sensitive equipment requiring high availability/reliability, you would replace the RAM without question and would probably switch to RAM with error correction such as ECC RAM. Even a 1-bit error can result in catastrophic consequences for say, a bank account balance. Note that not all motherboards support ECC memory, so consult the motherboard specifications before purchasing ECC RAM.


The ability of MemTest86 to detect and report on row hammer errors depends on several factors and what mitigations are in place. To generate errors adjacent memory rows must be repeatedly accessed. But hardware features such as multiple channels, interleaving, scrambling, Channel Hashing, NUMA & XOR schemes make it nearly impossible (for an arbitrary CPU & RAM stick) to know which memory addresses correspond to which rows in the RAM. Various mitigations might also be in place. Different BIOS firmware might set the refresh interval to different values (tREFI). The shorter the interval the more resistant the RAM will be to errors. But shorter intervals result in higher power consumption and increased processing overhead. Some CPUs also support pseudo target row refresh (pTRR) that can be used in combination with pTRR-compliant RAM. This field allows the RAM stick to indicate the MAC (Maximum Active Count) level which is the RAM can support. A typical value might be 200,000 row activations. Some CPUs also support the Joint Electron Design Engineering Council (JEDEC) Targeted Row Refresh (TRR) algorithm. The TRR is an improved version of the previously implemented pTRR algorithm and does not inflict any performance drop or additional power usage. As a result the row hammer test implemented in MemTest86 maybe not be the worst case possible and vulnerabilities in the underlying RAM might be undetectable due to the mitigations in place in the BIOS and CPU.


Most memory systems nowadays operate in multiple channel mode in order to increase the transfer rate between the RAM modules and the memory controller. It is recommended that modules with identical specifications (ie. "matching modules") when running in multi-channel mode. Some motherboards also have compatibility issues with certain brand/models of RAM when running in multi-channel mode.


When you see errors while running MemTest86 with multiple RAM modules installed, but not when they are tested individually, it is likely that the multi-channel configuration is the culprit. This could be due to mismatched RAM specifications, or simply using brands/models of RAM that is incompatible with the motherboard. Most motherboard vendors release a list of known compatible RAM models that have been tested to work with your motherboard. Replace the modules with a matching set of known good ones and see if you get better results.


When MemTest86 detects errors during the memory tests, the memory address, actual and expected data are reported to the user. The memory address is the location in system memory where the data contained does not match what was expected. This is the address that is specified by the CPU to the memory controller when requesting data from DRAM. The memory controller then decodes this memory address to identify the specific channel, DIMM, rank, DRAM chip, bank, row and column in DRAM using a chipset-specific address decoding scheme.

3a8082e126
Reply all
Reply to author
Forward
0 new messages