I’ve got a Supermicro-based server that I’m in the process of setting up for Xen hosting purposes. After 3 or 4 days of uptime and light load (because it’s not in production yet) sitting in its rack in a datacentre weird things start to happen.
I get random kernel panics and OOPSes, it locks up or spontaneously reboots. When I power cycle it then the serial console gets a bit garbled and slow once it gets past the BIOS screen, and it rarely manages to boot a kernel then. If I turn it off for several hours and try again then I can usually get it booted. The BIOS event log contains multibit ECC errors.
Some of this sounds like overheating (the fact that nothing will boot yet this “gets better” after a few hours without power), but the ECC errors suggest bad RAM. The server has 4x1GB DDR2-533.
Earlier this evening I’ve booted the server into memtest86+ and left it running for default settings for over 5 hours. In this time it completed 4 test runs without error. I know it can take a long time for memtest86+ to find errors so I may let it run for a few days.
I did get curious though and poked around in memtest86+’s configuration. When I change the memory map from “BIOS-std” to “BIOS-all” I get this:
Memtest86+ v1.70 | Pass 1% Pentium D (65nm) 3192 MHz | Test 50% ################### L1 Cache: 16K 22322MB/s | Test #0 [Address test, walking ones] L2 Cache: 2048K 17348MB/s | Testing: 120K - 4096M 4112M Memory : 4112M 3117MB/s | Pattern: 00000000 Chipset : WallTime Cached RsvdMem MemMap Cache ECC Test Pass Errors ECC Errs --------- ------ ------- -------- ----- --- ---- ---- ------ -------- 0:00:01 4112M 0K e820-All on off Std 0 22 0 ----------------------------------------------------------------------------- Tst Pass Failing Address Good Bad Err-Bits Count Chan --- ---- ----------------------- -------- -------- -------- ----- ---- 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f04000 1 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f08000 1 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f10000 1 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f20000 1 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f40000 1 0 0 000e7f00000 - 3711.0MB ffffffff 00000000 e7f80000 1 0 0 000ff000000 - 4080.0MB ffffffff 00000000 ff000004 1 0 0 000ff000000 - 4080.0MB ffffffff 00000000 ff000008 1 0 0 000ff000000 - 4080.0MB ffffffff 00000000 ff000010 1 0 0 000ff000000 - 4080.0MB ffffffff 00000000 ff000020 1 (ESC)Reboot (c)configuration (SP)scroll_lock (CR)scroll_unlock LOCKED
Instant errors, and all in the top ~400M of RAM.
But is this just a misconfiguration on my part of memtest86+? Is it expected that this should fail? Should I be taking out the last stick of RAM and seeing if life gets better? Is there some PAE issue here where memtest86+ can’t address higher than that amount of RAM?
My situation is made more difficult by the fact that this server is in a datacentre in San Francisco and I am in UK; my only means of interaction with it is by serial console and remote PDU to power cycle if necessary. Graham‘s going there for me in a couple of days and Paul may go there in a few weeks so I’d like to be able to make some suggestions of things they could try when they get there.
5 thoughts on “Dear Lazyweb, am I using memtest86+ correctly?”
Not quite answering your question, but thought that I’d just mention that I have used memtest86+ to test the RAM on an amd64 system with 32GB RAM quite happily. I’d upgraded the RAM from 8GB to 32GB and wanted to give the system a good memtest ‘soak’ to be happy the new RAM was good.
My test work fine and showed no errors, so it’s unlikely you’re hitting any kind of memtest/kernel limitation, assuming you’re using a recent version of memtest86+.
Yes, this is normal behavior from my experience of memtest86+. I’m a user, not an expert.
But in every case when I’ve gotten kernel panics/crashes and ECC errors, memtest86+ has detected a section of bad memory. My vendor doesn’t force me to trace it down to the DIMM, so I will just pull pairs of DIMMS and test with memtest86+ a pair at a time until I find the guilty pair, and then swap it for a new pair with the vendor.
This certainly looks like you’ve got a bad DIMM.
I don’t know how to tell from memtest86+ which physical DIMM it is, so I just do the brute force approach I outlined above. OTOH, depending on your vendor, maybe they’ll just send you 4 new DIMMS that you can swap in and they can handle figuring out the bad one.
I had similar behavior on my Dell Inspiron 700m – memory configuration e820-Std never fails, e820-All fails in a block of high memory. It does make me wonder whether e820-All maps something that always fails, or whether my machine really has bad RAM. I’m going to take the memtest86 bootable disks to some other machines and see if e820-All always fails on all machines.
The BIOS-ALL (e820-All memory config setting) gives spurious error reports. It does so on every machine I tried. Then I found this FAQ http://forum.x86-secret.com/showthread.php?t=2807 :
– When I select BIOS-ALL I get many errors or my machine crashes.
This is normal. With todays computers this option should never be selected.
When selecting BIOS-ALL memtest will ignore the e820 memory map supplied by
the BIOS and will also test areas reserved by the BIOS for the processor or
the BIOS. See the question about which memory is test for more information